Subhabrata Posted March 27, 2011 Posted March 27, 2011 Dear Group, While reading through some aspects of Machine Learning, I got two interpretations for the popular training algorithms, Maximum Likelihood estimation(MLE) and Expected Maximization Algorithm(EM). I am trying to post both the options here as under. If any one of the learned members may kindly help me out, I would be grateful enough. I. MLE AND EM VERSION I We have calculated the likelihood which is, L(f;p)=Πp(x)f(x), where x€X and : f is a non empty and finite corpus on a countable set of X of types. The maximum of Likelihood function is, argmax(L(f;p)). After taking the maximum likelihood, we are trying to take the Laplace smoothing and take Expectation and Maximization algorithm, which aims at finding maximum likelihood estimates for settings where this appears to be difficult if not impossible. The trick of the EM algorithm is to map the given data to complete data on which it is well-known how to perform maximum-likelihood estimation. Instead of a complete data corpus, the input of the EM algorithm is an incomplete data corpus, together with the so-called symbolic analyzer. Symbolic Analyzer is a device assigning to each incomplete data type a set of analysis, where each analysis is a complete data type. Thus, the missing complete data part can be partly compensated. The application of the symbolic analyzer to the incomplete-data corpus leads to an ambiguous complete-data corpus. The ambiguity arises as a consequence of the inherent analytical ambiguity of the symbolic analyzer: The expectation-maximization algorithm performs a sequence of runs over the resulting ambiguous complete-data corpus. Each of these runs consists of an expectation step followed by a maximization step. In the E step, the expectation-maximization algorithm combines the symbolic analyzer with an instance of the probability model. The result of this combination is a statistical analyzer which is able to resolve the analytical ambiguity introduced by the symbolic analyzer. In the M step, the expectation-maximization algorithm calculates an ordinary maximum-likelihood estimate on the resolved complete-data corpus. Let X and Y be non empty and countable set. A function, A: Y->2X, is called a symbolic analyzer, if the possible empty set of analysis, A(y) ⊆ X are pair wise disjoint, and the union of all sets of all sets of analysis, A(y) is complete X=ΣA(y), y€Y. (1) In this case, Y is called the set of incomplete data types, whereas X is called the set of complete data types. So, in other words, the analysis of A(y) of the incomplete data types y form a partition to the complete data type X. Therefore, for each x€X exists a unique y€Y, the so-called yield of x, such that x is an analysis of y, Y=yield(x) if and only if x€A(y). A pair <A, p> consisting of a symbolic analyzer A and a probability distribution p on the complete-data types X is called a statistical analyzer. We use a statistical analyzer to induce probabilities for the incomplete-data types y€Y, p(y):=Σp(x), x€A(Y). X x2A(y) p(x) Even more important, we use a statistical analyzer to resolve the analytical ambiguity of an incomplete-data type y€Y by looking at the conditional probabilities of the analyzes x€A(y) p(x|y) := p(x) p(y) where y = yield(x). Input, Procedure and Output of an EM algorithm: (i) A symbolic analyzer, i.e., a function A which assigns a set of analysis, A(y) ⊆X, to each incomplete data types, y€Y, such that all sets of analyses form a partition of the set X of complete data types, X=ΣA(Y) y€Y. (ii) a non-empty and finite incomplete-data corpus, i.e., a frequency distribution f on the set of incomplete-data types, f:Y->R. such that f(y)>=0, for all y€Y, and 0<|f|<α. (iii) a complete-data model M( M(X), i.e., each instance p€ M is a probability distribution on the set of complete-data types, p:X->[0,1] and Σp(x)=1, where x€X (2) (iv) Implicit Input : an incomplete-data model M ⊆ M(Y) induced by the symbolic analyzer and the complete-data model. Together with a given instance of the complete-data model, the symbolic analyzer constitutes a statistical analyzer which, in turn, induces the following instance of the incomplete-data model. p:Y[0,1] and p(Y)=Σp(x), x€A(y). (3) (v) a (randomly generated) starting instance p0 of the complete-data model M. (Note: If permitted by M, then p0 should not assign to any x€X a probability of zero.) The Procedure of EM algorithm is, (i) for each i=1,2,3,…do, (ii) q:=pi-1 (iii) E step: Compute the whole corpus fq: X->R, expected by q, fq(x)=f(y)*q(x|y), where y=yield(x). (iv) M step: compute a MLE p' of M on fq, L(fq;p')=maxL(fq, p) (v) pi=p' (vi) end//for each i. (vii) print p0,p1,p2,p3. II. MLE AND EM VERSION 2 To calculate the Maximum Likelihood, The category distribution is estimated by: p'© = freq© / Σc' freq(c') (4)where freq© is the number of times the category c showed up in the training data, with the denominator being the total number of training instances (each instance has a unique category).The maximum likelihood estimates for words in a category are computed similarly:p'(w|c) = freq(w,c) / Σw' freq(w',c) (5)where freq(w,c) is the number of times the word w appeared in a document labeled with category c. Smoothing:Maximum likelihood estimates provide estimates of zero probability for words that were not seen in a category during training. To overcome this gross underestimation, it is common to smoothe the estimated distribution. A typical way of doing this corresponds to assigning a Dirichlet prior to the multinomial parameters (this is a typical Bayesian prior), then computing the maximum a posteriori (MAP) estimate instead of the maximum likelihood estimate.In practice, this works out to a technique introduced by Laplace, and known as additive smoothing. The Dirichlet prior is parameterized by a prior number of counts per outcome. For instance, we can take 0.5 as the prior number of counts for a category, then estimate:p"© = (freq© + 0.5) / Σc' (freq(c') + 0.5) (6)and similarly for the word in category estimates, which we might give a 0.01 prior count:p"(w|c) = (freq(w,c) + 0.01) / Σw' (freq(w',c) + 0.01) (7) Given estimates of p© and p(w|c), we can classify new texts consisting of a sequence of words ws as follows, using Bayes's rule (which is where the technique gets its name)(c|ws) = p(ws|c) * p© / p(ws) (8) We expand out the probability of words assuming each word is generated independently (this is the naive assumption from which the technique gets its name), and hence the probabilty of all the words is the product of the probability of each word:p(ws|c) = Πi p(ws|c) (9)In statistical terms, the naive assumption is that the distribution of words is a multinomial.We can compute the marginal p(ws) by summation over its probability in each category weighted by the probability of that category:p(ws) = Σc' p(ws|c') * p(c') (10) Best Regards, Subhabrata Banerjee, Gurgaon, India.
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now