Generative Models, Maximum Likelihood, Soft Clustering, and Expectation Maximization

Geerative Models Maximum Likelihood Soft Clusterig ad Expectatio Maximizatio Aris Aagostopoulos We will see why we desig models for data how to lear their parameters ad how by cosiderig a mixture model of Gaussias ca give us a algorithm for soft clusterig. have icluded may of the derivatios that we omitted i class for those that are iterested i the details; there are a few formulae but the uderlyig ideas are simple. Feel free to skip the calculatios of Sectio 5 if you d like but make sure you uderstad the mai ideas. 1 Modelig A importat part of uderstadig our data is that of modelig. Modelig is the desig of a mathematical usually process that creates data. A good model has a few desired properties: t creates data that have similar statistical properties with the oes observed i reality. The geeratio process seems atural ad correspodig i some sese to what actually happes i reality. t is ameable to aalysis. These properties are rather vague ad ot always possible. For istace very ofte a model to create data that are realistic it has to become complicated; this meas that it will probably be hard to aalyze it. Thus we ofte cosider models that replicate oly some of the characteristics of the uderlyig data precisely the characteristics that we are iterested i uderstadig i the give time. This is why the whole process of modelig is usually hard requires experiece ad it is ofte both a sciece ad a art. Why are we iterested i desigig models for our data? First desigig a good model ca help us makig sese of our data ad the process with which the data are created. For istace the model of Watts ad Strogatz [] eve though it is ot realistic ca help us explai the small-world pheomeo i etworks. The Barabási Albert preferetial-attachmet model [1] ca explai the power law observed i the degree distributio. Secod a good model ca help us make predictios. Assume that we have a history of data say 100 days of trasportatio iformatio. What we will ofte do is desig a model i.e. come up with a family of models which are characterized by a set of parameters ad use the largest part of the past data say 90 days to trai the model that is fid the values of of the parameters of the model. The usig the part of the data that was left out we will evaluate the model: we will see if it ca create data for the 10 days that are somehow similar to what we the real data are i these 10 days. Or we will check what probability does the model give to the real data i these 10 days. This process which is ofte iterative ad usually more complicated tha what was just described allows us to evaluate the model ad see whether the family of the models that we have chose ad the parameters that we have estimated are proper. Assumig that this is the case the we ca use the model to make predictios for the future: if the model that was traied i 90 days i the past is able to predict the last 10 days the we assume hope! that it ca predict the ext 10 days as well. This of course requires several assumptios for our real-life scearios. 1

Takig this oe step further we ca ow take decisios for the future. For istace if the model predicts that we will have a lot of people watig to go from Colosseum to Sa Peter s Cathedral we ca icrease the umber of buses that make this route. Fially a model ca also be used to desig tools. For istace we will see that pagerak the algorithm first used by Googleamog may other features to assig scores to web pages is actually the value obtaied i the radom surfer model: this model we cosider a user who visits pages ad follows outgoig liks radomly whereas with some probability he stops ad restarts browsig at a radom page; the the pagerak score of a give web page p is the frequecy i the log term that this user visit page p. the ext sectiowewill see someofthe abovecoceptsby cosideriga verysimple model formodelig the height of a part of the populatio. This will brig us to the questio of how we ca lear the parameters of the model usig maximum likelihood. The by geeralizig this example ad iclude heights of people from differet parts of the populatio we will itroduce a more complicated model which i the process of learig its parameters will also allow us to perform soft clusterig of the uderlyig data. A simple Gaussia model To become familiar with some of the cocepts of modelig let us assume that we observe the heights of talia wome. Let x i be the height of the ith perso. A reasoable model for such a populatio is the Gaussia model: we assume that each x i is draw by some Gaussia distributio with mea µ ad variace σ each draw beig idepedet. We ca thik that whe God decides the height of the ith perso he picks a value x i distributed accordig to a ormal distributio Nµσ. Defie the etire set of data to be the vector x x 1 x...x. Recall that the Gaussia or ormal distributio symbolized by Nµσ is the cotiuous probability distributio with rage R ad probability desity fuctio give by x µ px Nx;µσ e σ πσ where the meaig of is that we defie Nx;µσ to be the expressio o the right. The expected value of the distributio is µ ad its variace is σ. Recall that whe we draw a value from a cotiuous probability distributio a distributio whose cumulative distributio fuctio CDF is cotiuous the probability that we obtai a particular value x is 0. Thus we ca characterize the distributio by its probability desity fuctio PDF which ca give us the probability that we obtai a value i a give iterval [ab]. The probability that we obtai a value i a very small iterval δx aroud x is approximately equal to pxδx ad so usig the same approach that we use whe we defie itegrals the probability that we obtai a value i the rage [ab] is Pr x [ab] b a px dx. Nevertheless it is ofte easy to thik of px as the likelihood that we will obtai the value x ad may of the properties that hold whe we talk about probabilities of discrete radom variables hold also for PDFs. For istace if two radom variables XY with a joit PDF p XY xy are idepedet the we have that p XY xy p X xp Y y where p X x ad p Y y are the PDFs of the radom variables X ad Y respectively. We call the values µ ad σ the parameters of our model. Ofte we deote the collectio of all our parameters with the vector θ so here we have θ µσ.

Give the data ad the model a atural questio is what are the best set of parameters for the data that we have. This is ofte called fittig the model to the data. There are differet ways to defie best ad i the ext sectio we will see oe of the most atural ad most commoly used. 3 Maximum Likelihood Assumig that the data came from a ormal distributio Nµσ the likelihood that we observe poit x i is x i µ p θ x Nx i ;µσ e σ πσ where we have put as a subscript θ to make explicit the depedece o the parameters θ µσ. Assumig also that all the x i s are joitly idepedet we have that the likelihood that we observe the values x 1 x...x is e x i µ σ p θ x i. πσ Note that this expressio is a fuctio of the data x x 1...x ad of the parameters of the model θ µσ. Because the data are give ad because we wat to fid the best value for θ we ca see it as a fuctio of θ. We thus defie the likelihood fuctio Lθ;x as Lθ;x Lµσ ;x p θ x i. Therefore the likelihood fuctio tells us what is the likelihood that we observe the data for a give set of model parameters θ. The maximum-likelihood approach to fidig the best model is to fid the value ˆθ ˆµˆσ that maximizes Lθ;x. some sese it is the most probable model for the give data amog all the models of the family that we have selected here a Gaussia distributio. t turs out that istead of maximizig Lθ; x it is more coveiet to maximize its logarithm. Thus we defie the log-likelihood fuctio LLθ;x llθ;x ad we try to maximize this oe. Because l is a icreasigfuctio the values ˆθ that maximize LLθ;x are the values that maximize Lθ;x. our case we have: LLθ;x l e x i µ σ πσ l πσ e xi µ σ 1 lπσ +l e x i µ σ 1 lπ 1 lσ x i µ σ x i µ σ lσ lπ. 3

To fid where LLµσ ;x attais its maximum we set the partial derivatives with respect to µ ad σ equal to 0. The we obtai LL µ 0 or equivaletly x i µ µ σ 0 x i µ σ 0 x i µ 0 µ 1 x i ad or LL σ 0 x i µ σ σ σ lσ 0 x i µ σ σ 0 x i µ σ 0 σ 1 x i µ. These give what we expect that the best values for the model s mea ad variace ˆµ 1 ˆσ 1 x i x i µ equal the sample mea ad variace respectively. Actually to show that it is a maximum ad ot a miimum or a saddle poit we have to check that the determiat of the Hessia matrix is positive ad so o; check your multi-variable calculus books if iterested. 4

4 Gaussia Mixture Model ad Soft Clusterig Now we will see how usig the ideas of the previous sectio we ca soft cluster a dataset. Whe we soft cluster a dataset istead of assigig each poit to a cluster we give to each poit for every cluster a probability that it belogs to that cluster. Assume that we have a populatio of wome who are either from taly or from Chia ad whose weights are give as before by a vector x x 1 x...x. We would like to try to cluster them based o their height. We will do this by usig a more complicated model for geeratig the data which will be a mixture of Gaussia distributios. The data-geeratio process for a height x i has two steps: 1. We flip a biased coi ad with probability π we choose to have a talia woma ad with probability π C a Chiese oe.. f the coi selected a talia woma the we choose the height x i distributed as Nµ σ. f the coi selected a Chiese oe the we choose the height x i distributed as Nµ C σc. Allthechoicesaremutuallyidepedet. theabovemodelwehavesixparametersθ {π π C µ σ µ CσC } ad we have the costrait that π +π C 1 so that the coi flip at the first step be properly defied. Give this model it is atural to ask the same questio as before: what is the best value of θ? We will try to apply agai the maximum-likelihood approach. For this it is coveiet to defie the hidde or latet radom variable Z i as { if the ith perso is talia ad Z i C if the ith perso is Chiese. We have p θ x i PrZ i p θ x i Z i +PrZ i C p θ x i Z i C x i µ σ x i µ C σ C π e +π πσ C e πσ C π Nx i ;µ σ +π C Nx i ;µ C σc. As before usig the fact that the differet x i s are idepedet we have that the likelihood fuctio is Lθ;x π Nx i ;µ σ +π C Nx i ;µ C σ C ad we ca also cosider the log-likelihood fuctio LLθ;x l π Nx i ;µ σ +π C Nx i ;µ C σc l π e σ πσ x i µ x i µ C σ C +π C e πσ C. 1 This is much harder to aalyze tha before because the sum prevets the logarithm to be applied to the expoetials ad we caot obtai a closed formula. priciple we ca try to maximize it umerically ad ideed this is the approach used sometimes. the ext sectio we will itroduce a alterative method which ca be used with mixture models ad more geerally whe we have to ifer the parameters of a model that cotais latet variables. 5

5 Expectatio Maximizatio for Gaussia Mixtures ad Soft Clusterig As we will see we will be able to ifer the values of the parameters with a algorithm similar to k-meas. For those iterested i the details have icluded the calculatios. They are actually a bit pedatic but if you follow them slowly ot too hard. Nevertheless feel free to skip them if you wat but make sure that you uderstad the high-level idea. But first let us assume that we kew the parameter vector θ. We will ow see how usig these values we could create a soft clusterig. This meas that for every value x i we will compute probabilities PrZ i X i x i ad PrZ i C X i x i which represet the probability that the ith perso is talia ad Chiese respectively coditioal that their weight is x i. We defie the radom variable X i to be the height of the ith perso whe we create the data with our model. Note that we coditio o a evet with probability 0 which eve though we have ot defied it for discrete probability spaces is fie but eeds some techical work. Nevertheless by some had-wavig argumets we have: PrZ i X i x i a PrZ i x i δ X i x i +δ PrZ i x i δ X i x i +δ Prx i δ X i x i +δ Prx i δ X i x i +δ Z i PrZ i Prx i δ X i x i +δ Prx i δ X i x i +δ Z i PrZ i Prx i δ X i x i +δ Z i PrZ i +Prx i δ X i x i +δ Z i C PrZ i C b δ p θ x i Z i PrZ i δ p θ x i Z i PrZ i +δ p θ x i Z i C PrZ i p θ x i Z i π p θ x i Z i π +p θ x i Z i C π C where i a we approximate the probability that X i x i with the probability that X i falls i a tiy iterval of legth δ aroud x i i b we approximate this probability with δ times its PDF at the ceter of the iterval whereas i the equatios i-betwee we use the defiitio of coditioal expectatio. We are provig Bayes rule which was also what Problem 1 of Homework 1 was askig you to do. Actually by takig the limit δ 0 the above approximatios hold with equality ad so we have p θ x i Z i π γ i PrZi X i x i p θ x i Z i π +p θ x i Z i C π C π Nx i ;µ σ π Nx i ;µ σ +π C Nx i ;µ C σc because assumig that the perso is talia her height is distributed accordig to Nµ σ : x i µ σ p θ x i Z i Nx i ;µ σ e. πσ Notice that if we kow θ we ca compute exactly the probability γ i PrZ i X i x i. Likewise we ca compute π C Nx i ;µ C σ γ ic C PrZi C X i x i π Nx i ;µ σ +π C Nx i ;µ C σc 3. Therefore if we fix the parameters of the model θ we ca assig a probability for height x i to be a talia or a Chiese perso ad this assigmet iduces a soft clusterig. 6

Now we will look at the iverse questio: Assume that we kow the probabilities γ i ad γ ic. How ca we compute the best value θ? We will try to maximize the log-likelihood fuctio of Equatio 1 as we did i Sectio 3. First we set the partial derivative with respect to µ equal to 0: which gives 0 LL µ π Nx i ;µ σ π Nx i ;µ σ +π C Nx i ;µ C σc xi µ σ γ i xi µ σ µ γ ix i γ. 4 i Similarly we get µ C γ icx i γ. 5 ic Settig the derivative with respect to σ equal to 0 we obtai: 0 LL σ 1 π Nx i ;µ σ +π C Nx i ;µ C σc π πσ π γ i 4πσ x i µ 1 σ σ 1 σ which gives γ i π σ 4πσ 4 x i µ 1 σ 1 e x i µ σ γ i x i µ γ i σ x i µ 1 σ e x πσ i µ σ 6 π σ ad similarly σ C γ ic x i µ C γ. 7 ic Fially because of the costrait π +π C 1 to maximize LLθ;x with respect to π we will use the techique of Lagrage multipliers check your optimizatio books for details if iterested. We add the costrait to the objective multiplied by λ ad we try to maximize LLθ;x+λπ +π C 1. Settig the derivative with respect to π equal to 0 we get: 0 π LLθ;x+λπ +π C 1 Nx i ;µ σ π Nx i ;µ σ +π C Nx i ;µ C σc +λ. To compute the value λ we will use the costrait π +π C 1. Multiplyig the above by π gives λ π Nx i ;µ σ π π Nx i ;µ σ +π C Nx i ;µ C σc γ i. 8 7

With the same approach we have λ π C Nx i ;µ C σc π C π Nx i ;µ σ +π C Nx i ;µ C σc γ ic. 9 Summig the two costraits ad usig that π +π C 1 gives λ Nx i ;µ σ π π Nx i ;µ σ +π C Nx i ;µ C σc + Nx i ;µ C σc π C π Nx i ;µ σ +π C Nx i ;µ C σc Nx i ;µ σ π π Nx i ;µ σ +π C Nx i ;µ C σc +π Nx i ;µ C σc C π Nx i ;µ σ +π C Nx i ;µ C σc. Combiig this with Equatios 8 ad 9 we obtai ad π 1 π C 1 γ i 10 γ ic. 11 Combiig these with Equatios 4 ad 5 we obtai µ γ i x i 1 π ad µ C ad combiig them with Equatios 6 ad 7 we obtai ad γ ic x i π C 13 σ γ i x i µ π 14 σc γ ic x i µ C. 15 π C Therefore we see that if we kow the values γ i ad γ ic we ca compute the best value for θ. Note that as before we eed some more work to show that the values that we computed ideed give a local maximum but we omit the details. Recall that Equatios ad 3 tell us how we ca compute the estimates γ i ad γ ic if we kow the value of θ whereas Equatios 10 15 tell us how we ca compute θ if we kow the estimates γ i ad γ ic. This suggests the algorithm of Figure 1 for computig the best θ ad γ i ad γ ic which is similar to k-meas ad is called expectatio maximizatio EM. itially we pick radom values for the parameter vector θ. Give θ we estimate the probability for each height to be of a talia or a Chiese woma γ i PrZ i X i x i ad γ ic PrZ i C X i x i ; this is the expectatio step. Give these values we proceed to the maximizatio step i which we update the parameter vector θ. We repeat util we have coverged which ca be tested for istace by checkig that the value of θ does ot chage a lot from the previous roud. At the ed we obtai our estimate 8

1. Fuctio Expectatio Maximizatiox. itializatio step: 3. Pick say radom reasoable values for θ {π π Cµ σµ CσC} 4. while ot coverged 5. Expectatio step: 6. π Nx i;µ σ γ i π Nx i;µ σ +πc Nxi;µCσ C Equatio 7. π C Nx i;µ Cσ γ ic C π Nx i;µ σ +πc Nxi;µCσ C Equatio 3 8. Maximizatio step: 9. π 1 γ i π C 1 γ ic Equatios1011 10. µ γi xi µ C π 11. σ γi xi µ σc π 1. ed while 13. Output the fial values of γ i γ ic ad θ γic xi π C γic xi µc π C Equatios113 Equatios1415 Figure 1: The expectatio maximizatio EM algorithm for learig Gaussia mixtures. for the best parameters of the model ˆθ ad our estimates for the probabilities PrZ i X i x i ad PrZ i X i x i which iduce the soft clusterig that we wated to compute i the first place. Note that the EM algorithm resembles a lot the k-meas algorithm. The expectatio step correspods to assigig poits to clusters there we oly allow γ i ad γ ic to take the values 0 or 1; istead i the EM algorithm they ca take ay value i [01] but i both cases we have the costrait that they sum to 1. The maximizatio step correspods to the step of k-meas i which we compute the best ew ceters as the ew meas give the assigmet of the poits to clusters. deed oe ca show that the k-meas ca be viewed as a special case of the EM algorithm if we put the additioal costrait that the variaces of the two Gaussia distributios σ ad σ C are very small ted to zero. this way we will ed up with a hard clusterig by performig essetially k-meas. Similartokmeaswecashowthat theemalgorithmcovergestosomevalue whichisalocalmaximum but ot ecessarily a global oe. But whereas the k-meas algorithm usually coverges fast the EM ca be much slower. Therefore oe trick ofte used whe we wat to produce a soft clusterig is to first ru k-meas ad obtai a hard cluster ad the use this assigmets to iitialize the EM algorithm. Here we saw oe particular example of the EM algorithm that of learig Gaussia mixtures actually a special case of that. But ote that the EM algorithm is much more geeral ad ca be used whe we wat to lear models that cotai latet variables. The techical details might be more ivolved but the approach is the same. Refereces [1] A.-L. Barabási ad R. Albert. Emergece of scalig i radom etworks. Sciece 86:509 51 1999. [] D. J. Watts ad S. H. Strogatz. Collective dyamics of small-world etworks. Nature 393:440 44 1998. 9