COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #21 Scribe: Lawrence Diao April 23, PDF Free Download

COS 511: Theoretcal Machne Learnng Lecturer: Rob Schapre Lecture #21 Scrbe: Lawrence Dao Aprl 23, 2013 1 On-Lne Log Loss To recap the end of the last lecture, we have the followng on-lne problem wth N experts. For each round t = 1,..., T : each expert predcts p t, dstrbuton on X master predcts q t dstrbuton on X observe x t X loss = ln q t (x t ) We want to get a bound on the total loss of the master q t n comparson to the best expert. log q t (x t ) mn log p t, (x t ) + small (1) where here we use the general log functon of arbtrary base. We ll see that ths on-lne log loss settng manfests tself n many applcatons such as horse racng and codng theory. 2 Codng Theory Here, we are concerned wth how to effcently send a message from Alce to Bob n as few bts as possble. In ths settng we defne X as the alphabet, and each x X as a letter. Say Alce wants to send one letter x. Defne p(x) to be the probablty of sendng x, whch you can estmate from a corpus. The best you can do s to take lg p(x) bts to send x. Now Alce s tryng to send a sequence of letters x 1, x 2, x 3,... One way we can do ths s to use p(x) for each letter separately, but ths s sub-optmal for Englsh. For example, f we see the followng strng of characters I am go, we can easly predct the next letter to be n gven the context, but f we smply use p(x), then we mght say that e s the most lkely, snce t s the letter of hghest frequency n the Englsh language. Our goal s to use the context to use fewer bts to encode x. If we defne p t (x t ) to be the probablty dstrbuton of x t gven context x t 1 1 = x 1,..., x t 1, then t takes lg p t (x t ) to encode the extra letter x t. However, t s really hard to model ths probablty. You can t get t just by countng as we could wth p(x). Instead, we consder combnng a collecton of codng methods where we don t know whch one wll be best. Let s say we have N codng methods (N experts). We try to pck a master codng method that uses at most a small amount more bts than the best encodng method.

Let p t, (x t ) = probablty of x t gven x t 1 1 accordng to the -th codng method. So we have lg p t, (x t ) bts used by -th codng method lg q t (x t ) bts used by arbtrary codng method q t We are tryng to come up wth a codng method q t (x t ) to guarantee lg q t (x t ) mn lg p t, (x t ) + small Such an algorthm s called a unversal compresson algorthm, snce t works about as well as the best codng method for any nput. Note that the bound should hold for any sequence of x t s, so there s no assumpton on randomness of x t. Also note that ths bound s of the form of (1). 3 Unversal Compresson Algorthm In ths secton we try to determne the algorthm for choosng the master codng method. To make the math cleaner, we change the base back to e, and try to acheve the followng bound ln q t (x t ) mn ln p t, (x t ) + small We also make the followng notaton changes q t (x t ) q(x t x t 1 p t, (x t ) p (x t x t 1 Let s pretend that x t are random even though they re not n order to motvate an algorthm for pckng q. Pretend that x t are pcked as follows: select one expert wth Pr[ = ] = 1 N x 1, x 2,..., generated accordng to : Pr[x 1 = ] = p (x Pr[x 2 x 1, = ] = p (x 2 x... 1, = ] = p (x t x t 1 2

Then the most natural way to pck q s: q(x t x t 1 = = Pr[x t, = x t 1 margnalze = = Pr[ = x t 1 Pr[x t =, x t 1 condtonal probablty w t, p [x t x t 1 w t, = Pr[ = x t 1 If we can fnd these w t, then we have an algorthm. w 1, = Pr[ = ] = 1 N w t+1, = Pr[ = x t 1] = Pr[ = x t 1 1, x t ] = Pr[ = x1 t Pr[x t =, x t 1 = w t, p (x t x1 t Normalzaton So we are left wth the followng algorthm. ntalzaton bayes rule : w 1, = 1 N On round t: Choose q(x t x t 1 = w t, p (x t x t 1 Update Weghts: : w t+1, = w t,p (x t x t 1 Normalzaton We can see that ths weght update s very smlar to other weght-update onlne learnng algorthms we have seen n the past, except we don t have to tune β snce there s only one correct choce of β = e 1 n ths case. 4 Boundng the Log Loss w t+1, w t, β loss loss = ln p t, (x t ) β = e 1 β loss = p t, (x t ) w t+1, w t, β loss = w t, p t, (x t ) Here we are tryng to prove (1), gven our choce of q(x t x t 1 = Theorem: log q t (x t ) mn log p t, (x t ) + log N 3

Defne q(x T = q(x q(x 2 x q(x 3 x 1, x 2 )... T = q(x t x t 1 = T = Pr[x T chan rule In the same way we can do ths wth each expert p (x T = Pr[x T = ] Addtonally, the total loss of our algorthm s gven by the followng: Smlarly, for any expert, log q t (x t ) = t [ = log = log q(x T log q(x t x t 1 t q(x t x t 1 log p t, (x t ) = log p (x T So we have the followng bound: q(x T = Pr[x T = Pr[ = ] Pr[x T = ] margnalze ] = 1 p (x T N 1 N p (x T = log q(x T log p (x T + log N = log q t (x t ) mn log p t, (x t ) + log N Here we consder log N to be small. Note that ths bound does not assume any randomness for x t. Now, let s consder an alternatve encodng scheme, where Alce wats for the entre message x 1, x 2,..., x T, chooses the best out of the N canddate encodng methods, uses lg N bts to encode whch encodng method she used, and fnally sends her message accordng to ths chosen method. We can see that ths scheme would use just as many bts as the rght hand sde of the bound, but usng our onlne algorthm we don t have to wat for the whole message to start encodng/sendng. We won t go nto detal about decodng, but n order to decode, Bob effectvely just smulates what Alce does to encode, so decodng s just as effcent as Alce s encodng, makng algorthmc effcency a non-factor. 4

5 Varatons 5.1 Usng a pror In ths secton we consder a pror Pr[ = ] = π not necessarly unform. Everythng about our algorthm stays the same except the ntal weghts are now w 1, = π, and the fnal bound ends up beng [ T ] log q t (x t ) mn log p t, (x t ) log π 5.2 Infnte Experts { 1 wth prob p Consder the problem where X = {0, 1}, and expert p predcts x t = 0 wth prob 1 p where we have all experts p [0, 1]. We need to fgure out the weghts w t,p to get q. In the fnte case, we had w t, = P r[ = x t 1, but applyng ths defnton to the nfnte case doesn t really make sense unless we re talkng about the probablty densty: Pr[p dp x t 1 = Pr[xt 1 1 p dp] P r[p dp] Pr[x t 1 = Pr[xt 1 1 p dp] Pr[p dp] Normalzaton = Pr[xt 1 1 p dp] Normalzaton p h (1 p) t h 1 bayes rule assumng Pr[p dp] unform where h s the number of heads (1 s) n the frst t 1 rounds. Now, lettng w t,p = p h (1 p) t h 1 1 0 q t = w t,ppdp 1 0 w t,pdp Normalzaton = h + 1 (t 1) + 2 sometmes called laplace smoothng We can get a smlar bound as before n ths case but log π or lg N doesn t make sense. We ll see a bound n a future lecture. 6 Swtchng Experts In ths secton we set up the problem for next class. Here, we no longer assume that one expert s good all the tme. Instead, we change the model so that at any step, the correct expert can swtch to another expert. However, the learnng algorthm has no dea when the experts are swtchng. Our goal s to desgn an algorthm that performs well wth respect to the best swtchng sequence of experts. We ll look at ths n the next lecture. 5

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #21 Scribe: Lawrence Diao April 23, 2013