Foundations of Machine Learning II TP1: Entropy

Foundatons of Machne Learnng II TP1: Entropy Gullaume Charpat (Teacher) & Gaétan Marceau Caron (Scrbe) Problem 1 (Gbbs nequalty). Let p and q two probablty measures over a fnte alphabet X. Prove that KL(p q) 0 Hnt: for a concave functon f and a random varable X, we have the Jensen s nequalty E f(x)] f (EX]). ln s a strctly concave functon. : We start by revewng some useful results from Boyd and Vandenberghe, 004; Cover and Thomas, 006. Defnton 1. A functon f : R n R s convex f dom f s a convex set and f for all x, y dom f and θ 0, 1], we have f(θx + (1 θ)y) θf(x) + (1 θ)f(y) (1) A functon s strctly convex when the equalty holds f θ = 0 or θ = 1. Also, a functon f s concave s f s convex. Provng that a gven functon f s convex s usually hard wth the prevous defnton. When f s twce-dfferentable, t s easer to use the second-order condton: Proposton 1. Let f be a twce-dfferentable functon, that s, ts Hessan or second-dervatve f exsts at each pont n dom f, whch s open. Then, f s convex f dom f s a convex set and f s postve semdefnte,.e., for all x dom f, we have x f x 0. Moreover, the functon s strctly convex f f s postve defnte. Proof. Use Taylor expanson (cf. Cover and Thomas, 006, p.6 for the unvarate case) By applyng ths result, we easly fnd that ln s a strctly concave functon. Now, we state the Jensen s nequalty requred n the proof of the Gbb s nequalty. Theorem. If f s a convex functon and X s a random varable, then https://www.lr.fr/ gcharpa/machnelearnngcourse/ E f(x)] f(e X]) () 1

Proof. Inducton (cf. Cover and Thomas, 006, p.7) Fnally, we recall the defnton of the Kullback-Lebler dvergence: Defnton. Let p and q be two dscrete probablty dstrbutons over an alphabet X. The Kullback-Lebler dvergence s defned as: KL(p q) = x X p(x) ln p(x) q(x) (3) wth the conventon that 0 ln 0 0 = 0, 0 ln 0 b = 0 and 0 ln a 0 = We fnsh wth the proof gven by Cover and Thomas, 006, p.8. Theorem 3 (Gbb s Inequalty). Let p and q be two dscrete probablty dstrbutons over an alphabet X. The Kullback-Lebler dvergence has the followng property: KL(p q) 0 (4) wth equalty f p(x) = q(x) for all x X. Proof. Let A = {x : p(x) > 0}, then we have KL(p q) = x A = x A ln x A p(x) ln p(x) q(x) p(x) ln q(x) p(x) p(x) q(x) p(x) (5) (6) (7) = ln q(x) (8) x A ln q(x) (9) x X = ln 1 (10) = 0 (11) Snce ln s strctly concave, we have equalty n eq. (7) f q(x) p(x) s constant,.e., p(x) = cq(x). Then, we have equalty n eq. (9) f x A q(x) = x X q(x). Fnally, wth both equaltes, we have that c = 1. Problem (Evdence Lower bound (ELBO)). Prove the followng nequalty 1 : ln p(d) E θ β ln p(d θ)] + KL(β α) (1) 1 Further nformaton can be found at https://www.lr.fr/~bensadon/

where D s a dataset, p(d) s the probablty of the dataset, p(d θ) s the lkelhood probablty of the dataset gven the model parameters θ, β s a dstrbuton over the model parameters approxmatng the posteror dstrbuton π(θ) := p(θ D) and α s the pror dstrbuton over the model parameters. (a) Wrte down the natural logarthm of the Bayes rule n an expanded form: π(θ) = p(d θ)α(θ) p(d) (13) By applyng the propertes of the logarthm, we obtan: 0 = ln p(d θ) + ln α(θ) ln p(d) ln π(θ) (14) (b) Introduce a new densty functon β and rewrte the expresson n terms of expectaton w.r.t. β 0 = β(θ) (ln p(d θ) + ln α(θ) ln p(d) ln π(θ)) (15) = β(θ) ln p(d θ) + β(θ) ln α(θ) β(θ) ln p(d) (16) + β(θ) ln β(θ) β(θ) ln β(θ) β(θ) ln π(θ) (17) = β(θ) ln p(d θ) ln p(d) + β(θ) ln α(θ) β(θ) + β(θ) ln β(θ) (18) π(θ) = E β ln p(d θ)] ln p(d) KL(β α) + KL(β π) (19) whch mples ln p(d) = E β ln p(d θ)] + KL(β α) KL(β π) (0) (c) Use the Gbbs nequalty and wrte down the ELBO ln p(d) E β ln p(d θ)] + KL(β α) (1) (d) Interpret the ELBO n a machne learnng framework cf. Varatonal nference Bshop, 006, p.46 Problem 3 (Entropy). Compute the dfferental entropy of the followng dstrbutons: (a) unvarate Normal dstrbuton N (x µ, σ ) = 1 πσ exp ] (x µ) σ () 3

we obtan By takng the natural logarthm of the normal dstrbuton, ln N (x µ, σ ) = ln πσ and wth the defnton of the dfferental entropy, we have (x µ) σ (3) Ent N ( µ, σ ) ] = E ln N (x µ, σ ) ] (4) = E ln ] πσ + 1 σ E (x µ) ] (5) = ln πσ + 1 (6) = ln πeσ (7) (b) multvarate Normal dstrbuton 1 N (x µ, C) = (π)d C exp 1 ] (x µ) C 1 (x µ) (8) where x, µ R d and C s a covarance matrx (assumed to be symmetrc postve-defnte). By takng the natural logarthm, we obtan ln N (x µ, C) = d ln(π) 1 ln C 1 (x µ)t C 1 (x µ) (9) and wth the defnton of the dfferental entropy, we have Ent N ( µ, σ ) ] = E ln N (x µ, σ ) ] (30) d = E ln(π) + 1 ln C + 1 ] (x µ)t C 1 (x µ) (31) = d ln(π) + 1 ln C + 1 E (x µ) T C 1 (x µ) ] (3) = d ln(π) + 1 ln C + d (33) = d (ln(π) + 1) + 1 ln C (34) = ln (πe) d C (35) Note that we have used to followng dentty for eq. (3): E (x µ) T C 1 (x µ) ] = E tr((x µ) T C 1 (x µ)) ] (36) = E tr((c 1 (x µ)(x µ) T )) ] (37) = tr(c 1 E (x µ)(x µ) T ] ) (38) = tr(c 1 C) (39) = d (40) 4

where tr s the trace operator. In ths dentty, we use the fact that the trace of a scalar s equal to the scalar. Then, we use a well-known property of the trace for cyclng the varables. Fnally, snce tr and E are lnear operators, we can swtch them. Problem 4 (Mutual nformaton). We are nterested n computng the mutual nformaton between a multvarate Normal dstrbuton β = N (x µ, C) where x, µ R d and a product of dentcal unvarate Normal dstrbutons α = d =1 N (x µ, σ). (a) Express the KL dvergence n terms of entropy and expectaton w.r.t. β ] β(x) KL(β α) = β(x) ln (41) α(x) = β(x) ln α(x) ln β(x)] (4) where Ent(β(x)) = β(x) ln β(x). x (b) Compute the exact expresson of E x β ln α(x). We have = E x β ln α(x) Ent(β(x)) (43) E x β ln α(x) = E x β ln N (x µ, σ) (44) ] = d (x µ) ln(πσ ) E x β σ (45) = d ln(πσ ) 1 σ E x β (x µ) ] (46) = d ln(πσ ) 1 ( σ C + (µ µ) ) (47) where we have margnalzed β for each term of the sum n eq. (45) and where we have used: E x β (x µ) ] = E x β (x µ + µ µ) ] (48) = E x β (x µ ) + (x µ )(µ µ) + (µ µ) ] (c) Compute KL(β α) (49) = E x β (x µ ) ] + (µ µ) (50) = C + (µ µ) (51) 5

KL(β α) = E x β ln α(x)] Ent(β) (5) = d ln(πσ ) + 1 ( σ C + (µ µ) ) d (ln(π) + 1) 1 ln C (53) = 1 d ln(σ ) ln C d + 1 ( σ C + (µ µ) )] (54) (d) Suppose that µ = µ and C = σ for all. Smplfy the prevous expresson. KL(β α) = 1 d ln(σ ) ln C ] (55) (e) How the mutual nformaton could appear n the ELBO? References Bshop, Chrstopher M. (006). Pattern Recognton and Machne Learnng (Informaton Scence and Statstcs). Secaucus, NJ, USA: Sprnger-Verlag New York, Inc. sbn: 0387310738. Boyd, Stephen and Leven Vandenberghe (004). Convex Optmzaton. New York, NY, USA: Cambrdge Unversty Press. sbn: 051833787. Cover, Thomas M. and Joy A. Thomas (006). Elements of Informaton Theory (Wley Seres n Telecommuncatons and Sgnal Processng). Wley- Interscence. sbn: 047141954. Programmng exercses Problem 5 (Text entropy). In the followng, we are nterested n estmatng the entropy of dfferent texts. We wll work wth the novel Crme and Punshment by Fyodor Dostoyevsky. Other books n dfferent languages are also avalable.. To do so, we compute the entropy of dfferent models: The chosen books are avalable at https://www.lr.fr/~marceau/courses/centraleml/ texts.zp, thanks to the Gutenberg project.https://www.gutenberg.org/ 6

1. Compute the entropy of a model based on the frequency of each sngle symbol n the chosen book (..d. model).. Use ths model to compute the cross-entropy of the dstrbuton from another book. Compare ths value wth the prevous entropy by computng the KL-dvergence. 3. Compute the entropy of a model based on the frequency of pars of symbols, and compare t wth the prevous model. Explan the dfference. 4. Compute the entropy rate of a Markov chan where each state s a symbol, and transton probabltes are estmated from the chosen book. 7