Foundations of Machine Learning II TP1: Entropy

Similar documents
COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #21 Scribe: Lawrence Diao April 23, 2013

Supplementary material for Non-conjugate Variational Message Passing for Multinomial and Binary Regression

Appendix for Solving Asset Pricing Models when the Price-Dividend Function is Analytic

Introduction to PGMs: Discrete Variables. Sargur Srihari

3: Central Limit Theorem, Systematic Errors

Problem Set 6 Finance 1,

ECE 586GT: Problem Set 2: Problems and Solutions Uniqueness of Nash equilibria, zero sum games, evolutionary dynamics

Appendix - Normally Distributed Admissible Choices are Optimal

Random Variables. b 2.

II. Random Variables. Variable Types. Variables Map Outcomes to Numbers

Introduction to game theory

Parallel Prefix addition

2.1 Rademacher Calculus... 3

MgtOp 215 Chapter 13 Dr. Ahn

Likelihood Fits. Craig Blocker Brandeis August 23, 2004

Chapter 5 Student Lecture Notes 5-1

Lecture 7. We now use Brouwer s fixed point theorem to prove Nash s theorem.

Multifactor Term Structure Models

/ Computational Genomics. Normalization

Linear Combinations of Random Variables and Sampling (100 points)

SUPPLEMENT TO BOOTSTRAPPING REALIZED VOLATILITY (Econometrica, Vol. 77, No. 1, January, 2009, )

Scribe: Chris Berlind Date: Feb 1, 2010

Tests for Two Ordered Categorical Variables

Option pricing and numéraires

occurrence of a larger storm than our culvert or bridge is barely capable of handling? (what is The main question is: What is the possibility of

Dynamic Analysis of Knowledge Sharing of Agents with. Heterogeneous Knowledge

ECONOMETRICS - FINAL EXAM, 3rd YEAR (GECO & GADE)

Tests for Two Correlations

Cumulative Step-size Adaptation on Linear Functions

Elements of Economic Analysis II Lecture VI: Industry Supply

Bayesian belief networks

Correlations and Copulas

A Comparison of Statistical Methods in Interrupted Time Series Analysis to Estimate an Intervention Effect

PhysicsAndMathsTutor.com

Data Mining Linear and Logistic Regression

Equilibrium in Prediction Markets with Buyers and Sellers

4.4 Doob s inequalities

TCOM501 Networking: Theory & Fundamentals Final Examination Professor Yannis A. Korilis April 26, 2002

CHAPTER 3: BAYESIAN DECISION THEORY

332 Mathematical Induction Solutions for Chapter 14. for every positive integer n. Proof. We will prove this with mathematical induction.

references Chapters on game theory in Mas-Colell, Whinston and Green

On the Moments of the Traces of Unitary and Orthogonal Random Matrices

Midterm Exam. Use the end of month price data for the S&P 500 index in the table below to answer the following questions.

Chapter 3 Descriptive Statistics: Numerical Measures Part B

CHAPTER 9 FUNCTIONAL FORMS OF REGRESSION MODELS

CS340 Machine learning Bayesian model selection

An Approximate E-Bayesian Estimation of Step-stress Accelerated Life Testing with Exponential Distribution

Lecture Note 1: Foundations 1

3/3/2014. CDS M Phil Econometrics. Vijayamohanan Pillai N. Truncated standard normal distribution for a = 0.5, 0, and 0.5. CDS Mphil Econometrics

OPERATIONS RESEARCH. Game Theory

Production and Supply Chain Management Logistics. Paolo Detti Department of Information Engeneering and Mathematical Sciences University of Siena

15-451/651: Design & Analysis of Algorithms January 22, 2019 Lecture #3: Amortized Analysis last changed: January 18, 2019

Maximum Likelihood Estimation of Isotonic Normal Means with Unknown Variances*

Notes on experimental uncertainties and their propagation

Consumption Based Asset Pricing

Economic Design of Short-Run CSP-1 Plan Under Linear Inspection Cost

Fast Laplacian Solvers by Sparsification

Measures of Spread IQR and Deviation. For exam X, calculate the mean, median and mode. For exam Y, calculate the mean, median and mode.

Mode is the value which occurs most frequency. The mode may not exist, and even if it does, it may not be unique.

Notes on Debye-Hückel Theory

On estimating the location parameter of the selected exponential population under the LINEX loss function

A Study on the Series Expansion of Gaussian Quadratic Forms

4. Greek Letters, Value-at-Risk

Interval Estimation for a Linear Function of. Variances of Nonnormal Distributions. that Utilize the Kurtosis

1 A Primer on Linear Models. 2 Chapter 1 corrections. 3 Chapter 2 corrections. 4 Chapter 3 corrections. 1.1 Corrections 23 May 2015

Elton, Gruber, Brown, and Goetzmann. Modern Portfolio Theory and Investment Analysis, 7th Edition. Solutions to Text Problems: Chapter 9

Evaluating Performance

General Examination in Microeconomic Theory. Fall You have FOUR hours. 2. Answer all questions

A Utilitarian Approach of the Rawls s Difference Principle

Explaining Movements of the Labor Share in the Korean Economy: Factor Substitution, Markups and Bargaining Power

>1 indicates country i has a comparative advantage in production of j; the greater the index, the stronger the advantage. RCA 1 ij

Quadratic Games. First version: February 24, 2017 This version: December 12, Abstract

Inference on Reliability in the Gamma and Inverted Gamma Distributions

The Market Selection Hypothesis

Understanding Annuities. Some Algebraic Terminology.

Quadratic Games. First version: February 24, 2017 This version: August 3, Abstract

Sequential equilibria of asymmetric ascending auctions: the case of log-normal distributions 3

CLOSED-FORM LIKELIHOOD EXPANSIONS FOR MULTIVARIATE DIFFUSIONS. BY YACINE AÏT-SAHALIA 1 Princeton University

0.1 Gradient descent for convex functions: univariate case

Global sensitivity analysis of credit risk portfolios

New Distance Measures on Dual Hesitant Fuzzy Sets and Their Application in Pattern Recognition

Test Problems for Large Scale Nonsmooth Minimization

2) In the medium-run/long-run, a decrease in the budget deficit will produce:

Variational Inference for Graphical Models of Multivariate Piecewise-Stationary Time Series

CS54701: Information Retrieval

Monte Carlo Rendering

Doubly Random Parallel Stochastic Algorithms for Large Scale Learning

Numerical Analysis ECIV 3306 Chapter 6

Static (or Simultaneous- Move) Games of Complete Information

Testing for Omitted Variables

Prospect Theory and Asset Prices

Applications of Myerson s Lemma

Games and Decisions. Part I: Basic Theorems. Contents. 1 Introduction. Jane Yuxin Wang. 1 Introduction 1. 2 Two-player Games 2

OCR Statistics 1 Working with data. Section 2: Measures of location

Economics 1410 Fall Section 7 Notes 1. Define the tax in a flexible way using T (z), where z is the income reported by the agent.

σ may be counterbalanced by a larger

A Bayesian Classifier for Uncertain Data

Quiz on Deterministic part of course October 22, 2002

Still Simpler Way of Introducing Interior-Point method for Linear Programming

Physics 4A. Error Analysis or Experimental Uncertainty. Error

Transcription:

Foundatons of Machne Learnng II TP1: Entropy Gullaume Charpat (Teacher) & Gaétan Marceau Caron (Scrbe) Problem 1 (Gbbs nequalty). Let p and q two probablty measures over a fnte alphabet X. Prove that KL(p q) 0 Hnt: for a concave functon f and a random varable X, we have the Jensen s nequalty E f(x)] f (EX]). ln s a strctly concave functon. : We start by revewng some useful results from Boyd and Vandenberghe, 004; Cover and Thomas, 006. Defnton 1. A functon f : R n R s convex f dom f s a convex set and f for all x, y dom f and θ 0, 1], we have f(θx + (1 θ)y) θf(x) + (1 θ)f(y) (1) A functon s strctly convex when the equalty holds f θ = 0 or θ = 1. Also, a functon f s concave s f s convex. Provng that a gven functon f s convex s usually hard wth the prevous defnton. When f s twce-dfferentable, t s easer to use the second-order condton: Proposton 1. Let f be a twce-dfferentable functon, that s, ts Hessan or second-dervatve f exsts at each pont n dom f, whch s open. Then, f s convex f dom f s a convex set and f s postve semdefnte,.e., for all x dom f, we have x f x 0. Moreover, the functon s strctly convex f f s postve defnte. Proof. Use Taylor expanson (cf. Cover and Thomas, 006, p.6 for the unvarate case) By applyng ths result, we easly fnd that ln s a strctly concave functon. Now, we state the Jensen s nequalty requred n the proof of the Gbb s nequalty. Theorem. If f s a convex functon and X s a random varable, then https://www.lr.fr/ gcharpa/machnelearnngcourse/ E f(x)] f(e X]) () 1

Proof. Inducton (cf. Cover and Thomas, 006, p.7) Fnally, we recall the defnton of the Kullback-Lebler dvergence: Defnton. Let p and q be two dscrete probablty dstrbutons over an alphabet X. The Kullback-Lebler dvergence s defned as: KL(p q) = x X p(x) ln p(x) q(x) (3) wth the conventon that 0 ln 0 0 = 0, 0 ln 0 b = 0 and 0 ln a 0 = We fnsh wth the proof gven by Cover and Thomas, 006, p.8. Theorem 3 (Gbb s Inequalty). Let p and q be two dscrete probablty dstrbutons over an alphabet X. The Kullback-Lebler dvergence has the followng property: KL(p q) 0 (4) wth equalty f p(x) = q(x) for all x X. Proof. Let A = {x : p(x) > 0}, then we have KL(p q) = x A = x A ln x A p(x) ln p(x) q(x) p(x) ln q(x) p(x) p(x) q(x) p(x) (5) (6) (7) = ln q(x) (8) x A ln q(x) (9) x X = ln 1 (10) = 0 (11) Snce ln s strctly concave, we have equalty n eq. (7) f q(x) p(x) s constant,.e., p(x) = cq(x). Then, we have equalty n eq. (9) f x A q(x) = x X q(x). Fnally, wth both equaltes, we have that c = 1. Problem (Evdence Lower bound (ELBO)). Prove the followng nequalty 1 : ln p(d) E θ β ln p(d θ)] + KL(β α) (1) 1 Further nformaton can be found at https://www.lr.fr/~bensadon/

where D s a dataset, p(d) s the probablty of the dataset, p(d θ) s the lkelhood probablty of the dataset gven the model parameters θ, β s a dstrbuton over the model parameters approxmatng the posteror dstrbuton π(θ) := p(θ D) and α s the pror dstrbuton over the model parameters. (a) Wrte down the natural logarthm of the Bayes rule n an expanded form: π(θ) = p(d θ)α(θ) p(d) (13) By applyng the propertes of the logarthm, we obtan: 0 = ln p(d θ) + ln α(θ) ln p(d) ln π(θ) (14) (b) Introduce a new densty functon β and rewrte the expresson n terms of expectaton w.r.t. β 0 = β(θ) (ln p(d θ) + ln α(θ) ln p(d) ln π(θ)) (15) = β(θ) ln p(d θ) + β(θ) ln α(θ) β(θ) ln p(d) (16) + β(θ) ln β(θ) β(θ) ln β(θ) β(θ) ln π(θ) (17) = β(θ) ln p(d θ) ln p(d) + β(θ) ln α(θ) β(θ) + β(θ) ln β(θ) (18) π(θ) = E β ln p(d θ)] ln p(d) KL(β α) + KL(β π) (19) whch mples ln p(d) = E β ln p(d θ)] + KL(β α) KL(β π) (0) (c) Use the Gbbs nequalty and wrte down the ELBO ln p(d) E β ln p(d θ)] + KL(β α) (1) (d) Interpret the ELBO n a machne learnng framework cf. Varatonal nference Bshop, 006, p.46 Problem 3 (Entropy). Compute the dfferental entropy of the followng dstrbutons: (a) unvarate Normal dstrbuton N (x µ, σ ) = 1 πσ exp ] (x µ) σ () 3

we obtan By takng the natural logarthm of the normal dstrbuton, ln N (x µ, σ ) = ln πσ and wth the defnton of the dfferental entropy, we have (x µ) σ (3) Ent N ( µ, σ ) ] = E ln N (x µ, σ ) ] (4) = E ln ] πσ + 1 σ E (x µ) ] (5) = ln πσ + 1 (6) = ln πeσ (7) (b) multvarate Normal dstrbuton 1 N (x µ, C) = (π)d C exp 1 ] (x µ) C 1 (x µ) (8) where x, µ R d and C s a covarance matrx (assumed to be symmetrc postve-defnte). By takng the natural logarthm, we obtan ln N (x µ, C) = d ln(π) 1 ln C 1 (x µ)t C 1 (x µ) (9) and wth the defnton of the dfferental entropy, we have Ent N ( µ, σ ) ] = E ln N (x µ, σ ) ] (30) d = E ln(π) + 1 ln C + 1 ] (x µ)t C 1 (x µ) (31) = d ln(π) + 1 ln C + 1 E (x µ) T C 1 (x µ) ] (3) = d ln(π) + 1 ln C + d (33) = d (ln(π) + 1) + 1 ln C (34) = ln (πe) d C (35) Note that we have used to followng dentty for eq. (3): E (x µ) T C 1 (x µ) ] = E tr((x µ) T C 1 (x µ)) ] (36) = E tr((c 1 (x µ)(x µ) T )) ] (37) = tr(c 1 E (x µ)(x µ) T ] ) (38) = tr(c 1 C) (39) = d (40) 4

where tr s the trace operator. In ths dentty, we use the fact that the trace of a scalar s equal to the scalar. Then, we use a well-known property of the trace for cyclng the varables. Fnally, snce tr and E are lnear operators, we can swtch them. Problem 4 (Mutual nformaton). We are nterested n computng the mutual nformaton between a multvarate Normal dstrbuton β = N (x µ, C) where x, µ R d and a product of dentcal unvarate Normal dstrbutons α = d =1 N (x µ, σ). (a) Express the KL dvergence n terms of entropy and expectaton w.r.t. β ] β(x) KL(β α) = β(x) ln (41) α(x) = β(x) ln α(x) ln β(x)] (4) where Ent(β(x)) = β(x) ln β(x). x (b) Compute the exact expresson of E x β ln α(x). We have = E x β ln α(x) Ent(β(x)) (43) E x β ln α(x) = E x β ln N (x µ, σ) (44) ] = d (x µ) ln(πσ ) E x β σ (45) = d ln(πσ ) 1 σ E x β (x µ) ] (46) = d ln(πσ ) 1 ( σ C + (µ µ) ) (47) where we have margnalzed β for each term of the sum n eq. (45) and where we have used: E x β (x µ) ] = E x β (x µ + µ µ) ] (48) = E x β (x µ ) + (x µ )(µ µ) + (µ µ) ] (c) Compute KL(β α) (49) = E x β (x µ ) ] + (µ µ) (50) = C + (µ µ) (51) 5

KL(β α) = E x β ln α(x)] Ent(β) (5) = d ln(πσ ) + 1 ( σ C + (µ µ) ) d (ln(π) + 1) 1 ln C (53) = 1 d ln(σ ) ln C d + 1 ( σ C + (µ µ) )] (54) (d) Suppose that µ = µ and C = σ for all. Smplfy the prevous expresson. KL(β α) = 1 d ln(σ ) ln C ] (55) (e) How the mutual nformaton could appear n the ELBO? References Bshop, Chrstopher M. (006). Pattern Recognton and Machne Learnng (Informaton Scence and Statstcs). Secaucus, NJ, USA: Sprnger-Verlag New York, Inc. sbn: 0387310738. Boyd, Stephen and Leven Vandenberghe (004). Convex Optmzaton. New York, NY, USA: Cambrdge Unversty Press. sbn: 051833787. Cover, Thomas M. and Joy A. Thomas (006). Elements of Informaton Theory (Wley Seres n Telecommuncatons and Sgnal Processng). Wley- Interscence. sbn: 047141954. Programmng exercses Problem 5 (Text entropy). In the followng, we are nterested n estmatng the entropy of dfferent texts. We wll work wth the novel Crme and Punshment by Fyodor Dostoyevsky. Other books n dfferent languages are also avalable.. To do so, we compute the entropy of dfferent models: The chosen books are avalable at https://www.lr.fr/~marceau/courses/centraleml/ texts.zp, thanks to the Gutenberg project.https://www.gutenberg.org/ 6

1. Compute the entropy of a model based on the frequency of each sngle symbol n the chosen book (..d. model).. Use ths model to compute the cross-entropy of the dstrbuton from another book. Compare ths value wth the prevous entropy by computng the KL-dvergence. 3. Compute the entropy of a model based on the frequency of pars of symbols, and compare t wth the prevous model. Explan the dfference. 4. Compute the entropy rate of a Markov chan where each state s a symbol, and transton probabltes are estmated from the chosen book. 7