COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #21 Scribe: Lawrence Diao April 23, 2013

Similar documents
Parallel Prefix addition

Scribe: Chris Berlind Date: Feb 1, 2010

15-451/651: Design & Analysis of Algorithms January 22, 2019 Lecture #3: Amortized Analysis last changed: January 18, 2019

OCR Statistics 1 Working with data. Section 2: Measures of location

Foundations of Machine Learning II TP1: Entropy

TCOM501 Networking: Theory & Fundamentals Final Examination Professor Yannis A. Korilis April 26, 2002

Lecture 7. We now use Brouwer s fixed point theorem to prove Nash s theorem.

Tests for Two Ordered Categorical Variables

Applications of Myerson s Lemma

Tests for Two Correlations

Appendix - Normally Distributed Admissible Choices are Optimal

Financial mathematics

2.1 Rademacher Calculus... 3

Elements of Economic Analysis II Lecture VI: Industry Supply

Survey of Math: Chapter 22: Consumer Finance Borrowing Page 1

Random Variables. b 2.

The IBM Translation Models. Michael Collins, Columbia University

CS 286r: Matching and Market Design Lecture 2 Combinatorial Markets, Walrasian Equilibrium, Tâtonnement

A MODEL OF COMPETITION AMONG TELECOMMUNICATION SERVICE PROVIDERS BASED ON REPEATED GAME

Supplementary material for Non-conjugate Variational Message Passing for Multinomial and Binary Regression

Mathematical Thinking Exam 1 09 October 2017

Economics 1410 Fall Section 7 Notes 1. Define the tax in a flexible way using T (z), where z is the income reported by the agent.

- contrast so-called first-best outcome of Lindahl equilibrium with case of private provision through voluntary contributions of households

Finite Math - Fall Section Future Value of an Annuity; Sinking Funds

Likelihood Fits. Craig Blocker Brandeis August 23, 2004

/ Computational Genomics. Normalization

Problem Set 6 Finance 1,

Data Mining Linear and Logistic Regression

Topics on the Border of Economics and Computation November 6, Lecture 2

Appendix for Solving Asset Pricing Models when the Price-Dividend Function is Analytic

Ch Rival Pure private goods (most retail goods) Non-Rival Impure public goods (internet service)

Finance 402: Problem Set 1 Solutions

Equilibrium in Prediction Markets with Buyers and Sellers

Chapter 5 Student Lecture Notes 5-1

Elton, Gruber, Brown, and Goetzmann. Modern Portfolio Theory and Investment Analysis, 7th Edition. Solutions to Text Problems: Chapter 9

Hewlett Packard 10BII Calculator

Homework 9: due Monday, 27 October, 2008

OPERATIONS RESEARCH. Game Theory

Taxation and Externalities. - Much recent discussion of policy towards externalities, e.g., global warming debate/kyoto

Single-Item Auctions. CS 234r: Markets for Networks and Crowds Lecture 4 Auctions, Mechanisms, and Welfare Maximization

Production and Supply Chain Management Logistics. Paolo Detti Department of Information Engeneering and Mathematical Sciences University of Siena

Games and Decisions. Part I: Basic Theorems. Contents. 1 Introduction. Jane Yuxin Wang. 1 Introduction 1. 2 Two-player Games 2

Repeated Games against Budgeted Adversaries

Linear Combinations of Random Variables and Sampling (100 points)

SIMPLE FIXED-POINT ITERATION

II. Random Variables. Variable Types. Variables Map Outcomes to Numbers

occurrence of a larger storm than our culvert or bridge is barely capable of handling? (what is The main question is: What is the possibility of

2) In the medium-run/long-run, a decrease in the budget deficit will produce:

2.1 The Inverting Configuration

Mutual Funds and Management Styles. Active Portfolio Management

Random Variables. 8.1 What is a Random Variable? Announcements: Chapter 8

Problems to be discussed at the 5 th seminar Suggested solutions

A Php 5,000 loan is being repaid in 10 yearly payments. If interest is 8% effective, find the annual payment. 1 ( ) 10) 0.

A Case Study for Optimal Dynamic Simulation Allocation in Ordinal Optimization 1

Understanding Annuities. Some Algebraic Terminology.

Principles of Finance

Monte Carlo Rendering

CHAPTER 3: BAYESIAN DECISION THEORY

Cracking VAR with kernels

CHAPTER 9 FUNCTIONAL FORMS OF REGRESSION MODELS

Measures of Spread IQR and Deviation. For exam X, calculate the mean, median and mode. For exam Y, calculate the mean, median and mode.

Jeffrey Ely. October 7, This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 License.

Evaluating Performance

3: Central Limit Theorem, Systematic Errors

Fast Laplacian Solvers by Sparsification

Trading by estimating the forward distribution using quantization and volatility information

Static (or Simultaneous- Move) Games of Complete Information

332 Mathematical Induction Solutions for Chapter 14. for every positive integer n. Proof. We will prove this with mathematical induction.

Quiz on Deterministic part of course October 22, 2002

Introduction to PGMs: Discrete Variables. Sargur Srihari

The Hiring Problem. Informationsteknologi. Institutionen för informationsteknologi

Option pricing and numéraires

Introduction. Chapter 7 - An Introduction to Portfolio Management

FORD MOTOR CREDIT COMPANY SUGGESTED ANSWERS. Richard M. Levich. New York University Stern School of Business. Revised, February 1999

4.4 Doob s inequalities

Lecture Note 2 Time Value of Money

ECE 586GT: Problem Set 2: Problems and Solutions Uniqueness of Nash equilibria, zero sum games, evolutionary dynamics

Online Appendix for Merger Review for Markets with Buyer Power

arxiv:cond-mat/ v1 [cond-mat.other] 28 Nov 2004

MULTIPLE CURVE CONSTRUCTION

General Examination in Microeconomic Theory. Fall You have FOUR hours. 2. Answer all questions

An Application of Alternative Weighting Matrix Collapsing Approaches for Improving Sample Estimates

A Bayesian Classifier for Uncertain Data

Universal Multiparty Data Exchange and Secret Key Agreement

An asymmetry-similarity-measure-based neural fuzzy inference system

arxiv: v2 [math.co] 6 Apr 2016

Introduction to game theory

On the Complexity of Fair Coin Flipping

Simulation Budget Allocation for Further Enhancing the Efficiency of Ordinal Optimization

Physics 4A. Error Analysis or Experimental Uncertainty. Error

EXAMINATIONS OF THE HONG KONG STATISTICAL SOCIETY

AC : THE DIAGRAMMATIC AND MATHEMATICAL APPROACH OF PROJECT TIME-COST TRADEOFFS

Quantitative Portfolio Theory & Performance Analysis

Chapter 11: Optimal Portfolio Choice and the Capital Asset Pricing Model

A DUAL EXTERIOR POINT SIMPLEX TYPE ALGORITHM FOR THE MINIMUM COST NETWORK FLOW PROBLEM

arxiv: v1 [cs.ds] 16 Jul 2015

The evaluation method of HVAC system s operation performance based on exergy flow analysis and DEA method

Alternatives to Shewhart Charts

Optimal Black-Box Reductions Between Optimization Objectives

How to Share a Secret, Infinitely

Transcription:

COS 511: Theoretcal Machne Learnng Lecturer: Rob Schapre Lecture #21 Scrbe: Lawrence Dao Aprl 23, 2013 1 On-Lne Log Loss To recap the end of the last lecture, we have the followng on-lne problem wth N experts. For each round t = 1,..., T : each expert predcts p t, dstrbuton on X master predcts q t dstrbuton on X observe x t X loss = ln q t (x t ) We want to get a bound on the total loss of the master q t n comparson to the best expert. log q t (x t ) mn log p t, (x t ) + small (1) where here we use the general log functon of arbtrary base. We ll see that ths on-lne log loss settng manfests tself n many applcatons such as horse racng and codng theory. 2 Codng Theory Here, we are concerned wth how to effcently send a message from Alce to Bob n as few bts as possble. In ths settng we defne X as the alphabet, and each x X as a letter. Say Alce wants to send one letter x. Defne p(x) to be the probablty of sendng x, whch you can estmate from a corpus. The best you can do s to take lg p(x) bts to send x. Now Alce s tryng to send a sequence of letters x 1, x 2, x 3,... One way we can do ths s to use p(x) for each letter separately, but ths s sub-optmal for Englsh. For example, f we see the followng strng of characters I am go, we can easly predct the next letter to be n gven the context, but f we smply use p(x), then we mght say that e s the most lkely, snce t s the letter of hghest frequency n the Englsh language. Our goal s to use the context to use fewer bts to encode x. If we defne p t (x t ) to be the probablty dstrbuton of x t gven context x t 1 1 = x 1,..., x t 1, then t takes lg p t (x t ) to encode the extra letter x t. However, t s really hard to model ths probablty. You can t get t just by countng as we could wth p(x). Instead, we consder combnng a collecton of codng methods where we don t know whch one wll be best. Let s say we have N codng methods (N experts). We try to pck a master codng method that uses at most a small amount more bts than the best encodng method.

Let p t, (x t ) = probablty of x t gven x t 1 1 accordng to the -th codng method. So we have lg p t, (x t ) bts used by -th codng method lg q t (x t ) bts used by arbtrary codng method q t We are tryng to come up wth a codng method q t (x t ) to guarantee lg q t (x t ) mn lg p t, (x t ) + small Such an algorthm s called a unversal compresson algorthm, snce t works about as well as the best codng method for any nput. Note that the bound should hold for any sequence of x t s, so there s no assumpton on randomness of x t. Also note that ths bound s of the form of (1). 3 Unversal Compresson Algorthm In ths secton we try to determne the algorthm for choosng the master codng method. To make the math cleaner, we change the base back to e, and try to acheve the followng bound ln q t (x t ) mn ln p t, (x t ) + small We also make the followng notaton changes q t (x t ) q(x t x t 1 p t, (x t ) p (x t x t 1 Let s pretend that x t are random even though they re not n order to motvate an algorthm for pckng q. Pretend that x t are pcked as follows: select one expert wth Pr[ = ] = 1 N x 1, x 2,..., generated accordng to : Pr[x 1 = ] = p (x Pr[x 2 x 1, = ] = p (x 2 x... 1, = ] = p (x t x t 1 2

Then the most natural way to pck q s: q(x t x t 1 = = Pr[x t, = x t 1 margnalze = = Pr[ = x t 1 Pr[x t =, x t 1 condtonal probablty w t, p [x t x t 1 w t, = Pr[ = x t 1 If we can fnd these w t, then we have an algorthm. w 1, = Pr[ = ] = 1 N w t+1, = Pr[ = x t 1] = Pr[ = x t 1 1, x t ] = Pr[ = x1 t Pr[x t =, x t 1 = w t, p (x t x1 t Normalzaton So we are left wth the followng algorthm. ntalzaton bayes rule : w 1, = 1 N On round t: Choose q(x t x t 1 = w t, p (x t x t 1 Update Weghts: : w t+1, = w t,p (x t x t 1 Normalzaton We can see that ths weght update s very smlar to other weght-update onlne learnng algorthms we have seen n the past, except we don t have to tune β snce there s only one correct choce of β = e 1 n ths case. 4 Boundng the Log Loss w t+1, w t, β loss loss = ln p t, (x t ) β = e 1 β loss = p t, (x t ) w t+1, w t, β loss = w t, p t, (x t ) Here we are tryng to prove (1), gven our choce of q(x t x t 1 = Theorem: log q t (x t ) mn log p t, (x t ) + log N 3

Defne q(x T = q(x q(x 2 x q(x 3 x 1, x 2 )... T = q(x t x t 1 = T = Pr[x T chan rule In the same way we can do ths wth each expert p (x T = Pr[x T = ] Addtonally, the total loss of our algorthm s gven by the followng: Smlarly, for any expert, log q t (x t ) = t [ = log = log q(x T log q(x t x t 1 t q(x t x t 1 log p t, (x t ) = log p (x T So we have the followng bound: q(x T = Pr[x T = Pr[ = ] Pr[x T = ] margnalze ] = 1 p (x T N 1 N p (x T = log q(x T log p (x T + log N = log q t (x t ) mn log p t, (x t ) + log N Here we consder log N to be small. Note that ths bound does not assume any randomness for x t. Now, let s consder an alternatve encodng scheme, where Alce wats for the entre message x 1, x 2,..., x T, chooses the best out of the N canddate encodng methods, uses lg N bts to encode whch encodng method she used, and fnally sends her message accordng to ths chosen method. We can see that ths scheme would use just as many bts as the rght hand sde of the bound, but usng our onlne algorthm we don t have to wat for the whole message to start encodng/sendng. We won t go nto detal about decodng, but n order to decode, Bob effectvely just smulates what Alce does to encode, so decodng s just as effcent as Alce s encodng, makng algorthmc effcency a non-factor. 4

5 Varatons 5.1 Usng a pror In ths secton we consder a pror Pr[ = ] = π not necessarly unform. Everythng about our algorthm stays the same except the ntal weghts are now w 1, = π, and the fnal bound ends up beng [ T ] log q t (x t ) mn log p t, (x t ) log π 5.2 Infnte Experts { 1 wth prob p Consder the problem where X = {0, 1}, and expert p predcts x t = 0 wth prob 1 p where we have all experts p [0, 1]. We need to fgure out the weghts w t,p to get q. In the fnte case, we had w t, = P r[ = x t 1, but applyng ths defnton to the nfnte case doesn t really make sense unless we re talkng about the probablty densty: Pr[p dp x t 1 = Pr[xt 1 1 p dp] P r[p dp] Pr[x t 1 = Pr[xt 1 1 p dp] Pr[p dp] Normalzaton = Pr[xt 1 1 p dp] Normalzaton p h (1 p) t h 1 bayes rule assumng Pr[p dp] unform where h s the number of heads (1 s) n the frst t 1 rounds. Now, lettng w t,p = p h (1 p) t h 1 1 0 q t = w t,ppdp 1 0 w t,pdp Normalzaton = h + 1 (t 1) + 2 sometmes called laplace smoothng We can get a smlar bound as before n ths case but log π or lg N doesn t make sense. We ll see a bound n a future lecture. 6 Swtchng Experts In ths secton we set up the problem for next class. Here, we no longer assume that one expert s good all the tme. Instead, we change the model so that at any step, the correct expert can swtch to another expert. However, the learnng algorthm has no dea when the experts are swtchng. Our goal s to desgn an algorthm that performs well wth respect to the best swtchng sequence of experts. We ll look at ths n the next lecture. 5