CHAPTER 3: BAYESIAN DECISION THEORY

Similar documents
MgtOp 215 Chapter 13 Dr. Ahn

OPERATIONS RESEARCH. Game Theory

/ Computational Genomics. Normalization

II. Random Variables. Variable Types. Variables Map Outcomes to Numbers

Introduction to PGMs: Discrete Variables. Sargur Srihari

Data Mining Linear and Logistic Regression

Random Variables. b 2.

occurrence of a larger storm than our culvert or bridge is barely capable of handling? (what is The main question is: What is the possibility of

3/3/2014. CDS M Phil Econometrics. Vijayamohanan Pillai N. Truncated standard normal distribution for a = 0.5, 0, and 0.5. CDS Mphil Econometrics

Tests for Two Ordered Categorical Variables

Which of the following provides the most reasonable approximation to the least squares regression line? (a) y=50+10x (b) Y=50+x (d) Y=1+50x

Likelihood Fits. Craig Blocker Brandeis August 23, 2004

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #21 Scribe: Lawrence Diao April 23, 2013

Tests for Two Correlations

Chapter 5 Student Lecture Notes 5-1

3: Central Limit Theorem, Systematic Errors

Problem Set 6 Finance 1,

15-451/651: Design & Analysis of Algorithms January 22, 2019 Lecture #3: Amortized Analysis last changed: January 18, 2019

TCOM501 Networking: Theory & Fundamentals Final Examination Professor Yannis A. Korilis April 26, 2002

Interval Estimation for a Linear Function of. Variances of Nonnormal Distributions. that Utilize the Kurtosis

Physics 4A. Error Analysis or Experimental Uncertainty. Error

Economic Design of Short-Run CSP-1 Plan Under Linear Inspection Cost

ECONOMETRICS - FINAL EXAM, 3rd YEAR (GECO & GADE)

Quiz on Deterministic part of course October 22, 2002

A MODEL OF COMPETITION AMONG TELECOMMUNICATION SERVICE PROVIDERS BASED ON REPEATED GAME

Random Variables. 8.1 What is a Random Variable? Announcements: Chapter 8

Elements of Economic Analysis II Lecture VI: Industry Supply

Instituto de Engenharia de Sistemas e Computadores de Coimbra Institute of Systems Engineering and Computers INESC - Coimbra

Scribe: Chris Berlind Date: Feb 1, 2010

Measures of Spread IQR and Deviation. For exam X, calculate the mean, median and mode. For exam Y, calculate the mean, median and mode.

ISyE 2030 Summer Semester 2004 June 30, 2004

A Comparison of Statistical Methods in Interrupted Time Series Analysis to Estimate an Intervention Effect

iii) pay F P 0,T = S 0 e δt when stock has dividend yield δ.

A Bayesian Classifier for Uncertain Data

Simple Regression Theory II 2010 Samuel L. Baker

Financial mathematics

Parallel Prefix addition

Equilibrium in Prediction Markets with Buyers and Sellers

Incorrect Beliefs. Overconfidence. Types of Overconfidence. Outline. Overprecision 4/15/2017. Behavioral Economics Mark Dean Spring 2017

2) In the medium-run/long-run, a decrease in the budget deficit will produce:

A Bootstrap Confidence Limit for Process Capability Indices

Foundations of Machine Learning II TP1: Entropy

CS 286r: Matching and Market Design Lecture 2 Combinatorial Markets, Walrasian Equilibrium, Tâtonnement

Bayesian belief networks

Applications of Myerson s Lemma

Linear Combinations of Random Variables and Sampling (100 points)

Solution of periodic review inventory model with general constrains

Appendix - Normally Distributed Admissible Choices are Optimal

Economics 1410 Fall Section 7 Notes 1. Define the tax in a flexible way using T (z), where z is the income reported by the agent.

Single-Item Auctions. CS 234r: Markets for Networks and Crowds Lecture 4 Auctions, Mechanisms, and Welfare Maximization

arxiv: v1 [q-fin.pm] 13 Feb 2018

A Set of new Stochastic Trend Models

Hedging Greeks for a portfolio of options using linear and quadratic programming

Jeffrey Ely. October 7, This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 License.

CHAPTER 9 FUNCTIONAL FORMS OF REGRESSION MODELS

Problem Set #4 Solutions

IND E 250 Final Exam Solutions June 8, Section A. Multiple choice and simple computation. [5 points each] (Version A)

Multifactor Term Structure Models

Introduction. Chapter 7 - An Introduction to Portfolio Management

An Application of Alternative Weighting Matrix Collapsing Approaches for Improving Sample Estimates

Survey of Math: Chapter 22: Consumer Finance Borrowing Page 1

ECE 586GT: Problem Set 2: Problems and Solutions Uniqueness of Nash equilibria, zero sum games, evolutionary dynamics

Global Optimization in Multi-Agent Models

Final Exam. 7. (10 points) Please state whether each of the following statements is true or false. No explanation needed.

arxiv:cond-mat/ v1 [cond-mat.other] 28 Nov 2004

Supplementary material for Non-conjugate Variational Message Passing for Multinomial and Binary Regression

Problems to be discussed at the 5 th seminar Suggested solutions

RECONCILING ATTRIBUTE VALUES FROM MULTIPLE DATA SOURCES

Notes on experimental uncertainties and their propagation

Finite Math - Fall Section Future Value of an Annuity; Sinking Funds

Midterm Exam. Use the end of month price data for the S&P 500 index in the table below to answer the following questions.

Clearing Notice SIX x-clear Ltd

A Case Study for Optimal Dynamic Simulation Allocation in Ordinal Optimization 1

Tree-based and GA tools for optimal sampling design

Monetary Tightening Cycles and the Predictability of Economic Activity. by Tobias Adrian and Arturo Estrella * October 2006.

Topics on the Border of Economics and Computation November 6, Lecture 2

Ch Rival Pure private goods (most retail goods) Non-Rival Impure public goods (internet service)

Teaching Note on Factor Model with a View --- A tutorial. This version: May 15, Prepared by Zhi Da *

Production and Supply Chain Management Logistics. Paolo Detti Department of Information Engeneering and Mathematical Sciences University of Siena

4. Greek Letters, Value-at-Risk

The IBM Translation Models. Michael Collins, Columbia University

MULTIPLE CURVE CONSTRUCTION

ISyE 512 Chapter 9. CUSUM and EWMA Control Charts. Instructor: Prof. Kaibo Liu. Department of Industrial and Systems Engineering UW-Madison

Elton, Gruber, Brown and Goetzmann. Modern Portfolio Theory and Investment Analysis, 7th Edition. Solutions to Text Problems: Chapter 4

FORD MOTOR CREDIT COMPANY SUGGESTED ANSWERS. Richard M. Levich. New York University Stern School of Business. Revised, February 1999

Survey of Math Test #3 Practice Questions Page 1 of 5

Global sensitivity analysis of credit risk portfolios

Least Cost Strategies for Complying with New NOx Emissions Limits

Prospect Theory and Asset Prices

Hewlett Packard 10BII Calculator

EDC Introduction

Self-controlled case series analyses: small sample performance

Two Period Models. 1. Static Models. Econ602. Spring Lutz Hendricks

An Approximate E-Bayesian Estimation of Step-stress Accelerated Life Testing with Exponential Distribution

Testing for Omitted Variables

Efficient Estimation of the Value of Information in Monte Carlo Models

Price and Quantity Competition Revisited. Abstract

Chapter 3 Student Lecture Notes 3-1

Consumption Based Asset Pricing

Analysis of Variance and Design of Experiments-II

Transcription:

CHATER 3: BAYESIAN DECISION THEORY

Decson makng under uncertanty 3 rogrammng computers to make nference from data requres nterdscplnary knowledge from statstcs and computer scence Knowledge of statstcs s requred to buld mathematcal framework for makng nference Knowledge of computer scence s requred for effcent mplementaton of nference methods In real lfe, data comes from a process that s often completely not known The lack of knowledge can be compensated by modelng t as a random process May be the underlyng data generaton process s determnstc, but because we do not have access to complete knowledge about t, we model t as random and use probablty theory to analyze t

robablty and Inference 4 Consder the event of tossng a con whch s a random process Tossng a con has two outcomes: Heads or Tals We defne a random varable X {1,0} to denote these two events 1 corresponds to Heads and 0 corresponds to Tals Such a random varable X s Bernoull dstrbuted where parameter of the dstrbuton p 0 s the probablty that outcome s head,.e., (X=1)=p 0 Bernoull: (X=x} = p x o (1 p o )(1 x) redcton of next toss: Heads f p o > ½, Tals otherwse redcton s straght forward f we know p 0 When we do not know what p 0 s, we can estmate t from sample Sample: X = {x t } N t =1 Estmaton: p o = # {Heads}/#{Tosses} = t x t / N

Classfcaton Consder the credt scorng problem agan where nputs are ncome and savngs and output s low-rsk vs hgh-rsk Customer s annual ncome and savngs are represented by random varables X 1 and X 2 respectvely Input: X = [X 1,X 2 ] T Output: C belongs to {0,1} Credblty of a customer s determned by a Bernoull random varable C condtoned on the observaton X = [X 1,X 2 ] T C=1 ndcates hgh-rsk customer and C=0 ndcates low-rsk customer Therefore, f we know (C X 1, X 2 ) when an applcaton arrves wth X 1 =x 1 and X 2 =x 2, we can use the followng predcton rule: C choose C or C choose C 1f ( C 1 x,x 0 otherwse 1f ( C 1 x,x 0 otherwse 1 1 2 2 ) 0. 5 ) ( C 0 x,x 1 2 ) robablty of error= 1-max((C=1 x 1,x 2 ), (C=0 x 1,x 2 )) 5 To be able to predct s same as to be able to calculate (C x), where x=[x 1,x 2 ] T

Bayes Rule 6 posteror C x pror C px px C lkelhood evdence (C=1) s called the pror probablty that C takes value 1(n our case hgh-rsk customers), regardless of what x s It s called pror probablty because t s the knowledge we have as to the value of C before lookng at the observable x Note that (C=0)+(C=1)=1 (x C) s called the class lkelhood and s the condtonal probablty that an event belongs to class Chas the assocated observaton value x In our case p(x 1, x 2 C=1) s the probablty that a hgh-rsk customer has hs/her X 1 =x 1 and X 2 =x 2. (x) s called evdence whch s margnal probablty that an observaton x s seen regardless of whether t s a postve or negatve example p x p( x, C) p( x, C 1) p( x, C 0) px C 1C 1 px C 0C 0 C Snce any observaton ether comes from a hgh-rsk or low-rsk class, gven any x, t s always the case that p(c=0 x)+p(c=1 x)=1

Bayes Rule: K>2 Classes K k k k C C p C C p p C C p C 1 x x x x x x x max f choose 1 0 and 1 k k K C C C C C 7 Equvalently, choose C f (C )(x C ) = max k (C k )(x C k )

Bayes Rule: Smple settng 8 Consder smple settng Y (class label) s boolean valued X s a vector contanng n boolean attrbutes (each feature/attrbute s bnary) Applyng Bayes Theorem..

Bayes Rule: How many parameters? 9 Let How many parameters do we need to estmate? (2 n -1) for each class, that s 2(2 n -1) n total for 2 classes (k=2) Why s ths bad? Ths corresponds to 2 dstnct parameters for each of the dstnct nstances n the nstance space X To make relable estmate we need to see each of those dstnct nstances multple tmes How bad can ths be? If X has 30 boolean features we need to estmate 3 bllon parameters! Totally mpractcal!

Can we do anythng about t? 10 By usng a smple modelng trck (assumpton or nductve bas), we can reduce the number of parameters to be estmated from 2(2 n -1) to just 2n The trck s called condtonal ndependence The resultng method (algorthm) s called Naïve Bayes classfer

Condtonal ndependence 11 Why?

Naïve Bayes 12 Ths s a classfcaton algorthm based on Bayes rule that assumes that attrbutes X 1,..., X n are condtonally ndependent of one another Ths dramatcally smplfes the representaton of (X Y) Consder frst the case when X has only two attrbutes.e., X=(X 1, X 2 ) In general when X=(X 1,,Xn), we can wrte

Naïve Bayes contd. 13 Applcaton of Bayes rule yelds posteror pror lkelhood Naïve Bayes classfcaton rule s evdence redct Y=y k, f t maxmzes R.H.S. of the above Whch smplfes to Why?

Naïve Bayes algorthm for dscrete nput 14 The settng n nput attrbutes/features X, each takng J possble dscrete values, In case of bnary feature J=2 and X takes value 0 or 1 Y s dscrete output varable (K class label) takng K possble values In case of bnary classfcaton problem k=2 and Y takes value 0 or 1 arameters Ths s the probablty that the -th put feature takes the j-th dscrete value gven that ths observaton s a member of class y k For each par of, k values There are n(j-1)k such parameters correspondng to lkelhood There are (K-1) parameters (pror probabltes) Estmates arameter θ jk s estmated as follows: Ths s the rato of the number of documents n whch the -th feature takes the j-th value and the class label s y k, to the number documents whose class label s y k arameter π k s estmated as follow: Ths s the rato of number of documents havng class label y k to the total number of documents

Naïve Bayes algorthm for contnuous nput 15 The settng n nput attrbutes/features X, each takes contnuous values Y s dscrete output varable (K class label) takng K possble values In case of bnary classfcaton problem k=2 and Y takes value 0 or 1 arameters and nference Same as before except each feature has a contnuous probablty dstrbuton Typcally normal dstrbuton s used for ths purpose Normal dstrbuton has two parameters mean and varance These two parameters for each feature are learned from tranng data

ractcal ssues and Engneerng hacks 16 Issue #1 Suppose the -th featuture X ( n spam flterng example, ths s the -th word n the dctonary) does not appear n the tranng set Consder class C=1 Estmate of the parameter (X =1 C=1) from tranng set s 0 (because X does not appear n class C=1 examples) Therefore, estmate of the parameter (X =0 C=1) s 1 Same thng s holds for class C=0 After tranng, a new test example x arrves whose -th feature s not zero. What wll be ts predcted class label? (C=1 x)=0 why? One of the terms n the product of the lkelhood, specfcally (X =1 C=1)=0 s zero and so s the lkelhood For the same reason (C=0 x)=0 But (C=1 x)+(c=0 x) must be equal to 1, whch s not the case here and hence causes a dffculty n predcton Hack #1 If a partcular feature /attrbute X does not appear n the tranng set a very small probablty s assgned for (X =1 C=1) and for (X =0 C=0) nstead of assgnng a value zero Sometmes these are called ghost examples/features, that s even though the feature doesn t appear n tranng set, we stll assgn a non-zero probablty Such an assgnment solve the problem mentoned above The soluton may be a lttle based f tranng set sze s small but wth large tranng set sze such bas goes away

ractcal ssues and Engneerng hacks 17 Issue #2 Hack #2 For large number of features, computng the lkelhood may be beyond the precson of a computer Ths s because each probablty term n the lkelhood expresson, whch s a product of n probablty terms f there are n features, s a postve number n [0,1] Therefore, lkelhood becomes extremely small non-negatve number and precsely representng the lkelhood and thus posteror probablty may be beyond the precson capacty of a computer whch affects the predcton decson Instead of usng probablty we use log-probablty for the numerator of the posteror expresson We dscard the denomnator (evdence) n the posteror expresson because they are same for all classes thus s not crucal n makng a predcton decson Log probablty transforms product nto sum whch does not depend on computer s precson capacty Thus for X=(X 1,,X n ) havng n features, log of posteror probablty log ((C X)) s computed as log( ( C) n 1 ( X C)) log((c)) n 1 log((x C))

Naïve Bayes algorthm for emal SAM flterng 18 Homework assgnment

Losses and Rsks Often t s the case that decsons (predctons) are not equally good or costly An accepted low-rsk applcant ncreases proft, whle a rejected hgh-rsk applcant decreases loss The loss for a hgh-rsk applcant erroneously accepted may be dfferent from the potental gan for an erroneously rejected low-rsk applcant Actons: Let α be the decson to assgn nput x to class C Let λ k be the loss of takng acton α when the actual class of nput s C k Expected rsk of takng acton α s (Duda and Hart, 1973) R K k1 x C x choose f k R x mn R x That s, we choose the acton wth mnmum expected rsk k k k

Losses and Rsks: 0/1 Loss 20 Suppose α, =1,2,,K are K actons, where α s the acton of assgnng x to C,.e., the decson to assgn nput x to class C 0 f k k In a specal case of 0/1 loss, we have 1f k That s all correct decsons have no loss and all errors (ncorrect decsons) are equal costly The rsk of takng acton α s R K x C x k1 k k C k x Therefore, to mnmze expected rsk we choose the most probable class In some applcatons wrong classfcaton, that s msclassfcaton may have very hgh cost (e.g., n medcal dagnostcs) In such stuaton an addtonal acton reject/doubt s ntroduced k 1 C x

Losses and Rsks: Reject Suppose α, =1,2,,K are K actons as before of assgnng x to C and α K+1 s an addtonal acton reject/doubt A possble loss functon s k 0 1 f k f K 1, otherwse 0 1 The rsk of reject s K K 1 C k k 1 x x The rsk of takng acton all other actons α s Therefore, the optmal decson rule s to R R x C x 1 C x k k Choose C f R(α x)< R(α K x) for all k and R(α x)< R(α K+1 x) Reject f R(α K+1 x)< R(α x) for =1,,k 21

Losses and Rsks: Reject Ths s equvalent to the followng decson rule choosec reject f Note that now we choose C, not only f t has largest posteror probablty but also ts posteror probablty s greater than some threshold λ What happens when λ=0 and all other losses are 1? We always reject. Why? What happens when λ=1 and all other losses are 1? We never reject. Why? C x C x k andc x otherwse 1 k 22

Losses and Reject: Example Consder a 2 class classfcaton problem where losses are defned as follows λ 11 =0, λ 22 =0, λ 12 =10, λ 21 =5 Wrongly choosng C 1 as predcton s more costly R(α 1 x)=0.(c 1 x)+10.(c 2 x)=10(1-(c 1 x)) R(α 2 x)=5.(c 1 x)+0.(c 2 x)=5(c 1 x) Choose acton α 1 (that s predct output class s C 1 ) f R(α 1 x)< R(α 2 x) That s when (C 1 x)>2/3 Observe that decson boundary has shfted! Suppose now we ntroduce an addtonal acton α 3 wth losses λ 31 =1, λ 32 =1 We choose acton α 1 (that s predct output class s C 1 ) f R(α 1 x)<1, that s (C 1 x)>9/10 We choose acton α 2 (that s predct output class s C 2 ) f R(α 2 x)<1, that s (C 1 x)<1/5 We reject otherwse That s f 1/5<(C 1 x)<9/10 23

Dfferent Losses and Reject 24 Equal losses Unequal losses Wth reject

Dscrmnant Functons Classfcaton can also be seen as mplementng set of dscrmnant functons g (x), =1,,K such that We choose C f g (x)=max k g k (x) We can represent the Bayes classfer n ths way by settng g (x)= -R(α x) Maxmum dscrmnant functon corresponds to mnmum condtonal rsk When we use 0/1 loss g (x)=(c x) R x g x C x px C C Dscrmnant functons dvde the feature space n to K decson regons R 1,...,R K, where R x g x max g k x g, 1,, K k x 25

K=2 Classes When k=2, t we can defne a sngle dscrmnant functon g(x) = g 1 (x) g 2 (x) Consequently, decson rule s x C1 f g 0 choose C 2 otherwse Log odds: log C C 1 2 x x 26

Assocaton Rules Assocaton rule: X Y eople who buy/clck/vst/enjoy X are also lkely to buy/clck/vst/enjoy Y. A rule mples assocaton, not necessarly causaton. 28

Assocaton measures 29 Support (X Y): Confdence (X Y): Lft (X Y): customers and customerswho bought # #, Y X Y X X Y X X Y X X Y customerswho bought and customerswho bought # # ) (, ) ( ) ( ) ( ) (, Y X Y Y X Y X

Assocaton measures 30 Support shows statstcal sgnfcance of the rule We are nterested n maxmzng the support of a rule because even f there s a dependency wth strong confdence value, f the number of such customers s small, the rule s worthless Confdence shows the strength of the rule To be able to say a rule holds wth enough confdence, ths value must be close to 1 and sgnfcantly larger than (Y) If X and Y are ndependent we expect Lft to be close to 1

31 Example

Apror algorthm (Agrawal et al., 32 1996) For (X,Y,Z), a 3-tem set, to be frequent (have enough support), (X,Y), (X,Z), and (Y,Z) should be frequent. If (X,Y) s not frequent, none of ts supersets can be frequent. Once we fnd the frequent k-tem sets, we convert them to rules: X, Y Z,... and X Y, Z,...