CHATER 3: BAYESIAN DECISION THEORY
Decson makng under uncertanty 3 rogrammng computers to make nference from data requres nterdscplnary knowledge from statstcs and computer scence Knowledge of statstcs s requred to buld mathematcal framework for makng nference Knowledge of computer scence s requred for effcent mplementaton of nference methods In real lfe, data comes from a process that s often completely not known The lack of knowledge can be compensated by modelng t as a random process May be the underlyng data generaton process s determnstc, but because we do not have access to complete knowledge about t, we model t as random and use probablty theory to analyze t
robablty and Inference 4 Consder the event of tossng a con whch s a random process Tossng a con has two outcomes: Heads or Tals We defne a random varable X {1,0} to denote these two events 1 corresponds to Heads and 0 corresponds to Tals Such a random varable X s Bernoull dstrbuted where parameter of the dstrbuton p 0 s the probablty that outcome s head,.e., (X=1)=p 0 Bernoull: (X=x} = p x o (1 p o )(1 x) redcton of next toss: Heads f p o > ½, Tals otherwse redcton s straght forward f we know p 0 When we do not know what p 0 s, we can estmate t from sample Sample: X = {x t } N t =1 Estmaton: p o = # {Heads}/#{Tosses} = t x t / N
Classfcaton Consder the credt scorng problem agan where nputs are ncome and savngs and output s low-rsk vs hgh-rsk Customer s annual ncome and savngs are represented by random varables X 1 and X 2 respectvely Input: X = [X 1,X 2 ] T Output: C belongs to {0,1} Credblty of a customer s determned by a Bernoull random varable C condtoned on the observaton X = [X 1,X 2 ] T C=1 ndcates hgh-rsk customer and C=0 ndcates low-rsk customer Therefore, f we know (C X 1, X 2 ) when an applcaton arrves wth X 1 =x 1 and X 2 =x 2, we can use the followng predcton rule: C choose C or C choose C 1f ( C 1 x,x 0 otherwse 1f ( C 1 x,x 0 otherwse 1 1 2 2 ) 0. 5 ) ( C 0 x,x 1 2 ) robablty of error= 1-max((C=1 x 1,x 2 ), (C=0 x 1,x 2 )) 5 To be able to predct s same as to be able to calculate (C x), where x=[x 1,x 2 ] T
Bayes Rule 6 posteror C x pror C px px C lkelhood evdence (C=1) s called the pror probablty that C takes value 1(n our case hgh-rsk customers), regardless of what x s It s called pror probablty because t s the knowledge we have as to the value of C before lookng at the observable x Note that (C=0)+(C=1)=1 (x C) s called the class lkelhood and s the condtonal probablty that an event belongs to class Chas the assocated observaton value x In our case p(x 1, x 2 C=1) s the probablty that a hgh-rsk customer has hs/her X 1 =x 1 and X 2 =x 2. (x) s called evdence whch s margnal probablty that an observaton x s seen regardless of whether t s a postve or negatve example p x p( x, C) p( x, C 1) p( x, C 0) px C 1C 1 px C 0C 0 C Snce any observaton ether comes from a hgh-rsk or low-rsk class, gven any x, t s always the case that p(c=0 x)+p(c=1 x)=1
Bayes Rule: K>2 Classes K k k k C C p C C p p C C p C 1 x x x x x x x max f choose 1 0 and 1 k k K C C C C C 7 Equvalently, choose C f (C )(x C ) = max k (C k )(x C k )
Bayes Rule: Smple settng 8 Consder smple settng Y (class label) s boolean valued X s a vector contanng n boolean attrbutes (each feature/attrbute s bnary) Applyng Bayes Theorem..
Bayes Rule: How many parameters? 9 Let How many parameters do we need to estmate? (2 n -1) for each class, that s 2(2 n -1) n total for 2 classes (k=2) Why s ths bad? Ths corresponds to 2 dstnct parameters for each of the dstnct nstances n the nstance space X To make relable estmate we need to see each of those dstnct nstances multple tmes How bad can ths be? If X has 30 boolean features we need to estmate 3 bllon parameters! Totally mpractcal!
Can we do anythng about t? 10 By usng a smple modelng trck (assumpton or nductve bas), we can reduce the number of parameters to be estmated from 2(2 n -1) to just 2n The trck s called condtonal ndependence The resultng method (algorthm) s called Naïve Bayes classfer
Condtonal ndependence 11 Why?
Naïve Bayes 12 Ths s a classfcaton algorthm based on Bayes rule that assumes that attrbutes X 1,..., X n are condtonally ndependent of one another Ths dramatcally smplfes the representaton of (X Y) Consder frst the case when X has only two attrbutes.e., X=(X 1, X 2 ) In general when X=(X 1,,Xn), we can wrte
Naïve Bayes contd. 13 Applcaton of Bayes rule yelds posteror pror lkelhood Naïve Bayes classfcaton rule s evdence redct Y=y k, f t maxmzes R.H.S. of the above Whch smplfes to Why?
Naïve Bayes algorthm for dscrete nput 14 The settng n nput attrbutes/features X, each takng J possble dscrete values, In case of bnary feature J=2 and X takes value 0 or 1 Y s dscrete output varable (K class label) takng K possble values In case of bnary classfcaton problem k=2 and Y takes value 0 or 1 arameters Ths s the probablty that the -th put feature takes the j-th dscrete value gven that ths observaton s a member of class y k For each par of, k values There are n(j-1)k such parameters correspondng to lkelhood There are (K-1) parameters (pror probabltes) Estmates arameter θ jk s estmated as follows: Ths s the rato of the number of documents n whch the -th feature takes the j-th value and the class label s y k, to the number documents whose class label s y k arameter π k s estmated as follow: Ths s the rato of number of documents havng class label y k to the total number of documents
Naïve Bayes algorthm for contnuous nput 15 The settng n nput attrbutes/features X, each takes contnuous values Y s dscrete output varable (K class label) takng K possble values In case of bnary classfcaton problem k=2 and Y takes value 0 or 1 arameters and nference Same as before except each feature has a contnuous probablty dstrbuton Typcally normal dstrbuton s used for ths purpose Normal dstrbuton has two parameters mean and varance These two parameters for each feature are learned from tranng data
ractcal ssues and Engneerng hacks 16 Issue #1 Suppose the -th featuture X ( n spam flterng example, ths s the -th word n the dctonary) does not appear n the tranng set Consder class C=1 Estmate of the parameter (X =1 C=1) from tranng set s 0 (because X does not appear n class C=1 examples) Therefore, estmate of the parameter (X =0 C=1) s 1 Same thng s holds for class C=0 After tranng, a new test example x arrves whose -th feature s not zero. What wll be ts predcted class label? (C=1 x)=0 why? One of the terms n the product of the lkelhood, specfcally (X =1 C=1)=0 s zero and so s the lkelhood For the same reason (C=0 x)=0 But (C=1 x)+(c=0 x) must be equal to 1, whch s not the case here and hence causes a dffculty n predcton Hack #1 If a partcular feature /attrbute X does not appear n the tranng set a very small probablty s assgned for (X =1 C=1) and for (X =0 C=0) nstead of assgnng a value zero Sometmes these are called ghost examples/features, that s even though the feature doesn t appear n tranng set, we stll assgn a non-zero probablty Such an assgnment solve the problem mentoned above The soluton may be a lttle based f tranng set sze s small but wth large tranng set sze such bas goes away
ractcal ssues and Engneerng hacks 17 Issue #2 Hack #2 For large number of features, computng the lkelhood may be beyond the precson of a computer Ths s because each probablty term n the lkelhood expresson, whch s a product of n probablty terms f there are n features, s a postve number n [0,1] Therefore, lkelhood becomes extremely small non-negatve number and precsely representng the lkelhood and thus posteror probablty may be beyond the precson capacty of a computer whch affects the predcton decson Instead of usng probablty we use log-probablty for the numerator of the posteror expresson We dscard the denomnator (evdence) n the posteror expresson because they are same for all classes thus s not crucal n makng a predcton decson Log probablty transforms product nto sum whch does not depend on computer s precson capacty Thus for X=(X 1,,X n ) havng n features, log of posteror probablty log ((C X)) s computed as log( ( C) n 1 ( X C)) log((c)) n 1 log((x C))
Naïve Bayes algorthm for emal SAM flterng 18 Homework assgnment
Losses and Rsks Often t s the case that decsons (predctons) are not equally good or costly An accepted low-rsk applcant ncreases proft, whle a rejected hgh-rsk applcant decreases loss The loss for a hgh-rsk applcant erroneously accepted may be dfferent from the potental gan for an erroneously rejected low-rsk applcant Actons: Let α be the decson to assgn nput x to class C Let λ k be the loss of takng acton α when the actual class of nput s C k Expected rsk of takng acton α s (Duda and Hart, 1973) R K k1 x C x choose f k R x mn R x That s, we choose the acton wth mnmum expected rsk k k k
Losses and Rsks: 0/1 Loss 20 Suppose α, =1,2,,K are K actons, where α s the acton of assgnng x to C,.e., the decson to assgn nput x to class C 0 f k k In a specal case of 0/1 loss, we have 1f k That s all correct decsons have no loss and all errors (ncorrect decsons) are equal costly The rsk of takng acton α s R K x C x k1 k k C k x Therefore, to mnmze expected rsk we choose the most probable class In some applcatons wrong classfcaton, that s msclassfcaton may have very hgh cost (e.g., n medcal dagnostcs) In such stuaton an addtonal acton reject/doubt s ntroduced k 1 C x
Losses and Rsks: Reject Suppose α, =1,2,,K are K actons as before of assgnng x to C and α K+1 s an addtonal acton reject/doubt A possble loss functon s k 0 1 f k f K 1, otherwse 0 1 The rsk of reject s K K 1 C k k 1 x x The rsk of takng acton all other actons α s Therefore, the optmal decson rule s to R R x C x 1 C x k k Choose C f R(α x)< R(α K x) for all k and R(α x)< R(α K+1 x) Reject f R(α K+1 x)< R(α x) for =1,,k 21
Losses and Rsks: Reject Ths s equvalent to the followng decson rule choosec reject f Note that now we choose C, not only f t has largest posteror probablty but also ts posteror probablty s greater than some threshold λ What happens when λ=0 and all other losses are 1? We always reject. Why? What happens when λ=1 and all other losses are 1? We never reject. Why? C x C x k andc x otherwse 1 k 22
Losses and Reject: Example Consder a 2 class classfcaton problem where losses are defned as follows λ 11 =0, λ 22 =0, λ 12 =10, λ 21 =5 Wrongly choosng C 1 as predcton s more costly R(α 1 x)=0.(c 1 x)+10.(c 2 x)=10(1-(c 1 x)) R(α 2 x)=5.(c 1 x)+0.(c 2 x)=5(c 1 x) Choose acton α 1 (that s predct output class s C 1 ) f R(α 1 x)< R(α 2 x) That s when (C 1 x)>2/3 Observe that decson boundary has shfted! Suppose now we ntroduce an addtonal acton α 3 wth losses λ 31 =1, λ 32 =1 We choose acton α 1 (that s predct output class s C 1 ) f R(α 1 x)<1, that s (C 1 x)>9/10 We choose acton α 2 (that s predct output class s C 2 ) f R(α 2 x)<1, that s (C 1 x)<1/5 We reject otherwse That s f 1/5<(C 1 x)<9/10 23
Dfferent Losses and Reject 24 Equal losses Unequal losses Wth reject
Dscrmnant Functons Classfcaton can also be seen as mplementng set of dscrmnant functons g (x), =1,,K such that We choose C f g (x)=max k g k (x) We can represent the Bayes classfer n ths way by settng g (x)= -R(α x) Maxmum dscrmnant functon corresponds to mnmum condtonal rsk When we use 0/1 loss g (x)=(c x) R x g x C x px C C Dscrmnant functons dvde the feature space n to K decson regons R 1,...,R K, where R x g x max g k x g, 1,, K k x 25
K=2 Classes When k=2, t we can defne a sngle dscrmnant functon g(x) = g 1 (x) g 2 (x) Consequently, decson rule s x C1 f g 0 choose C 2 otherwse Log odds: log C C 1 2 x x 26
Assocaton Rules Assocaton rule: X Y eople who buy/clck/vst/enjoy X are also lkely to buy/clck/vst/enjoy Y. A rule mples assocaton, not necessarly causaton. 28
Assocaton measures 29 Support (X Y): Confdence (X Y): Lft (X Y): customers and customerswho bought # #, Y X Y X X Y X X Y X X Y customerswho bought and customerswho bought # # ) (, ) ( ) ( ) ( ) (, Y X Y Y X Y X
Assocaton measures 30 Support shows statstcal sgnfcance of the rule We are nterested n maxmzng the support of a rule because even f there s a dependency wth strong confdence value, f the number of such customers s small, the rule s worthless Confdence shows the strength of the rule To be able to say a rule holds wth enough confdence, ths value must be close to 1 and sgnfcantly larger than (Y) If X and Y are ndependent we expect Lft to be close to 1
31 Example
Apror algorthm (Agrawal et al., 32 1996) For (X,Y,Z), a 3-tem set, to be frequent (have enough support), (X,Y), (X,Z), and (Y,Z) should be frequent. If (X,Y) s not frequent, none of ts supersets can be frequent. Once we fnd the frequent k-tem sets, we convert them to rules: X, Y Z,... and X Y, Z,...