Daa Mining Anomaly Deecion Lecure Noes for Chaper 10 Inroducion o Daa Mining by Tan, Seinbach, Kumar Tan,Seinbach, Kumar Inroducion o Daa Mining 4/18/2004 1
Anomaly/Oulier Deecion Wha are anomalies/ouliers? The se of daa poins ha are considerably differen han he remainder of he daa Varians of Anomaly/Oulier Deecion Problems Given a daabase D, find all he daa poins x D wih anomaly scores greaer han some hreshold Given a daabase D, find all he daa poins x D having he opn larges anomaly scores f(x) Given a daabase D, conaining mosly normal (bu unlabeled) daa poins, and a es poin x, compue he anomaly score of x wih respec o D Applicaions: Credi card fraud deecion, elecommunicaion fraud deecion, nework inrusion deecion, faul deecion Tan,Seinbach, Kumar Inroducion o Daa Mining 4/18/2004 2
Imporance of Anomaly Deecion Ozone Depleion Hisory In 1985 hree researchers (Farman, Gardinar and Shanklin) were puzzled by daa gahered by he Briish Anarcic Survey showing ha ozone levels for Anarcica had dropped 10% below normal levels Why did he Nimbus 7 saellie, which had insrumens aboard for recording ozone levels, no record similarly low ozone concenraions? The ozone concenraions recorded by he saellie were so low hey were being reaed as ouliers by a compuer program and discarded! Sources: hp://exploringdaa.cqu.edu.au/ozone.hml hp://www.epa.gov/ozone/science/hole/size.hml Tan,Seinbach, Kumar Inroducion o Daa Mining 4/18/2004 3
Anomaly Deecion Challenges How many ouliers are here in he daa? Mehod is unsupervised Validaion can be quie challenging (jus like for clusering) Finding needle in a haysack Working assumpion: There are considerably more normal observaions han abnormal observaions (ouliers/anomalies) in he daa Tan,Seinbach, Kumar Inroducion o Daa Mining 4/18/2004 4
Anomaly Deecion Schemes General Seps Build a profile of he normal behavior Profile can be paerns or summary saisics for he overall populaion Use he normal profile o deec anomalies Anomalies are observaions whose characerisics differ significanly from he normal profile Types of anomaly deecion schemes Graphical & Saisical-based Disance-based Model-based Tan,Seinbach, Kumar Inroducion o Daa Mining 4/18/2004 5
Graphical Approaches Boxplo (1-D), Scaer plo (2-D), Spin plo (3-D) Limiaions Time consuming Subjecive Tan,Seinbach, Kumar Inroducion o Daa Mining 4/18/2004 6
Convex Hull Mehod Exreme poins are assumed o be ouliers Use convex hull mehod o deec exreme values Wha if he oulier occurs in he middle of he daa? Tan,Seinbach, Kumar Inroducion o Daa Mining 4/18/2004 7
Saisical Approaches Assume a parameric model describing he disribuion of he daa (e.g., normal disribuion) Apply a saisical es ha depends on Daa disribuion Parameer of disribuion (e.g., mean, variance) Number of expeced ouliers (confidence limi) Tan,Seinbach, Kumar Inroducion o Daa Mining 4/18/2004 8
Grubbs Tes Deec ouliers in univariae daa Assume daa comes from normal disribuion Deecs one oulier a a ime, remove he oulier, and repea H 0 : There is no oulier in daa H A : There is a leas one oulier Grubbs es saisic: Rejec H 0 if: G > ( N 1) N Tan,Seinbach, Kumar Inroducion o Daa Mining 4/18/2004 9 G = max N 2 X s ( α / N, N 2) 2 + 2 X ( α / N, N 2 )
Saisical-based Likelihood Approach Assume he daa se D conains samples from a mixure of wo probabiliy disribuions: M (majoriy disribuion) A (anomalous disribuion) General Approach: Iniially, assume all he daa poins belong o M Le L (D) be he log likelihood of D a ime For each poin x ha belongs o M, move i o A Le L +1 (D) be he new log likelihood. Compue he difference, = L (D) L +1 (D) If > c (some hreshold), hen x is declared as an anomaly and moved permanenly from M o A Tan,Seinbach, Kumar Inroducion o Daa Mining 4/18/2004 10
Tan,Seinbach, Kumar Inroducion o Daa Mining 4/18/2004 11 Saisical-based Likelihood Approach Daa disribuion, D = (1 λ) M + λ A M is a probabiliy disribuion esimaed from daa Can be based on any modeling mehod (naïve Bayes, maximum enropy, ec) A is iniially assumed o be uniform disribuion Likelihood a ime : = + + + = = = i i i i A x i A M x i M A x i A A M x i M M N i i D x P A x P M D LL x P x P x P D L ) ( log log ) ( log ) log(1 ) ( ) ( ) ( ) (1 ) ( ) ( 1 λ λ λ λ
Limiaions of Saisical Approaches Mos of he ess are for a single aribue In many cases, daa disribuion may no be known For high dimensional daa, i may be difficul o esimae he rue disribuion Tan,Seinbach, Kumar Inroducion o Daa Mining 4/18/2004 12
Disance-based Approaches Daa is represened as a vecor of feaures Three major approaches Neares-neighbor based Densiy based Clusering based Tan,Seinbach, Kumar Inroducion o Daa Mining 4/18/2004 13
Neares-Neighbor Based Approach Approach: Compue he disance beween every pair of daa poins There are various ways o define ouliers: Daa poins for which here are fewer han p neighboring poins wihin a disance D The op n daa poins whose disance o he kh neares neighbor is greaes The op n daa poins whose average disance o he k neares neighbors is greaes Tan,Seinbach, Kumar Inroducion o Daa Mining 4/18/2004 14
Ouliers in Lower Dimensional Projecion In high-dimensional space, daa is sparse and noion of proximiy becomes meaningless Every poin is an almos equally good oulier from he perspecive of proximiy-based definiions Lower-dimensional projecion mehods A poin is an oulier if in some lower dimensional projecion, i is presen in a local region of abnormally low densiy Tan,Seinbach, Kumar Inroducion o Daa Mining 4/18/2004 15
Ouliers in Lower Dimensional Projecion Divide each aribue ino φ equal-deph inervals Each inerval conains a fracion f = 1/φ of he records Consider a k-dimensional cube creaed by picking grid ranges from k differen dimensions If aribues are independen, we expec region o conain a fracion f k of he records If here are N poins, we can measure sparsiy of a cube D as: Negaive sparsiy indicaes cube conains smaller number of poins han expeced Tan,Seinbach, Kumar Inroducion o Daa Mining 4/18/2004 16
Example N=100, φ = 5, f = 1/5 = 0.2, N f 2 = 4 Tan,Seinbach, Kumar Inroducion o Daa Mining 4/18/2004 17
Densiy-based: LOF approach For each poin, compue he densiy of is local neighborhood Compue local oulier facor (LOF) of a sample p as he average of he raios of he densiy of sample p and he densiy of is neares neighbors Ouliers are poins wih larges LOF value p 2 p 1 In he NN approach, p 2 is no considered as oulier, while LOF approach find boh p 1 and p 2 as ouliers Tan,Seinbach, Kumar Inroducion o Daa Mining 4/18/2004 18
Clusering-Based Basic idea: Cluser he daa ino groups of differen densiy Choose poins in small cluser as candidae ouliers Compue he disance beween candidae poins and non-candidae clusers. If candidae poins are far from all oher non-candidae poins, hey are ouliers Tan,Seinbach, Kumar Inroducion o Daa Mining 4/18/2004 19
Base Rae Fallacy Bayes heorem: More generally: Tan,Seinbach, Kumar Inroducion o Daa Mining 4/18/2004 20
Base Rae Fallacy (Axelsson, 1999) Tan,Seinbach, Kumar Inroducion o Daa Mining 4/18/2004 21
Base Rae Fallacy Even hough he es is 99% cerain, your chance of having he disease is 1/100, because he populaion of healhy people is much larger han sick people Tan,Seinbach, Kumar Inroducion o Daa Mining 4/18/2004 22
Base Rae Fallacy in Inrusion Deecion I: inrusive behavior, I: non-inrusive behavior A: alarm A: no alarm Deecion rae (rue posiive rae): P(A I) False alarm rae: P(A I) Goal is o maximize boh Bayesian deecion rae, P(I A) P( I A) Tan,Seinbach, Kumar Inroducion o Daa Mining 4/18/2004 23
Deecion Rae vs False Alarm Rae Suppose: Then: False alarm rae becomes more dominan if P(I) is very low Tan,Seinbach, Kumar Inroducion o Daa Mining 4/18/2004 24
Deecion Rae vs False Alarm Rae Axelsson: We need a very low false alarm rae o achieve a reasonable Bayesian deecion rae Tan,Seinbach, Kumar Inroducion o Daa Mining 4/18/2004 25