Data Mining Anomaly Detection. Lecture Notes for Chapter 10. Introduction to Data Mining

Similar documents
Data Mining Anomaly Detection. Lecture Notes for Chapter 10. Introduction to Data Mining

4452 Mathematical Modeling Lecture 17: Modeling of Data: Linear Regression

Appendix B: DETAILS ABOUT THE SIMULATION MODEL. contained in lookup tables that are all calculated on an auxiliary spreadsheet.

R e. Y R, X R, u e, and. Use the attached excel spreadsheets to

This specification describes the models that are used to forecast

Financial Econometrics Jeffrey R. Russell Midterm Winter 2011

Comparison of back-testing results for various VaR estimation methods. Aleš Kresta, ICSP 2013, Bergamo 8 th July, 2013

INSTITUTE OF ACTUARIES OF INDIA

Financial Markets And Empirical Regularities An Introduction to Financial Econometrics

A Note on Missing Data Effects on the Hausman (1978) Simultaneity Test:

Key Formulas. From Larson/Farber Elementary Statistics: Picturing the World, Fifth Edition 2012 Prentice Hall. Standard Score: CHAPTER 3.

You should turn in (at least) FOUR bluebooks, one (or more, if needed) bluebook(s) for each question.

INSTITUTE OF ACTUARIES OF INDIA

Bond Prices and Interest Rates

Introduction. Enterprises and background. chapter

A Regime Switching Independent Component Analysis Method for Temporal Data

Finance Solutions to Problem Set #6: Demand Estimation and Forecasting

ACE 564 Spring Lecture 9. Violations of Basic Assumptions II: Heteroskedasticity. by Professor Scott H. Irwin

LIDSTONE IN THE CONTINUOUS CASE by. Ragnar Norberg

Jarrow-Lando-Turnbull model

Stock Market Behaviour Around Profit Warning Announcements

Documentation: Philadelphia Fed's Real-Time Data Set for Macroeconomists First-, Second-, and Third-Release Values

FORECASTING WITH A LINEX LOSS: A MONTE CARLO STUDY

Package NPHMC. R topics documented: February 19, Type Package

San Francisco State University ECON 560 Summer 2018 Problem set 3 Due Monday, July 23

Unemployment and Phillips curve

Forecasting with Judgment

(c) Suppose X UF (2, 2), with density f(x) = 1/(1 + x) 2 for x 0 and 0 otherwise. Then. 0 (1 + x) 2 dx (5) { 1, if t = 0,

Multiple Choice Questions Solutions are provided directly when you do the online tests.

CHAPTER CHAPTER18. Openness in Goods. and Financial Markets. Openness in Goods, and Financial Markets. Openness in Goods,

Midterm Exam. Use the end of month price data for the S&P 500 index in the table below to answer the following questions.

Inventory Investment. Investment Decision and Expected Profit. Lecture 5

Ch. 10 Measuring FX Exposure. Is Exchange Rate Risk Relevant? MNCs Take on FX Risk

Speculator identification: A microstructure approach

VaR and Low Interest Rates

On the Impact of Inflation and Exchange Rate on Conditional Stock Market Volatility: A Re-Assessment

UNIVERSITY OF MORATUWA

Bank of Japan Review. Performance of Core Indicators of Japan s Consumer Price Index. November Introduction 2015-E-7

Chapter Outline CHAPTER

Variable selection for heavy-duty vehicle battery failure prognostics using random survival forests

Organize your work as follows (see book): Chapter 3 Engineering Solutions. 3.4 and 3.5 Problem Presentation

2. Quantity and price measures in macroeconomic statistics 2.1. Long-run deflation? As typical price indexes, Figure 2-1 depicts the GDP deflator,

Systemic Risk Illustrated

Constructing Out-of-the-Money Longevity Hedges Using Parametric Mortality Indexes. Johnny Li

Non-Stationary Processes: Part IV. ARCH(m) (Autoregressive Conditional Heteroskedasticity) Models

Estimating Earnings Trend Using Unobserved Components Framework

Portfolio Risk of Chinese Stock Market Measured by VaR Method

Session IX: Special topics

Reconciling Gross Output TFP Growth with Value Added TFP Growth

Advanced Forecasting Techniques and Models: Time-Series Forecasts

Suggested Template for Rolling Schemes for inclusion in the future price regulation of Dublin Airport

1.2 A CATALOG OF ESSENTIAL FUNCTIONS

Empirical analysis on China money multiplier

VOLATILITY CLUSTERING, NEW HEAVY-TAILED DISTRIBUTION AND THE STOCK MARKET RETURNS IN SOUTH KOREA

Detailed Examples of the Modifications to Accommodate. any Decimal or Fractional Price Grid

, where P is the number of bears at time t in years. dt (a) Given P (i) Find

Pricing Vulnerable American Options. April 16, Peter Klein. and. Jun (James) Yang. Simon Fraser University. Burnaby, B.C. V5A 1S6.

Computer Lab 6. Minitab Project Report. Time Series Plot of x. Year

Technological progress breakthrough inventions. Dr hab. Joanna Siwińska-Gorzelak

The relation between U.S. money growth and inflation: evidence from a band pass filter. Abstract

IJRSS Volume 2, Issue 2 ISSN:

Lecture 23: Forward Market Bias & the Carry Trade

Alexander L. Baranovski, Carsten von Lieres and André Wilch 18. May 2009/Eurobanking 2009

Macroeconomics. Part 3 Macroeconomics of Financial Markets. Lecture 8 Investment: basic concepts

Advanced Tools for Risk Management and Asset Pricing

CENTRO DE ESTUDIOS MONETARIOS Y FINANCIEROS T. J. KEHOE MACROECONOMICS I WINTER 2011 PROBLEM SET #6

Macroeconomics II A dynamic approach to short run economic fluctuations. The DAD/DAS model.

STATIONERY REQUIREMENTS SPECIAL REQUIREMENTS 20 Page booklet List of statistical formulae New Cambridge Elementary Statistical Tables

An Alternative Test of Purchasing Power Parity

Models of Default Risk

ASSESSING PREDICTION INTERVALS FOR DEMAND RATES OF SLOW-MOVING PARTS FOR A NATIONAL RETAILER

DYNAMIC ECONOMETRIC MODELS Vol. 7 Nicolaus Copernicus University Toruń Krzysztof Jajuga Wrocław University of Economics

A Robust Modification of the Goldfeld-Quandt Test for the Detection of Heteroscedasticity in the Presence of Outliers

Web Usage Patterns Using Association Rules and Markov Chains

Guglielmo Maria Caporale Brunel; University. Abstract

MA Advanced Macro, 2016 (Karl Whelan) 1

A NOTE ON BUSINESS CYCLE NON-LINEARITY IN U.S. CONSUMPTION 247

Econometric modelling of inbound tourist expenditure in South Africa

Optimal Early Exercise of Vulnerable American Options

ECONOMIC GROWTH. Student Assessment. Macroeconomics II. Class 1

Credit Spread Option Valuation under GARCH. Working Paper July 2000 ISSN :

Information in the term structure for the conditional volatility of one year bond returns

Microeconomic Sources of Real Exchange Rate Variability

RJOAS, 5(65), May 2017

Subdivided Research on the Inflation-hedging Ability of Residential Property: A Case of Hong Kong

A Study of Process Capability Analysis on Second-order Autoregressive Processes

Management Science Letters

1 Purpose of the paper

Process of convergence dr Joanna Wolszczak-Derlacz. Lecture 4 and 5 Solow growth model (a)

Frequency Analysis for Non stationary Flood Series

HEDGING SYSTEMATIC MORTALITY RISK WITH MORTALITY DERIVATIVES

TESTING FOR SKEWNESS IN AR CONDITIONAL VOLATILITY MODELS FOR FINANCIAL RETURN SERIES

Robustness of Memory-Type Charts to Skew Processes

Online Appendix to: Implementing Supply Routing Optimization in a Make-To-Order Manufacturing Network

From Discrete to Continuous: Modeling Volatility of the Istanbul Stock Exchange Market with GARCH and COGARCH

International Review of Business Research Papers Vol. 4 No.3 June 2008 Pp Understanding Cross-Sectional Stock Returns: What Really Matters?

The Death of the Phillips Curve?

Exam 1. Econ520. Spring 2017

Assessing the financial vulnerability of Italian households: a microsimulation approach Valentina Michelangeli and Mario Pietrunti 1

USE REAL-LIFE DATA TO MOTIVATE YOUR STUDENTS 1

Transcription:

Daa Mining Anomaly Deecion Lecure Noes for Chaper 10 Inroducion o Daa Mining by Tan, Seinbach, Kumar Tan,Seinbach, Kumar Inroducion o Daa Mining 4/18/2004 1

Anomaly/Oulier Deecion Wha are anomalies/ouliers? The se of daa poins ha are considerably differen han he remainder of he daa Varians of Anomaly/Oulier Deecion Problems Given a daabase D, find all he daa poins x D wih anomaly scores greaer han some hreshold Given a daabase D, find all he daa poins x D having he opn larges anomaly scores f(x) Given a daabase D, conaining mosly normal (bu unlabeled) daa poins, and a es poin x, compue he anomaly score of x wih respec o D Applicaions: Credi card fraud deecion, elecommunicaion fraud deecion, nework inrusion deecion, faul deecion Tan,Seinbach, Kumar Inroducion o Daa Mining 4/18/2004 2

Imporance of Anomaly Deecion Ozone Depleion Hisory In 1985 hree researchers (Farman, Gardinar and Shanklin) were puzzled by daa gahered by he Briish Anarcic Survey showing ha ozone levels for Anarcica had dropped 10% below normal levels Why did he Nimbus 7 saellie, which had insrumens aboard for recording ozone levels, no record similarly low ozone concenraions? The ozone concenraions recorded by he saellie were so low hey were being reaed as ouliers by a compuer program and discarded! Sources: hp://exploringdaa.cqu.edu.au/ozone.hml hp://www.epa.gov/ozone/science/hole/size.hml Tan,Seinbach, Kumar Inroducion o Daa Mining 4/18/2004 3

Anomaly Deecion Challenges How many ouliers are here in he daa? Mehod is unsupervised Validaion can be quie challenging (jus like for clusering) Finding needle in a haysack Working assumpion: There are considerably more normal observaions han abnormal observaions (ouliers/anomalies) in he daa Tan,Seinbach, Kumar Inroducion o Daa Mining 4/18/2004 4

Anomaly Deecion Schemes General Seps Build a profile of he normal behavior Profile can be paerns or summary saisics for he overall populaion Use he normal profile o deec anomalies Anomalies are observaions whose characerisics differ significanly from he normal profile Types of anomaly deecion schemes Graphical & Saisical-based Disance-based Model-based Tan,Seinbach, Kumar Inroducion o Daa Mining 4/18/2004 5

Graphical Approaches Boxplo (1-D), Scaer plo (2-D), Spin plo (3-D) Limiaions Time consuming Subjecive Tan,Seinbach, Kumar Inroducion o Daa Mining 4/18/2004 6

Convex Hull Mehod Exreme poins are assumed o be ouliers Use convex hull mehod o deec exreme values Wha if he oulier occurs in he middle of he daa? Tan,Seinbach, Kumar Inroducion o Daa Mining 4/18/2004 7

Saisical Approaches Assume a parameric model describing he disribuion of he daa (e.g., normal disribuion) Apply a saisical es ha depends on Daa disribuion Parameer of disribuion (e.g., mean, variance) Number of expeced ouliers (confidence limi) Tan,Seinbach, Kumar Inroducion o Daa Mining 4/18/2004 8

Grubbs Tes Deec ouliers in univariae daa Assume daa comes from normal disribuion Deecs one oulier a a ime, remove he oulier, and repea H 0 : There is no oulier in daa H A : There is a leas one oulier Grubbs es saisic: Rejec H 0 if: G > ( N 1) N Tan,Seinbach, Kumar Inroducion o Daa Mining 4/18/2004 9 G = max N 2 X s ( α / N, N 2) 2 + 2 X ( α / N, N 2 )

Saisical-based Likelihood Approach Assume he daa se D conains samples from a mixure of wo probabiliy disribuions: M (majoriy disribuion) A (anomalous disribuion) General Approach: Iniially, assume all he daa poins belong o M Le L (D) be he log likelihood of D a ime For each poin x ha belongs o M, move i o A Le L +1 (D) be he new log likelihood. Compue he difference, = L (D) L +1 (D) If > c (some hreshold), hen x is declared as an anomaly and moved permanenly from M o A Tan,Seinbach, Kumar Inroducion o Daa Mining 4/18/2004 10

Tan,Seinbach, Kumar Inroducion o Daa Mining 4/18/2004 11 Saisical-based Likelihood Approach Daa disribuion, D = (1 λ) M + λ A M is a probabiliy disribuion esimaed from daa Can be based on any modeling mehod (naïve Bayes, maximum enropy, ec) A is iniially assumed o be uniform disribuion Likelihood a ime : = + + + = = = i i i i A x i A M x i M A x i A A M x i M M N i i D x P A x P M D LL x P x P x P D L ) ( log log ) ( log ) log(1 ) ( ) ( ) ( ) (1 ) ( ) ( 1 λ λ λ λ

Limiaions of Saisical Approaches Mos of he ess are for a single aribue In many cases, daa disribuion may no be known For high dimensional daa, i may be difficul o esimae he rue disribuion Tan,Seinbach, Kumar Inroducion o Daa Mining 4/18/2004 12

Disance-based Approaches Daa is represened as a vecor of feaures Three major approaches Neares-neighbor based Densiy based Clusering based Tan,Seinbach, Kumar Inroducion o Daa Mining 4/18/2004 13

Neares-Neighbor Based Approach Approach: Compue he disance beween every pair of daa poins There are various ways o define ouliers: Daa poins for which here are fewer han p neighboring poins wihin a disance D The op n daa poins whose disance o he kh neares neighbor is greaes The op n daa poins whose average disance o he k neares neighbors is greaes Tan,Seinbach, Kumar Inroducion o Daa Mining 4/18/2004 14

Ouliers in Lower Dimensional Projecion In high-dimensional space, daa is sparse and noion of proximiy becomes meaningless Every poin is an almos equally good oulier from he perspecive of proximiy-based definiions Lower-dimensional projecion mehods A poin is an oulier if in some lower dimensional projecion, i is presen in a local region of abnormally low densiy Tan,Seinbach, Kumar Inroducion o Daa Mining 4/18/2004 15

Ouliers in Lower Dimensional Projecion Divide each aribue ino φ equal-deph inervals Each inerval conains a fracion f = 1/φ of he records Consider a k-dimensional cube creaed by picking grid ranges from k differen dimensions If aribues are independen, we expec region o conain a fracion f k of he records If here are N poins, we can measure sparsiy of a cube D as: Negaive sparsiy indicaes cube conains smaller number of poins han expeced Tan,Seinbach, Kumar Inroducion o Daa Mining 4/18/2004 16

Example N=100, φ = 5, f = 1/5 = 0.2, N f 2 = 4 Tan,Seinbach, Kumar Inroducion o Daa Mining 4/18/2004 17

Densiy-based: LOF approach For each poin, compue he densiy of is local neighborhood Compue local oulier facor (LOF) of a sample p as he average of he raios of he densiy of sample p and he densiy of is neares neighbors Ouliers are poins wih larges LOF value p 2 p 1 In he NN approach, p 2 is no considered as oulier, while LOF approach find boh p 1 and p 2 as ouliers Tan,Seinbach, Kumar Inroducion o Daa Mining 4/18/2004 18

Clusering-Based Basic idea: Cluser he daa ino groups of differen densiy Choose poins in small cluser as candidae ouliers Compue he disance beween candidae poins and non-candidae clusers. If candidae poins are far from all oher non-candidae poins, hey are ouliers Tan,Seinbach, Kumar Inroducion o Daa Mining 4/18/2004 19

Base Rae Fallacy Bayes heorem: More generally: Tan,Seinbach, Kumar Inroducion o Daa Mining 4/18/2004 20

Base Rae Fallacy (Axelsson, 1999) Tan,Seinbach, Kumar Inroducion o Daa Mining 4/18/2004 21

Base Rae Fallacy Even hough he es is 99% cerain, your chance of having he disease is 1/100, because he populaion of healhy people is much larger han sick people Tan,Seinbach, Kumar Inroducion o Daa Mining 4/18/2004 22

Base Rae Fallacy in Inrusion Deecion I: inrusive behavior, I: non-inrusive behavior A: alarm A: no alarm Deecion rae (rue posiive rae): P(A I) False alarm rae: P(A I) Goal is o maximize boh Bayesian deecion rae, P(I A) P( I A) Tan,Seinbach, Kumar Inroducion o Daa Mining 4/18/2004 23

Deecion Rae vs False Alarm Rae Suppose: Then: False alarm rae becomes more dominan if P(I) is very low Tan,Seinbach, Kumar Inroducion o Daa Mining 4/18/2004 24

Deecion Rae vs False Alarm Rae Axelsson: We need a very low false alarm rae o achieve a reasonable Bayesian deecion rae Tan,Seinbach, Kumar Inroducion o Daa Mining 4/18/2004 25