Associating Absent Frequent Itemsets with Infrequent Items to Identify Abnormal Transactions

Similar documents
FITTING EXPONENTIAL MODELS TO DATA Supplement to Unit 9C MATH Q(t) = Q 0 (1 + r) t. Q(t) = Q 0 a t,

A Change Detection Model for Credit Card Usage Behavior

Section 6 Short Sales, Yield Curves, Duration, Immunization, Etc.

Chain-linking and seasonal adjustment of the quarterly national accounts

Noise and Expected Return in Chinese A-share Stock Market. By Chong QIAN Chien-Ting LIN

Improving Forecasting Accuracy in the Case of Intermittent Demand Forecasting

Normal Random Variable and its discriminant functions

Network Security Risk Assessment Based on Node Correlation

Batch Processing for Incremental FP-tree Construction

UNN: A Neural Network for uncertain data classification

Accuracy of the intelligent dynamic models of relational fuzzy cognitive maps

Baoding, Hebei, China. *Corresponding author

Correlation of default

Improving Earnings per Share: An Illusory Motive in Stock Repurchases

An Inclusion-Exclusion Algorithm for Network Reliability with Minimal Cutsets

Dynamic Relationship and Volatility Spillover Between the Stock Market and the Foreign Exchange market in Pakistan: Evidence from VAR-EGARCH Modelling

Deriving Reservoir Operating Rules via Fuzzy Regression and ANFIS

SkyCube Computation over Wireless Sensor Networks Based on Extended Skylines

Return Calculation Methodology

The UAE UNiversity, The American University of Kurdistan

Explaining Product Release Planning Results Using Concept Analysis

The Financial System. Instructor: Prof. Menzie Chinn UW Madison

Pricing and Valuation of Forward and Futures

Differences in the Price-Earning-Return Relationship between Internet and Traditional Firms

Mind the class weight bias: weighted maximum mean discrepancy for unsupervised domain adaptation. Hongliang Yan 2017/06/21

Fugit (options) The terminology of fugit refers to the risk neutral expected time to exercise an

Michał Kolupa, Zbigniew Śleszyński SOME REMARKS ON COINCIDENCE OF AN ECONOMETRIC MODEL

IFX-Cbonds Russian Corporate Bond Index Methodology

A valuation model of credit-rating linked coupon bond based on a structural model

Gaining From Your Own Default

Optimal Fuzzy Min-Max Neural Network (FMMNN) for Medical Data Classification Using Modified Group Search Optimizer Algorithm

Online Technical Appendix: Estimation Details. Following Netzer, Lattin and Srinivasan (2005), the model parameters to be estimated

Prediction of Oil Demand Based on Time Series Decomposition Method Nan MA * and Yong LIU

STOCK PRICES TEHNICAL ANALYSIS

VI. Clickstream Big Data and Delivery before Order Making Mode for Online Retailers

Estimation of Optimal Tax Level on Pesticides Use and its

A Multi-Periodic Optimization Modeling Approach for the Establishment of a Bike Sharing Network: a Case Study of the City of Athens

An improved segmentation-based HMM learning method for Condition-based Maintenance

A Novel Approach to Model Generation for Heterogeneous Data Classification

Methodology of the CBOE S&P 500 PutWrite Index (PUT SM ) (with supplemental information regarding the CBOE S&P 500 PutWrite T-W Index (PWT SM ))

Online appendices from Counterparty Risk and Credit Value Adjustment a continuing challenge for global financial markets by Jon Gregory

SOCIETY OF ACTUARIES FINANCIAL MATHEMATICS. EXAM FM SAMPLE SOLUTIONS Interest Theory

Recursive Data Mining for Masquerade Detection and Author Identification

Economics of taxation

A Hybrid Method to Improve Forecasting Accuracy Utilizing Genetic Algorithm An Application to the Data of Operating equipment and supplies

PFAS: A Resource-Performance-Fluctuation-Aware Workflow Scheduling Algorithm for Grid Computing

Lab 10 OLS Regressions II

Terms and conditions for the MXN Peso / US Dollar Futures Contract (Physically Delivered)

Floating rate securities

Bank of Japan. Research and Statistics Department. March, Outline of the Corporate Goods Price Index (CGPI, 2010 base)

Albania. A: Identification. B: CPI Coverage. Title of the CPI: Consumer Price Index. Organisation responsible: Institute of Statistics

Empirical Study on the Relationship between ICT Application and China Agriculture Economic Growth

Co-Integration Study of Relationship between Foreign Direct Investment and Economic Growth

The Empirical Research of Price Fluctuation Rules and Influence Factors with Fresh Produce Sequential Auction Limei Cui

Using Fuzzy-Delphi Technique to Determine the Concession Period in BOT Projects

A Novel Application of the Copula Function to Correlation Analysis of Hushen300 Stock Index Futures and HS300 Stock Index

Fairing of Polygon Meshes Via Bayesian Discriminant Analysis

Quarterly Accounting Earnings Forecasting: A Grey Group Model Approach

Agricultural and Rural Finance Markets in Transition

American basket and spread options. with a simple binomial tree

Cryptographic techniques used to provide integrity of digital content in long-term storage

Data Mining Anomaly Detection. Lecture Notes for Chapter 10. Introduction to Data Mining

Estimating intrinsic currency values

Data Mining Anomaly Detection. Lecture Notes for Chapter 10. Introduction to Data Mining

UC San Diego Recent Work

Optimal Combination of Trading Rules Using Neural Networks

A New Method to Measure the Performance of Leveraged Exchange-Traded Funds

Analysing Big Data to Build Knowledge Based System for Early Detection of Ovarian Cancer

Hardware-Assisted High-Efficiency Ray Casting of Unstructured Time-Varying Flows Using Temporal Coherence

The Virtual Machine Resource Allocation based on Service Features in Cloud Computing Environment

THE IMPACT OF COMMODITY DERIVATIVES IN AGRICULTURAL FUTURES MARKETS

Tax Dispute Resolution and Taxpayer Screening

Pricing Model of Credit Default Swap Based on Jump-Diffusion Process and Volatility with Markov Regime Shift

ANFIS Based Time Series Prediction Method of Bank Cash Flow Optimized by Adaptive Population Activity PSO Algorithm

SETTING CUT OFF SCORES FOR SELECTIVE EDITING IN STRUCTURAL BUSINESS STATISTICS: AN AUTOMATIC PROCEDURE USING SIMULATION STUDY.

Multiagent System Simulations of Sealed-Bid Auctions with Two-Dimensional Value Signals

Impact of Stock Markets on Economic Growth: A Cross Country Analysis

The Proposed Mathematical Models for Decision- Making and Forecasting on Euro-Yen in Foreign Exchange Market

Effective Feedback Of Whole-Life Data to The Design Process

Documentation: Philadelphia Fed's Real-Time Data Set for Macroeconomists First-, Second-, and Third-Release Values

RMF: Rough Set Membership Function-based for Clustering Web Transactions

FINAL EXAM EC26102: MONEY, BANKING AND FINANCIAL MARKETS MAY 11, 2004

Numerical Evaluation of European Option on a Non Dividend Paying Stock

Some Insights of Value-Added Tax Gap

Forecasting Sales: Models, Managers (Experts) and their Interactions

Online Data, Fixed Effects and the Construction of High-Frequency Price Indexes

Empirical analysis on China money multiplier

A Hybrid Method for Forecasting with an Introduction of a Day of the Week Index to the Daily Shipping Data of Sanitary Materials

Classification and Prediction. Topic 5: Data Mining II. Classification Process (1): Model Construction. Classification A Two-Step Process

Online Adaboost-Based Parameterized Methods for Dynamic Distributed Network Intrusion Detection

Determinants of firm exchange rate predictions:

Price trends and patterns in technical analysis: A theoretical and empirical examination

A Novel Particle Swarm Optimization Approach for Grid Job Scheduling

Semantic-based Detection of Segment Outliers and Unusual Events for Wireless Sensor Networks (Research-in-Progress)

The impact of intellectual capital on returns and stock prices of listed companies in Tehran Stock Exchange

A Neural Network Approach to Time Series Forecasting

MACROECONOMIC CONDITIONS AND INCOME DISTRIBUTION IN VENEZUELA:

Recall from last time. The Plan for Today. INTEREST RATES JUNE 22 nd, J u n e 2 2, Different Types of Credit Instruments

Stock Market Behaviour Around Profit Warning Announcements

Online appendices from The xva Challenge by Jon Gregory. APPENDIX 14A: Deriving the standard CVA formula.

Transcription:

Assocang Absen Frequen Iemses wh Infrequen Iems o Idenfy Abnormal Transacons L-Jen Kao Deparmen of Compuer Scence and Informaon Engneerng Hwa Hsa Insue of Technology New Tape Cy, Tawan 23568 ljenkao@cc.hwh.edu.w Yo-Png Huang * Deparmen of Elecrcal Engneerng Naonal Tape Unversy of Technology Tape, Tawan 10608 yphuang@nu.edu.w *correspondng auhor Frode Eka Sandnes Insue of Informaon Technology Faculy of Technology, Ar and Desgn Oslo and Akershus Unversy College of Appled Scences Oslo, Norway Frode-Eka.Sandnes@hoa.no Absrac Daa sored n ransaconal daabases are vulnerable o nose and oulers and are ofen dscarded a he early sage of daa mnng. Abnormal ransacons n he markeng ransaconal daabase are hose ransacons ha should conan some ems bu do no. However, some abnormal ransacons may provde valuable nformaon n he knowledge mnng process. The leraure on how o effcenly denfy abnormal ransacons n he daabase as well as deermne wha causes he ransacons o be abnormal s scarce. Ths paper proposes a framework o realze abnormal ransacons as well as he ems ha nduce he abnormal ransacons. Resuls from one synhec and wo medcal daa ses are presened o compare wh prevous work o verfy he effecveness of he proposed framework. Keywords daa mnng; abnormal ransacons; absen frequen emse; nfrequen ems; assocaon rules. 1 Inroducon Daa mnng s an emergng echnology used o dscover neresng paerns from large daabases. In he pas more effors repored n leraure were focused on developng effcen mehods o fnd assocaon rules [5, 9-11, 14, 30, 32]. Recenly, ouler deecon aracs aenon due o s mporance n deecng devan daa. Several applcaons rely on ouler deecon for he dscovery of val nformaon such as cred card fraud deecon, nework nruson deecon, abnormal numerc values n sock prces, and dsease sympom dagnoss [1-2, 6-7, 12, 17-18, 28, 30-31, 34-36]. Though some breakhroughs have been repored on ouler deecon, here reman crcal ssues o be resolved. Frs, mos ouler deecon algorhms are desgned for numercal daa and rely on compung he relave dsance beween daa pons. These algorhms are no suable for daases wh caegorcal arbues [17]. The followng example llusraes why dsance measurng mehods fal o deec oulers n caegorcal daases. Table 1 conans 10 ransacons ha can be dvded no 3 ypes,.e., {em1, em2}, {em1} and {em2}. If he mnmum suppor and mnmum confdence are se o 50% and 80%, respecvely, he rule em1 em2 wll be an assocaon rule, nsead of he rule em2 em1. Accordng o he derved assocaon rule, f someone buys em1, s very possble ha hey wll buy em2 a he same me. Tha s, a ransacon wh only em1 may be an ouler, bu no a ransacon wh only em2. However, Fg. 1 shows ha f he daa pon (em1, em2) s he cener of a cluser and (em1) s an ouler, hen (em2) should be an ouler, oo. In oher words, f ransacons wh only em1 are possble oulers, hen ransacons wh only em2 should also be possble oulers accordng o dsance. Wh hgher dmensonales more ransacons wll be ncorrecly classfed as oulers. Many commercal applcaons rely on markeng ransaconal daabases wh boh caegorcal and numercal arbues. Fndng ouler ransacons n such daabases s mporan snce oulers for nsance may affec markeng managemen or sales sraeges. Only a handful of sudes have focused on he deecon of ouler ransacons from caegorcal daases or ransaconal daabases [15-16, 18, 21, 27]. He e al. [16] proposed an enropy-based mehod o deec oulers. The FndFPOF (Frequen Paern Ouler Facor) algorhm [17] s anoher well-known em-based ouler deecon echnque. He, Xu and Deng [17] defned an ouler ransacon as a ransacon wh few frequen paerns. FndFPOF frs dscovers frequen emses and hen fnds oulers by comparng each ransacon wh every frequen emse. The drawback of hs algorhm s ha he effcency deeroraes wh he ncrease of frequen emses. Nara and Kagawa [26] proposed anoher em-based approach where oulers are assumed o be ransacons ha volae mos assocaon rules. Nara and Kagawa s work shared smlares wh ha of He, Xu and Deng [17], bu hey reduced he search space o expede he search of oulers n large daases. Table 1. Transaconal daabase sample 1. TID: Transacon IDenfcaon number. TID Iems 1 em1, em2 2 em1, em2 3 em1, em2 1

4 em1, em2 5 em1, em2 6 em1 7 em2 8 em2 9 em2 10 em2 Fg. 1. The relaonshp beween daa pon (em1, em2). Secondly and mos mporanly, he aemps documened n he leraure dd no offer any suggesons on wha caused he ransacons o become abnormal. In fac, he abnormal ransacons hemselves provde worhless nformaon on decson makng. For example, assume ha an assocaon rule r, {Jam, Mlk} {Bread}, s derved from a ransaconal daabase D wh he hgh confdence value of 80%. Accordng o he assocaon rule r, he ransacon <Bacon, Corn, Jam, Mlk> may be abnormal due o he absence of Bread. There s no benef of knowng he absence of a ceran em due o s rrelevance. However, f one can fnd he reasons for occurrence of oulers ha wll help us make beer decson. There may be a varey of reasons behnd he abnormal ransacons. For example, a cusomer may wan o buy bread, bu found ha he or she dd no brng enough cash. In hs example s no easy o explore he underlyng reasons for no buyng bread. Bu, f he reason s because he emergence of some ems leads o he dsappearance of some oher ems, hen fndng he reason s ransformed o he ssue of denfyng whch ems cause some oher ems absence and hs can be resolved by usng he proposed mehod. In he aforemenoned example, could Bacon or Corn be he em ha makes Bread absen? The dea o denfy whch ems cause some oher ems absence s praccal. The reason s f users can apply assocaon rules o fnd he relaonshps beween ems, hey can also use assocaon rules o fnd he relaonshps beween ems and absen ems. Ths paper proposes a framework for denfyng he ouler ransacons n markeng daabases and fndng whch ems may cause ransacons o become oulers. The framework s dvded no wo pars. The frs par of hs sudy s o ulze assocaon rules o effcenly denfy abnormal ransacons n daabase. An abnormal ransacon s defned as a ransacon ha s expeced o conan some ems ha acually do no appear. Those ems ha should have been conaned are marked as absen ems. Absen ems hemselves hardly provde any value n decson makng unless he reasons ha cause he ems absence can be found. The second par of hs sudy uses assocaon rules mnng algorhm o exrac he relaonshp beween absen frequen ems and nfrequen ems. Typcally, he nfrequen ems ha are always gnored n assocaon rules sudy may be he key o cause ransacons o be abnormal [13]. Our approach s o ransform each ransacon o absen frequen emses and nfrequen ems. These new ransformed ransacon can be mned by employng an assocaon rules mnng algorhm o fnd he relaonshp beween nfrequen ems and absen ems. The remanng secons of hs paper are organzed as follows. Secon 2 nroduces relaed work on ouler deecon. Secon 3 descrbes he proposed mehod and secon 4 denfes ems ha nduce abnormal ransacons. Secon 5 provdes expermenal evdence. Secon 6 concludes he paper. 2 Relaed work 2.1 Frequen emses and assocaon rules A frequen emse s an emse ha conans a ceran number of ransacons. Assocaon rules can be derved from frequen emses. The well-known assocaon rule example derved from supermarke shoppng daa s {daper}{beer}, whch means people buyng daper wll also buy beer a he same me. The assocaon rules help busnesses o plan proper sraeges o ncrease her sales. The followng s a bref descrpon of how assocaon rules are found based on a ransaconal daabase. 2

Le be he se of all ems. A ransaconal daabase D s a se of ransacons where each ransacon s a se of ems such ha. The cardnaly of he daabase D s denoed by D. For wo emses X, Y I and X Y, he rule X Y means f X occurs hen Y also occurs. An emse X s suppor s denoed by suppor(x): X suppor(x). (1) D An emse s frequen f s suppor s larger han or equal o a pre-defned suppor hreshold mn_sup. The confdence of X Y s defned as confdence(xy): suppor( X Y ) confdence ( X Y ). (2) suppor( X ) An assocaon rule s a rule wh s confdence larger han or equal o a pre-defned hreshold mn_conf. The Apror-based algorhm s usually adoped for mnng assocaon rules. The orgnal Apror-based algorhm s neffcen because repeaedly scans he same daabase o fnd frequen emses. Varous non-apror mehods have been proposed o expede he dscovery of assocaon rules [14, 20, 33]. FP-growh [14], a well-known non-apror assocaon rules mnng algorhm, scans he daabase wce o buld an FPree where all he frequen emses are sored. Each branch n he FP-ree s a frequen emse. The assocaon rules are hen mned from he FP-ree. Snce he FP-ree s a compac srucure, s performance s beer compared o he Apror famly of algorhms [11, 19]. 2.2 Maxmal frequen emses The FP-ree srucure s no only used o generae assocaon rules bu also a good daa srucure for applcaons ha only need o ulze he nformaon of frequen emses. However, f here are many long ransacon paerns or he mnmum suppor seng s low, he number of frequen emses and he FP-ree sorage wll be huge [11]. In hs case, one can consder geng mal frequen emses nsead of frequen ems. A frequen emse X s a mal frequen emse (MFI) f here s no oher frequen emse Y such ha X Y. Any subse of a mum frequen emse s a frequen emse; ha s, one sll can ge frequen emses from mum frequen emses. Snce he oal number of mum frequen emses s less han frequen emses, sorage requremens are reduced. Several algorhms, such as MAFIA [5], GenMax [9] and FP [10], fnd mal frequen emses. FP s based on FP-growh and s proven o be a compeve algorhm [19]. FP bulds an FP-ree lke srucure called an MFI-ree, o keep rack of all mum frequen emses. Subsequen research has proposed more effecve algorhms for acqurng mum frequen emses; however, snce hey are exensons of FP-growh and have huge sorage requremens, he algorhm proposed heren employs FP o fnd mum frequen emses. The followng example llusraes how FP fnds mum frequen emses [10]. Table 2 lss a sample daabase ha conans 10 ransacons. The mnmum suppor s se o be 20%. Fg. 2 shows he fnal complee FP-ree. If a FP-ree has only one pah, s a MFI-ree. Snce he FP-ree n Fg. 2 s no a sngle pah ree we frs fnd a condonal paern base and condonal FP-ree for each em n he header able. For example, he correspondng condonal FP-ree of em f s shown n Fg. 3. The ems n he condonal paern base are lsed n descendng order accordng o frequency. Noe ha f he condonal FP-ree of an em has more han one pah he FP-ree needs o be separaed no several sngle-pah rees. The nal MFI-ree only conans he header able. The frs em f s condonal FPree s nsered no he MFI-ree and he emse {a, c, e, b, f} s a mum frequen emse. The followng sep nvolves checkng f he em d s condonal FP-ree, {a, c, d}, s a subse of any mum frequen emse n he MFI-ree. If s no a subse, s nsered no he MFI-ree. Nex, he em b s condonal FP-ree, {a, c, e, b}, s a subse of {a, c, e, b, f} and wll no be nsered. Ths subse-checkng sep s repeaed unl all he ems n he header able are processed. Fg. 4 shows he complee MFI-ree. Table 2. Transaconal daabase sample 2. TID Iems 1 a, b, c, e, f, o 2 a, c, g 3 e, 4 a, c, d, e, g 5 a, c, e, g, l 3

6 e, j 7 a, b, c, e, f, p 8 a, c, d 9 a, c, e, g, m 10 a, c, e, g, n roo a:8 e:2 c:8 e:6 g:1 d:1 b:2 g:4 f:2 d:1 Fg. 2. The complee FP-ree derved from Table 2. roo a c e b Fg. 3. The condonal FP-rees for em f n header able. roo f a c e d b g f 2.3 The defnon of ouler ransacons Fg. 4. The complee MFI-ree for he daase n Table 2. There s no defne ouler defnon. Dfferen applcaons defne oulers dfferenly wh dfferen ouler deecon approaches. Sascal mehods f he daase o assumed dsrbuons, and daa are deermned o be oulers accordng o how well hey f no he daase [17]. However, he underlyng dsrbuon for a ceran daase may no mach he assumed dsrbuon and consequenly affec he ouler deecon accuracy. Anoher problem nvolves daases wh hgh dmensonaly as s dffcul o esmae muldmensonal dsrbuons [22]. Dsance-based mehods defne a daa pon p n a se D as an ouler f a ceran percenage of oher pons n D are more han a pre-defned dsance away from p [4]. Approaches usng hs defnon [1, 34] have he drawback of hgh compuaon complexy when processng large daases, makng dffcul o fnd local oulers [22]. Cluserng mehods are also used o denfy oulers, ha s, pons ha are no nsde any of he clusers [2, 3, 8, 24, 29]. Cluser-based mehods frs denfy he clusers, hus he effcency depends on how clusers are formed. Densy-based mehods [18] denfy he oulers by comparng he densy of he npu daase, and consder he oulers as pons lyng n low densy regons. 4

All he menoned mehods are nended for daases wh numercal arbues and rely on daa pon dsance measures o deermne oulers. Markeng ransaconal daabase, a mul-caegorcal arbues daase, s usually employed o record daa by commercal applcaons. In such mul-dmensonal daase, he concep of proxmy may no be meanngful [23]. Tha s, an ouler ransacon s no a daa pon; herefore, one canno use he concep of dsance measures o deermne. The aforemenoned mehods are also no suable o deec ouler ransacons, even f some of he mehods map he caegorcal arbues o numercal arbues before dsances beween daa pons are compued. The approach sll faces he problem ha he mappng resuls are no conssen across dfferen mappng orderngs. Only a few sudes have focused on denfyng oulers from ransaconal daases [15-16, 27]. He e al. assumed ha ransacons conanng less frequen emses are more lkely o be ouler ransacons [17]. They defned he Frequen Paern Ouler Facor (FPOF) o evaluae wheher a ransacon s an ouler or no. Nara and Kagawa were neresed n assessng f a ransacon s lkely o be an ouler when some ems are supposed o appear, bu acually do no appear [26]. Based on hs concep, an ouler degree s defned o evaluae wheher a sngle ransacon s an ouler or no. Accordng o her expermens Nara and Kagawa clam ha her approach can derve more accurae resuls compared o oher approaches such as [17]. The followng example llusraes he concep of ouler ransacons [26]. In order o derve assocaon rules from he ransaconal daabase n Table 3, he mnmum suppor and mnmum confdence are se o 50% and 80%, respecvely. Table 4 gves paral assocaon rules generaed from Table 3. Snce all he rules n Table 4 have hgh confdence, we see ha TID 2 <Bacon, Corn, Jam, Mlk> s abnormal. By checkng RID 2, hs ransacon does no nclude he em Bread ha s supposed o appear n he ransacon. In fac, TID 2 s an ouler accordng o [26]. Table 3. Transaconal daabase sample 3. TID Iems 1 Bread, Jam, Mlk 2 Bacon, Corn, Jam, Mlk 3 Bread, Jam, Mlk 4 Bacon, Bread, Corn, Egg, Mlk 5 Bacon, Bread, Corn, Egg, Jam, Mlk 6 Bread, Corn, Jam, Mlk 7 Bacon, Bread, Egg, Mlk 8 Bacon, Bread, Egg, Jam, Mlk 9 Bread, Jam, Mlk 10 Bacon, Egg, Mlk Table 4. Assocaon rules derved from Table 3. RID: assocaon Rules IDenfcaon number. RID Rule 1 {Jam} {Bread} 2 {Jam, Mlk} {Bread} 3 {Jam} {Bread, Mlk} 4 {Bacon} {Egg} 5 {Bacon, Mlk} {Egg} 6 {Bacon} {Egg, Mlk} 7 {Mlk} {Bread} 2.4 The approach o deec abnormal ransacons Ths sudy ams o fnd he ems whch cause ransacons o be abnormal. From he perspecve of ouler managemen, we can approach from denfyng he ouler ransacons, analyzng he causaly among ems and hen dscoverng he reasons behnd he abnormaly. Thus, our proposed abnormal ransacon deecon model wll sar from defnng wha abnormal ransacons are. Then, we wll propose our new fndng on deecng abnormal ransacons as well as on denfyng ems ha nduce abnormal ransacons. The underlyng secon wll nroduce he defnons on dervng ouler degree. Defnon 1. Le be a ransacon, e be an em, and R be he se of hgh confdence assocaon rules. s assocave closure + s defned as follows: 5

0 1 e ey and X and X Y R The emse +1 ncludes he em ha should appear n bu acually does no. The emse +1 wll converge f has no more ems ha should appear bu acually do no appear. The assocave closure + s an deal form for and does no volae any assocaon rule. Defnon 2. Le be a ransacon, R be he se of hgh confdence assocaon rules, and + be he assocave closure of. The ouler degree of s defned as od () : od ( ). (3) The ouler degree value s n he range 0 and 1. For example, he assocave closure + for TID 2 n Table 3 s <Bacon, Corn, Bread, Egg, Jam, Mlk>. The ouler degree for TID 2 s herefore equal o 2 0. 33. Noe ha f + 6 s equal o, he ouler degree, od(), s equal o 0. Defnon 3. An ouler ransacon s a ransacon wh an ouler degree od() greaer han or equal o mn_od, a pre-defned ouler degree hreshold. If he mn_od s se o 0.3, TID 2 n Table 3 s an ouler. Snce he effcency of he algorhm deeroraes wh he ncrease of ransacons sze, here s a need o mprove he algorhm o reduce he me complexy. The basc dea s o reduce he sze of boh ransaconal daabase and he se of assocaon rules. Defnon 4. Le M be he se of all mal frequen emse, be a mal frequen emse and. A ransacon s mal assocave closure s defned as follows: 0 1 e em and m Defnon 5. Le od() be s ouler degree, and od() s upper bound s derved as follows: od ( ). (4) If he upper bound of a ransacon s ouler degree s less han mn_od, hen he ransacon s marked as an ouler. Insead of usng assocae rules se o denfy oulers, one can frs ulze mal frequen emses wh comparavely smaller daa sze o calculae each ransacon s upper bound of ouler degree and hen prune ransacons wh upper bounds less han mn_od. Ths helps o reduce he ransacon se, and he ouler degree s only compued for he remanng ransacons. Consequenly, he ouler degree calculaon effcency s sgnfcanly mproved. Defnon 6. An assocaon rule X Y s a non-redundan rule f no oher rules Z W and S V such ha (), and (), respecvely. Accordng o Defnon 6, RID 1, 2 and RID 4, 5 n Table 4 are redundan snce hey can be descrbed by RID 3 and RID 6, respecvely. The redundan rules can be removed from he orgnal assocaon rules se and he se of all non-redundan rules s denoed as he mnmal rules se R mn. The sze of R mn s smaller han he orgnal assocaon rules se and a ceran ransacon s assocave closure ha derved from assocaon rules se s he same as he assocave closure derved from he mnmal rules se. Defnon 6 and RID rules gven n Table 4 ndcae ha he effcency of he ouler dscovery algorhm s mproved when he number of assocaon rules s reduced. 6

3 Infrequen ems and ouler degree Ouler degree s a measuremen o deermne how many frequen ems are absen n a specfc ransacon. I decdes he compleeness n deecng possble ouler ransacons. By rerospec o Eq.(3), one can fnd ha he nfrequen ems also affec he ouler degree calculaon. We wll show ha here s no need o ake nfrequen ems no consderaon n calculang ouler degree. Defnon 7. Infrequen ems are ems ha are no conaned n any frequen emses. The nfrequen ems n he ransacon se n Table 5 are Baery and Corn usng he same mnmum suppor and mnmum confdence as n he prevous example. Compared wh TID 2 n Table 3, hs ransacon has only one addonal nfrequen em, namely Baery. The assocave closure for TID 2 s <Bacon, Corn, Jam, Mlk, Baery, Bread, Egg>, and he ouler degree s equal o 2/7. If he mnmum ouler degree s se o 0.3, hs ransacon s no longer an ouler. Bu accordng o [26], an ouler s a ransacon wh some ems ha are expeced o appear, bu do no. TID 2 n Table 5 s effecvely he same as TID 2 n Table 3, and should be an ouler. The problem arose from he defnon of assocave closure where he more nfrequen ems a ransacon has, he more normal he ransacon s. The ouler degree calculaon on TID 2 and 10 n Table 5 are nfluenced by he number of nfrequen ems and here are no oulers f he mnmum ouler degree s se o 0.3. Based on hs observaon, one should remove nfrequen ems from he ransacons before calculang he ouler degree. Ths dscovery s based on he fac ha he ouler degree s used o ndcae how many frequen ems are mssng; herefore, he nfrequen ems should no be consdered n calculang ouler degree. Table 5. Transaconal daabase sample 4. TID Iems Infrequen od() ems 1 Bread, Jam, Mlk 0 2 Baery, Bacon, Corn, Jam, Baery, 2/7 Mlk Corn 3 Bread, Jam, Mlk 0 4 Bacon, Bread, Corn, Egg, Corn 0 Mlk 5 Bacon, Bread, Corn, Egg, Corn 0 Jam, Mlk 6 Bread, Corn, Jam, Mlk Corn 0 7 Bacon, Bread, Egg, Mlk 0 8 Bacon, Bread, Egg, Jam, 0 Mlk 9 Bread, Jam, Mlk 0 10 Baery, Bacon, Egg, Mlk Baery 1/5 Convenonal mehod n calculang ouler degree canno ruly reflec he role of oulers n ransacons. We herefore redefne a ransacon s assocave closure and s mal assocave closure o dscover he ouler ransacons. Defnon 8. Le be a ransacon, R be he se of hgh confdence assocaon rules, and be he se of all nfrequen ems. - s denoed as s frequen ransacon f all he nfrequen ems are removed from. s assocave closure + s defned as follows: 0 1 e e I r e ey and X and X Y R Defnon 9. Le be a ransacon, R be he se of hgh confdence assocaon rules, and + be he assocave closure of. The new ouler degree of should be defned as od () : od ( ). (5) 7

Defnon 10. Le M be he se of all mal frequen emse and s defned as follows:. A ransacon s mal assocave closure 0 1 e e I r e e m and m Defnon 11. Le od() be s ouler degree, and od() s upper bound s derved as follows: od ( ). (6) Fg. 5 shows he proposed ouler degree algorhm. 4 Fndng ems ha make ransacons abnormal 1.Ge he assocaon rules se R from a ransaconal daabase D by employng FP-growh algorhm. 2.Ge he mum frequen emses se M by employng FP algorhm. 3.Reduce he sze of he ransaconal daabase. Ge each ransacon s frequen ransacon - and hen calculae s ouler degree upper bound od. Remove ransacons whose od are less han mn_od. The remanng ransacons are he canddaes of ouler ransacons. The remanng ransacons se s denoed as D mn. 4. Reduce he sze of he assocaon rules se. Remove redudan rules from R and ge he mnmum assocaon rules se R mn. 5.Ge he ouler ransacons se OT. For each n D mn Ge each ransacon s frequen - and s assocave closure by checkng R mn. Calculae s ouler degree od(). If od() >= mn_od hen OT = OT { - } Fg. 5. The proposed ouler degree algorhm. The proposed ouler degree measuremen mehod allows us o denfy abnormal ransacons. However, wha s he benef from dscoverng oulers? Can he dscovery of oulers provde valuable nformaon o furher mprove decson makng? Usually, resuls from daa mnng help users realze unknown bu mporan facs and users can ulze hese facs o do some beer decson makng. For example, he famous assocaon rule, {dapers} -> {beers}, mned from real sores daabases shows ha hose who purchase dapers end o also buy beers when hey go grocery shoppng. Based on hs observaon, he realers sock dapers nex o he beer coolers o ncrease revenues. Inruson deecon, anoher ouler mnng example, provdes me seres paerns o help users predc possble nruson evens. Whle an abnormal ransacon may be deeced, here s nohng we can do abou. A frs glance, seems ha he abnormal ransacons hemselves dd no provde valuable nformaon for knowledge mnng and he proposed algorhm has no major mprovemen over he convenonal mehods. However, he major conrbuon of he presened work les n fndng he ems ha cause ransacons o be abnormal. Accordng o our knowledge here s no leraure ha suded on converng he oulers no useful knowledge. There could be housands of reasons ha cause ransacons o be abnormal. Some reasons, lke human errors, are no easy o predc and ryng o explore 8

hem s beyond he scope of hs sudy. Bu, nfrequen ems may cause abnormal behavor n some applcaons [13] and we should go one sep furher o denfy whch ems cause ransacon o be abnormal. To counerbalance hs problem, a mehod s proposed for analyzng he relaonshps beween nfrequen ems and abnormal ransacons and denfyng nfrequen ems ha ofen cause ceran frequen ems absence. Assocaon rules mnng fnds ems ha are frequenly occurrng ogeher. However, he mechansms for fndng assocaon rules can also be appled o fndng nfrequen ems ha cause specfc frequen ems o be dscarded. Before he assocaon rules mnng algorhm can be appled o fnd nfrequen ems ha cause ransacons o be labeled as abnormal, each ransacon s ransformed no wo pars, namely absen frequen emses and nfrequen ems. Defnon 12. Le be a ransacon, R mn be he se of mnmum assocaon rules. An absen frequen emse (AF) s defned as follows: AF( ) e e X Y and X and Y and X Y R mn. Table 6 lss he ransformed ransaconal daabase from Table 5. There are no absen frequen emses and nfrequen emse n TID 1 n Table 5. TID 2 has hree absen frequen emses, {Mlk, Bread*}, {Mlk, Jam, Bread*} and {Bacon, Mlk, Egg*}, an asersk s used o denoe ha he em s expeced, bu acually does no occur. The nfrequen ems for he ransacon are Baery and Corn. TID 4 has no absen frequen emses, bu has one nfrequen em Corn. TID 10 has one absen frequen emse, {Mlk, Bread*}, and s nfrequen emse s Baery. In he ransformed ransacon se, each absen frequen emse s vewed as an em, and he relaonshps beween absen frequen emses and nfrequen ems can be found. The complee algorhm ncludng ouler deecon and fndng he relaonshp beween nfrequen ems and oulers s shown n Fg. 6. Table 7 shows an example used o verfy ha he proposed mehod can fnd he relaonshp beween ouler and s nfrequen ems. The ems n he synhec ransacon se are a, b, c, d, e, f, g, h and. Table 8 shows paral frequen emses and par of assocaon rules derved from Table 7 nclude {c}{d}, {d}{c}, {c, f}{d}, {d, f}{c} f he mnmum suppor s se o 50% and mnmum confdence s 80%. Accordng o Table 8, he nfrequen ems are a, e, g, h, and. If he mnmum ouler degree s se o 0.5, hen ransacons 5, 8, and 14 are oulers. Table 6. The ransformed ransaconal daabase. Each ransacon s dvded no unoberved frequen emses and nfrequen emses. TID Iems Noe 1 denoes no absen frequen emse and no nfrequen em. 2 Mlk/Jam/Bread*, Bacon/Mlk/Egg*, Mlk/Bread*, Baery, Corn The absen frequen emse {Mlk, Jam, Bread*} s vewed as an em, and s denoed as Mlk/Jam/Bread*. Baery and Corn are nfrequen ems. 3 4 Corn Corn s an nfrequen em. 5 Corn Corn s an nfrequen em. 6 Corn Corn s an nfrequen em. 7 8 9 10 Mlk/Bread*, Baery 9

1.Ge he ouler ransacons se OT from he ransaconal daabase D. 2.Transform OT o OT rans Transform each n OT o rans by dvdng no wo pars, absen frequen emses and nfrequen ems. 3.Ge he assocaon rules se R rans from OT rans. Fg. 6. Algorhm for fndng abnormal ransacons and denfyng whch ems cause ransacon o be labelled as abnormal. The frs sep nvolves ransformng each ouler no wo pars, namely absen frequen emses and nfrequen ems. TID 5 has wo absen frequen emses, {d, c*} and {d, f, c*}, and e and h are nfrequen ems. TID 8 has wo absen frequen emses, namely {c, d*} and {c, f, d*}, and e and are nfrequen ems. The fnal ransformaon resul s shown n Table 9. Table 9 can be vewed as a new ransaconal daabase and each absen frequen emse can be reaed as an em. By applyng assocaon rules mnng wh mnmum suppor and mnmum confdence se o 50% and 80%, respecvely, we fnd * * ha em h s he one ha nduces he abnormal ransacon snce rule { h} { d, f, c } and { h} { d, c } can be derved from Table 9. I means ha em c should appear, bu because of he nfrequen em h, em c s no observed n he ransacon. Tha s, em h causes he ransacon o be marked as abnormal. Table 7. A paral synhec ransaconal daabase. TID Iems 1 c, d, f, g 2 a, b, c, d, e, g 3 a, c, d, f 4 c, d, h, 5 d, e, f, h 6 a, c, d, f, e, g 7 b, c, d, e, f 8 b, c, f, e, 9 c, d, e, f, g, 10 b, c, d, f 11 a, b, c, d 12 b, g 13 c, d, f, h 14 b, d, f, h 15 b, c, d, f 16 c, d, f, g Table 8. Par of frequen emses and nfrequen ems derved from Table 7. 1-em frequen emses 2-em frequen emses 3-em frequen emses nfrequen ems {b} {c} {d} {f} {f, c} {f, d} {c, d} {f, c, d} a, e, g, h, Table 9. The ransformed ransaconal daabase accordng o Table 8. TID Iems 5 d/c*, d/f/c*, e, h 8 c/d*, c/f/d*, e, 14 d/c*, d/f/c*, h 10

5 Expermenal resuls and dscusson Three expermens were conduced o evaluae he effecveness of he algorhm. The proposed algorhm was mplemened n Dev C++ and expermens were run on a worksaon wh an Inel 2.5GHz processor and 2G of memory. The FP-growh s adoped o mne frequen emses and assocaon rules. The mum frequen emses are derved by usng FP. The frs expermen uses a synhec daa se as npu generaed usng IBM Ques synhec daa generaor. The parameer sengs for he daa generaon are: () he oal number of ransacons D =532, () average sze per ransacon =8, and () oal number of ems N =25. To ge assocaon rules from he generaed 532-ransaconal daabase, mnmum suppor and mnmum confdence were se o 18% and 78%, respecvely. Table 10 lss he dscovered assocaon rules and nfrequen ems. Before geng oulers, he redundan assocaon rules check was performed and no redundan rules were found n hs expermen. If he mnmum ouler degree s se o 10%, 96 oulers are deeced by employng convenonal algorhm [26], compared o 106 oulers by employng he approach proposed heren. Table 11 shows he number of oulers deeced wh dfferen mnmum ouler degree sengs. The proposed algorhm can prune more ransacons and fnd more oulers han he convenonal mehods. I s mporan o fnd any possble oulers ha may nduce he ransacons o be abnormal. Nex, he 106 oulers were aken as he esng se o fnd whch ems cause he oulers o be marked as abnormal. Frs, each ouler was ransformed no absen frequen emses and nfrequen ems, and hen he assocaon rules mnng was appled o he ransformed se. Table 12 shows paral ransformed resul. In order o dscover he ems ha cause he ransacons o be marked as abnormal, several sengs were explored and was found ha wh mnmum suppor and mnmum confdence beng se o 5% and 50%, respecvely, he rule {j} -> {f,, m*} s found. Ths means ha he nfrequen em j causes ransacons o be marked as abnormal. Table 10. Assocaon rules derved from he 532 ransacons generaed by daa generaor. Assocaon rules Infrequen ems {d, } {m} h, j, l, n, o, q,, u, x {d, v} {m} {f, b} {m} {f, } {m} {b, c} {m} {b, v} {m} {b,, v} {m} {b,, m} {v} Table 11. Oulers dscovered from he ransaconal daabase. No. of ransacons pruned No. of oulers dscovered mn. ouler degree prevous [26] our prevous [26] our 10% 136 141 96 106 20% 175 188 33 45 Table 12. Paral ransformed daa se for he 106 oulers. TID Iems 9 b//m/v*, j, u 26 d/v/m*, j 39 b//m/v* 89 d//m*, d/v/m*, b/v/m*, b//v/m*, The second expermen uses Wsconsn breas cancer daa se from UCI Machne Learnng Reposory [37]. In order o check he effcency, accuracy and precson raes are defned as follows: no. of deeced oulers ha are posve accuracy. (7) no. of all oulers no. of deeced oulers ha are posve precson. (8) no. of deeced oulers The orgnal Wsconsn breas cancer daa se conans 699 records wh 458 labeled as bengn and 241 labeled as malgnan. Each record has 9 arbues and one class arbue. The arbue nformaon s shown n Table 13. Among he 11

699 records, 14 bengn records and 2 malgnan records conanng unknown daa are dscarded. To form an unbalanced daa se, he expermen follows he sraegy oulned [12], namely removng anoher 200 malgnan records. The fnal es daa conans 444 bengn records and 39 malgnan records. We assume he 39 malgnan records are rue abnormal records. We also assume ha some arbues may cause ceran records o be abnormal. In order o derve relaonshps beween arbues, each record s ransformed no a mul-caegorcal arbues ransacon and hen he assocaon rules algorhm can be appled o hs ransacon se o ge assocaon rules. For example, f he frs arbue value s 5, wll be labeled as a5. If he second arbue value s 1, wll be labeled as b1 (see Table 14). The ransacon wh class arbue o2 s a bengn record, whereas o4 s a malgnan record. The hrd sep nvolves mnng assocaon rules from he es daa. Snce he goal s o deec malgnan records ha are hough as oulers, only rules wh consequen par o2 are kep. Table 15 shows he assocaon rules dscovered from he ransformed daa se wh mnmum suppor and confdence se o 75% and 85%, respecvely. Noe ha accordng o Defnon 6 all k-em rules wh k greaer han 2 are redundan and are no lsed n Table 15. The op-k hghes ouler degrees are chosen as oulers, ha s, an ouler s no decded by comparng s ouler degree wh he pre-defned mnmum ouler degree. Table 16 lss op-10, op-20, op-40, and op-60 rue ouler number deeced wh correspondng accuracy and precson rae. Accordng o Table 16, he proposed algorhm yelds beer accuracy and precson raes han prevous approach. The las sep of he second expermen nvolves fndng nfrequen ems ha cause oulers o be marked as abnormal. The expermen res o fnd ems n he op-40 resul. Agan, each ouler n he op-40 resul s ransformed no absen frequen emses and nfrequen ems. The mnmum suppor and mnmum confdence are se o 50% and 80%, respecvely. Only rule {fa} -> {b1, o2*} s found. Tha s, he em fa (he arbue Bare Nucle s 10) may be he reason ha caused a paen s umor o be malgnan, alhough he arbue Unformy of Cell Sze s 1. The hrd expermen uses Parknson s elemonorng daa se from UCI Machne Learnng Reposory [37]. There are a oal of 5,875 records n he daa se, and each record has 19 arbues capurng 16 voce measures, gender, moor-updrs score, and oal-updrs score [31]. Each arbue s quanave and needs o be dscrezed, or dvded no several nervals, before he assocaon rules mnng algorhm s appled. Each arbue, excep gender, s dvded no hree non-nerseced nervals, hgh, medum and low. For example, age ranges from 36 o 85, and a subjec older han 74 belongs o he hgh nerval, below 50 belongs o he low and ohers belong o he medum range. Afer dscrezng he arbues, one can proceed o dscover he assocaon rules from he Parknson s daa se. In hs expermen, a record wh moor-updrs_medum s assumed o be a normal record whle a record wh moor-updrs_hgh s reaed as a possble ouler record. Smlar o he second expermen, only rules wh consequen par of moor-updrs_medum are kep. Table 17 shows paral assocaon rules dscovered from he ransformed daa se wh mnmum suppor and confdence of 8% and 75%, respecvely. No k-em rule wh k less han 4 has consequen par of moor-updrs_medum, and accordng o Defnon 6, all k-em rules wh k greaer han 4 ha have consequen par of moor-updrs_medum are redundan. Table 18 lss dscovered nfrequen ems. The second sep nvolves fndng oulers by comparng he Parknson s daa se wh he dscovered rules. Several possble ouler records are found and f he mnmum ouler degree s se o 0.05, he 4 records n Table 19 wll be rue oulers. These 4 records should have arbue value moor-updrs_medum accordng o he assocaon rules dscovered, bu hey have moor-updrs_hgh. To fnd nfrequen ems ha cause he records o become abnormal, each ouler n Table 19 s ransformed no wo pars, absen frequen emses and nfrequen ems. Agan, by applyng he assocaon rules mnng algorhm wh a mnmum suppor of 50% and a mnmum confdence of 50%, he nfrequen em RPDE_low (a nonlnear dynamcal complexy measure below 0.347) s denfed as he source ha causes he records o become abnormal. Tha s, when all he measuremens are n medum or low nervals, he RPDE measure s he key o ell a healhy subjec apar from a Parknson s paen. These expermens show ha he proposed ouler deecon mehod s more praccal han prevous approaches snce no only denfes he ouler ransacons, bu also dscovers he assocaons beween ouler and s nfrequen ems. Ths s useful snce he mnng resuls help users deermne wha causes ransacons o become abnormal whou havng consuled exper knowledge n advance. Table 13. The arbue nformaon for Wsconsn breas cancer daa se. arbue ID arbue nformaon doman 1 Clump Thckness 1-10 2 Unformy of Cell Sze 1-10 3 Unformy of Cell Shape 1-10 4 Margnal Adheson 1-10 5 Sngle Ephelal Cell Sze 1-10 6 Bare Nucle 1-10 12

7 Bland Chroman 1-10 8 Normal Nucleol 1-10 9 Moses 1-10 10 Class 2 for bengn, 4 for malgnan Table 14. Paral ransformed daa se n ransacon forma. Orgnal nsances (9 caegorcal arbues n Transformed resul each record) <5, 1, 1, 1, 2, 1, 3, 1, 1, 2> <a5, b1, c1, d1, e2, f1, g3, h1, 1, o2> <5, 4, 4, 5, 7, 10, 3, 2, 1, 2> <a5, b4, c4, d5, e7, fa, g3, h2, 1, o2> <3, 1, 1, 1, 2, 2, 3, 1, 1, 2> <a3, b1, c1, d1, e2, f2, g3, h1, 1, o2> <9, 1, 2, 6, 4, 10, 7, 7, 2, 4> <a9, b1, c2, d6, e4, fa, g7, h7, 2, o4> Table 15. Assocaon rules mned from Wsconsn breas cancer daa se. 2-em rules (suppor, confdence) {b1} -> {o2} (76.4%, 100%) {f1} -> {o2} (80.1%, 98.7%) {h1} -> {o2} (81.0%, 98.5%) {1} -> {o2} (89.2%, 95.1%) Table 16. Ouler decon under dfferen k values. Top-k prevous (accuracy, precson) [26] our (accuracy, precson) op-10 6 (15%, 60%) 8 (21%, 80%) op-20 15 (38%, 75%) 17 (44%, 85%) op-40 34 (87%, 85%) 35 (90%, 88%) op-60 39 (100%, 65%) 39 (100%, 65%) Table 17. Paral assocaon rules dscovered from Parknson s daa se (Mn Suppor=8%, Mn Confdence=75%). Aneceden par for Consequen par for dscovered rule dscovered rule Male, shmmer_low, moor-updrs_medum db_low Male, shmmer_low, moor-updrs_medum APQ3_low Male, age_medum, moor-updrs_medum db_low Male, db_low, moor-updrs_medum RPDE_medum Male, APQ3_low, moor-updrs_medum DFA_medum Table 18. Infrequen ems dscovered from Parknson s daa se (Mn Suppor=8%, Mn Confdence=75%). age_low, jer_hgh, Abs_hgh, RAP_hgh, PPQ5_hgh, DDP_hgh, shmmer_hgh, db_hgh, APQ3_hgh,APQ5_hgh, APQ11_hgh, DDA_hgh, NHR_hgh, HNR_low, HNR_hgh, RPDE_low, RPDE_hgh, DFA_hgh, PPE_hgh 13

Table 19. The 4 oulers dscovered by comparng he Parknson s daa se wh dscovered assocaon rules. No. Ouler records Ouler degrees 1 age_low, Male, moor-updrs_hgh, jer_low, Abs_medum, RAP_low, PPQ5_low, 1/19 DDP_low, shmmer_low, db_low, APQ3_low, APQ5_low, ln,dda_low, NHR_low, HNR_medum, RPDE_medum, DFA_medum, PPE_medum 2 age_medum, Male, moor-updrs_hgh, jer_low, Abs_low, RAP_low, PPQ5_low, 1/19 DDP_low, shmmer_low, db_low, APQ3_low, APQ5_low, APQ11_low, DDA_low, NHR_low, HNR_hgh, RPDE_medum, DFA_low, PPE_low 3 age_medum, Male, moor-updrs_hgh, jer_low, Abs_low, RAP_low, PPQ5_low, 1/19 DDP_low, shmmer_low, db_low, APQ3_low, APQ5_low, APQ11_low, DDA_low, NHR_low, HNR_medum, RPDE_low, DFA_low, PPE_medum 4 age_medum, Male, moor-updrs_hgh, jer_low, Abs_low, RAP_low, PPQ5_low, DDP_low, shmmer_low, db_low, APQ3_low, APQ5_low, APQ11_low, DDA_low, NHR_low, HNR_medum, RPDE_low, DFA_low, PPE_medum 1/19 6 Conclusons From he perspecve of ouler managemen, convenonal mehods dd no ackle he queson on how o furher ulze he deeced oulers. The proposed framework can fnd he nfrequen ems ha nduce he ransacons o be abnormal. To preven he nfrequen ems from devang from he rue ouler degrees he proposed mehod modfed he defnon of ransacon s assocaon closure by removng he nfrequen ems before he calculaon of ouler degrees. Afer denfyng he oulers, he proposed approach furher dscovers whch nfrequen ems make ransacons abnormal. Abnormal ransacons are dvded no absen frequen emses and nfrequen ems. By applyng assocaon rule mnng mehod, he relaonshp beween absen frequen emses and nfrequen ems are found. Iems ha cause he ransacons o become oulers are herefore found and he mnng resuls are easer o undersand. The proposed framework provdes a oal soluon no only on fndng bu also on managng oulers. The expermenal resuls verfy ha he proposed algorhm s more effcen boh n erms of accuracy and precson raes. Fuure mprovemens are possble. The calculaon of ouler degree reles on assocave closure. However, he confdence values of assocaon rules should be consdered. Tha s, f a ransacon volaes a hgher confdence rule, should have hgher ouler degree. Nex, he preceden pars of assocaon rules affec he ouler deecon. Snce he proposed algorhm employs non-redundan rules o check ransacons, he fnal resul may nclude many known oulers, and even he nfrequen ems ha cause abnormal oulers are revealed. In hs case, seng a mnmum em number for preceden par may solve he problem. We frs apply he framework o healh care daa o verfy he algorhm s feasbly. In he fuure, s necessary o acqure more real world daa from dfferen sources o derve abnormal ransacons and fnd reasons behnd he abnormaly. The mnng resuls wll also be shared wh hospal offcals o nqure her opnons. I s mporan o menon ha he proposed framework can also be appled o any knd of ransacon daa se o fnd whch nfrequen ems nduce ransacons o be abnormal. There are a varey of reasons ha can lead o abnormaly, and hs sudy s conrbuon s o provde a way o denfy he sources of confuson. The nfrequen ems are always gnored n daa mnng bu now hey may provde valuable nformaon o allow people o make beer decson. Acknowledgmens Ths work was suppored n par by Mnsry of Scence and Technology, Tawan under Grans NSC102-2221-E-027-083- and NSC102-2218-E-002-009-MY2, and n par by jon projec beween Naonal Tape Unversy of Technology and Mackay Memoral Hospal under Gran NTUT-MMH-102-03 and Gran NTUT-MMH-103-01. References [1] Angull F, Pzzu C (2002) Fas ouler deecon n hgh dmensonal spaces. In: Proceedngs of he 6h European Conference on Prncples of Daa Mnng and Knowledge Dscovery n Daabases. Helsnk, Fnland: 15-26. [2] Angull F, Pzzu C (2005) Ouler mnng n large hgh-dmensonal daa ses. IEEE Trans on Knowledge and Daa Engneerng 17:203-215. [3] Bahrampour S, Moshr B, Salahshoor K (2011) Weghed and consraned possblsc C-means cluserng for onlne faul deecon and solaon. Appled Inellgence 35(2): 269-284. [4] Bhadur K, Mahews BL, Gannella CR (2011) Algorhms for speedng up dsance-based ouler deecon. In: Proceedngs of ACM SIGKDD In. Conf. on Knowledge Dscovery and Daa Mnng, San Dego, CA, USA: 859-867. [5] Burdck D, Calmlm M, Flannck J, Gehrke J, Yu T (2005) MAFIA: A mal frequen emse algorhm. IEEE Trans on Knowledge and Daa Engneerng 17:1490-1504. 14

[6] Chandola V, Banerjee A, Kumar V (2009) Anomaly deecon: A survey. ACM Compung Surveys 41:1-58. [7] Chazard E, Fcheur G, Bernonvlle S, Luyckx M, Beuscar R (2011) Daa mnng o generae adverse drug evens deecon rules. IEEE Trans on Informaon Technology n Bomedcne 15:823-830. [8] Elah M, L K, Nsar W, Lv X, Wang H (2008) Effcen cluserng-based ouler deecon algorhm for dynamc daa sream. In: Proceedngs of he 5h In. Conf. on Fuzzy Sysems and Knowledge Dscovery, Jnan, Shandong, Chna 5:298-304. [9] Gouda K, Zak MJ (2001) Effcenly mnng mal frequen emses. In: Proceedngs of IEEE In. Conf. on Daa Mnng, San Jose, Calforna, USA: 163-170. [10] Grahne G, Zhu J (2003) Hgh performance mnng of mal frequen emses. In: Proceedngs of he 6h SIAM Workshop on Hgh Performance Daa Mnng, San Francsco, CA, USA: 135-143. [11] Grahne G, Zhu JF (2005) Fas algorhms for frequen em se mnng usng FP-Trees. IEEE Trans on Knowledge and Daa Engneerng 17:1347-1362. [12] Guo T, L GY (2008) Neural daa mnng for cred card fraud deecon. In: Proceedngs of he 7h In. Conf. on Machne Learnng and Cybernecs, Kunmng, Chna 7: 3630-3634.. [13] Hagln DJ, Mannng AM (2007) On mnmal nfrequen emse mnng. In: Proceedngs of he In. Conf. on Daa Mnng, Las Vegas, Nevada, USA: 141-147. [14] Han J, Pe J, Yn Y (2000) Mnng frequen paerns whou canddae generaon. In: Proceedngs of ACM SIGMOD In. Conf. on Managemen of Daa, Dallas, Texas, USA: 1-12. [15] He Z, Deng S, Xu X (2005) An opmzaon model for ouler deecon n caegorcal daa. In: Proceedngs of IEEE In. Conf. on Inellgen Compung, Hefe, Chna: 400-409. [16] He Z, Deng S, Xu X (2006) A fas greedy algorhm for ouler mnng. In: Proceedngs of he 10h Pacfc-Asa Conf. on Knowledge Dscovery and Daa Mnng, Sngapore: 567-576. [17] He Z, Xu X, Deng S (2005) Fp-ouler: Frequen paern based ouler deecon. Compuer Scence and Informaon Sysem 2:103-118. [18] Hdo S, Tsubo Y, Kashma H, Sugyama M, Kanamor T (2011) Sascal ouler deecon usng drec densy rao esmaon. Knowledge and Informaon Sysems 26:309-336. [19] Hu T, Sung SY, Xong H, Fu Q (2008) Dscovery of mum lengh frequen emses. Informaon Scences 178: 69-87. [20] Huang Y-P, Kao LJ, Sandnes FE (2008) Effcen mnng of salny and emperaure assocaon rules from ARGO daa. Exper Sysems wh Applcaons 35:59-68. [21] Koufakou A, Georgopoulos M, Anagnosopoulos GC, Reynolds KM (2007) A scalable and effcen ouler deecon sraegy for caegorcal daa. In: Proceedngs of IEEE In. Conf. on Tools wh Arfcal Inellgence, Paras, Greece: 210-217. [22] Koufakou A, Georgopoulos M (2010) A fas ouler deecon sraegy for dsrbued hgh-dmensonal daa ses wh mxed arbues. Daa Mnng and Knowledge Dscovery 20: 259-289. [23] Kregel HP, Kröger P, Zmek A (2009) Cluserng hgh-dmensonal daa: A survey on subspace cluserng, paern-based cluserng, and correlaon cluserng. ACM Transacons on Knowledge Dscovery from Daa 3:1-58. [24] Le D, Zhu QH, Chen J, Ln H, Yang P (2012) Auomac PAM cluserng algorhm for ouler deecon. Journal of Sofware 7:1045-1051. [25] Márquez-Vera C, Morales CR, Soo SV (2013) Predcng school falure and dropou by usng daa mnng echnques. IEEE Journal of Lan- Amercan Learnng Technologes 8:7-14. [26] Nara K, Kagawa H (2008) Ouler deecon for ransacon daabases usng assocaon rules. In: Proceedngs of he 9h In. Conf. on Web-Age Informaon Managemen, Zhangjaje, Hunan, Chna: 373-380. [27] Oey ME, Ghong A, Parhasarahy A (2006) Fas dsrbued ouler deecon n mxed-arbue daa ses. Daa Mnng and Knnowledge Dscovery 12:203-228. [28] Papadmrou S, Kagawa H, Gbbons PB, Falousos C (2003) Loc: Fas ouler deecon usng he local correlaon negral. In: Proceedngs of he 19h In. Conf. on Daa Engneerng, Bangalore, Inda: 315-326. [29] Sh K, L L (2013) Hgh performance genec algorhm based ex cluserng usng pars of speech and ouler elmnaon. Appled Inellgence 38(4): 511-519. [30] Troano L, Scbell G (2014) Mnng frequen emses n daa sreams whn a me horzon. Daa & Knowledge Engneerng 89:21-37. [31] Tsanas A, Lle MA, McSharry PE, Ramg LO (2010) Accurae elemonorng of Parknson s dsease progresson by non-nvasve speech ess. IEEE Transacons on Bomedcal Engneerng 57:884-893. [32] Tseng VS, She B-E, Wu C-W, Yu PS (2013) Effcen algorhms for mnng hgh uly emses from ransaconal daabases. IEEE Trans on Knowledge and Daa Engneerng 25:1772-1786. [33] Wu X, Kumar V, Ross Qunlan J, Ghosh J, Yang Q, Mooda H, McLachlan G, Ng A, Lu B, Yu P, Zhou Z-H, Senbach M, Hand D, Senberg D (2008) Top 10 algorhms n daa mnng. Knowledge and Informaon Sysems 14:1-37. [34] Wan Y, Ban F (2008) Cell-based ouler deecon algorhm: A fas ouler deecon algorhm for large daases. In: Proceedngs of he 12h Pacfc- Asa Conference on Knowledge Dscovery and Daa Mnng, Osaka, Japan 5012:1042-1048. [35] Yanqng J, Hao Y, Peer D, Ayman M, John T, Rchard ME, Massanar R-M (2011) A poenal causal assocaon mnng algorhm for screenng adverse drug reacons n posmarkeng survellance. IEEE Trans on Informaon Technology n Bomedcne 15:428-437. [36] Zhu C, Kagawa H, Falousos C (2005) Example-based robus ouler deecon n hgh dmensonal daases. In: Proceedngs of he 5h IEEE In. Conf. on Daa Mnng, Houson, Texas, USA: 829-832. [37] UCI machne learnng reposory. hp://www.cs.uc.edu/ mlearn/mlreposory.hml. 15