Inernaonal Journal of Compuer Applons (975 8887) Volume 5 No.5, Augus 21 Bach Processng for Incremenal FP-ree Consrucon Shashkumar G. Toad Deparmen of CSE, GMRIT, Rajam, Srkakulam Dsrc AndraPradesh, Inda. Geea R. B. Deparmen of IT, GMRIT, Rajam, Srkakulam Dsrc AndraPradesh, Inda. PVGD Prasad Reddy Deparmen of CS & SE, Andhra Unversy, Vsakhapanam AndraPradesh, Inda. ABSTRACT Frequen Paerns are very mporan n knowledge dscovery and daa mnng process such as mnng of assoon rules, correlaons ec. Prefx-ree based approach s one of he conemporary approaches for mnng frequen paerns. FP-ree s a compac represenaon of ransacon daabase ha conans frequency nformaon of all relevan Frequen Paerns (FP) n a daase. Snce he nroducon of FP-growh algorhm for FP-ree consrucon, hree major algorhms have been proposed, namely AFPIM, CATS ree, and CanTree, ha have adoped FP-ree for ncremenal mnng of frequen paerns. All of he hree mehods perform ncremenal mnng by processng one ransacon of he ncremenal daabase a a me and updang o he FP-ree of he nal (orgnal) daabase. Here n hs paper we propose a novel mehod o ake advanage of FP-ree represenaon of ncremenal ransacon daabase for ncremenal mnng. We propose Bach Incremenal Tree (BIT) algorhm o merge wo small consecuve duraon FP-rees o oban a FP-ree ha s equvalen of FP-ree obaned when he enre daabase s processed a once from he begnnng of he frs duraon o he end of he second duraon. For large daabases, our expermenal resuls show sgnfn reducon n runme of he BIT algorhm compared o he runme of sequenal ncremenal algorhms. General Terms Daa mnng, FP-ree, Prefx-ree Frequen Paerns, Incremenal mnng. Keywords Bach Incremenal Mnng, Bach Incremenal ree, Sequenal Incremenal Mnng, mnsup. 1. INTRODUCTION Large daabases, some mes dsrbued over several remoe loons, are becomng more common n he conemporary Global Economy scenaro. The lol daabases whch were nally small, have grown, growng connually and geng dsrbued o several remoe ses as a resul of globalzaon. Many of he convenonal daa mnng algorhms are neffecve and neffcen for handlng large and growng daa ses [1] [2]. Hence, he slable and ncremenal daa mnng has become an acve area of research wh many challengng problems. The large se of evolvng and dsrbued daa n be handled effcenly by Incremenal Daa mnng. Incremenal daa mnng algorhms perform knowledge updang ncremenally o amend and srenghen wha was prevously dscovered [5] [7] [12]. Incremenal daa mnng algorhms ncorporae daabase updaes whou havng o mne he enre daase agan. Frequen paern s a paern of ems or evens ha appear frequenly n a daa se. Frequen paerns are very mporan n knowledge dscovery and daa mnng process, such as mnng of assoon rules, correlaons ec. Snce he nroducon of he concep of frequen paerns n 1993, by R. Agrawal e al. [3], here have been many consderable sudes[2] [4] [6] proposng dfferen approaches for dscoverng varous knds of frequen paerns and her applons. Prefx-ree-based approach s one of he conemporary approaches for mnng frequen paerns. A paern P s sad o be frequen n a gven daa se D f s suppor coun sup(p, D) s greaer han or equal o a predefned hreshold lled mnsup. Gven a daa se D and a suppor hreshold m, he collecon of all frequen em ses n D, s F(m, D) and s lled space of frequen paerns. Td 1 r,s,,u 2 q, s, Transacons 3 p,q,r, 4 p,s,,u 5 p,r,s, Td 1, s, r 2, s 3, r, p 4, s, p Transacons 5, s, r, p :5 s:4 p:1 r:2 (a) (b) (c) Fgure 1. a) Inal Daase b) Projeced Daase wh mnhreshold= 5% c) FP-ree The prefx-ree compacly represens he ransacons of a daa se. Prefx-ree enables fas compuaon of suppor couns of all he frequen paerns of a daase. Frequen paerns n be generaed by raversng he prefx-ree, avodng mulple snnng of he daase. The Frequen-Paern ree (FP-ree) s a prefx-ree, frs proposed n 2 by Han e al., n ACM-SIGMOD nernaonal conference[13] and laer publshed n 24[8]. FP- Tree s a compac represenaon of ransacon daabase ha conans frequency nformaon of all relevan paerns n a daase. To consruc a FP-Tree for a gven daase, frs, he daa se s ransformed no projeced daase. The projeced daa se conans only he frequen ems (wh suppor coun>mnhreshold) and each ransacon s sored n he descendng order of her suppor coun. The ransacons n projeced daase are added o prefx-ree one by one. The Fgure1 shows he daase, projeced daa se and he correspondng FP-ree consruced for he gven daase. p:1 r:1 p:1 28
Inernaonal Journal of Compuer Applons (975 8887) Volume 5 No.5, Augus 21 r:1 s :2 :3 :4 :5 s:1 :2 p:1 s:2 p:1 s:3 p:1 s:4 :1 r:1 q: 1 q:1 r:1 q :1 q:1 p:1 r:1 q :1 q:1 p:1 r:2 q :1 u:1 u:1 r:1 u:1 u : 1 u : 1 r:1 p:1 u:1 u:1 r:1 Fgure 2. Sep wse consrucon of CATS re e whle processng each ransacon 2. RELATED WORK Han e al. proposed FP-growh algorhm [8] [13] o dscover frequen paerns from FP-ree. FP-growh raverses he FP-ree n a deph-frs manner. I requres only wo sns of he daase o consruc FP-ree, unlke Apror algorhm [3] ha makes mulple sns over he daase. Snce he nroducon of FP-growh algorhm hree major algorhms have been proposed, namely AFPIM, CATS ree, and CanTree ha have adoped FP-ree for ncremenal mnng of frequen paerns. AFPIM: Koh and Sheh proposed Adjusng FP-Tree for Incremenal Mnng (AFPIM) algorhm [9].Ths algorhm updaes prevously consruced FP-ree ha conans frequen ems based on user specfed mnmum suppor hreshold mnsup, by snnng only he ncremenal par of he daase. As ems are arranged n descendng order of suppor coun based on orgnal daase, AFPIM re-sors he ems accordng o new values of suppor coun based on ncremenal daase hrough bubble-sor. There are wo major drawbacks of AFPIM: Frs, compuaonal expensveness of sorng process. Second, when new frequen paerns emerge, as a resul of snnng of ncremenal daase, AFPIM has o consruc a new FP-Tree. CATS Tree: CATS ree (Compressed and Arranged Transacon Sequence Tree) [1] addresses he lmaons of AFPIM algorhm. Unlke AFPIM, he CATS ree consders all he ems n he ransacons for represenaon no ree, regardless of wheher ems are frequen or no. Ths allows CATS ree o represen even new emergng frequen paerns from ncremenal daase. CATS arranges he nodes based on her lol suppor coun, whch helps o acheve hgh compacness of he ree. For ncremenal mnng CATS ree updaes he exsng ree by consderng he ransacons of he ncremenal daase one by one and mergng hem wh exsng ree branches. Fgure 2 shows how CATS ree s consruced consderng he daase of Fgure 1. However, CATS ree oo has wo lmaons. Frs, for each new ransacon s requred o fnd he rgh pah for he new ransacon o merge n. Second, s requred o swap and merge he nodes durng he updaes, as he nodes n CATS ree are lolly sored. CanTree: CanTree (Canonl-order Tree) s proposed by Leung e al. [11]. Consrucon of CanTree s very much smlar o CATS ree excep ha, n CanTree ems are arranged accordng o some nonl order. The nonl order n be deermned by he user pror o mnng process. Canonl orderng n be lexcographc or based on ceran propery values of ems. Snce he nonl order s fxed and no based on he suppor coun, CanTree allows easy nseron of nodes. Unlke he CATS Tree, ransacon nserons n CanTree requre no exensve searchng of mergeable pahs. CanTree oo has some lmaons. I generaes compac ree f and only f majory of he ransacons conan common paern-base n nonl order. I generaes skewed ree wh oo many branches and hence wh oo many nodes, oherwse. Furher, hough he CanTree akes less me for ree consrucon requres more memory and more me for exracng frequen paerns from he generaed CanTree. All of he hree ncremenal prefx-ree based algorhms dscussed above perform sequenal ncremenal mnng. Tha s, for ncremenal mnng hey consder one ransacon of he ncremenal daase a a me. However, n real scenaro s requred o perform perodl mnng of ransacon daabases for frequen paern generaon. The above dscussed algorhms fal o ake advanage of hs perodl mnng of frequen paerns. Supposng wo daa analyss are avalable for he frs and second quarer of a year, n he form of FP-rees. And supposng s requred o oban FP-ree for he frs egh monhs of a year. All of he above dscussed mehods consder he FP-ree for he frs quarer and perform ncremenal mnng by processng one ransacon of he second quarer daabase a a me. These mehods do no ake he advanage of he FP-ree of he second quarer ha s readly avalable. Here n hs paper we propose a novel mehod o ake advanage of such prevously obaned perodl FP-ree,.e., FP-ree represenaon of ncremenal ransacon daabase, for ncremenal mnng. We propose an Bach Incremenal Tree (BIT) algorhm o merge he small consecuve duraon FP-ree o oban a FP-ree ha s equvalen of FP-ree obaned when he enre daabase s processed a once from he begnnng of he frs duraon o he end of he second duraon. 29
Inernaonal Journal of Compuer Applons (975 8887) Volume 5 No.5, Augus 21 In hs secon we dscuss abou workng of he BIT algorhm for ncremenal mnng of frequen paerns. BIT algorhm akes FPree of he wo perodc daases. I hen reads he emses of one of he FP-ree (T1) one by one along wh her frequency couns and searches for he mergeable prefx pah of he oher FP-ree (T2). I hen merges he emse of T1 wh he mergeable prefx by updang frequency coun of he ems and nserng remanng non-prefx ems(f any) by exendng he ree branch afer he las machng prefx em of he mergeable paern. The algorhm gven below precsely ells he seps nvolved n bach ncremenal processng. 3. BATCH INCREMENTAL TREE (BIT) ALGORITHM ALGORITHM BachIncremenalTree(FP-ree T1,FP-ree T2) 1. Ge emses from T2 by consderng each of he leaves one by one. 2. FP-ree T= T1 3. For each emse obaned from T2 do he followng seps, up o 18 4. { Read he nex emse of T2. 5. Ge he nex em nk o compare, from T // Inally 1 s // chld of roo of T 6. For each em j n he emse do he followng seps, up o 18 7. f em nk s equal o em j hen 8. f nk represens leaf node hen 9. { Updae node represened by nk. 1. Ge he remanng ems from he emse and add each em as descendans of nk one below he oher. 11. } 12. else // f nk s no leaf node 13. { Updae node represened by nk. 14. nk = frs chld of nk. 15. } 16. else // f em nk s no equal o em j 17. f nk has any more chld hen nk = nex chld of nk. 18. else Ge he remanng ems from he emse and add each em as descendans of nk one below he oher. 19. } 2. Reurn T. 21. 4. TIME COMPLEXITY ANALYSIS For ncremenal daa mnng, CanTree reads he emses (ransacons) of ncremenal daabase (D 2 ) one a a me, and upends each emse o he FP-ree (T 1 ) of he orgnal daabase (D 1 ), whereas he BIT algorhm ges he emses from he FP-ree of he ncremenal daabase (D 2 ) and upends each emse o he FP-ree (T 1 ) of he orgnal daabase (D 1 ). Hence, he process of mergng s essenally same for boh he algorhms. The advanage of he BIT algorhm les n he fac ha processes he mulple occurrences of he same emse (represened wh he occurrence frequency n he FP-ree T 2 ) only once for mergng, where as CanTree performs mergng for every occurrence of he emse. In he followng secon we brng ou hs dfference by way of me complexy analyss. Followng noaons are used for performance analyss: m - Toal number of ems avalable. (Ths corresponds o maxmum number of chldren for he roo of a ree) n Number of leaf nodes of ree T 2. q Number of nodes / ems n branch (em se ) of T 2. l Number of node ems of T 1 ha mach wh he ems of emse (.e sze of he machng prefx of T 1 for emse of T 2 ). Toal runnng me of he mergng process. Tme requred for processng each emse of T 2. Tme requred o Compare and Move o he nex node n forward or downward drecon (f comparson fals). Tme o Creae and Add node, correspondng o an em of he emse of T 2, as descendan. Consder he (wors se) scenaro wheren whle comparng he ems of emse of T 2 a every level of he ree, he exreme rgh node em maches and he remanng ems of emse are added as descendans of he exreme rgh leaf node of FP ree T 1. Fgure 3 below shows he wors se scenaro for FP-ree T1. Fgure 3. FP-ree T1 showng wors se scenaro Tme, =Tme requred for comparng ems of h Assumng, q >l emse of T 2 and movng forward and downward + Tme for addng all he remanng ems of h emse of T 2. Roo... Level-1 1 2 m... Level-2 1 2 m-1 1 2 m-2 3
Runme ( n seconds ) Runme ( n seconds ) Inernaonal Journal of Compuer Applons (975 8887) Volume 5 No.5, Augus 21 j l [( m j)* ] ( q l) * In he wors se maxmum ems (level) of FP ree would be equal o m-1, conanng m-1 ems n a branch..e l = m-1. j m 1 [( m j)* ] ( q ( m 1)) * m 1 [( m j)* ] ( m ( m 1)) * j as q = m, n he wors se.e [ m*... 1* ] = (1+2+..m) + m( m 2 1) There fore, he runnng me for enre merge process, (n he wors se) s: n 1 m( m 2 1) * BIT algorhm ges ransacons from he FP ree T 2 unlke of CanTree whch reads from daabase. In FP-Tree, mulple occurrences of each emse are represened wh a sngle branch, conanng also he frequency of occurrence. Hence, n BIT algorhm mulple occurrences of an emse are read and processed for mergng only once. Therefore he value of n s always much less han ha of CanTree and hence he value of. Furher, as he daabase sze ncreases he number of emses wh hgh frequency also ncreases. Hence, BIT algorhm always akes much less me han he CanTree. As he CanTree akes less me for FP-ree consrucon compared o AFPIM and CATS ree algorhms, we consdered CanTree as he represenave of sequenal ncremenal FP-ree algorhms. We have mplemened boh CanTree and BIT algorhms and made comparave sudy of performance of he algorhms n erms of he execuon me for ree consrucon. For CanTree, ree consrucon me s measured as he me requred o read he ransacons from ncremenal daabase and nser he ems no he FP-ree consruced for orgnal daabase. For BIT, ree consrucon me s measured as he me requred for readng he emse from he exsng FP-ree of ncremenal daabase and nserng he emses no he FP-ree of orgnal daabase., 8 7 6 5 4 3 2 1 14 12 1 8 6 4 2 1 3 5 CanTree BIT 7 9 Daabase Sze ( n mllon ransacons ) CanTree BIT (a) 2 4 6 8 % of Incremenal DaaBase sze ( n mllon ransacons ) (b) Fgure 4. Runme: BIT Vs. CanTree We esed he algorhm for her performance on duel processor machnes wh 2.8 GHz speed. We made mulple runs of he algorhms on synhec daabases of varous szes, rangng from 1 mllon ransacons o 1 mllon ransacons. Average emse sze of he ransacons was 15 n he doman of 5 ems. We esed he algorhms by measurng runme agans () varyng sze of daabases keepng he orgnal and ncremenal daabase sze n fxed proporons (6: 4) and () varyng he proporon of orgnal and ncremenal daabase keepng he oal daabase sze fxed. The resuls of he expermens are shown n he form of he graphs below n Fgure 4 (a) & Fgure 4 (b). As n be observed from he graphs below, BIT algorhm akes much less me (almos half of he me requred for CanTree) for he consrucon of FP-ree. As he sze of he daabase ncreases (Fgure 4 (a)), he runme of BIT algorhm decreases. Furher, he me dfference beween CanTree and BIT algorhm also ncreases as he daabase sze ncreases. Ths s beuse, as he 31
Inernaonal Journal of Compuer Applons (975 8887) Volume 5 No.5, Augus 21 daabase sze ncreases he frequency of occurrence of ems also ncreases and hence CanTree requres more me o read ransacons from ncremenal daabase. Whereas, n BIT algorhm as reads emses from FP-ree and FP-ree conans only one represenaon for mulple occurrences of he emses, reads only once. In Fgure 4(a), he runme decreases as he percenage of he ncremenal daabase decreases (keepng he sze of he orgnal daabase fxed) for boh CanTree and BIT. Here agan, n be observed ha he dfference n runme of CanTree and BIT s more when he sze of ncremenal daabase s more (.e., percenage of ncremenal daabase) and reduces as sze reduces. As n be seen from he graph above n Fgure 4, runme of BIT algorhm reduces o nearly half of he runme of sequenal algorhms for large sze daabases. 5. CONCLUSION BIT algorhm akes much less me o consruc FP-ree by usng prevously generaed FP-ree of ncremenal daabase. Ths s possble beuse BIT reads he ncremenal ransacons from he FP-ree raher han daabase, where mulple occurrences of a ransacon of he daabase are represened only once. As n be seen from he graph above n Fgure 4, CanTree does more work o search for machng prefx as he daabase sze ncreases. On he conrary BIT algorhm does less work as he daabase sze ncreases. Beuse, as he daabase sze ncreases he probably of recurrence of emses also ncreases and hence he dfference n runme beween BIT algorhm and sequenal ncremenal algorhms ncreases,.e. BIT akes less me for ree consrucon. 6. REFERENCES [1] Paul S. Bradley, J. E. Gehrke, Raghu Ramakrshnan and Ramakrshnan Srkan. Phlosophes and Advances n Slng Mnng Algorhms o Large Daabases. Communons of he ACM, Augus 22 [2] R.J. Bayardo, Effcen mnng of long paerns from daabases. In Proc. SIGMOD 1998, pp. 85-93. [3] Agrawal R., Imelnsk, T., and Swam, A. 1993. Mnng assoon rules beween ses of ems n large daabases. In Proc. of ACM-SIGMOD, 1993 (SIGMOD 93), pp. 27 216. [4] Agrawal R, Srkan R. Fas Algorhms for Mnng Assoon Rules. In Proc. of VLDB, Sep 2-15 1994, pp. 487-99. [5] D W Cheung, J. Han, V.T. Ng, and C.Y. Wong, Manenance of dscovered assoon rules n large daabases: an ncremenal updang echnque. In Proc. of ICDE 1996, pp. 16 114. [6] F. Bonch and C. Lucchese, On closed consraned frequen paern mnng. In Proc ICDM 24,pp. 35-42. [7] Lee, C-H., Ln, C-R., & Chen, M.S., Sldng wndow flerng: an effcen mehod for ncremenal mnng on a me-varan daabase. In ELSEVIER-Informaon Sysems,3(3), 25, pp. 227-244. [8] J. Han, J. Pe, Y. Yn and R. Mao, Mnng Frequen Paerns whou Canddae Generaon: A Frequen-Paern Tree Approach. Daa Mnng and Knowledge Dscovery, 8(1), 24, pp.53-87. [9] Koh, J-L., & Sheh, S-F. An Effcen Approach for Mananng Assoon Rules Based on Adjusng FP-ree Srucures. Proceedngs of he 24 Daabase Sysems for Advanced Applons, 24, pp. 417-424. [1] Cheung, W, & Zaïane, O. R.. Incremenal Mnng of Frequen-paerns whou Canddae Gneraon or Suppor Consran. Proceedngs of he 23 Inernaonal Daabase Engneerng and Applons Symposum, 23, pp. 111-116. [11] Leung, C. K-S., Khan, Q. I., L Z., & Hoque, T. CanTree: A Tree Srucure for Effcen Incremenal Mnng of Frequen Paerns. Proceedngs of he Ffh IEEE Inernaonal Conference on Daa Mnng (ICDM 5), 25. [12] D. W. cheung, S.D. Lee, and B. kao, A general ncremenal echnque for mananng dscovered assoon rules. In Proc. DASFAA 1997, pp. 185-194. [13] J. Han, J. Pe, and Y. Yn, Mnng Frequen Paerns whou Canddae Generaon. In Proc. of SIGMOD 2,pp.1-12 32