A Data adaptive and Dynamic Segmentation Index for Whole Matching on Time Series

Size: px

Start display at page:

Download "A Data adaptive and Dynamic Segmentation Index for Whole Matching on Time Series"

Solomon Flowers
5 years ago
Views:

1 A Data adaptve and Dynamc Segmentaton Index for Whole Matchng on Tme Seres Yang Wang Peng Wang Jan Pe We Wang Sheng Huang School of Computer Scence, Fudan Unversty, Shangha, Chna School of Computng Scence, Smon Fraser Unversty, Burnaby, BC, Canada Informaton Management Team, IBM Research Chna, Shangha, Chna {844, pengwang5, ABSTRACT Smlarty search on tme seres s an essental operaton n many applcatons. In the state-of-the-art methods, such as the R-tree based methods, SAX and SAX, tme seres are by default dvded nto equ-length segments globally, that s, all tme seres are segmented n the same way. Those methods then focus on how to approxmate or symbolze the segments and construct ndexes. In ths paper, we make an mportant observaton: global segmentaton of all tme seres may ncur unnecessary cost n space and tme for ndexng tme seres. We develop, a data adaptve and dynamc segmentaton ndex on tme seres. In addton to savngs n space and tme, our new ndex can provde tght upper and lower bounds on dstances between tme seres. An extensve emprcal study shows that our new ndex supports tme seres smlarty search effectvely and effcently.. INTRODUCTION Smlarty search on tme seres s essental n many applcatons []. Gven a set T S of tme seres, a query tme seres Q, and a dstance threshold ǫ, a smlarty search retreves the tme seres S T S such that D(Q, S) ǫ, where D(, ) s a dstance functon. When the Eucldean dstance s used and the tme seres n queston are assumed of the same length, the problem s called whole matchng [], whch has been popularly used n varous applcatons. The problem s challengng n practce, snce often the set of tme seres T S to be searched may contan many tme seres and each tme seres may be long. To tackle the whole matchng We sncerely thank Dr. Thems Palpanas for sendng us the SAX. code. We are deeply grateful to the anonymous revewers for ther nsghtful and constructve comments and suggestons that help to mprove the qualty of ths paper. We tred our best to accommodate ther suggestons n ths camera-ready verson. The work s supported n part by NSFC under grants 639, 676, 633, IBM-Fudan Jont Study program JSA6, an NSERC Dscovery Grant and a BCFRST NRAS Endowment Research Team Program project. All opnons, fndngs, conclusons and recommendatons n ths paper are those of the author and do not necessarly reflect the vews of the fundng agences. Permsson to make dgtal or hard copes of all or part of ths work for personal or classroom use s granted wthout fee provded that copes are not made or dstrbuted for proft or commercal advantage and that copes bear ths notce and the full ctaton on the frst page. To copy otherwse, to republsh, to post on servers or to redstrbute to lsts, requres pror specfc permsson and/or a fee. Artcles from ths volume were nvted to present ther results at The 39th Internatonal Conference on Very Large Data Bases, August 6th 3th 3, Rva del Garda, Trento, Italy. Proceedngs of the VLDB Endowment, Vol. 6, No. Copyrght 3 VLDB Endowment 5 897/3/... $.. S S S3 S4 Fgure : Dynamc segmentaton of tme seres. problem, many ndex structures have been proposed [, 4, 5,, 6, 3], whch wll be brefly revewed n Secton 6, all those ndexes are based on two fundamental prncples. Prncple : Dmensonalty Reducton by Global Segmentaton A tme seres can be regarded as a pont n a multdmensonal space, one dmenson representng a tme nstant. A fundamental challenge, however, s that the length of tme seres s often long. A tme seres often contans readngs at hundreds or even thousands of nstants. It s hghly neffectve to drectly ndex tme seres usng spatal ndexes, such as an R-tree [7]. To tackle ths problem, many exstng methods apply dmensonalty reducton technques, such as Sngular Value Decomposton (SVD) [8], Dscrete Fourer Transform (DFT) [], Dscrete Wavelet Transform (DWT) [4], Pecewse Lnear Approxmaton (PLA) [4], Pecewse Aggregate Approxmaton (PAA) [], Adaptve Pecewse Constant Approxmaton (APCA) [] and Chebyshev Polynomals (CP) []. After dmensonalty reducton, a multdmensonal ndex, such as R-tree [7], can be used as an ndex n the lower dmensonal space. Accordngly, n the state-of-the-art tme seres ndexng methods, such as the R-tree based methods, SAX [5], and SAX [6], all tme seres to be ndexed are segmented n the same way. Thus, they are global segmentaton approaches. Those methods focus on how to approxmate or symbolze segments and construct ndexes. The segmentaton of tme seres s not closely ntegrated wth ndex buldng. Does such a global segmentaton method provde the best beneft to tme seres ndexng? EXAMPLE (SEGMENTATION). Consder the 4 tme seres n Fgure. Each tme seres has 8 tme nstants. To reduce the dmensonalty, we can segment each tme seres nto 4 segments, each segment consstng of nstants. If we notce that tme seres S and S have (relatvely) stable values on the frst 4 nstants, we can segment S and S nto 3 segments: the frst segments consst of the frst 4 nstants, the second segments consst of the 5th and the 6th nstants, and the last segments consst of the 7th and the 8th nstants, as ndcated by the dotted lnes n the fgure. To the contrast, tme seres S3 and S4 have (relatvely) stable values both on the frst 4 nstants and on the last 4 ones. Accordngly, we can segment them such that the frst

2 segments cover the frst 4 nstants and the second segments cover the last 4 nstants. By dynamc segmentatons adaptve to data, we can reduce dmensonalty further, n ths example, from 4 to 3 for S and S, and to for S3 and S4,. At the same tme, we can retan good approxmaton qualty. Example, though smple, clearly shows that local segmentaton enables substantal opportuntes for more effectve ndexes. If we can segment tme seres n an adaptve way, we may be able to acheve better dmensonalty reducton and thus save more space and query answerng tme. Now, the challenge s how we can dynamcally segment tme seres n a data adaptve manner, and retan good qualty. Prncple : Usng Lower Bounds n Search Dmensonalty reducton almost unavodably comes wth errors n data representaton. An essental requrement n smlarty search, however, s no false dsmssals. The lower boundng property (also known as the contractve property) s an mportant desrable property for the dmensonalty reducton representaton methods of tme seres. A dmensonalty reducton method s sad to hold the lower boundng property f the method comes wth a dstance lower bound functon D LB( S, S ) D(S, S ) for any tme seres S and S, where S and S are the approxmaton representatons of S and S, respectvely, n the method. A method wth the lower boundng property guarantees no false negatve n search. That s, when a tme seres S s pruned usng the lower bound functon D LB( Q, S) > ǫ, snce D(Q, S) D LB( S, S ) > ǫ, S s defntely not an answer to the smlarty search. Whle lower boundng s well explored n lterature, to the best of our knowledge, no exstng methods consder usng upper bounds n smlarty search systematcally. If a method comes wth a dstance upper bound functon D UB( S, S ) D(S, S ) for any tme seres S and S, where S and S are the approxmaton representatons of S and S, respectvely, n the method, then once D UB( Q, S) < ǫ, we can mmedately know S s an answer to the smlarty search wthout computng the exact dstance D(Q, S). Although some prevous methods propose upper bound for tme seres smlarty computaton [7], they only consder how to defne and compute the upper bound of the dstance between two tme seres. How to utlze the upper bound n the ndex for a large number of tme seres s far from trval and has not been solved. Moreover, upper boundng can be used to answer nterestng queres beyond smlarty search. For example, consder query what s the dstrbuton of the dstance between Q and the tme seres n the database? Wth both lower boundng and upper boundng, we may be able to gve a bounded hstogram quckly as the answer to the queston based on the ndex only wthout accessng the orgnal data. An applcaton example of the hstogram obtaned as such s to help to set a meanngful threshold n smlarty search. Now, the challenge s how to develop an effectve upper boundng mechansm n ndexes for effcent smlarty search. In ths paper, we study the whole matchng problem [] where Eucldean dstance s used and tme seres have the same length. It s a fundamental tme seres processng problem tackled by numerous prevous studes [, 4, 5,, 6, 3]. Please note that our work can be easly extended to subsequence matchng [6] where query tme seres are allowed to have dfferent lengths. We explore data-adaptve dynamc segmentaton and upper boundng n tme seres ndexes. We propose a new representaton of tme seres that s an extenson of the renowned Adaptve Pecewse Constant Approxmaton (APCA). It not only offers better representaton accuracy, but also supports upper bound estmaton, whch enrches the functonalty of ndex greatly. Symbol Meanng S, S tme seres S the length of tme seres S D(S, S ) the (Eucldean) dstance between tme seres S and S D LB (N, Q), D UB (N, Q) the lower and upper bound of dstance between tme seres Q and tme seres n node N S = a tme seres s dvded nto m segments (S,..., S m) r j the rght end tme nstant of segment j µ S j the mean of segment j n tme seres S σj S the standard devaton of segment j n tme seres S SG a segmentaton of a tme seres C the number of tme seres ndexed n the subtree rooted at a node Z the synopss at a node ψ Leaf capacty of N a node n a T S N the tme seres assgned to node N LB µ, LBσ, the lower and upper bounds usng mean or standard UB µ, UBσ devaton, see Equatons,, 3, and 4 for detal H-splt M H-splt usng mean value H-splt SD H-splt usng standard devaton V-splt L V-splt usng the left subsegment V-splt R V-splt usng the rght subsegment Table : Some frequently used symbols. We develop, a data adaptve and dynamc segmentaton ndex on tme seres. In addton to savngs n space and tme, our new ndex can provde tght upper and lower bounds on dstances between tme seres. An extensve emprcal study shows that our new ndex supports tme seres smlarty search effectvely and effcently. The rest of the paper s organzed as follows. Secton ntroduces the new representaton of tme seres. Secton 3 develops the new ndex and the constructon method. Secton 4 apples the new ndex n smlarty search. Secton 5 reports the experment results. Secton 6 revews the related work. Secton 7 concludes the paper. Table summarzes the symbols frequently used n ths paper.. EXTENDING APCA REPRESENTATION In ths secton, we extend the well-known Adaptve Pecewse Constant Approxmaton (APCA) of tme seres data. Our extenson, called EAPCA, wll be used to represent tme seres n our ndex. We also derve upper and lower bounds of dstances among tme seres usng the extended approxmaton.. APCA A tme seres S = (s,..., s n) s a sequence of values. Wthout loss of generalty, n ths paper we assume that every tme seres has a value at every tme nstant t =,,..., n. We denote by S = n the length of the tme seres S, and by S[] = s ( S ) the value of S at tme nstant t =. Gven two tme seres S and S such that S = S, the (Eucldean) dstance between S and S s D(S, S ) = S = (S[] S []). The Eucldean dstance s popularly used n tme seres analyss. Moreover, there are strong evdences showng that the Eucldean dstance s superor n accuracy comparng to other smlarty measures [3, 8, ]. In the rest of the paper, we assume the Eucldean dstance, and, when the dstance between two tme seres s concerned, the tme seres have the same length. In many applcatons, t s hghly desrable to estmate the dstance between two tme seres quckly. There are exstng methods provdng lower bounds by segmentng tme seres. Here, we revew a popularly used method, APCA.

3 APCA dvdes a tme seres S = (s,..., s n) nto several dsjont segments, S = (S,..., S m), (m n), where S j = (s rj +,..., s rj ) ( j m, r =, r < < r m = n). APCA approxmates each segment S j by a par (µ j, r j), where µ j = r j k=r j + s k r j r j s the mean value of the segment. That s, S can be approxmated as S = ((µ, r ),..., (µ m, r m)). For two tme seres X and Y such that X = Y, let X = ((µ X, r ),..., (µ X m, r m)) and Ỹ = ((µy, r ),..., (µ Y m, r m)) be the APCA representatons of X and Y, respectvely. The segmentatons of the two tme seres are sad to be algned n X and Ỹ, snce X and Ỹ use the same r,..., rm and r =. Usng the mean values, APCA can gve a lower bound of the dstance between two tme seres. Apparently, we have the followng. LEMMA (APCA LOWER BOUND). Gven two tme seres X and Y such that X = Y, let X = ((µ X, r ),..., (µ X m, r m)) and Ỹ = ((µy, r ),..., (µ Y m, r m)) be two algned APCA representatons of X and Y, respectvely. Then, D(X, Y ) m (r r )(µ X µ Y ) () = Equpped wth only mean values, APCA cannot provde any upper bound on dstance between tme seres. Next, we show that combnng standard devatons we can derve an upper bound and a tghter lower bound on dstances.. EAPCA and Upper/Lower Bounds Usng Standard Devatons We can extend APCA by ncludng the standard devaton for every segment. Concretely, for a tme seres S = (s,..., s n) and an APCA representaton S = ((µ, r ),..., (µ m, r m)), we extend the approxmaton to the extended APCA representaton (EAPCA for short), denote by S = ((µ, σ, r ),..., (µ m, σ m, r m)), r r j=r + sj ) s the stan- j=r where σ = + s j ( r r r r dard devaton of the -th segment ( m). We have the followng results. THEOREM (BOUNDS). Gven two tme seres X and Y such that X = Y, let X = ((µ X, σ X, r ),..., (µ X m, σm, X r m)) and Ỹ = ((µy, σ Y, r ),..., (µ Y m, σm, Y r m)) be two algned EAPCA representatons of X and Y, respectvely. Then, D(X, Y ) m (r r )[(µ X µ Y ) + (σ X σ Y ) ] and = D(X, Y ) m (r r )[(µ X µ Y ) + (σ X + σ Y ) ] = (3) The lower and upper bounds n Equatons and 3 are realzable. () PROOF. D(X, Y ) = = = (X Y ) where µ m r = j=r + m (r r )µ = m = = r j=r + (x j y j ) r r (x j y j) (X Y ) (r r )[(µ X µ Y ) + (σ X Y ) ] r j=r + (x j y j ) ( r (4) r r and σ X Y = j=r + (x j y j ) r r ). Due to the defnton of standard devaton, we have where (σ X Y ) = (σ X ) + (σ Y ) Cov(X, Y ) Cov(X, Y ) = = (σ X ) + (σ Y ) ρ(x, Y )σ X σ Y (5) r j=r + (xj µx )(y j µ Y ) r r (6) s the covarance between X and Y, and r j=r ρ(x, Y ) = + (xj µx )(y j µ Y ) r j=r + (xj µx ) r j=r + (yj µy ) (7) s the correlaton coeffcent between segments X and Y. Snce ρ(x, Y ), we have (σ X σ Y ) (σ X Y ) (σ X + σ Y ) (8) Combnng Equatons 8 and 4, we have Equatons and 3. Comparng Equatons and, the lower bound gven by EAPCA uses the standard devatons to acheve a tghter bound. The bounds are realzable..3 Boundng Dstances to a Set of Tme Seres Often, we need to estmate the dstance between a tme seres and a set of tme seres. We can nfer the lower and upper bounds of the dstance based on Equatons and 3. For a tme seres X and a set of tme seres Y,..., Y l ( X = Y = = Y l ), let X = ((µ X, σ X, r ),..., (µ X m, σ X m, r m)), Ỹ = ((µ Y, σ Y, r ),..., (µ Y m, σ Y m, r m)),..., Ỹ l = ((µ Y l, σy l, r),..., (µy l m, σ Y l m, r m)) be algned EACPA representatons, respectvely. Let the mnmal and maxmal mean values n the -th segments of Y,..., Y l, respectvely, be µ mn = mn j l {µy j } and µ max = max j l {µy j }. Moreover, let the mnmal and maxmal standard devaton values n the -th segments of Y,..., Y l, respectvely, be σ mn max j l {σy j }. We have the followng result. = mn j l {σy j } and σ max = THEOREM (BOUNDS ON SET). For algned EAPCA representatons X of a tme seres X and Ỹ,..., Ỹl of a set of tme seres Y,..., Y l, mn {D(X, Yj)} m (r r )(LB µ + j l LBσ ) (9) =

4 and max {D(X, Yj)} j l m (r r )(UB µ + UBσ ) () = where LB µ = and, LB σ = UB µ = { (µ mn µ X ) f µ X µ mn ; f µ mn < µ X µ max. (µ max µ X ) f µ max < µ X ; (σ mn σ X ) f σ X σ mn ; f σ mn < σ X σ max. (σ max σ X ) f σ max < σ X ; (µ max µ X ) f µ X µmn (µ mn +µ max +µ max ; µ X ) f µmn < µ X ; () () (3) UB σ = (σ max + σ X ) (4) PROOF. (Lower boud) From Equaton, t s easy to see that for the -th segment, the component (µ X µ Y ) +(σ X σ Y ) can be decomposed nto two tems: (µ X µ Y ), whch s only related to the mean value, and (σ X σ Y ), whch s related to the standard devaton. Snce both of them are non-negatve, We can obtan a lower bound from the lower bounds of these two tems. We compare µ X and the range of mean values of Y j s, [µ mn, µ max ], and have 3 cases as follows. Case : If µ X s smaller than the mnmal mean value µ mn of Y j s, t s obvous that for any Y j, (µ X µ Y j ) (µ X µ mn ). Thus, (µ X µ mn ) (µ X µ Y ). We denote by LB µ = (µ X µ mn ). Case : If µ X falls wthn the range, (µ X µ Y ) =. We set LB µ =. Case 3: If µ X > µ max, then for any Y j, (µ X µ Y j ) (µ X µ max ). We set LB µ = (µ X µ max ). We can derve a lower bound LB σ smlarly. Combnng these tems, we can obtan the lower bound n the theorem. The upper bound can be proved n a smlar way. Theorem ndcates that, for a set of tme seres {Y,..., Y l }, to compute the lower and upper bounds between the set and any other tme seres, we need to mantan only the mnmum and maxmum mean values and standard devaton of each segment for all Y j s: µ mn, µ max, σ mn and σ max ( m). 3. THE DSTREE INDEX In ths secton, we develop our new dynamc splttng tree ndex ( for short) on tme seres. 3. One crtcal feature n our new s the segmentaton nformaton. In general, for a tme seres S = (s,..., s n), a segmentaton of S dvdes S nto m exclusve segments S = (S,..., S m), where S = (s,..., s r ) and S = (s r +,..., s r ) ( >, r m = n). Apparently, to record a segmentaton, we only need to record m, the number of segments, and (r,..., r m), the rghtendponts of the segments, where r < < r m = n. Gven a tme seres S, let SG = (r,..., r m) and SG = (r,..., r m ) be two segmentatons. SG s called a one-segment refnement of SG, denoted by SG SG, f m = m + and there exsts a number ( < m) such that, for all, r = r ; and for >, r = r +. Fgure : EXAMPLE. Consder a tme seres S such that S =, and two segmentatons SG = (3, 8,) and SG = (3,5, 8,). SG dvdes S nto 3 segments, (,3), (4,8), (9,). SG dvdes S nto 4 segments, (,3), (4,5), (6,8),(9, ). SG s a one-segment refnement of SG snce t further dvdes the second segment n SG nto two smaller segments. We call a segmentaton SG a refnement of segmentaton SG, denoted by SG SG, f there exst a seres of segmentatons SG,..., SG l (l ) such that SG = SG, SG = SG l, and SG SG + for ( < l). As llustrated n Fgure, a organzes tme seres to be ndexed nto a herarchy. There are two types of nodes: nternal nodes and leaf nodes. Each node contans the followng nformaton.. The number C of tme seres ndexed n the subtree rooted at ths node.. The segmentaton SG = (r,..., r m) of the tme seres ndexed at ths node, where r < < r m = n, and r ( m) s the rght-endpont of the -th segment. 3. A synopss Z = (z, z,, z m), where z = (µ mn, µ max, σ mn, σ max ). The synopss s used to compute the upper and lower bounds. 4. A leaf node lnks to a dsk fle that stores up to ψ tme seres represented by the synopss of ths leaf node, where ψ s the leaf capacty of the. An nternal node has two ponters pontng to chldren nodes. 5. An nternal node stores a splttng strategy SP, whch wll be dscussed n detal n Secton 3.3. In Fgure, a crcle represents an nternal node, and a rectangle represents a leaf node, where up to ψ = tme seres are stored. In the fgure, the segmentaton and the number of segments m are shown for each node, too. In a, for an nternal node N and ts segmentaton SG N, the segmentaton SG N n any descendant node N of N s ether the same as SG N or a refnement of SG N. Consequently, dfferent nodes n a may have dfferent segmentatons. Dfferent segmentatons may dvde tme seres nto dfferent numbers of segments, such as the segmentatons n nodes N 4 and N 5 n Fgure. Even f two segmentatons have the same number of segments, they stll can be dfferent. For example, n Fgure, the segmentatons n nodes N 4 and N 7 both have 3 segments, but the segmentatons are stll dfferent.

5 Algorthm N.Insert(X): N s a node, X s a tme seres : update Z n node N accordng to X; : f N s a leaf node then 3: f C < ψ then N has space to hold X 4: Append X to data fle ponted by N, C = C + ; 5: else C == ψ, no space n N to hold X 6: Append X to data fle ponted by N, C = C + ; 7: SP = BestSplt(); 8: Create two chldren nodes for N; 9: for each tme seres Y n N do : N = N.routeToChld(Y, SP); N.nsert(Y ); : end for : end f 3: else 4: N = N.routeToChld(X, SP); N.nsert(X); 5: end f (a) Splttng usng mean (b) Splttng usng standard devaton Fgure 3: Horzontal splttng 3. Constructon Gven a set T S of tme seres, each of length n, a s constructed n two steps as follows. Step : Intalzaton. We ntalze a that contans only the root node N R. The segmentaton SG = (n), that s, each tme seres s regarded as contanng only one segment. Step : Inserton. We nsert the tme seres n T S one by one nto the. The nserton step s to assgn every tme seres X to a leaf node. Ideally, smlar tme seres are allocated to the same leaf node or a subtree, so that they can be delberated n smlarty search usng the same segmentatons. For the nterest of computatonal effcency, we heurstcally follow a path from the root node to assgn a tme seres X to a leaf node. Specfcally, for each tme seres X T S, we frst vst the root node N R. In the case that N R s a leaf node, we assgn X to N R f N R has space; otherwse, we splt N R accordng to the splttng strategy SP of N R, whch wll be dscussed later n Secton 3.3. If N R s an nternal node, we select the chld node of N R that X fts better, and recursvely search untl a leaf node s met. The pseudocode of functon Insert s shown n Algorthm. A crtcal step of ths algorthm s the functon BestSplt(), whch selects the best splttng strategy whenever a node s splt. We provde multple types of splttng strateges. Whenever splttng a node, we call BestSplt() to fnd the best one, denoted as SP. Functon routet ochld() uses SP to determne whch chld node one tme seres belongs to. These two functons wll be dscussed n the next secton. 3.3 Node Splttng Strateges At an nternal node whose subtree ndexes a subset of tme seres, there are multple possble ways to partton the tme seres nto smaller subsets and assgn them to chldren nodes. We need to defne a good measurement to assess the beneft of dfferent strateges, and fnd a good splttng strategy. In ths subsecton, we frst demonstrate the deas behnd varous splttng strateges, and then present those strateges and a qualty measure Ideas We can splt a set of tme seres n two ways: horzontal splttng (H-splt for short) and vertcal splttng (V-splt for short). In an H-splt, the segmentaton remans unchanged, but the set of tme seres are splt nto two dsjont sets. To splt, the tme seres are assgned to dfferent subsets accordng to a selected segment. Ether mean or standard devaton of the selected segment can be used to make the assgnments. Two examples are shown n Fgure 3, n whch only the -th segment s shown. Fgure 3(b) shows an example where the tme seres cannot be dvded well nto two (a) Usng H-splt Fgure 4: Vertcal splttng (b) Usng V-splt subsets usng the mean, but can be parttoned well usng standard devaton. V-splt leads to a one-segment refnement of the current segmentaton. We llustrate ths process n Fgure 4. The tme seres n Fgure 4(a) cannot be splt well usng an H-splt for the -th segment, snce the 4 tme seres have smlar mean and standard devaton values. In a V-splt, we frst splt the segment nto two, and then cluster tme seres accordng to the mean of the left subsegment, as shown n Fgure 4(b). provdes more possble ways to dvde and conquer tme seres, and thus has the potental to acheve more smlar tme seres n leaf nodes. All the state-of-the-art methods, such as the R-tree based methods and SAX, only support horzontal splttng, and only the mean values can be used n splttng. No segmentaton refnement s allowed n those methods Splttng Strateges A selecton of splttng strateges happens only when a leaf node has no space to accommodate a newly assgned tme seres, and thus has to be splt to host two chldren nodes. The global userspecfed parameter ψ defnes the maxmum number of tme seres that can be ndexed by a leaf node. Consder a leaf node N that needs to hold a set T S N of ψ + tme seres, where the segmentaton s SG. We need to splt N nto two nodes. Now we specfy the splttng strateges H-splt and V-splt as follows. H splt. Suppose the -th segment s selected to be used n splttng. We wll dscuss the choce of segment when we dscuss the qualty measure n Secton In an H-splt M (for H-splt usng mean values), suppose the range of the mean values n the -th segment of T S N s [µ mn, µ max ], we splt N and generate two chldren nodes N l and N r wth the same segmentaton SG n N. [µ mn The range of mean values of the -th segment n N l s +µ max +µ max, µmn ), and that n N r s [ µmn, µ max ]. The tme seres n N wll be assgned to N l and N r accordng to ther mean values. Smlarly, n an H-splt SD (for H-splt usng standard devaton values), suppose the range of the standard devaton value n the -th segment s [σ mn, σ max the -th segment n N l s [σ mn ], the range of standard devaton n, σmn +σ max ), and that n N r s

6 [ σmn +σ max, σ max ]. V splt. Suppose the -th segment s selected to be used n splttng. Agan, we wll dscuss the choce of segment when we dscuss the qualty measure n Secton We refne the segmentaton SG by splttng the -th segment nto two equal-length segments: S = [r +, r + r r ] and S = [ r r r +, r ]. We use one of the two new subsegments and apply an H-splt to partton the tme seres. We denote by V-splt L and V-splt R, respectvely, the left and the rght subsegment s chosen. Consequently, two chldren nodes are created for node N. Clearly, V-splt contans an H-splt step. A splttng strategy can be wrtten as a tuple SP =(sd, strategy, measure), where sd [, m] s the segment d that s selected n the splttng, m s the number of segments n the current segmentaton SG, strategy {H-splt, V-splt L, V-splt R } s the choce of H- or V-splt (and the subsegment n the case of V-splt), and measure {M, SD} records whether the mean values or the standard devaton values are used n the H-splt. For example, n Fgure, SP n node N s (, V-splt L, M), whch means that the second segment s selected, a V-splt s appled, the left subsegment and the mean values are used n the H- splt. SP of node N 3 s (, H-splt, M), whch means the second segment s selected, an H-splt s appled usng the mean values Splttng Strategy Qualty Measure When a node s splt, the tme seres assgned to the node are then assgned to the two chldren nodes created n the splttng process. As just dscussed, several dfferent strateges can be used to make the assgnment, ncludng choosng H-splt or V-splt, the segment used and the measurement (mean or standard devaton). We need a qualty measure to evaluate the beneft of varous splttng strateges n order to choose a good one. A brute-force method to evaluate the qualty of a splttng strategy s that, for every possble strategy, we compute the smlarty among the tme seres assgned to each chld node. Ths brute-force method, however, s very costly. For each splttng strategy, the tme complexty s O(ψ ). If there are m segments, then the total cost to fnd the best strategy s O(m ψ ) = O(m ψ ). In ths subsecton, we tackle the cost by usng the upper and lower bounds of the tme seres n the chldren nodes to evaluate the splt qualty. Gven a node N n a, let Q be a query tme seres. The effectveness of the upper and lower bounds n node N wth respect to Q can be measured by the bound range, whch s the dfference between the upper bound and the lower bound of the dstances between Q and the set of tme seres ndexed n N, that s, m R(Q) = (r r )((UB µ +UBσ ) (LB µ +LBσ )), (5) = where UB µ, UBσ, LB µ, LBσ are defned n Equatons 3, 4,, and, respectvely. m From Equaton 5,we have R(Q) = (r r )(R µ +Rσ ), = where R µ = UBµ LBµ and Rσ = UB σ LB σ. For R µ, accordng to the relatonshp of µq and [µ mn (µ max, µ max ], µ Q ) (µ mn µ Q ), f µ Q µ mn ; µ Q ), f µ mn < µ Q (µmn +µ max ) R µ = (µ max ; (µ mn µ Q ), f (µmn +µ max ) < µ Q µ max ; (µ mn µ Q ) (µ max µ Q ), f µ max < µ Q ; (6) In the second and thrd cases, the range has the same upper bound (µ max µ mn ). The more smlar µ max and µ mn are, the smaller the range s. In the frst and fourth cases, t also holds that the more smlar µ max and µ mn are, the smaller the range s. Thus, we can use (µ max µ mn ) to evaluate the range related to the mean value. Smlarly, for R σ, (σ R σ max + σ Q ) (σ mn σ Q ), f σ Q σ mn ; = (σ max + σ Q ), f σ mn < σ Q σ max ; (σ max + σ Q ) (σ max σ Q ), f σ max < σ Q ; (7) By smple transformaton, we can see that, n both the frst and second cases, the range s smaller than (σ max ). Moreover, n the thrd case, the range equals to 4σ max σ Q. In all cases, t holds that the smaller σ max s, the smaller the range related to standard devaton s. Thus, we can use (σ max ) to evaluate the range related to standard devaton. We combne the above two components and defne our measurement of estmaton qualty as Qos = m = (r r r )((µ max µ mn ) + (σ max ) ) (8) The measurement Qos does not depend on query tme seres. The smaller Qos s, the more effectve the bounds n a node are for smlarty estmaton Fndng Splttng Strateges Denote by N the node to be splt, and by Qos N ts Qos value. We splt N to two chldren nodes N l and N r, and denote ther Qos values by Qos l and Qos r, respectvely. We defne the splttng beneft as B = Qos N Qos l+qos r. The larger B s, the better the splttng strategy. Now, we ntroduce functon BestSplt. For each segment, we compute B for all possble vertcal and horzontal splttng strateges, and select the one wth the maxmum B value as the best strategy. After the ndex buldng, each nternal node mantans ts own splttng strategy, SP. Gven splttng strategy SP of node N and a query tme seres Q, the functon routetochld can correctly fnd the approprate chld node. The process s smlar to that of reassgnng tme seres when splttng occurs. We frst transform Q accordng to the segmentaton of N. Then, we re-transform Q accordng to the correspondng splttng strategy and check whch chld node t belongs to, and assgn Q to t. 3.4 Analyss of In ths secton, we dscuss some factors that are related to the performance of the ndex. Adaptve segmentaton. In all prevous approaches, one has to specfy the dmensonalty of the tme seres representaton, such as the number of coeffcents n DFT and DWT, and the number of segments n SAX and APCA. However, t s hard to determne the optmal parameters. avods ths dffculty by automatc segmentaton splttng. Data dstrbuton. Ideally, the performance of an ndex s nsenstve to the dstrbuton of tme seres to be ndexed. Many exstng methods assume or target at some dstrbutons n desgn. For example, the R-tree based approaches assume that all tme seres can be represented well usng the same number of coeffcents, whch may not hold n many applcatons. If some tme seres are domnated by low-frequency data and the others are domnated by hghfrequency data, DFT-based ndex may have poor performance. does not assume any data dstrbuton snce dfferent nodes have ther own representatons. For ths reason, tme seres wth dramatcally dfferent characterstcs can stll be handled well by usng dfferent nodes. Balance of. SAX and both may generate mbalanced ndex trees. Our expermental results (Table ) show that

7 s better than SAX. n terms of balancng because of the multple splttng strateges. Heurstcally, we can mprove the balance of s n two ways. Frst, we can shuffle the data set and buld the ndex several (e.g., 3-5) tmes usng dfferent nput orders of tme seres, and then pck the best one as the fnal ndex. Second, we can adjust the tree by a post-process where we move the extraordnarly deep subtrees toward the root node. Lmted by space, we omt the detals here. The major search cost n and also some other tme seres ndexes, such as SAX, s to retreve tme seres data from dsk. Searchng nternal nodes n the ndex s relatvely quck. Therefore, keepng smlar tme seres n a leaf node can help to reduce the number of I/O operatons needed n a smlarty search. Ths s very dfferent from lookup queres usng B+-tree or smlar ndexes, where the whole ndex s stored on dsk. Extenson to subsequence matchng. The subsequence matchng problem s to fnd matchng subsequences between two tme seres, whch may have dfferent lengths. The state-of-the-art approaches partton (long) tme seres nto a set of equal-length subsequences based on overlapped wndows, and then buld the ndex for these subsequences for fast smlarty search. The search results are assembled to compute matchng subsequences. Snce the tme seres after parttonng for smlarty search are of the same length, can be used to support subsequence matchng drectly. 4. QUERY ANSWERING ALGORITHMS A supports two types of queres. The frst one s the tradtonal smlarty search, whch returns the tme seres nearest to the query tme seres. The second type s to estmate the dstance dstrbuton, whch returns a hstogram of dstances between the query tme seres and all ndexed tme seres. Algorthm exactsearch(q) : Input: A query tme seres Q : Output: The nearest tme seres TS wth dstance D bsf 3: N bsf = HeurstcSearch(Q); 4: (TS, D bsf ) = calcmndst(n bsf, Q); 5: Intalze dstance prorty queue pq; 6: pq.add(n R, D LB(N R, Q)); 7: whle!pq.sempty() do 8: (N cur, LB cur) = pq.popmn(); 9: f LB cur > D bsf then : break; : end f : f N cur s a leaf node then 3: (X, Dst) = calcmndst(n cur, Q); 4: f Dst < D bsf then 5: D bsf = Dst; TS = X; 6: end f 7: else 8: for all chldren nodes N of N cur do 9: f D LB(N, Q) < D bsf then : pq.add(n, D LB(N, Q)); : end f : end for 3: end f 4: end whle 5: return TS,D bsf ; 4. Smlarty Search Before ntroducng the exact smlarty search, we frst ntroduce a heurstc search method, whch s more effcent and wll be used n the exact search method later. 4.. A Heurstc Method Algorthm 3 Hstogram(Q) : Input: A query tme seres Q : Output: A dstance hstogram Hst 3: Intalze a dstance range count lst L; 4: Intalze a node stack Stack; 5: Stack.push(Root,, + ); 6: whle!stack.sempty() do 7: (N, LB p, UB p) = Stack.Pop(); 8: LB = D LB(N, Q); 9: UB = D UB(N, Q); : LB = max(lb,lb p); : UB = mn(ub, UB p); : f N s a leaf node then 3: Count = N.C; L.add(LB, UB, Count); 4: else 5: for all chld node of N, N do 6: Stack.Push(N, LB, UB); 7: end for 8: end f 9: end whle : Hst = BuldHstogram(L); : return Hst; Instead of fndng the exact most smlar tme seres by checkng all possble nodes n a, a heurstc search only nvestgates one leaf node, and tres to fnd the most smlar tme seres n ths node. Ths method s based on the heurstc that smlar tme seres are often ndexed n the same node. Specfcally, gven a query Q, we start from the root node. If the root node s not a leaf node, then we fnd a chld node of the root node that can hold Q as f Q ware nserted nto the ndex. Ths search process s conducted recursvely untl a leaf node N s met. Then, we calculate the dstance D(S, Q) for every tme seres S T S N, and return the tme seres of the shortest dstance. Please note that the heurstc method, as the name suggests, may not fnd the most smlar tme seres n the whole data set. 4.. The Exact Search To speed up search, we combne the heurstc search method and the lower boundng dstance functon to prune the search space. The pseudo-code s gven n Algorthm. The algorthm begns wth a best-so-far (BSF) answer returned by the heurstc search method. The ntuton s that, by quckly obtanng a tme seres that s lkely smlar to the query tme seres, a large porton of the search space may be pruned effectvely. Once a BSF s obtaned, a prorty queue, denoted by pq, s created to examne nodes that may host tme seres that are potentally more smlar to the query tme seres than the BSF answer. Ths prorty queue s ntalzed to nclude only the root node. The algorthm then repeatedly extracts the node wth the smallest lower bound dstance from the prorty queue untl ether the prorty queue becomes empty or an early termnaton condton s met. The early termnaton occurs when the lower bound dstance s greater than or equal to the dstance of the BSF answer. When the condton s satsfed, the remanng tme seres n the prorty queue cannot qualfy as the nearest neghbor and can be pruned. To process a node from the prorty queue, two possble cases may happen. () In the case that the node s a leaf node, we fetch the tme seres from dsk and compute the dstance from the query to these tme seres, recordng the mnmum dstance. If ths dstance s less than our BSF answer, we update the BSF answer. () In the case that the node s an nternal node, ts chldren nodes are nserted nto the prorty queue provded ther lower bound dstances to the query tme seres are less than the dstance of the BSF answer.

8 4. Dstance Dstrbuton Hstogram Algorthm 3 gves the pseudocode of computng an equ-wdth hstogram of the dstances from a query tme seres to all tme seres ndexed by a. We collect all statstcal nformaton of the leaf nodes to form a lst, denoted by L, n whch each entry represents the number of tme seres fallng n certan dstance range. The range can be estmated based on Theorem. EXAMPLE 3. A lst L = ([, ], ), ([5, 3], 5), ([4, 5], ) means that there are 3 leaf nodes: N, N and N 3. N ncludes tme seres, and ther dstance from Q s between [, ]. The dstance range and number of tme seres n N and N 3 are ([5, 3], 5) and ([4, 5], ) respectvely. Snce the entres of any two leaf nodes are dsjont, there s not redundant nformaton n L. Thus, we can obtan a correspondng hstogram quckly. One ssue s that n some cases, the lower (or upper) bound of a chld node may be smaller (or larger) than that of ts parent node. In other words, the bounds n the parent node may be tghter than those n the chldren nodes. Usng the bounds at such chldren nodes causes less accurate estmaton of the chldren nodes. We address t wth Theorem 3, whch s easy to show. THEOREM 3. If the estmated range of the dstance n a node s [LB, UB], and that n ts parent nodes s [LB p, UB p], then [max(lb, LB p), mn(ub, UB p)] s a tghter and correct range of ths node. Usng Theorem 3, whenever a node s traversed, we frst compute the lower and upper bounds accordng to Theorem. Then we compare the bounds wth those n ts parent node, and use the tghter bounds nstead Lnes 7- n Algorthm 3. Furthermore, one may buld a hstogram more quckly by traversng some nternal nodes nstead of all leaf nodes. Specfcally, we propose an approach here, called α-level ( < α ), to compute a hstogram. Denote by H the heght of a. For each path from the root to a leaf node, we select the α H -th nternal node nstead of the correspondng leaf node to generate the lst L. If the length of a path s shorter than α H, we smply use the leaf node. In other words, we use the nodes located n certan cross secton of the whole tree to generate L. Algorthm 3 can be extended to ths case easly. The expermental results show that we can obtan good estmaton wth 3 -level. Once we obtan the lst L, we can compute a hstogram based on t. There are multple ways to compute a hstogram. A straghtforward way s to assume that the tme seres contaned n a node are dstrbuted unformly. EXAMPLE 4. Consder the lst L n Example 3. We can estmate the number of tme seres wthn the range [5, ] as =. 3 5 In general, we can assume that the tme seres n a node follow some dstrbuton, such as Gaussan dstrbuton. We can spend some extra space to mantan the parameters of the model, such as mean and varance, whch allow more accurate and effcent estmaton. Lmted by space, we omt the detals here. 5. EMPIRICAL EVALUATION In ths secton, we report extensve experments to verfy the effectveness of. We compare both PAA-ndex (usng PAA as representaton and R-tree as ndex) and SAX. wth DStree n ndex effcency, approxmate search error rate and prunng power. We also showcase the lower bound tghtness and accuracy of hstogram estmaton. All experments were executed on a laptop computer wth an Intel Core 5.5GHz CPU and 4GB man memory. All expermental results were averaged over 5 runs. 5. Data Sets and Default Settng The tme seres n both synthetc and real data sets were normalzed wth Z-normalzaton. 5.. Synthetc Data sets Each of our synthetc data set s a combnaton of four types of tme seres as follows. Random walk tmes seres. The start pont s pcked randomly from range [ 5, 5] and the step length s chosen randomly n range [, ]; One-segment Gaussan tme seres. The values n the whole tme seres are pcked from a Gaussan Dstrbuton wth mean value and standard devaton randomly selected n ranges [ 5, 5] and [, ], respectvely; Mult-segment Gaussan tme seres. Such a tme seres s concatenated by multple one-segment Gaussan tme seres. The number of segments s randomly set between 3 to. A mxed sne tme seres. Each tme seres s a mxture of several sne waves whose perod s randomly set n range [, ], ampltude s randomly set n range [, ], and mean value randomly chosen n range [ 5, 5]. To generate a tme seres, the synthetc data generator frst randomly chooses a type, and then pcks the correspondng parameters randomly to generate the tme seres. We generated four synthetc data sets of tme seres lengths 64, 8, 56 and 5, respectvely. Each data set contans one mllon tme seres by default. We also use synthetc data sets of up to mllon tme seres n the scalablty test. 5.. Real Data sets We used a real data set collected n a brdge condton montorng system. In ths system, data was collected from about one thousand sensors of more than types, such as thermometers, accelerometers, stran gauges, dsplacement meters, and fatgue meters. The length of each tme seres s 56, and one mllon tmes seres were collected. The total storage space s about 3GB Parameters To verfy the effectveness of data-adaptve and dynamc segmentaton versus global segmentaton, we compared wth PAA-ndex (mplemented by ourselves) and SAX. (source code provded by the authors). Both PAA-ndex and SAX. use fxed, global segmentatons. To test the performance extensvely, we bult PAA-ndex and SAX. wth segment szes of 8,, 6 and respectvely. The leaf capacty threshold, ψ, was set to. The FBL sze for SAX. was set to,. The fll factor of R-tree n PAA-ndex was set to Index Sze We dd not mplement SAX. by ourselves. Instead, we used the mplementaton provded by the authors. We realze that the the mplementaton detals, partcularly the storage methods, n SAX. and may be dfferent. To avod any confuson, we report the absolute ndex sze for the methods we mplemented but not for SAX.. The frst group of experments compare the ndex space cost of, PAA-ndex and SAX. wth respect to the length of tme seres. Specfcally, we report three measurements, namely the number of nodes n the tree, the physcal ndex sze, and the average number of tme seres contaned by a leaf node. The number of nodes ncludes both nternal and leaf nodes. Consderng the dfference on data representatons n the three approaches, we also compare the physcal ndex sze for and PAA. We use the average number of tme seres n leaf nodes to evaluate the balance

9 PAA 8 PAA PAA 6 PAA ISAX 8 ISAX ISAX 6 ISAX 5 8 Node Count(*) Index Sze(MB) 5 #Tme Seres 6 4 #segments/node Length of Tme Seres Length of Tme Seres Length of Tme Seres Length of Tme Seres (a) Number of nodes (b) Index sze (c) Average #ts per leaf (d) # segments/node Fgure 5: Index sze on the synthetc data sets Node Count(*) Index Sze(MB) 5 5 #Tme Seres 6 4 #segments/node 5 5 PAA 8 PAA PAA 6 PAA ISAX 8 ISAX ISAX 6 ISAX PAA 8 PAA PAA 6 PAA (a) Number of nodes (b) Index sze (c) Average #ts per leaf (d) # segments/node Fgure 6: Index sze on the real data set. PAA 8 PAA PAA 6 PAA ISAX 8 ISAX ISAX 6 ISAX PAA 8 PAA PAA 6 PAA ISAX 8 ISAX ISAX 6 ISAX of the ndex nodes. The results on the synthetc data sets are shown n Fgure 5, and those on the real data set are n Fgure 6. Four dfferent segmentaton szes, 8,, 6 and, were tested for both PAA-ndex and SAX.. Label PAA-6 means PAA-ndex wth 6 segments. Fgure 5(a) shows that, n all the three approaches, the number of nodes s nsenstve to the length of tme seres. However, the number of segments has dfferent effects on SAX. and PAAndex. In SAX., the number of nodes ncreases exponentally as the number of segments ncreases, for example, SAX-6 and SAX- have much more nodes. PAA-ndex s nsenstve to the number of segments. The number of nodes n s stable and far less than those n SAX-6 and SAX-. The number of nodes affects the search effcency. If t s too large, the average number of tme seres per leaf node decreases and more I/O cost s needed to read data from dsk. Fgure 5(b) compares the absolute ndex sze, the smaller, the better. The sze of an ndex s determned by two factors: the number of nodes and the unt space cost per node. For PAA-ndex, the space cost of each node ncreases almost lnearly as the number of segments ncreases. Snce needs to mantan both mean and standard devaton values, t has a larger unt space cost. However, benefttng from the dynamc splttng strateges, the average segment sze of s small. Fgure 5(c) shows the average number of tme seres per leaf node. The smaller the number, the fewer tme seres n expectaton can be retreved from a leaf node. The number n s about 5. The number n PAA-ndex s the largest (about 6) due to the R- tree structure. In SAX., ths value decreases when the number of segments ncreases for two reasons. Frst, n SAX., the root node has too many chldren nodes (for example, 6 for SAX-6). Second, t uses fxed segmentaton. In some cases, t s dffcult to splt a set of tme seres only based on the mean value. Snce uses a dynamc segmentaton strategy, the segment sze vares n dfferent nodes. We report the average segment sze wth respect to length of tme seres, that s, the rato of the total number of segments n all nodes aganst the number of nodes. Fgure 5(d) shows the results. The average segment sze of ncreases moderately when the length of tme seres ncreases, whch confrms the effectveness of our splttng strateges. The average segment szes of the other approaches are nsenstve to the length of tme seres. Fgure 6 shows the results on the real data set. The trends are smlar to those on the synthetc data sets. The number of nodes of s smlar to those of SAX- and smaller than those of SAX-6 and SAX-. The average number of segments per node of s Although the tme seres n the real data set are more dverse, can stll represent the tme seres wth a small number of segments, whch verfes the effectveness of the dynamc splttng strategy n. Both SAX. and are bnary trees. To examne the balance of those ndexes, Table compares the average heght of SAX. and. We use the normalzed standard devaton (that s, standard devaton dvded by the average) of the tree heght to measure the balance of the trees. We do not consder PAA-ndex n ths comparson because R-tree, though balanced, has a much larger fan-out factor. The heght of all ndexes ncreases very moderately as the tme seres length ncreases. These ndexes are all scalable wth respect to long tme seres. SAX. are substantally shorter than n average heght, but clearly taller than n maxmum heght. The dynamc splttng strategy n can effectvely avod long branches. The small normalzed standard devaton values n clearly show that has good balance. Table 3 examnes the effect of the leaf capacty threshold ψ. The number of nodes and ndex sze of, SAX-, and SAX- 8 decrease dramatcally as ψ ncreases. SAX-6 and SAX- do not gan much from a larger leaf capacty threshold. One reason s that SAX. wth m segments may have up to m chldren nodes of the root node, though many such nodes may contan a very small number of tme seres. 5.3 Accuracy We tested the effectveness of the ndexes n smlarty search,

10 Data SAX-8 SAX- SAX-6 SAX- set Avg NSD Max Avg NSD Max Avg NSD Max Avg NSD Max Avg NSD Max S S S S R Table : Average heght (Avg), normalzed standard devaton (NSD), and maxmum length (Max) of the ndexes. ( S56 denotes the synthetc data set wth the length of tme seres 56, and R56 denotes the real data set wth length 56.) Leaf capacty SAX-8 SAX- SAX-6 SAX- threshold ψ # nodes Sze (MB) # nodes # nodes # nodes # nodes Table 3: Number of nodes and ndex sze (MB) versus leaf capacty threshold ψ. ncludng both heurstc search and exact search. The accuracy of heurstc search s measured by the error rate E = D D, where D D and D are the dstance between the query tme seres and the exact nearest neghbor and the heurstc search result, respectvely. For exact search, we compare the prunng power, whch s the rato of the number of tme seres pruned aganst the total number of tme seres. For both heurstc and exact search, tme seres were used as the queres, half of them pcked randomly from the data set, and the rest generated randomly. Fgures 7 and 8, respectvely, show the results on the synthetc and real data sets. In Fgure 7(a), although the error rate ncreases as the length of tme seres ncreases for all three methods, outperforms the others clearly. In Fgure 8(a), the error rate decreases for PAAndex and SAX. when the sze of segment ncreases, snce usng more segments can represent the tme seres more accurately. Wth the same segment sze, SAX. outperforms PAA-ndex. Interestngly, when the query tme seres s pcked from the data sets, both SAX. and correctly fnd the leaf node contanng the rght tme seres due to the dsjont space dvson property of these two approaches. PAA-ndex fnds the wrong node n some of such cases, because of the ntersecton of MBR n R-tree. When the query s generated randomly, s more accurate n fndng the correspondng leaf nodes than SAX. because of our dataadaptve splttng strategy. Fgures 7(b) and 8(b) show the prunng power of exact smlarty search. The prunng power of s greater than 95% on all synthetc data sets and s 98% on the real data set, whch s clearly better than those of the other two approaches. The prunng power of PAA-ndex ncreases dramatcally as the segment sze ncreases from 8 to. However, the margnal performance gan decreases as the segment sze ncreases further. A reason s that R-tree performs poorly wth hgh dmensonalty. SAX. has a smlar trend. The advantages of are from two factors. Frst, a tghter lower bound helps to prune more nodes. Second and more mportantly, the proposed data adaptve splttng strateges can cluster smlar tme seres better. Consequently, the heurstc search n s more accurate, whch gves a good startng pont n exact search. Moreover, fewer data fles are vsted snce smlar tme seres are clustered better nto fewer nodes. 5.4 Lower Bound Tghtness We tested the tghtness of the proposed lower bound estmaton approach. We measure the lower bound tghtness by the rato of the estmated lower bound dstance aganst the mnmum dstance from a query to all tme seres ndexed n a node. Ths rato s between Error Rate PAA 8 PAA PAA 6 PAA ISAX 8 ISAX ISAX 6 ISAX Length of Tme Seres (a) Heurstc search error rate Prunng Power Length of Tme Seres (b) Exact search prunng power Fgure 7: Error rate and prunng power on the synthetc data sets. Error Rate PAA 8 PAA PAA 6 PAA ISAX 8 ISAX ISAX 6 (a) Heurstc search error rate ISAX Prunng Power PAA 8 PAA PAA 6 PAA ISAX 8 ISAX ISAX 6 ISAX (b) Exact search prunng power Fgure 8: Error rate and prunng power on the real data set. and, the larger, the better. We collected ths nformaton durng the processng of exact search. Fgure 9 shows the results on both the synthetc and real data sets. The lower bound usng both mean and standard devaton values s tghter than that usng only mean values. 5.5 Hstogram Computaton Fgure compares the exact dstance hstogram by a full scan of the data and the dstance hstogram estmated the α-level method (Secton 4.). For the latter, three cases are shown. Full level uses all leaf nodes to estmate. /3 level and /3 level, respectvely, use nternal nodes located at the /3-level and /3-level cross sectons to compute the hstogram. Although the full and /3

15-451/651: Design & Analysis of Algorithms January 22, 2019 Lecture #3: Amortized Analysis last changed: January 18, 2019

15-451/651: Design & Analysis of Algorithms January 22, 2019 Lecture #3: Amortized Analysis last changed: January 18, 2019 5-45/65: Desgn & Analyss of Algorthms January, 09 Lecture #3: Amortzed Analyss last changed: January 8, 09 Introducton In ths lecture we dscuss a useful form of analyss, called amortzed analyss, for problems