RECONCILING ATTRIBUTE VALUES FROM MULTIPLE DATA SOURCES

RECONCILING ATTRIBUTE VALUES FROM MULTIPLE DATA SOURCES Zhengru Jang School of Management Unversty of Texas at Dallas Rchardson, TX U.S.A. zx011000@utdallas.edu Prabuddha De Krannert School of Management Purdue Unversty West Lafayette, IN U.S.A. pde@purdue.edu Sumt Sarkar School of Management Unversty of Texas at Dallas Rchardson, TX U.S.A. sumt@utdallas.edu Debrabata Dey Unversty of Washngton Busness School Seattle, WA U.S.A. ddey@uwashngton.edu Abstract Because of the heterogeneous nature of multple data sources, data ntegraton s often one of the most challengng tasks of today s nformaton systems. Whle the exstng lterature has focused on problems such as schema ntegraton and entty dentfcaton, our current study attempts to answer a basc queston: When an attrbute value for a real-world entty s recorded dfferently n two databases, how should the best value be chosen from the set of possble values? We frst show how probabltes for attrbute values can be derved, and then propose a framework for decdng the cost-mnmzng value based on the total cost of type I, type II, and msrepresentaton errors. Keywords: Data ntegraton, heterogeneous databases, probablstc databases, msclassfcaton errors, msrepresentaton errors Introducton Busness decsons often requre data from multple sources. As has been wdely documented, ntegratng data from several exstng ndependent databases poses a varety of complex problems. Data ntegraton problems arsng from heterogeneous data sources can be dvded nto two broad categores: the schema-level problems and nstance level-problems (Rahm and Do 2000). Topcs such as schema ntegraton (Batn et al. 1986) and semantc conflct resoluton (Ram and Park 2004) belong to the frst category, whle problems such as entty dentfcaton and matchng (Dey et al. 1998b) and data cleanng and duplcaton removal (Hernandez and Stolfo 1998) belong to the second. All of these problems have been extensvely studed and varous solutons have been proposed. However, after schema ntegraton and entty matchng, another problem emerges: What should be done f, once all schema level problems have been resolved, all real-world enttes optmally matched, and duplcates removed, we stll face two conflctng data values for the same attrbute of a real-world entty? How should we deal wth the conflctng attrbute values when we merge, for example, alumn data stored separately by a unversty and by one of ts departments and encounter two dfferent work addresses for the same person? One soluton s to store all conflctng values wth assocated probabltes; probablstc relatonal models (Dekhtyar et al. 2001; Dey and Sarkar 1996) have been proposed n that context. However, despte the theoretcal progress on the probablstc database model, t s not commercally avalable as yet. Even f t becomes readly avalable n the future, because of the sgnfcant 2004 Twenty-Ffth Internatonal Conference on Informaton Systems 725

Jang et al./reconclng Attrbute Values from Multple Data Sources overhead assocated wth the storng and handlng of the probablstc data, t remans to be seen whether t s cost-ustfable to mplement a probablstc database model. Therefore, storng the most lkely value or the best value based on some gven crtera seems to be a more practcal soluton at ths pont. To choose a sngle determnstc value, we need to frst evaluate the probabltes assgned to each conflctng value. Varous peces of nformaton, such as values of related attrbutes, tme stamps of stored values n dfferent data sources, and data source relabltes, may be utlzed to estmate the probablty dstrbutons assocated wth all possble true attrbute values. The approach we propose n ths study s as follows: We frst estmate the probablty for each conflctng attrbute value based on source data (attrbute) relablty, and then determne the best value to store, based on total expected error costs assocated wth each canddate value. We examne stochastc attrbutes wth only dscrete domans n ths study. The paper s organzed as follows. In the next secton, we derve the probablty assocated wth each possble attrbute value based on source data relablty. We then classfy queres based on possble errors that may result from ncorrect attrbute values. Such classfcatons are used n computng the total expected cost assocated wth ncorrect values beng stored n the database. We demonstrate how the cost mnmzng attrbute values can be determned for a dscrete attrbute. We extend the soluton to multple dscrete attrbutes. The last secton provdes concludng remarks and dscusses possble extensons. Computng Attrbute Value Probabltes We frst derve the probabltes for a sngle dscrete attrbute. Consder data sources S 1 and S 2. We denote by A the value of attrbute A for a partcular entty nstance as observed n S 1, and by A the value of attrbute A for the same entty nstance as observed n S 2. For example, we may fnd A 1 n S 1 (A 1 ) and A 2 n S 2 (A 2 ). For any number of reasons, the data n these data sources may be naccurate (Dey et al. 1998b). We would lke to determne the probablty that a specfc value (whch may or may not be the value observed from a data source) s ndeed the true value of an attrbute. When multple sources are nvolved, t would requre us to consder the relablty of the dfferent data sources. In general, the requred probablty terms can be expressed as P(A k, A ). We dentfy the followng stuatons that cover all the possbltes: Case 1a: k = = ; P(A, A ), Case 1b: k = ; P(A k, A ); and Case 2a: k = ; P(A, A ), Case 2b: k ; P(A k, A ). What s Avalable? We can sample S 1 and S 2 to fnd what proporton of values of attrbute A s accurate n S 1, and what proporton of values of attrbute A s accurate n S 2, n general. For example, we may sample S 1 and S 2, and fnd that attrbute A s accurate n S 1 80 percent of the tme (.e., 80 percent of the values for attrbute A are correct n the sample from S 1 ), and that t s accurate n S 2 90 percent of the tme (Morey 1982). Then, P(A ) = 0.8, P(A a ) = 0.2; and P(A ) = 0.9, P(A a ) = 0.1. Assumptons Our obectve s to determne the desred probabltes for the attrbute values based on sample estmates from each ndvdual data source. In order to do that, we need to make a few assumptons. These are lsted next. Assumpton 1: The value of A recorded n S 1 (.e., A ) s not dependent on the value of A recorded n S 2, once we know the true value of A (and vce versa). Ths mplctly assumes that the causes of errors are ndependent n the two data sources. Mathematcally, ths mples P(A k, A ) = P(A k ) œ,, k. 726 2004 Twenty-Ffth Internatonal Conference on Informaton Systems

Jang et al./reconclng Attrbute Values from Multple Data Sources The above assumpton would, of course, be volated f the data n one source are derved from the data n the other source. Assumpton 2: Our prors for P(A ), P(A ), etc. are the same f we have no reason a pror to beleve that one value s more lkely to occur than another. Ths would be true n general when the doman s large or qute unpredctable. Mathematcally, ths mples P(A k ) = 1/ œ k, where s the number of possble realzatons of attrbute A. If the doman s restrcted to ust a few values, and one (or a subset of those values) s known to be predomnant, then the approprate prors should be used. These prors can be ncorporated n our estmate (we omt ths analyss here for space consderatons). Assumpton 3: All possble values of A other than that observed from a partcular data source are assumed to be equally lkely. Mathematcally, ths mples P(A k ) = P(A a )/ [ A -1] = [1- P(A )]/ [ A -1] œ k, where s the number of possble realzatons of attrbute A. The mplcatons are smlar to those of the prevous assumpton. Probablty Estmates for Attrbute Values: Sngle Attrbute Case Based on the relablty nformaton about an attrbute n two data sources and the assumptons dscussed above, we derve the probabltes of true values n varous stuatons as follows (detaled dervatons are provded n the appendx). Case 1a: k = = ; P(A, A ) = P(A P(A ) P(A ) ) P(A ) + P(A a ) P(A a ) / [ -1] The above expresson llustrates that, as the number of possble values of an attrbute ncreases, the lkelhood that both sources have ncorrectly captured A goes down. Case 1b: k = ; P(A k, A ) = P(A P(A a ) P(A Case 2a: k = ; P(A, A ) = P(A ) P(A a ) + P(A a P(A ) P(A a ) + P(A a ) P(A ) P(A a 2 )/ [ -1] ) P(A a ) + P(A a ) In the stuaton where A can have only two values a and a, the above expresson becomes P(A, A ) = P(A P(A ) P(A a The expresson for P(A, A ) s analogous. ) P(A a ) + P(A a ) ) / [ -1] ) P(A a ) P(A ) )[ -2] / [ -1] 2004 Twenty-Ffth Internatonal Conference on Informaton Systems 727

Jang et al./reconclng Attrbute Values from Multple Data Sources Case 2b: k ; P(A k, A ) = P(A ) P(A a P(A a ) + P(A a ) P(A a ) P(A ) / [ -1] ) + P(A a ) P(A a )[ -2] / [ -1] Probablty Estmates for Attrbute Values: Multple Attrbutes Case In the above dervaton, only one common attrbute s consdered. The solutons can, however, be easly extended to stuatons where multple attrbutes are common across the databases. If the relabltes of the attrbutes are ndependent of one another, the analyss presented n the prevous subsecton apples for each attrbute ndvdually. When the relabltes across attrbutes are dependent, we treat that group of attrbutes as one composte attrbute and all possble combnatons of the multple attrbutes values as the possble realzatons of the composte attrbute. If two attrbutes A and B form a composte attrbute, then the number of realzatons for ths composte attrbute s B. For example, f the values stored for a composte attrbute are the same n two locatons, then analogous to case 1a, we have N P( A, B = b, B = b, A, B = b ) =, where N = P(A, B = b D = P(A a, B b, B, B N + D = b ) P(A, B = b = b ) P(A a, B b, B, B The desred probabltes for the other cases can be obtaned n the same manner. = b = b ), ) / [ B -1]. Alumn Database Example Consder alumn data that are collected ndependently n both an unversty database (S 1 ) and a department database (S 2 ), and suppose we want to merge them nto one database. Some attrbute values for the same person could be dfferent n the two sources. Bnary Attrbute Suppose there exsts a bnary attrbute Self-Employed (SE for brevty), whch can take a value of a 1 = Yes or a 2 = No. Assume that, based on samplng, we have found that ths attrbute s accurate n S 1 80 percent of the tme and that t s accurate n S 2 90 percent of the tme,.e., P(SE SE ) = 0.8, P(SE a SE ) = 0.2, œ = 1, 2; and P(SE SE ) = 0.9, P(SE a SE ) = 0.1, œ = 1, 2. When the stored values of SE n the two sources are dfferent for an alumnus,.e., a a, we have: P(SE SE, SE ) = (0.8 0.1) / (0.8 0.1 + 0.9 0.2) = 8 / 26 = 0.308, P(SE SE, SE ) = (0.9 0.2) / (0.8 0.1 + 0.9 0.2) = 18 / 26 = 0.692. In case the stored values n the two sources are the same for a partcular person, we have P(SE SE, SE ) = (0.8 0.9) / (0.8 0.9 + 0.2 0.1) = 72 / 74 = 0.973, P(SE k SE, SE ) = (0.2 0.1) / (0.8 0.9 + 0.2 0.1) = 2 / 74 = 0.027. Mult-Valued Attrbute Suppose the orgnal alumn database also stores the current home locaton of the alumn and the attrbute Home_Locaton (HL for brevty) can be any of the 50 states n the Unted States. We may fnd that, for nstance, the Home_Locaton for an alumnus Robert Black s stored as TX n the unversty database S 1 and as LA n the department database S 2. Assumng that the attr- 728 2004 Twenty-Ffth Internatonal Conference on Informaton Systems

Jang et al./reconclng Attrbute Values from Multple Data Sources Table 1. Alumn Data A_ID FName LName Employer Ttle Home_Locaton Value Prob. 10001 Robert Black Walmart Sales Manager TX 0.286 LA 0.644 AOV 0.00146 10002 Tmothy Earnest GTE Accountant NY 0.99943 AOV * 1.15627 10-05 * AOV Any Other Value. The probablty for AOV reflects the probablty that any other specfc attrbute value except those lsted separately s true. For example, from Table 1 we know that the probablty that Calforna s the true Home_Locaton for Robert Black s 0.00146. bute s accurate n S 1 80 percent of the tme and accurate n S 2 90 percent of the tme, we are able to calculate the dstrbuton of the true home locaton values for Robert Black as follows: HL = 50 (.e., there are 50 possble state values). P(HL = TX HL = TX, HL = LA) = (0.8 0.1) / (0.8 0.1 + 0.9 0.2 + 0.2 0.1 48/49) = 0.286, P(HL = LA HL = TX, HL = LA) = (0.9 0.2) / (0.8 0.1 + 0.9 0.2 + 0.2 0.1 48/49) = 0.644, P(HL = HL k HL = TX, HL = LA) = (0.2 0.1/49)/(0.8 0.1+0.9 0.2+0.2 0.1 48/49) = 0.00146, (for each HL k other than TX or LA.) On the other hand, f the Home_Locaton shown for Tmothy Earnest n both data sources s NY, the value dstrbuton s as follows: P(HL = NY HL = NY, HL = NY) = (0.8 0.9) / (0.8 0.9+0.2 0.1/49) = 0.99943, P(HL = HL k HL = NY, HL = NY) = (0.2 0.1/2401) / (0.8 0.9+0.2 0.1/49) = 1.15627 10-05, (for each HL k other than NY.) Table 1 summarzes the Home_Locaton value dstrbuton under the two dfferent cases. Classfcaton of Queres and Errors As n pror research (Dey et al. 1998a; Mendelson and Sahara, 1986), we assume that all relevant queres have been dentfed. Based on where the attrbute beng examned appears n a query, we categorze the relevant queres nto three classes. If the stochastc attrbute(s) appear only n the selecton condton of a query, we call ths query a class C (Condtonng) query. If the attrbute(s) appear only n the proecton lst of a query, we call ths query a class T (Targetng) query. We call a query a class CT query f the attrbute(s) beng examned appear n both the selecton condton and the proecton lst. In the alumn example, f Home_Locaton s a stochastc attrbute, then query Q1 s of class C, Q2 s of class T, and Q3 s of class CT. Q1: Dsplay ID of those alumn who lve n LA. Q2: Dsplay Name and Home_Locaton of all alumn who work for GTE. Q3: Dsplay ID, Name, and Home_Locaton of all alumn who lve n TX. If the stored value of an attrbute s not the true value, three types of errors can occur. A type I error occurs when an obect should have been selected by a query based on the true value of an attrbute, but was not selected because the stored value was dfferent from the true value. A type II error occurs when an obect that should not have been selected based on the true attrbute value was selected because of the ncorrectly recorded value. A msrepresentaton error occurs when the value dsplayed for an attrbute n a query output s not the true value. The followng parameters are applcable to all classes of queres: (1) f(q) Frequency of query q. 2004 Twenty-Ffth Internatonal Conference on Informaton Systems 729

Jang et al./reconclng Attrbute Values from Multple Data Sources Table 2. Cost Matrx for Three Classes of Queres Error Type class Type I Type II Msrepresentaton (C) Condtonng "(q) $(q) N/A (T) Targetng N/A N/A g(q) (CT) Condtonng and Targetng "(q) $(q) ((q) (2) "(q) Cost of type I error for query q. (3) $(q) Cost of type II error for query q. (4) ((a, q) Average cost of one occurrence of msrepresentng attrbute a n the query output of q. The parameter a s omtted n the sngle attrbute problem. (5) J(q) Expected percentage of obects n a relaton that may be selected by query q. For the smple query, dsplay all alumn who lve n Texas, J(q) equals the expected percentage of employees who lve n Texas. For the complex query, dsplay all alumn who lve n Texas AND are at least 50 years old, J(q) equals the product of the expected percentage of employees who lve n Texas and the expected percentage of employees who are at least 50 years old, assumng that attrbutes Home_Locaton and AGE are ndependent of each other. The three cost parameters of a query are estmated based on the utlzaton of the query output. Consder a drect marketng frm that runs a query based on some gven crtera to dentfy potental customers. In ths case, the expected net proft per potental customer and the average cost of sendng a promoton to each potental customer consttute the type I error cost and the type II error cost, respectvely. Whle both the type I error cost and the type II error cost are unque for a partcular query, the msrepresentaton cost s specfc to a query as well as to one of ts attrbutes dsplayed n the query output. In the drect marketng example, f the potental customer s street address s n the query output, then the cost of msrepresentng the street address of a potental customer equals the expected net proft per potental customer tmes the probablty that the mal would be lost or returned due to the ncorrect address nformaton. The three classes of queres and ther relevant error types are summarzed n Table 2. Attrbute Reconclaton: Sngle Stochastc Attrbute We start our analyss wth the smplest case where there s only one stochastc attrbute n a relaton. We use the alumn example shown n Table 1 to llustrate how the cost-mnmzng value for the attrbute Home_Locaton can be determned for Robert Black, gven the value dstrbuton lsted n Table 1. We start our analyss usng a set of smple queres, follow t by some more complex queres, and fnally provde the standardzed procedure for the analyss. An Example wth Queres Havng a Sngle Clause n the Condton For the purpose of llustraton, we assume that the three queres dscussed n the prevous secton are the only queres relevant to the Home_Locaton attrbute,.e, only these three queres have the Home_Locaton attrbute ether n the selecton condton or n the proecton lst. All three queres have a sngle clause n the condton. We frst calculate the expected type I, type II, and msrepresentaton error costs when dfferent values are chosen to be stored n the merged table, and then compare the total costs to determne the best value to store. Cost of Type I and Type II Errors. We frst examne the cost of type I and type II errors when TX s the stored Home_Locaton value for Robert Black. Only Q1 and Q3 need to be consdered for type I and type II errors. Obvously, Q1 wll not select Robert Black f the stored Home_Locaton value for hm s TX. Gven that he s not selected, f Robert Black s true 730 2004 Twenty-Ffth Internatonal Conference on Informaton Systems

Jang et al./reconclng Attrbute Values from Multple Data Sources Table 3. Cost of Type I and Type II Errors ( If TX s Stored ) Q1 (C) Q3 (CT) Retreval Crteron LA TX Result If True Value s Type I Error Cost Type II Error Cost Not Not LA 0 N/A LA "(Q1)f(Q1)P(LA) N/A TX N/A 0 Not TX N/A $(Q3)f(Q3)[1-P(TX)] Home_Locaton s ndeed LA, then a type I error occurs. The frequency of ths occurrence equals the product of the frequency f(q1) and the probablty that the true value s LA, denoted by P(LA). On the other hand, when Q3 s processed, Robert Black wll be selected. Gven that Robert Black has been selected based on the stored determnstc value TX, f the true value s not TX, but LA or any other value, then a type II error occurs. The frequency of ths occurrence equals the product of the frequency f(q3) and the probablty that the true value s not TX, whch equals ( 1-P(TX) ). The above analyss s summarzed n Table 3. By multplyng the error frequences wth the cost parameters for each query, we obtan the total of type I and type II error costs resultng from choosng TX as the stored value: C I,II (TX) = "(Q1)f(Q1)P(LA) + $(Q3)f(Q3)[1 P(TX)]. (1) Smlarly, the type I and type II error costs resultng from choosng LA as the stored value for Robert Black (shown below) s obtaned based on the analyss presented n Table 4. C I,II (LA) = $(Q1)f(Q1)[1 P(LA)] + "(Q3)f(Q3)P(TX). (2) Table 5 shows the type I and type II error costs ncurred f any value other than TX and LA s stored, and equaton (3) shows the resultng cost expresson: C I,II (AOV) = "(Q1)f(Q1)P(LA) + "(Q3)f(Q3)P(TX). (3) In ths example, the cost analyss shown n Table 5 s also vald f NULL s chosen. Therefore, we have C I,II (NULL) = "(Q1)f(Q1)P(LA) + "(Q3)f(Q3)P(TX). (4) Table 4. Cost of Type I and Type II Errors ( If LA s Stored) Retreval Crteron Result If True Value s Type I Error Cost Type II Error Cost Q1 (C) LA LA N/A 0 Not LA N/A $(Q1)f(Q1)[1-P(LA)] Q3 Not Not TX 0 N/A TX (CT) TX "(Q3)f(Q3)P(TX)] N/A 2004 Twenty-Ffth Internatonal Conference on Informaton Systems 731

Jang et al./reconclng Attrbute Values from Multple Data Sources Table 5. Cost of Type I and Type II Errors (If Any Value Other than TX and LA s Stored) Q1 (C) Q3 (CT) Retreval Crteron LA TX Result If True Value s Type I Error Cost Type II Error Cost Not Not Not LA 0 N/A LA "(Q1)f(Q1)P(LA) N/A TX "(Q3)f(Q3)P(TX) N/A Not TX 0 N/A Cost of Msrepresentaton Errors. A msrepresentaton error occurs when a value n a query output s not the true attrbute value, and ths type of error s relevant to only class T and class CT queres. In our example, Q2 s a class T query and Q3 s a CT query. We frst assume that TX s chosen to be the determnstc value for Robert Black. Therefore, whenever Robert Black s selected by a query and Home_Locaton s n the proecton lst, the value dsplayed wll be TX. Gven that TX s dsplayed n the query output, f the actual true value s not TX, a msrepresentaton error occurs. The frequency of ths occurrence equals the frequency that Robert Black s selected by a class T or class CT query tmes the probablty that TX s not the true value. To calculate the frequency that Robert Black s selected, we examne Q2 and Q3 separately. Snce Q3 always selects Robert Black f TX s chosen to be the determnstc value, the frequency that Robert Black s selected by Q3 equals the frequency of the query f(q3). For the class T query Q2, we assume that all obects are equally lkely to be selected by ths query. Therefore, the frequency that Robert Black s selected by Q2 equals f(q2)j(q2). Multplyng the error frequences by (, the unt cost of a msrepresentaton error, and summng over all class T and class CT queres, we obtan the total msrepresentaton cost when TX s chosen to be the stored value as follows: C m (TX) = [((Q3)f(Q3) + ((Q2)f(Q2)J(Q2)][1 P(TX)]. (5) If LA had been chosen to be the stored value, Robert Black would never appear n the query output of Q3. The total msrepresentaton cost, denoted by C m (LA), thus equals the followng: C m (LA) = ((Q2)f(Q2)J(Q2)[1 P(LA)]. (6) Now consder when a value other than TX or LA s chosen for Robert Black. We can gnore Q3 snce t wll not select Robert Black. The msrepresentaton cost s straghtforward: C m (AOV) = ((Q2)f(Q2)J(Q2)[1 P(AOV)]. (7) Fnally, we examne the msrepresentaton cost f NULL s stored. For smplcty, we assume that NULL s never the true value and the unt msrepresentaton cost when NULL or any other ncorrect value s stored s the same. Then the resultng msrepresentaton cost s C m (NULL) = ((Q2)f(Q2)J(Q2). (8) If the cost of msrepresentaton s dfferent when NULL s stored, then we can estmate another msrepresentaton parameter (' specfcally for NULL and replace ( n (8) by ('. All other analyses reman the same. Mnmzng Total Error Cost. The best determnstc Home_Locaton value for Robert Black s chosen by mnmzng the total expected error cost, obtaned by summng up the cost of type I and type II errors and the cost of msrepresentaton errors: TC(TX) = "(Q1)f(Q1)P(LA) + [$(Q3)f(Q3) + ((Q3)f(Q3) + ((Q2)f(Q2)J(Q2)][1 P(TX)] (9) TC(LA) = "(Q3)f(Q3)P(TX) + [$(Q1)f(Q1) + ((Q2)f(Q2)J(Q2)][1 P(LA)] (10) TC(AOV) = "(Q1)f(Q1)P(LA) + a(q3)f(q3)p(tx) + ((Q2)f(Q2)J(Q2)][1 P(AOV)], (11) TC(NULL) = "(Q1)f(Q1)P(LA) + "(Q3)f(Q3)P(TX) + ((Q2)f(Q2)J(Q2). (12) 732 2004 Twenty-Ffth Internatonal Conference on Informaton Systems

Jang et al./reconclng Attrbute Values from Multple Data Sources The value that mnmzes the total cost should be the one stored n the merged table. Dependng on the cost parameters, any of the values can be chosen. In normal stuatons, TX or LA should be the lkely best value. However, n cases where the cost of type II errors s sgnfcantly hgher than cost of type I errors, AOV could be the cost mnmzng opton. Ths s due to the fact that f AOV or NULL s stored, Robert Black wll not be selected by Q1 and Q3 and hence type II errors wll never occur. Addtonal Queres wth Dsunctve Clauses n the Condton The example dscussed above nvolves three smple queres wth a sngle clause n the condton. To generalze the soluton, we consder three addtonal queres Q4, Q5, and Q6 that have dsunctve clauses n the selecton condton. Among them, Q4 s of class CT and Q5 and Q6 are of class C. Q4: Dsplay ID, Name, and Home_Locaton of those alumn who lve n OK or TX. Q5: Dsplay ID of those alumn who lve n CA, NY, or TX. Q6: Dsplay ID, Name, and Employer of those alumn who lve n IN, MN, or PA. We use the example data for Robert Black to llustrate how the costs assocated wth these new queres can be determned. Assume that TX s the stored value. We frst derve the msrepresentaton cost snce t s relatvely smple. As dscussed earler, for the msrepresentaton cost, we only need to consder the Class T queres and those class CT queres that select Robert Black. Therefore, only Q4 needs to be consdered for the msrepresentaton cost. Based on the dscusson presented above, the msrepresentaton cost assocated wth Q4 s [1 P(TX)][((Q4) f(q4)]. The costs of type I and type II errors are summarzed n Table 6. Based on Table 6 and Table 3, we make the followng observatons: Observaton 1: Observaton 2: Gven that the chosen determnstc value s ncluded n the retreval crteron of a query (such as Q3, Q4, and Q5), the probablty of type II error equals the probablty that a value other than those ncluded n the retreval crteron s the true value. If the chosen determnstc value s not ncluded n the retreval crteron of a query (e.g., Q1 or Q6), then the probablty of type I error equals the probablty that one of the values ncluded n the query s retreval crteron s the true value. To determne whch value s the best choce, the total cost assocated wth other possble values also needs to be calculated. The value that results n the smallest cost should be stored. Table 6. Cost of Type I and Type II Errors (If TX s chosen to be the determnstc value for Robert Black) No. Retreval Crteron Result If True Value s Type I Error cost Type II Error cost Q4 (CT) OK, TX OK or TX N/A 0 Others N/A $(q4)f(q4)[1-p(ok)- P(TX)] Q5 (C) CA, NY, TX CA, NY or N/A 0 TX Others N/A $(q5)f(q5)[1-p(ca)- P(NY)-P(TX)] Q6 (C) IN, MN, PA Not IN, MN, or PA "(Q6)f(Q6)[P(IN)+ P(MN)+P(PA)] N/A Others 0 N/A 2004 Twenty-Ffth Internatonal Conference on Informaton Systems 733

Jang et al./reconclng Attrbute Values from Multple Data Sources Table 7. Coverage Btmap Crteron Queres CA IN LA MN NY OK PA TX PS Q1 (C) 0 0 1 0 0 0 0 0 P(LA) Q3 (CT) 0 0 0 0 0 0 0 1 P(TX) Q4 (CT) 0 0 0 0 0 1 0 1 P(OK) + P(TX) Q5 (C) 1 0 0 0 1 0 0 1 P(CA) + P(NY) + P(TX) Q6 (C) 0 1 0 1 0 0 1 0 P(IN) + P(MN) + P(PA) A Standardzed Procedure Based on Coverage Btmap As we can see from the prevous analyses, f the number of possble realzatons of the attrbute or the number of relevant queres s large, the error cost calculaton can be a tedous process. To smplfy the computaton, we construct a query coverage btmap as shown n Table 7. The query coverage btmap summarzes the values ncluded n the retreval crteron of each query. For example, for Q5, the columns for CA, NY, and TX are marked as 1 snce these three state values are ncluded n the retreval crteron of Q5. The last column, labeled as PS, represents the probablty sum that any one of the attrbute values ncluded n the query s retreval crteron s the true attrbute value. Input: Probablstc value vector V and correspondng probablty vector P. 1. Fnd out the correspondng column ndex numbers n the Coverage Btmap for all components of V and keep them n vector J. 2. Let C mn = A very large number; BestVal= V 0. For all J (representng all probablstc values): Begn TC = 0; C = 0; C = 0; C = (1 P ) f ( q) τ ( q) γ ( q). I II m q classt qures For each query wth row ndex : Do If QCBtmap[][] equals 1, /* selects the obect, possble type II errors and msrepresentaton errors.*/ Then ncrease cost of type II error C II by f(q )β(q )[1-PS(q )]; and f q s class CT query, then ncrease msrepresentaton cost C m by (1-P )f(q )γ(q ); Else /* wll not select the obect, possble type I errors */ ncrease cost of type I error C I by f(q )α(q )PS(q ). Endf End TC = CI + CII + Cm. If (TC < C mn ) Then C mn = TC, BV=V. End Output: The cost-mnmzng value BV and the assocated total cost. Fgure 1. Procedure for Determnng the Best Value 734 2004 Twenty-Ffth Internatonal Conference on Informaton Systems

Jang et al./reconclng Attrbute Values from Multple Data Sources Based on the btmap, we can automate the cost calculaton and value determnaton process. Fgure 1 shows the algorthm for determnng the best determnstc value for an obect wth a probablstc value vector V and a correspondng probablty dstrbuton P. The total cost assocated wth each chosen value s determned as follows: Frst, f there are class T queres, the assocated msrepresentaton cost s calculated. Second, the column n the query coverage btmap that corresponds to the chosen value s dentfed. Thrd, for each query n the btmap, the value n the cell that corresponds to the row of the query and the column of the chosen value s checked. If the value s 1, the cost of type II errors s calculated based on observaton 1 dscussed n the prevous subsecton, and the msrepresentaton cost assocated wth ths query s determned f the query s of Class CT; f the value s 0, the cost of type I error s calculated based on observaton 2 n the prevous subsecton. Fourth, the total costs assocated wth the chosen value are determned by summng up all three types of costs. Fnally, the best determnstc value s chosen based on the total error costs. If the number of queres s n and the number of probablstc values s r, then the complexty of ths standardzed procedure s O(nr). Dscusson For the above procedure, not all possble values need to be explctly examned to decde whch one s the best. If, n a group of canddate values, all have the same probablty of beng the true value, and one of the followng two condtons holds: (1) none of the canddate values appears n any query or (2) f one value appears n a query, then all other canddate values n the group also appear n the query n exactly the same manner, then the total expected cost assocated wth each value n the group s always the same. In the gven example, the expected cost when ether CA or NY s chosen s the same; the expected cost assocated wth choosng IN or MN s the same; and the costs assocated wth all other values except TX, LA, CA, NY, IN, MN, OK and NULL are also the same. Attrbute Reconclaton: Multple Stochastc Attrbutes In the prevous secton, we have shown how the cost-mnmzng value can be determned for a sngle stochastc attrbute based on query parameters. In ths secton, we extend the analyss to multple stochastc attrbutes wth conflctng values from dfferent data sources. There are two possble cases: (1) The relabltes for the stochastc attrbutes are mutually ndependent. In ths case, the cost-mnmzng value for each attrbute can be determned ndvdually wthout consderng other attrbutes. The probablty dervaton for a sngle attrbute shown n the second secton and the value determnaton process dscussed n the prevous secton can be appled. (2) The relabltes are dependent. In that scenaro, we have to consder the combnatons of attrbute values and ther ont probabltes. As shown earler, we can estmate the ont probabltes for all feasble value combnatons. The cost-mnmzng value combnaton can be determned based on the total expected error cost, whch ncludes, as n the sngle attrbute case, the costs of type I errors, type II errors, and msrepresentaton errors. To llustrate how the cost-mnmzng value combnaton can be determned, consder a modfed alumn database example as shown n Table 8. In ths example, we assume that the values of both attrbutes Employer and Home_Locaton are dfferent for Robert Black n the two data sources. Further assume that the probabltes are as shown n Table 8 for the combnaton of Employer and Home_Locaton. The number of realzatons for Employer s assumed to be 200 and the number of possble Home_Locatons s agan assumed to be 50. Thus, the total number of possble realzatons of the composte attrbute s 200 50 = 10,000. Table 8. Modfed Probablstc Alumn Data A_ID FName LName Ttle Employer H_L Prob. 10001 Robert Black Sales Manager (Walmart, TX) 0.6 (Nortel, LA) 0.3 AOV 1.02 10-05 2004 Twenty-Ffth Internatonal Conference on Informaton Systems 735

Jang et al./reconclng Attrbute Values from Multple Data Sources Consder the stuaton when the followng are the only queres that have the above two attrbutes ether n ther proecton lsts or n ther selecton condtons: Q7: Dsplay Name, Employer, and Home_Locaton of those alumn who are managers. Q8: Dsplay ID, Name, and Home_Locaton of all alumn who work for WalMart. Q9: Dsplay ID and Name of those alumn who work for WalMart OR lve n LA. Q10 Dsplay Name and Home_Locaton of those alumn who work for Nortel AND lve n TX. Among the four queres, snce Q7 has both Employer and Home_Locaton n ts proect lst, t s of class T wth respect to both Employer and Home_Locaton. Smlarly, Q8 s of class C wth respect to Employer and of class T wth respect to Home_Locaton; Q9 s of class C wth respect to both Employer and Home_Locaton; and Q10 s of class C wth respect to Employer and of class CT wth respect to Home_Locaton. Among the four queres, Q9 and Q10 deserve specal attenton snce both attrbutes appear n the selecton condton of the two queres. The dfference s that the two parts of the selecton condton n Q9 are connected by the OR operator and those n Q10 are connected by the AND operator. We llustrate how the costs of type I, type II, and msrepresentaton errors assocated wth each query are determned for the example shown n Table 8. We frst consder the case when (Walmart, TX) s the stored value for Robert Black n the merged table. Cost of Msrepresentaton Errors. For msrepresentaton errors, only those class T and class CT queres wth respect to ether Employer or Home_Locaton,.e., queres that nclude ether Employer or Home_Locaton or both n ther proect lst, need to be consdered. In our example, Q7 and Q8 are the only class T queres and Q10 s the only class CT query. The msrepresentaton cost assocated wth each query equals the product of the unt cost of msrepresentaton error, the frequency of the query, the probablty that the target obect s selected by the query, and the margnal probablty that the stored value s not the true value. If more than one examned attrbute s n the proecton lst (e.g., Q7), the msrepresentaton errors equal the sum of the msrepresentaton errors computed separately for each attrbute. For example, the msrepresentaton errors assocated wth Q7, Q8, and Q10 are as follows: C mq7 (Walmart, Tx) = ((Q7, Home_Locaton)f(Q7)J(Q7)[1 P(TX)] + ((Q7, Employer)f(Q7)J(Q7)[1 P(Walmart)], C mq8 (Walmart, Tx) = ((Q8, Home_Locaton)f(Q8)J(Q7)[1 P(TX)], and C mq10 (Walmart, Tx) = 0 In the above expressons, the margnal probabltes are P(TX) = 0.602 and P(Walmart) = 0.6005. The expresson for C m,q8 does not contan J(Q8) snce ths query always selects Robert Black, gven that (Walmart, TX) s stored. C m,q10 equals zero because Q10 never selects Robert Black based on the stored values (Walmart, TX). From the above example, we can see that, for msrepresentaton error costs, the soluton s smlar to that for the sngle attrbute case, except that the probablty of a sngle value s replaced by the margnal dstrbuton n the multple attrbute case. Cost of Type I and Type II Errors. For type I and type II errors, we only need to consder class T or class CT queres wth respect to Employer or Home_Locaton. Therefore, we can gnore Q7. The costs of type I and type II errors assocated wth Q8, Q9, and Q10 are summarzed n Table 9. We observe that, f only one of the multple attrbute beng examned s n the selecton condton of a query (e.g., Q8), the same soluton that we derve for a sngle stochastc attrbute can be used. For queres wth more than one attrbutes beng examned n ts selecton condton, the same rules stll apply: f the obect s selected based on the stored attrbute values, the cost of type II errors equals the product of the unt type II error cost, the frequency of the query, and the probablty that selecton condton s volated; f the obect s not selected, then cost of type I error equals the product of the unt type I error cost, the frequency of the query, and the probablty that the selecton condton s satsfed. Although the probablty that a selecton condton s satsfed or volated s slghtly more complex wth multple attrbutes beng consdered, as shown n Table 9, t can be derved wthout much dffculty. To decde the cost-mnmzng value for both Employer and Home_Locaton for Robert Black, the total expected cost assocated wth other value combnatons also needs to be examned. Snce values that need to be separately examned for Employer nclude Walmart, Nortel, NULL, and any other value, and those for Home_Locaton nclude TX, LA, NULL, and any other value, there are a total of only 16 cases, nstead of 10,000 potental cases, to be examned n order to decde the cost-mnmzng value for both attrbutes. 736 2004 Twenty-Ffth Internatonal Conference on Informaton Systems

Jang et al./reconclng Attrbute Values from Multple Data Sources Table 9. Cost of Type I and Type Error for Two Stochastc Attrbutes (If (Walmart, TX) s stored) No. Retreval Crtera Employer H_L. Result Q8 Walmart Q9 (OR) Walmart LA Q10 Nortel TX Not If True Vaues are Type I Error Cost Type II Error Cost (Walmart, -) N/A 0 Others N/A $(Q8)f(Q8)[1 P(Walmart)] (Walmart, -) N/A 0 (-, LA) N/A 0 Others N/A $(Q9)f(Q9)[1 P(Walmart) P(LA) + P(Walmart, LA)] (Nortel, TX) "(Q10)f(Q10)[1 P(Nortel, TX)] N/A Others 0 N/A Dscussons and Future Research We have shown how the cost-mnmzng values can be determned for dscrete attrbutes based on source relablty nformaton and query nformaton. Based on the proposed procedure, when conflctng data values for a real-world entty are encountered n the data ntegraton process, the cost-mnmzng value can be determned and stored n the merged table. Subsequently, queres can be drectly executed on the merged table. We call ths approach determnstc ntegraton. Compared wth probablstc ntegraton,.e., storng all probablstc values based on the probablstc database model, the man dsadvantage of the determnstc approach s the loss of potentally useful dstrbuton nformaton. The advantages of determnstc ntegraton nclude the followng: Frst, data storage and subsequent query processng can be effcently handled by the exstng commercal database systems. Wth probablstc ntegraton, snce the probablstc algebra s not currently supported by standard database packages, the cost of mplementng such a probablstc model could be prohbtvely hgh. Second, the storage cost s lower wth determnstc ntegraton. Ths s because wth the probablstc relatonal model (e.g., Dey and Sarkar 1996), a new column has to be added even f only one obect n the table has uncertan values for only one attrbute. In addton, a row has to be added to the table for every probablstc value assocated wth each obect. Thrd, wth determnstc ntegraton, the operatonal performance of the resultng database s better. Ths s because the computatonal overhead assocated wth a probablstc database s avoded wth a determnstc table. The procedures we propose here are for dscrete attrbutes only. An extenson to ths study s to examne stochastc attrbutes wth contnuous domans and a mxture of dscrete attrbutes and contnuous attrbutes. Computatonally, n the multple attrbutes scenaro, f the number of realzatons of each attrbute or the number of queres beng consdered s large, the computatonal overhead could ncrease sgnfcantly. We are tryng to dentfy rules or patterns that can help reduce the computatonal effort. References Batn, C., Lenzern, M., and Navathe, S. B. A Comparatve Analyss of Methodologes for Database Schema Integraton, ACM Computng Surveys (18:4), December 1986, pp. 323-364. Dekhtyar, A., Ross, R., and Subrahmanan, V. S. Probablstc Temporal Databases, I: Algebra, ACM Transactons on Database Systems (26:1), March 2001, pp. 41-95. Dey, D., Barron, T. M., and Sahara, A. N. A Decson Model for Choosng the Optmal Level of Storage n Temporal Databases, IEEE Transactons on Knowledge and Data Engneerng (10:2), February 1998a, pp. 297-309. Dey, D., and Sarkar, S. A Probablstc Relatonal Model and Algebra, ACM Transactons on Database Systems (TODS) (21:3), September 1996, pp. 339-369. Dey, D., Sarkar, S., and De, P. A Probablstc Decson Model for Entty Matchng n Heterogeneous Databases, Management Scence (44:10), October 1998b, pp. 1379-1395. Hernandez, M. A., and Stolfo, S. J. Real-World Data s Drty: Data Cleanng and the Merge/Purge Problem, Data Mnng and Knowledge Dscovery (2:1), January 1998, pp. 9-37. 2004 Twenty-Ffth Internatonal Conference on Informaton Systems 737

Jang et al./reconclng Attrbute Values from Multple Data Sources Mendelson, H., and Sahara, A. N. Incomplete Informaton Costs and Database Desgn, ACM Transactons on Database Systems (TODS) (11:2), June 1986, pp.159-185. Morey, R. C. Estmatng and Improvng the Qualty of Informaton n a MIS, Communcaton of the ACM (25:5), May 1982, pp. 337-342. Rahm, E., and Do, H. H. Data Cleanng: Problems and Current Approaches, IEEE Bulletn of the Techncal Commttee on Data Engneerng (23:4), December 2000, pp. 3-13. Ram, S., and Park, J. Semantc Conflct Resoluton Ontology (SCROL): An Ontology for Detectng and Resolvng Data and Schema Level Conflcts, IEEE Transactons on Knowledge and Data Engneerng (16:2), February 2004, pp. 189-202. Appendx. Probablty Dervatons for One Attrbute, Two Data Sources We frst show some smplfcatons for the general expresson that apply to all of the cases. I. P(A, A ) = E m P(A, A, A m ) = E m P(A, A m ) P(A m ) = E m P(A m ) P(A m ) P(A m ) = E m [P(A m ) P(A )/ P(A m )] [P(A m A ) P(A )/P(A m )] P(A m ) = E m P(A m ) P(A m A ) P(A ) P(A ) / P(A m ) = P(A ) P(A ) E m P(A m ) P(A m A ) / P(A m ). II. P(A k, A ) = P(A, A k ) P(A k ) / P(A, A ) Analogous to I, we can show: P(A, A k ) P(A k ) = P(A k ) P(A k A ) P(A ) P(A ) / P(A k ). Therefore, P(A k, A ) = P(A k ) P(A k A ) P(A ) P(A ) / [P(A k ) P(A, A )]. Substtutng for P(A, A ) from I, we get: P(A k, A ) = [P(A k A ) P(A k A )/P(A k )] / [E m P(A m ) P(A m A ) / P(A m )]. From our second assumpton, P(A m ) = P(A k ), and P(A m ) = 1/ A for all m. Therefore, P(A k A,A ) = [P(A k A ) P(A k A )]/[E m P(A m ) P(A m A )]. We now show how the desred probablty estmates may be obtaned for each case. Case 1a: k = = ; P(A, A ) P(A, A ) = [P(A A ) P(A A )] / [E m P(A m ) P(A m A )]. For m, and assumpton three, we have P(A m ) = P(A a ) / [ A -1] œ m. Smlarly, P(A m ) = P(A a ) / [ A -1] œ m. Therefore, E m P(A m ) P(A m A ) = P(A A ) P(A A ) + E m P(A m ) P(A m A ) = P(A A ) P(A A ) + E m P(A a ) P(A a ) / [ A -1] 2 = P(A A ) P(A A ) + P(A a ) P(A a ) / [ A -1]. 738 2004 Twenty-Ffth Internatonal Conference on Informaton Systems

Jang et al./reconclng Attrbute Values from Multple Data Sources Hence, P(A, A ) = P(A P(A ) * P(A ) ) * P(A ) + P(A a ) * P(A a ) / [ -1] Case 1b: k = ; P(A k, A ) P(A k, A ) = [P(A k A ) P(A k A )] / [E m P(A m ) P(A m A )]. Snce k, from assumpton 3, we have P(A k ) = P(A a )/ [ A -1], and P(A k ) = P(A a A )/ [ A -1]. Therefore, P(A k A ) P(A k A ) = P(A a ) P(A a ) / [ A -1] 2. Hence, P(A k, A ) = P(A P(A a ) * P(A ) * P(A a ) + P(A a )/ [ -1] ) * P(A a 2 ) / [ -1] Case 2a: k = ; P(A, A ) P(A, A ) = [P(A A ) P(A A )] / [E m P(A m ) P(A m A )]. Here, we have, from assumpton 3, P(A A ) = P(A a )/ [ A -1], and P(A A ) = P(A a )/ [ A -1]. Now, E m P(A m ) P(A m A ) = P(A ) P(A A ) + P(A ) P(A A ) + E m, P(A m ) P(A m A ) = P(A ) P(A a )/ [ A -1] + P(A a )/ [ A -1] P(A A ) + E m, P(A a ) P(A a ) / [ A -1] 2 = P(A ) P(A a = a )/ [ A -1] + P(A a )/ [ A -1] P(A A ) + P(A a ) P(A a ) [ A -2] / [ A -1] 2. Hence, P(A, A ) = P(A ) * P(A a ) + P(A a P(A ) * P(A a ) * P(A ) ) + P(A a ) * P(A a. )[ -2] / [ -1] Case 2b: k ; P(A k, A ) P(A k, A ) = [P(A k A ) P(A k A )] / [E m P(A m ) P(A m A )]. The numerator s P(A k A ) P(A k A ) = P(A a ) P(A a ) / [ A -1] 2 The denomnator s as shown n case 2a. Hence, P(A, A ) = P(A ) * P(A a P(A a ) + P(A a ) * P(A a ) * P(A ) / [ -1] ) + P(A a ) * P(A a. )[ -2] / [ -1] 2004 Twenty-Ffth Internatonal Conference on Informaton Systems 739

740 2004 Twenty-Ffth Internatonal Conference on Informaton Systems