Truth Discovery in Data Streams: A Single-Pass Probabilistic Approach

Similar documents
Normal Random Variable and its discriminant functions

Mind the class weight bias: weighted maximum mean discrepancy for unsupervised domain adaptation. Hongliang Yan 2017/06/21

Correlation of default

Section 6 Short Sales, Yield Curves, Duration, Immunization, Etc.

Noise and Expected Return in Chinese A-share Stock Market. By Chong QIAN Chien-Ting LIN

UNN: A Neural Network for uncertain data classification

Online Technical Appendix: Estimation Details. Following Netzer, Lattin and Srinivasan (2005), the model parameters to be estimated

Deriving Reservoir Operating Rules via Fuzzy Regression and ANFIS

Improving Forecasting Accuracy in the Case of Intermittent Demand Forecasting

FITTING EXPONENTIAL MODELS TO DATA Supplement to Unit 9C MATH Q(t) = Q 0 (1 + r) t. Q(t) = Q 0 a t,

A valuation model of credit-rating linked coupon bond based on a structural model

Online appendices from Counterparty Risk and Credit Value Adjustment a continuing challenge for global financial markets by Jon Gregory

Lab 10 OLS Regressions II

Chain-linking and seasonal adjustment of the quarterly national accounts

Baoding, Hebei, China. *Corresponding author

Batch Processing for Incremental FP-tree Construction

SkyCube Computation over Wireless Sensor Networks Based on Extended Skylines

American basket and spread options. with a simple binomial tree

Interest Rate Derivatives: More Advanced Models. Chapter 24. The Two-Factor Hull-White Model (Equation 24.1, page 571) Analytic Results

SOCIETY OF ACTUARIES FINANCIAL MATHEMATICS. EXAM FM SAMPLE SOLUTIONS Interest Theory

The Financial System. Instructor: Prof. Menzie Chinn UW Madison

Comparing Sharpe and Tint Surplus Optimization to the Capital Budgeting Approach with Multiple Investments in the Froot and Stein Framework.

A Novel Approach to Model Generation for Heterogeneous Data Classification

Accuracy of the intelligent dynamic models of relational fuzzy cognitive maps

Explaining Product Release Planning Results Using Concept Analysis

Quarterly Accounting Earnings Forecasting: A Grey Group Model Approach

Albania. A: Identification. B: CPI Coverage. Title of the CPI: Consumer Price Index. Organisation responsible: Institute of Statistics

Michał Kolupa, Zbigniew Śleszyński SOME REMARKS ON COINCIDENCE OF AN ECONOMETRIC MODEL

Dynamic Relationship and Volatility Spillover Between the Stock Market and the Foreign Exchange market in Pakistan: Evidence from VAR-EGARCH Modelling

Network Security Risk Assessment Based on Node Correlation

SUMMARY INTRODUCTION. Figure 1: An illustration of the integration of well log data and seismic data in a survey area. Seismic cube. Well-log.

A Framework for Large Scale Use of Scanner Data in the Dutch CPI

IFX-Cbonds Russian Corporate Bond Index Methodology

Estimation of Optimal Tax Level on Pesticides Use and its

Fairing of Polygon Meshes Via Bayesian Discriminant Analysis

Financial Innovation and Asset Price Volatility. Online Technical Appendix

VI. Clickstream Big Data and Delivery before Order Making Mode for Online Retailers

A Hybrid Method for Forecasting with an Introduction of a Day of the Week Index to the Daily Shipping Data of Sanitary Materials

Improving Earnings per Share: An Illusory Motive in Stock Repurchases

A Hybrid Method to Improve Forecasting Accuracy Utilizing Genetic Algorithm An Application to the Data of Operating equipment and supplies

Pricing and Valuation of Forward and Futures

Optimal Combination of Trading Rules Using Neural Networks

Prediction of Oil Demand Based on Time Series Decomposition Method Nan MA * and Yong LIU

A Novel Application of the Copula Function to Correlation Analysis of Hushen300 Stock Index Futures and HS300 Stock Index

Online appendices from The xva Challenge by Jon Gregory. APPENDIX 14A: Deriving the standard CVA formula.

Differences in the Price-Earning-Return Relationship between Internet and Traditional Firms

Multiagent System Simulations of Sealed-Bid Auctions with Two-Dimensional Value Signals

An Inclusion-Exclusion Algorithm for Network Reliability with Minimal Cutsets

Economics of taxation

The Virtual Machine Resource Allocation based on Service Features in Cloud Computing Environment

Numerical Evaluation of European Option on a Non Dividend Paying Stock

Trade, Growth, and Convergence in a Dynamic Heckscher-Ohlin Model*

MULTI-SPECTRAL IMAGE ANALYSIS BASED ON DYNAMICAL EVOLUTIONARY PROJECTION PURSUIT

A Neural Network Approach to Time Series Forecasting

Optimal Fuzzy Min-Max Neural Network (FMMNN) for Medical Data Classification Using Modified Group Search Optimizer Algorithm

Estimating intrinsic currency values

Short-Term Load Forecasting using PSO Based Local Linear Wavelet Neural Network

ANFIS Based Time Series Prediction Method of Bank Cash Flow Optimized by Adaptive Population Activity PSO Algorithm

Floating rate securities

The UAE UNiversity, The American University of Kurdistan

Cointegration between Fama-French Factors

Bank of Japan. Research and Statistics Department. March, Outline of the Corporate Goods Price Index (CGPI, 2010 base)

A Novel Particle Swarm Optimization Approach for Grid Job Scheduling

Mutual Fund Performance Evaluation System Using Fast Adaptive Neural Network Classifier

A Change Detection Model for Credit Card Usage Behavior

PFAS: A Resource-Performance-Fluctuation-Aware Workflow Scheduling Algorithm for Grid Computing

Gaining From Your Own Default

Determinants of firm exchange rate predictions:

Empirical Study on the Relationship between ICT Application and China Agriculture Economic Growth

Using Fuzzy-Delphi Technique to Determine the Concession Period in BOT Projects

Data Quality Inference

DEA-Risk Efficiency and Stochastic Dominance Efficiency of Stock Indices *

Pricing Model of Credit Default Swap Based on Jump-Diffusion Process and Volatility with Markov Regime Shift

Fugit (options) The terminology of fugit refers to the risk neutral expected time to exercise an

NBER WORKING PAPER SERIES TRADE, GROWTH, AND CONVERGENCE IN A DYNAMIC HECKSCHER-OHLIN MODEL. Claustre Bajona Timothy J. Kehoe

OPTIMAL EXERCISE POLICIES AND SIMULATION-BASED VALUATION FOR AMERICAN-ASIAN OPTIONS

ESSAYS ON MONETARY POLICY AND INTERNATIONAL TRADE. A Dissertation HUI-CHU CHIANG

The Effects of Nature on Learning in Games

Return Calculation Methodology

Keywords: School bus problem, heuristic, harmony search

EXPLOITING GEOMETRICAL NODE LOCATION FOR IMPROVING SPATIAL REUSE IN SINR-BASED STDMA MULTI-HOP LINK SCHEDULING ALGORITHM

Can Multivariate GARCH Models Really Improve Value-at-Risk Forecasts?

The Proposed Mathematical Models for Decision- Making and Forecasting on Euro-Yen in Foreign Exchange Market

Cryptographic techniques used to provide integrity of digital content in long-term storage

Effective Feedback Of Whole-Life Data to The Design Process

Hardware-Assisted High-Efficiency Ray Casting of Unstructured Time-Varying Flows Using Temporal Coherence

Unified Unit Commitment Formulation and Fast Multi-Service LP Model for Flexibility Evaluation in Sustainable Power Systems

1 Purpose of the paper

A Multi-Periodic Optimization Modeling Approach for the Establishment of a Bike Sharing Network: a Case Study of the City of Athens

Optimal procurement strategy for uncertain demand situation and imperfect quality by genetic algorithm

Agricultural and Rural Finance Markets in Transition

Impact of Stock Markets on Economic Growth: A Cross Country Analysis

Associating Absent Frequent Itemsets with Infrequent Items to Identify Abnormal Transactions

IJEM International Journal of Economics and Management

A Hierarchical Bayes Model for Combining Precipitation Measurements from Different Sources

A Backbone Formation Algorithm in Wireless Sensor Network Based on Pursuit Algorithm

UCLA Department of Economics Fall PhD. Qualifying Exam in Macroeconomic Theory

An Integrated Model of Aggregate Production Planning With Maintenance Costs. F. Khoshalhan * & A. Cheraghali Khani

Modeling Regional Impacts of BSE in Alberta in Terms of Cattle Herd Structure

An improved segmentation-based HMM learning method for Condition-based Maintenance

Transcription:

Truh Dscovery n Daa Sreams: A Sngle-Pass Probablsc Approach Zhou Zhao, James Cheng and Wlfred Ng Deparmen of Compuer Scence and Engneerng, Hong Kong Unversy of Scence and Technology Deparmen of Compuer Scence and Engneerng, The Chnese Unversy of Hong Kong {zhaozhou, wlfred}@cse.us.hk, jcheng@cse.cuhk.edu.hk ABSTRACT Truh dscovery s a long-sandng problem for assessng he valdy of nformaon from varous daa sources ha may provde dfferen and conflcng nformaon. Wh he ncreasng promnence of daa sreams arsng n a wde range of applcaons such as weaher forecas and sock prce predcon, effecve echnques for ruh dscovery n daa sreams are demanded. However, exsng work manly focuses on ruh dscovery n he conex of sac daabases, whch s no applcable n applcaons nvolvng sreamng daa. Ths movaes us o develop new echnques o ackle he problem of ruh dscovery n daa sreams. In hs paper, we propose a probablsc model ha ransforms he problem of ruh dscovery over daa sreams no a probablsc nference problem. We frsdesgn asreamng algorhm hanfers he ruh as well as source qualy n real me. Then, we develop a one-pass algorhm, n whch he nference of source qualy s proved o be convergen and he accuracy s furher mproved. We conduced exensve expermens on real daases whch verfy boh he effcency and accuracy of our mehods for ruh dscovery n daa sreams. Caegores and Subjec Descrpors H.2.8 [Daabase Managemen]: Daabase Applcaons General Terms Algorhms, Desgn, Expermens Keywords Daa Sream, Truh Dscovery. INTRODUCTION Truh dscovery s an exensvely suded opc n daabases and s mporance has been wdely recognzed by he research communy [9, 6, 7, 4, 29, 25, ]. In hs paper, we sudy he problem of Truh Dscovery n daa sreams. Many daa sream managemen applcaons requre negrang daa from mulple sources n real Permsson o make dgal or hard copes of all or par of hs work for personal or classroom use s graned whou fee provded ha copes are no made or dsrbued for prof or commercal advanage and ha copes bear hs noce and he full caon on he frs page. Copyrghs for componens of hs work owned by ohers han ACM mus be honored. Absracng wh cred s permed. To copy oherwse, or republsh, o pos on servers or o redsrbue o lss, requres pror specfc permsson and/or a fee. Reques permssons from permssons@acm.org. CIKM 4, November 3 7, 204, Shangha, Chna. Copyrgh 204 ACM 978--4503-2598-/4/...$5.00. hp://dx.do.org/0.45/266829.266892. Daa Source Value Facor ACCU Cloudy? Underground Showery? WFC Showery? (a) Conflcng values for an eny Daa Source Value Facor ACCU Cloudy False Underground Showery True WFC Showery True (b) True values for an eny Fgure : Truh Dscovery n Weaher Forecas for New York me. For each eny, each source provdes a value for. However, he values of he eny from dfferen sources may be conflcng, as some beng rue whle ohers beng false. To provde he rue value for he eny, s val ha he daa sream managemen sysems are capable of resolvng such conflcs and dscoverng he rue values. Consder a se of conflcng weaher forecas values for New York a one mesamp as shown n Fgure (a). The ruh dscovery for Fgure (a) s o resolve he conflcs and fnd he rue weaher forecas values for New York n Fgure (b). For example, he value Cloudy provded by he source ACCU s false nformaon. Prevous works [3, 8, 6, 7, 0, 7, 8, 26, 27, 29, 3, 5] on ruh dscovery manly focus on sac daabases. They defned he problem of ruh dscovery n he conex of sac daabases. The defnon s based on he qualy of daa sources and conflcng values for he eny. A daa source ha ofen provdes rue values s gven a hgh score for s accuracy. A value for an eny ha s provded by accurae sources s consdered o be more lkely rue. Ierave mehods were proposed o alernavely dscover he rue values and esmae he accuracy of he sources. In recen years, advances n moble echnologes have led o he prolferaon of many onlne daa nensve applcaons, n whch daa n sreamng forma are beng colleced connuously n large volume and hgh speed. Effecve and effcen ruh dscovery mehods for such hgh speed daa sreams are essenal o a wde range of applcaons, such as weaher forecas, sock prce predcon, flgh schedule checkng, ec. Moreover, here s also a rsng need o share huge amouns of daa n varous commercal and scenfc applcaons. Wheher or no he daa s naurally n sreamng forma, s sheer volume makes mpraccal o make mulple passes of he daa for ruh dscovery, whle s also unrealsc o assume ha he daa can be loaded no man memory for ruh dscovery. More compellng reasons of sudyng daa sream negraon can be found n [23]. However, as dscussed n [4, 3], exsng mehods focus on sac daabases, where he prob-

lem of ruh dscovery s solved usng erave mehods. Thus, s dffcul for hese mehods o dscover ruh n daa sreams, s- nce her echnques are based on erave updaes of he score of source qualy and values for he daa, whch requres he enrey of he daa for he processng (.e., he daa needs o resde n man memory, or oherwse he cos of random dsk access ncurred wll be oo hgh). To develop effcen and effecve echnques for ruh dscovery n daa sreams, we should address he followng hree major challengng compuaonal ssues: Conflcng Values Semanc V = {v,,v k } Mappng v v 2 v K s s. 2 s K Daa Sources Source Qualy Conflc Resoluon Daa Uncerany Truh Dscovery True Value v One-Pass Naure: The sreamng daa from sources arrve n large quanes and a hgh speed. Thus, s mpraccal o perform ruh dscovery on hgh-speed daa sreams by offlne mul-passng algorhms. Insead, an effcen algorhm should read he daa only once. Lmed Memory Usage: Daa sreams are oo volumnous o be kep n man memory, even hough memory has become cheaper oday. Moreover, many emergng applcaons requre ruh dscovery n daa sreams n memory lmed envronmens such as n moble devces, whch can only hold a lmed amoun of daa. Shor Response Tme: Daa sreams such as weaher forecas and sock prce predcon are arrvng connuously, whle he applcaons ofen requre real-me response. Thus, ruh dscovery should be performed wh lmed processng me,.e., he algorhm should be able o process hgh-speed daa sreams. None of he exsng mehods for ruh dscovery have effecvely addressed he above compuaonal ssues. In fac, he bes exsng approach ha addresses hese ssues s probably he mehod majory voe, whch smply consders he value reurned by he majory of he sources. However, hs mehod s known o be errorprone [3, 4], snce he mehod values he qualy of all sources equally. In general, an effecve ruh dscovery mehod should ake no consderaon he dfference n he qualy of varous sources. In hs work, we formulae he problem of ruh dscovery n daa sreams, and address all he hree compuaonal ssues saed above by proposng a generave model for ruh dscovery. We assume ha here exss a rue value for he eny among he conflcng values provded by dfferen sources. Noe ha, our focus s no on fuzzy value negraon and hence f a rue value does no exs, here s nohng or no ruh o fnd. The proposed generave model for he colleced conflcng values from varous sources s based on wo fundamenal facors, whch are daa uncerany and source qualy. Whn our proposed model, we ransform he ruh dscovery problem no a probablsc nference problem. We derve he opmal soluon for he nference problem and propose an erave algorhm o converge. Then, we mprove he erave algorhm and desgn a sreamng algorhm ha nfers he ruh as well as source qualy n real me. Then, we develop a one-pass algorhm, n whch he nference of source qualy s proved o be convergen and he accuracy s furher mproved. Specfcally, we compue he poseror dsrbuon of all possble values, and fnd he mos probable one wh he maxmum probably. Inuvely, hs model bes explans all he possble values repored by sources and he conflcng values n he daa sreams. Fgure 2 llusraes he mporan conceps and man deas n he archecure of our ruh dscovery n daa sreams. As he sources can be heerogenous, we frs employ a semanc mappng for he values provded by varous sources, such ha he values for ruh Fgure 2: Sreams A Concepual Vew of Truh Dscovery n Daa dscovery are n a conssen manner. For example, we consder he meanng of he weaher condons rany and we o be he same n weaher forecas ruh dscovery. We also group Parly Sunny and Mosly Cloudy, and consder hem o be he same as Clear. A each me, he sysem collecs a se of conflcng values for eny as V = {v,v 2,...,v k } from mulple daa sources. Nex, he sysem resolves he conflcs and dscovers he rue value v n V based on he curren daa uncerany and source qualy. Then, he sysem updaes he daa uncerany and source qualy based on he nferred value v and conflcng values V. We summarze he man conrbuons of our work as follows. Mos of he exsng work focus on he problem of ruh dscovery n sac daabases. In hs paper, we formulae he problem of ruh dscovery n daa sreams, and propose a new probablsc model ha resolves conflcng values arsng from mulple daa sreams. We propose a novel source qualy model for capurng varous errors embedded n he nformaon sources. Compared wh exsng source qualy models ha are based on a sngle accuracy value, our marx model s more general and able o represen he qualy of he sources beer, snce he model s able o represen rue posve rae, rue negave rae, false posve rae and false negave rae. We adop a new approach ha convers he ruh dscovery problem no a probablsc nference problem. We hen ransform he complex nference problem no a hgh-dmensonal opmzaon problem. We develop a one-pass algorhm ha solves he problem of ruh dscovery n daa sreams, whch assumes lmed man memory and shor response me. We prove he convergence of source qualy nference for one-pass algorhm. For ruh dscovery from he sources wh me-evolvng qualy, we devse a sreamng algorhm o adapvely nfer he source qualy over me. We evaluae he performance of our algorhms wh daa from real applcaons of ruh dscovery n daa sreams. Our resuls verfy boh he accuracy and effcency of our onepass algorhm and sreamng algorhm for ruh dscovery n daa sreams Organzaon. The res of he paper s organzed as follows. Secon 2 surveys he relaed work. Secon 3 presens our probablsc model and formulaes he problem of ruh dscovery n daa sreams. Secon 4 nroduces he opmzaon algorhms for he proposed problem. We repor he expermenal resuls n Secon 5 and conclude he paper n Secon 6.

Value Source Cloudy ACCU Showery Underground Showery WFC (a) Conflcng Values Source\Value Cloudy Showery ACCU 0 Underground 0 WFC 0 (b) Voe by Sources Value Cloudy Showery Truh 0 (c) Truh Value Fgure 3: An Illusraon of Probablsc Inference n Truh Dscovery 2. RELATED WORK Truh dscovery for conflcng values s a fundamenal problem n daabases. The man challenges of such a problem are o resolve daa nconssency []. The problem of ruh dscovery for sac daabases was frs formalzed by Yn e al. [29] and an erave algorhm was proposed o jonly nfer he ruh values and source qualy. Pasernack and Roh developed several web-lnk based algorhms and proposed a lnear-programmng based algorhm [7]. They also nroduced a generalzed framework ha ncorporaes background knowledge no he ruh fndng process [8]. Galland e al. nroduced several fx-pon algorhms o predc he ruh values of he facs [0]. Wang e al. proposed an EM algorhm for dscoverng he ruh n sensor neworks [24]. Yn and Tan explored sem-supervsed ruh dscovery by ulzng he smlary beween daa records [30]. Dong e al. suded he source selecon problem for ruh dscovery [9]. Kasnec e al. developed a probablsc model for ruh dscovery from several knowledge bases [2]. Pal e al. ackled he problem of evolvng daa negraon [6]. L e al. conduced an expermenal sudy on exsng algorhms [4]. A comprehensve survey of ruh dscovery echnques can be found n [8]. There are works ha focus on oher neresng aspecs of daa negraon whch are relaed o he problem of ruh dscovery. The Q sysem [22, 2, 28] develops an nformaon need drven paradgm for daa negraon. The copyng relaonshp deecon n daa negraon was suded n [3, 7, 6, 9]. Lu e al. proposed an early-reurn daa negraon mehod when enough confdence s ganed for he daa from he unprocessed sources ha are unlkely o change he answer [5]. Mehme e al. addressed he prvacy-aware daa negraon [3]. Several works on daa fuson n wreless sensor neworks and RFID sysems are also relaed o our problem, whch dscover he rue locaon or readng from a se of observed readngs from he sensors [20, 32, 33]. However, hese echnques frs ran he sensor model based on ranng daases and hen nfer he rue readng based on he raned models, whch are no suable for ruh dscovery n daa sreams. Noneheless, none of he above-menoned algorhms are applcable for handlng daa sreams. Ths s because he exsng mehods manly requre he enre daase for processng. More recenly, Zhao e al. suded he ruh dscovery problem usng Gbbs samplng and hey showed ha her algorhm ouperforms pror mehods [3]. An ncremenal algorhm was also proposed, whch s based on he raned model from he bach daabases, o dscover he ruh of he new daa. However, her ncremenal algorhm assumes a ranng phrase on a gven bach daase, whch s no effcen enough, or even no possble, for processng daa sream negraon n many applcaons. Conrary o all he above-menoned work, we develop effcen algorhms for ruh fndng n daa sreams, whch sasfy he consrans of one-pass naure, shor response me, and lmed memory usage. 3. A PROBABILISTIC MODEL FOR TRUTH DISCOVERY IN DATA STREAMS In hs secon, we presen a probablsc approach ha ransforms he problem of ruh dscovery over daa sreams no a probablsc nference problem. We derve he opmal soluon for he nference problem and devse an erave mehod. Specfcally, we calculae he poseror dsrbuon of all possble values, and fnd he mos probable one wh he maxmum probably. Inuvely, he proposed model explans all he possble voes by he sources on he conflcng values. We propose a ruh dscovery mehod based on he generave process of he voes n he daa sreams. We sar by llusrang he general process of ruh dscovery n daa sreams. Afer ha, we nroduce some basc noons and noaons n Secon 3., and defne he problem n Secon 3.2. Then, we presen a generave process for he voe on conflcng values n Secon 3.3 and defne he probablsc model n Secon 3.4. Now, we llusrae he general process of ruh dscovery for he voe on conflcng values by sources usng he followng example. EXAMPLE. Consder a se of conflcng weaher forecas values for New York Cy a me as shown n Fgure 3(a). We am o repor he correc weaher condons n real me. We frs exrac he weaher forecas values Cloudy and Showery provded by he sources. We hen record he voes owards he exraced values n Fgure 3(b). For example, he source ACCU voes only for Cloudy, he source Underground voes only for Showery, and he source WFC also voes for Showery. The process of ruh dscovery s o valdae he correcness of each value provded by he sources. An example of rue weaher condon s gven n Fgure 3(c). 3. Noons and Noaons 3.. Conflcng Values The conflcng values for an eny a me are a se of values provded by sources, whch are exclusve. We denoe he conflcng values for eny a me by V = {v,,v 2,,...,v K,} where vk, s he value by source k for eny a me. Forexample, he weaher forecas values for New York Cy a me are gven n Fgure 3(a),.e., VNew York = {vaccu,new York, vunderground,new York, vwfc,new York} = {Cloudy, Showery, Showery}. The value v can be eher leral or numerc. We consder he conflcng values for enes a me as V and he sequenal conflcng values a d- fferen mesamps as V = {V,V 2,...,V T } where T can be nfne. 3..2 Voe We now consder he voe by dfferen daa sources for an eny a me as O = {{o,,v,...,o,k,v },...,{o,,v n,...,o,k,v n }} where {o,,v,...,o,k,v } s he voe of sources on value v for eny and n s he number of possble values for eny a me. For example, he voe by hree sources ACCU, Underground and WFC for he weaher forecas values of New York a

me n Fgure 3(b) by voe se O New York ={{o New York,ACCU,Cloudy, o New York,Underground,Cloudy, o New York,WFC,Cloudy}, {o New York,ACCU,Showery, o New York,Underground,Showery, o New York,WFC,Showery}} = {{,0,0}, {0,,}}. Thevoevalueo can eher be 0 or. We consder he voe for enes a me as O and he voe a dfferen mesamps as O = {O,O 2,...,O T } where T can be nfne. 3..3 Truh Value We denoe he ruh value for an eny a me as Z = {z,v,z,v 2,...,z,v n } where z,v j s a valdaor for value v j V (z,v j {0, }). The valdaor z,v j =f he value v j s he ruh n he values for eny a me, V,oherwsez,v j =0. An example of he ruh value for eny New York s gven n Fgure 3(c),.e., ZNew York = {znew York,Cloudy,zNew York,Showery} = {0, }. 3..4 Source Qualy Exsng work [8, 7] usually models he qualy of daa sources usng sngle accuracy value. However, usng sngle accuracy value may no explan he possble msakes made by he source such as false posve and false negave msakes. Thus, we propose a more general qualy model usng confuson marx for each source s, denoed by π s. The proposed qualy model ams o explan he voe by he source. The qualy model of source s s gven by ( ) π s π s = 00 π0 s π0 s π s. () Consder a conflcng value se for an eny, V, wh s rue value Z, we explan he voe o,s,v of source s on he value v V. Based on he confuson marx of he qualy model π s,here are four cases of he voe, gven by πmn s = p s (o,s,v = m z,v = n),m,n {0, } (2) where p s (o,s,v = m z,v = n) s he probably of source s o gve a voe m gven he value ruh n. We defne π s as rue posve rae, π0 s as false negave rae, π0 s as false posve rae, and π00 s as rue negave rae, where π, s π0, s π0, s π00 s [0, ] and π s + π0 s =and π0 s + π00 s =. We now explan he voe by referrng o he source gven by he example n Fgure 3. Suppose he rue weaher forecas value s Showery as gven n Fgure 3(c), we hen explan he voe by sources ACCU and Underground. The probably of vong by source ACCU on Cloudy s based on s false posve rae π0 ACCU and he probably of vong on Showery s based on s false negave rae π0 ACCU. We smulae he voe by he sources based on he confuson marx of he qualy model. We can see ha he source Underground cass a voe on he value Showery whle he source ACCU does no cas a voe. We conclude ha he rue posve rae π Underground s hgher. On he oher hand, he source ACCU makes he msake on vong he rue value, snce s false negave rae π0 ACCU s hgh. Smlarly, he source Underground does no cas a voe on a false value Cloudy, whch llusraes s hgh rue negave rae π Underground 00. In our proposed source qualy model, he accuracy of he source depends on boh rue posve rae and rue negave rae. We model he qualy of a se of sources S by a collecon of confuson marces Π={π,π 2,...,π S }. 3.2 Problem Defnon We now formulae he problem of ruh dscovery n daa sreams as follows. Gven ses of conflcng values V,V 2,...,V provded by a se of sources S, we am o valdae each value of he enes n he se V such ha he followng hree compuaonal ssues are effecvely addressed: () one-pass naure: he sreamng collecons of values can be only read once; (2) lmed memory usage: only he curren collecon of values can be kep n man memory; and (3) shor response me: he oal runnng me should be lnear o he sze of sreamng collecons of values, and he valdaon of he values n each se V should be performed onlne. 3.3 A Generave Process Gven a se of sources, we llusrae he generave process for he observed voe O. Wefrs denoe a se of parameers ϕ = { α, β }, where α s he hyper-parameers for confuson marces and β s he hyper-parameers for value uncerany. 3.3. Generang Truh Value Z For each eny a me, s ruh Z consss of a se of valdaors z,v j for values v j V. Snce he value of each valdaor z,v j s bnary (.e. z,v j {0, }), we assume ha s generaed from he Bernoull dsrbuon [2]. The Bernoull dsrbuon s he mos wdely used dsrbuon for bnary random varables, whch generaes value wh success probably θ and value 0 wh falure probably θ. Thus, he probablsc generaon for valdaors z,v j s gven by z,v j Bernoull(θ v j ) (θ,v j ) z,v j ( θ,v j ) z,v j (3) where θ,v j s he pror probably of value v j o be he ruh n he value se V. The probably θ,v models he he uncerany of value v o be he ruh or no. We assume ha θ,v s generaed by a Bea dsrbuon [2]. The Bea dsrbuon generaes a connuous value θ whn an nerval [0, ] wh wo parameers β and β 0. We choose he Bea dsrbuon o generae θ,v because he Bea dsrbuon s he conjugae pror [2] of he Bernoull dsrbuon. The probablsc generaon of value uncerany θ,v wh hyperparameer β =(β,β 0) s gven by θ,v Bea( β ) Γ(β + β0) Γ(β )Γ(β 0) (θ,v) β ( θ,v) β 0 (4) where Γ s a gamma funcon [2], β s he pror ruh coun, and β 0 s he pror false coun for he values o be he ruh n he daa sreams. 3.3.2 Generang Confuson Marx Π We now show he generave process for each confuson marx π s. As saed above, he enres of he confuson marx have he propery ha π s + π s 0 =and π s 00 + π s 0 =. For brevy, we gve he generave process for rue posve rae π s and rue negave rae π s 00 of source s. We frs assume ha he rue posve rae π s s generaed from a Bea dsrbuon wh hyperparameers α and α 0 n α,gven by π s Bea( α ) Γ(α + α0) Γ(α )Γ(α 0) (πs ) α ( π s ) α 0 (5)

where α s he pror rue posve coun and α 0 s he pror false negave coun for he confuson marx π s. We hen assume ha he rue negave rae π s 00 s also generaed from a Bea dsrbuon wh hyperparameers α 00 and α 0 n α, gven by π s 00 Bea( α ) Γ(α0 + α00) Γ(α 00)Γ(α 0) (πs 00) α00 ( π00) s α 0 (6) where α 00 s he pror rue negave coun and α 0 s he pror false posve coun for he confuson marx π s. 3.3.3 Generang Voe O We show he generave process of he voes made by each source. For he conflcng values of each eny, we assume ha he voes by source s s generaed from he Bernoull dsrbuon based on he confuson marx π s and he ruh value for eny Z.Thevalue of each voe by source s for eny, o,s,v s also bnary, and hus he Bernoull dsrbuon s a suable choce for s generaon. The probablsc generaon for he voe o,s,v s gven by o,s,v Bernoull(π s z,v ) (π s z,v )o,s,v ( π s z,v ) o,s,v (7) Forexample,fhevaluev s he ruh n V (.e. z,v =), hen he voe o,s,v s generaed by he rue posve rae π s or false negave rae π0 s of source s. 3.4 Model Defnon In he prevous dscusson, we descrbed a generave process for he voes O. We now formally defne a probablsc model ha represens he underlyng jon dsrbuon over he generaon of pror dsrbuon for he ruh Θ, ruh value Z, source qualy Π and he voes O. Gven hyper-parameers ϕ = { α, β }, and a se of sources S, we facorze he jon dsrbuon over Z, Θ, Π and O, gvenby p(θ,z,π,o S, ϕ) =p(θ β )p(z Θ)p(Π α )p(o Π,Z) where p(θ β ) = p(z Θ) = T N p(θ,v β,β 0), = = v V T N = = v V p(z,v θ,v), p(π α ) = p(π s α,α 0)p(π s 00 α 0,α 00), p(o Π,Z) = T N p(o,s,v π s,z,v), = = v V Here, γ, η, λ and λ 0 are he varaonal parameers. and he probably dsrbuons p(θ,v Thus, he nference for he ruh value n Equaon 8 can be smplfed as follows: β ), p(z,v θ,v), p(π α s,α 0), p(π00 α s 0,α 00), p(o,s,v π s,z,v) are defned n Equaons 3-7, respecvely. For brevy, we om he condonal par of he jon Z = [arg max ), arg max ),...,arg max )] Z dsrbuon p(θ, Z, Π,O S, ϕ) and abbrevae o p(θ, Z, Π,O) Z 2 Z T n he res of hs paper. = [arg max,...,arg max ] () η Based on he model, he problem of ruh fndng for observed η T voe can be ransformed no a sandard probablsc nference problem, namely, fndng he maxmum a poseror (MAP) confgura- The goal of he varaonal algorhm s o fnd he varaonal dsrbuon ha s close o he rue poseror p(θ,z,π O). Thss on of he ruh Z condonng on O. Thasofnd Z =argmaxp(z O) (8) Z where p(z O) s he poseror dsrbuon of Z gven he voes O (and ϕ). However, dffcul o compue he poseror dsrbuon of Z, p(z O) = p(θ,z,π O)dΘdΠ, (9) where p(θ,z,π, O) = Z p(θ,z,π,o). (0) p(θ,z,π,o)dθdπ Ths dsrbuon s nracable o compue due o he couplng beween Π and Θ. To ackle hs problem, we develop an effcen and effecve approxmaon algorhm n he nex secon. 4. THE OPTIMIZATION ALGORITHM In hs secon, we propose he algorhms o approxmae he dsrbuon p(θ,z,π O) defned n Equaon 9. We frs nroduce a bach opmzaon algorhm for he proposed problem by assumng ha T s a fxed value. Then, we nroduce wo onlne algorhms for solvng he problem of ruh dscovery over daa sreams. 4. Bach Opmzng Algorhm We presen a varaonal algorhm for dscoverng he ruh wh fxed T.Wefrs resrc he varaonal dsrbuon o a famly of dsrbuons ha facorze as follows: T N q(θ,z,π) = ( q(θv)q(z v)) q(π)q(π s 00). s = = v V Thus he calculaon of he jon probably dsrbuon can be reduced o he produc of mulple dsrbuons and hus he compuaon cos can be grealy reduced. The choce of varaonal dsrbuons s no arbrary and we requre he dsrbuon n he same famly of he model probably dsrbuon and ake he followng paramerc form: where q(θ,z,π γ,η, λ ) T N = ( q(θv γ)q(z v η)) q(π λ s )q(π00 λ s 0), = = v V q(θv γ) = Bea(γ), q(zv η) = Bernoull(η), q(π λ s ) = Bea(λ ), q(π00 λ s 0) = Bea(λ 0).

equvalen o opmzng he varaonal parameers γ, η, λ and λ 0 wh respec o some dsance measure, gven by (γ,η,λ,λ 0) = arg mn D(q(γ,η,λ,λ 0) p(θ,z,π O)). γ,η,λ,λ 0 In hs work, we adop he Kullback-Lebler (KL) dvergence whch s commonly used o measure he dfference beween wo dsrbuons. I s defned as KL(q p) = q(γ,η,λ q(γ,η,λ,λ0),λ 0)log p(θ,z,π O) dθdπ, Z where KL dvergence s a funcon of he varaonal parameers γ, η, λ and λ 0. However, drecly opmzng he KL dvergence s nfeasble because he KL dvergence nvolves he erm p(θ,z,π O), whch s nracable. Insead, we solve an equvalen maxmzaon problem, whose objecve funcon s defned as L(q) = Z q(γ,η,λ,λ 0)log p(θ,z,π,o) q(γ,η,λ,λ 0) dθdπ The equvalence beween hese wo opmzaon problems can easly be seen as her objecve funcons sum up o a consan KL(q p)+l(q) =logp(o). In order o maxmze he objecve funcon L(q), weakehe dervaves of wh respec o he varaonal parameers γ, η, λ and λ 0, and se hese dervaves o zeros. L(q) =( L η, L γ, L, L )= 0. (2) λ λ 0 For clary, we pu all he dervaons n he Appendx. We repor he soluons o he opmzaon problem by η,v,j exp{ψ(γ,j) ψ( exp{ λ s 0,j = α 0,j + λ s,j = α,j + j=0 m=0 γ,m)} o,s,v,j(ψ(λ s,v,j) ψ( λ s,v,m))} (3) γ,j = β j + T = = T = = m=0 N η,v,j (4) = v V N η,v,0o,s,v,j. (5) v V N η,v,o,s,v,j. (6) v V for all s =,..., S ; =,...,T; =,...,N; j {0, }. ψ( ) s he Dgamma funcon whch s he logarhmc dervave of he Gamma funcon Γ( ), gvenby ψ(x) = log Γ(x). x 4.2 Sreamng Opmzaon Algorhm In hs secon, we develop a sreamng ruh fndng algorhm called SreamTF, n Algorhm. The SreamTF algorhm s able o heurscally fnd he ruh wh one-pass naure, shor response me and lmed memory usage. Furhermore, SreamTF algorhm s also capable for ruh dscovery n he case ha he qualy of daa sources evolves. The dea of he SreamTF algorhm s based on he sequenal Bayesan esmaon, gven by p(ϕ O,O 2,...,O ) p(o ϕ )p(ϕ O,O 2,...,O ) where ϕ s he esmaed varaonal parameers a me -. Ths ndcaes ha we can use a poseror p(ϕ O,O 2,...,O ) as he pror and nfer he varaonal parameers ϕ based on he collecon of voes O. Thus, we presen he echnques of he SreamTF algorhm ha fnds he ruh and esmaes he source qualy sequenally. We noce ha Equaons 5 and 6 for esmang he source qualy can also be represened as T N λ s 0,j = {α 0,j + η,v,0o,s,v,j} + N = v V T = = T λ s,j = {α,j + + N = v V T = = v V η T,v,0o T,s,v,j, N v V η T,v,o T,s,v,j, η,v,o,s,v,j} where we can nerpre he erms {α 0,j + T N = = η,v,0o,s,v,j}, and{α,j + T N = = v V v V η,v, o,s,v,j} as he pror parameers of rue negave rae and rue posve rae of source s, denoed as (λ s 0,j) T and (λ s,j) T,respecvely. Nex, we consder (λ s 0,j) T and (λ s,j) T as he pror parameers for esmang daa ruh η T and daa uncerany γ T. Then, we esmae source qualy (λ s 0,j) T and (λ s,j) T basedonhe esmaed η T and γ T,gvenby (λ s 0,j) T =(λ s 0,j) T + (λ s,j) T =(λ s,j) T + N = N = v V T v V T η,v,0o T T,s,v,j, (7) η,v,o T T,s,v,j. (8) The SreamTF s oulned n Algorhm. We now show how he SreamTF algorhm effecvely addresses he hree compuaon ssues saed n Secon 3.2. Frs, SreamTF acheves one-pass naure snce s obvous ha he algorhm reads he daa only once. Second, SreamTF acheves lmed memory usage because only uses memory of sze of one collecon of voes a any me n he sream. Thrd, SreamTF acheves shor response me snce he algorhm repors he ruh onlne and our expermens also verfy ha our algorhm can process a lo of collecons of voes n one second, whch s n effec real me response.

Algorhm Sreamng Truh Fndng (SreamTF) Inpu: Observed voes O, varaonal parameers λ, a hreshold ɛ Oupu: Varaonal parameers η, λ, γ : for = do 2: for each source s S do 3: Se rue negave rae λ s 0 (λ s 0) 4: Se rue posve rae λ s (λ s ) 5: end for 6: repea 7: for eny : N do do 9: Updae η,v by Equaon 3 0: end for : end for 2: Updae γ by Equaon 4 8: for each value v V v V 3: unl change n N N = η,v <ɛ 4: for each source s S do 5: Updae rue negave rae (λ s 0) by Equaon 7 6: Updae rue posve rae (λ s ) by Equaon 8 7: end for 8: end for 9: reurn η, λ, γ. 4.3 One-Pass Opmzaon Algorhm We can furher mprove he SreamTF algorhm f he sze of he daase s known, by whch we can desgn a one-pass algorhm ha no only sasfes he hree compuaonal ssues of daa sream processng saed n Secon 3.2, bu also sochascally maxmzes he objecve funcon n Equaon 2,.e., L(q). Such a one-pass algorhm s parcularly useful for processng massve sac daabases. We observe ha he objecve funcon L(q) can be represened as T funcons of he varaonal parameers, gven by L(q) = T l(o,η,θ,λ 0,λ ). = where we consder η and θ as local parameers for he funcon l(o,η,θ,λ 0,λ ) and λ 0, λ as global parameers for source qualy. The challenge of hs problem s he nference for parameers λ 0, λ. The reason s ha we only have o keep one collecon of voes O a each me. To ackle hs problem, we develop our one-pass algorhm based on he sochasc naural graden algorhm [4]. We model he sreamng collecons of voes O,O 2,...,O T o be sampled from unform dsrbuon, ha s, h(o )=. The T expecaon of he objecve funcon s gven by E h [L(q)] = T E h [l(o,η,θ,λ 0,λ )] (9) Then, we opmze Equaon 9 by repeaedly samplng he collecon of voes a dfferen mes, and applyng he updae λ s,j λ s,j + ρ T l(o,η,θ,λ 0,λ ) λ s,j N = ( ρ )λ s,j + ρ (α,j + T n= v Vn η n,v,o n,s,v,j) (20) Algorhm 2 One-Pass Truh Fndng (PassTF) Inpu: Observed voes O, npu daa sze T, a hreshold ɛ Oupu: Varaonal parameers η, λ, γ : Defne ρ =(τ + ) κ 2: for = T do 3: repea 4: for eny : N do do 6: Updae η,v by Equaon 3 7: end for 8: end for 9: Updae γ by Equaon 4 5: for v V v V 0: unl change n N N = ηv <ɛ : for each source s S do 2: Updae varaonal parameers λ s by Equaon 20 3: end for 4: end for 5: reurn η, λ, γ. for all s, {0, } and j {0, }, whereρ s he decay facor. The dervaon of Equaon 20 can be found n he Appendx. To guaranee he convergence of source qualy, we se he decay facor as he funcon of ρ =(τ + ) κ where he parameers κ,τ conrol he learnng rae of old λ s,j o be forgoen. We se κ>0.5 and τ>0such ha = ρ = and = (ρ ) 2 <, where he esmaon of λ s,j can converge o a saonary pon. THEOREM. (Onlne Opmzaon [4]) The general greedy descen mehod converges f and only f s learnng raes ρ fulflls (ρ ) 2 <, ρ =. = = The dealed proof o he above heorem can be found n [4], and we gve he nuon of he convergence on λ s,j here. The rao ρ =(τ + ) κ s a funcon of and becomes smaller afer he algorhm s run for more eraons. The funcon (α,j+ η n,v, o n,s,v,j) s he nference for new T N n= v Vn λ s,j. The updae on he varaonal parameer λ s,j s he produc of (α,j+t N n= v V n η n,v, o n,s,v,j) and ρ whch becomes less afer he algorhm eraes more. Thus, he nference of parameer λ s,j becomes convergen. The PassTF algorhm s oulned n Algorhm 2. I s easy o see ha he PassTF algorhm also addresses he hree compuaon ssues saed n Secon 3.2 by followng he same analyss gven o he SreamTF algorhm a he end of Secon 4.2. 5. EXPERIMENTAL RESULTS In hs secon, we evaluae he effecveness and effcency of our algorhms. All he algorhms, ncludng hose we compared wh n he expermens, were mplemened n Java and esed on machnes wh Wndows OS, Inel(R) Core(TM2) Quad CPU 2.66Hz, and 8GB of RAM. 5. Daases We use hree real daases o evaluae he performance of our algorhms. Some sascs of he daases are repored n Table. Weaher. We colleced he weaher forecas daa n May 203 for 285 US ces ha have a populaon of a leas 00,000. The weah-

Table : Sascs of Real Daases Daases #Voes O #Sources #Values Enropy Avg Dev Avg Dev Flgh 35k 35 2.809.59 0.70 0.47 Weaher 54k 7 2.48 2.502 0.53 0.42 Sock 2k 5 7.58 4.027.227 0.073 er forecas daa are modeled as sreams repored hourly from dfferen sources. The source s obaned as follows. We searched weaher forecas on Google and colleced he deep-web sources from he op 00 reurned resuls. Among hem, we chose he sources where he weaher forecas daa are encoded n he URL. Then, we seleced he sources ha forecas he weaher hourly and removed he sources whose daa were coped from oher sources. Fnally, we obaned a se of seven sources. We also colleced he hsorc weaher daa for hese ces from a normal weaher forecas webse as he groundruh for our evaluaon. We use he hsorc weaher daa as he ruh daa, snce hey are recorded afer he day. On he oher hand, he weaher forecas daa may conan some msakes as he daa are based on some knd of predcon. Flgh. The flgh daa conans 200 flghs from hree arlnes (AA, UA and Connenal) n one monh. The sources nclude he offcal webses of he hree arlnes, and 32 hrd-pary webses ha provde flgh nformaon for he arlnes. We consder he daa provded by he offcal webses of he hree arlnes as he gold sandard. Sock. The sock daa conans 000 socks from 55 sources over one monh. We use he daa provded from NASDAQ00 as he gold sandard. We oban boh Flgh daa and Sock daa from he webse 2. These wo daases are used as benchmark daases n he expermenal sudy n [4]. The gold sandards are he same as suggesed n [4]. For some sources of boh Flgh daa and Sock daa, hey may copy he values from oher sources o provde he values. We use he well-known copy-deecon mehod n [7, 9] o remove hese sources. 5.2 Expermenal Sengs For connuous values n he daases such as he deparure me of flgh and sock prce, we ransform her values no dscree forma by he noaons of olerance and buckeng (as suggesed n [4]) as follows. Tolerance. For he deparure me n he flgh daa, we olerae a 0-mnue dfference. For a sock prce value a me, we consder all he values V provded by he sources and compue he mean value V. The olerance for V s compued based on V and a predefned hreshold ε, whch s gven by τ(v )=ε V, where he hreshold ε s se o 0.0 by defaul. Buckeng. For each collecon of values V, we group values wh very small dfferences no a bucke. We sar o compue s mean value V and pu all he values no he followng buckes:...,(v 3τ(V ), V 2 τ(v ) ], (V 2 τ(v ), V 2 + τ(v ) ], (V 2 + τ(v ), V 2 + 3τ(V ) ],... 2 hp://www.weaherforyou.com/ 2 hp://cs.bnghamon.edu/~xanl/ ruhfndng.hm We now repor he conssency of he daases above. We consder he observed values and voes a me as V and O, respecvely. We use he average number of values o measure he uncerany of he daa sreams. We employ he enropy o measure he confuson of he conflcng values for he sources. Boh average number and enropy are popular measuremens for daa conssency, whch are also used n he expermenal sudy of he work n [4]. Average Number of Values. We denoe he number of values for eny by n. For he collecon of values V = {V, V 2,..., V T }, we consder he average number of values for daa sreams as V = T N T N = = n. The sandard devaon of V s compued by δ( V ) = T N T = = N (n V )2. Average Enropy. We consder he voe for eny from source s S on he value v a me as o,s,v, whch s a bnary value. For all he observed voes O = {O,O 2,...,O T }, we consder he average enropy of he voes from sources by En(O) = = T N T N T = = T N En(O) N n n = = = o,s,v S log o,s,v. S And s sandard devaon can be compued as δ(en(o)) = T N (En(O T N ) En(O))2. = = The deals of he sascs of he real daases can be found n Table. We evaluae he performance of our bach algorhm and he wo sreamng algorhms usng he above daases. To measure he effecveness of our mehods, we defne he average accuracy (AVG), mnmum accuracy (MIN) and sandard devaon of accuracy (DE- V). We frs paron he collecon of he daa sreams no buckes of he same sze and compue he accuracy of he algorhm for each bucke. Then, we compue he average accuracy and mnmum accuracy over all he buckes as AVG and MIN, respecvely. Fnally, we compue he sandard devaon of accuracy based on he bucke accuracy and average accuracy as DEV. By defaul, we se he sze of each bucke o be 300. For he sreamng algorhms, we also evaluae her robusness by varyng he decay facor κ, and he decay seed τ. 5.3 Sreamng Truh Fndng We nex evaluae he performance of our sreamng algorhms, SreamTF and passtf, for he followng hree measures: () accuracy, (2) runnng me, and (3) robusness. Snce here s no exsng algorhm acklng ruh fndng over daa sreams, we use he ncremenal algorhm LTMnc [3] as he baselne for comparson. We use 0%-40% percenage of he daa for ranng he LTM model for LTMnc, denoed by LTMnc 0.-LTMnc 0.4, respecvely. 5.3. Accuracy of Sreamng Algorhms We now presen he resul of accuracy of he algorhms n Table 2. For all he hree daases, boh SreamTF and passtf acheve

Table 2: Accuracy of Sreamng Algorhms (he bes score n bold excep Dev)) Mehod Flgh Weaher Sock Avg Mn Dev Avg Mn Dev Avg Mn Dev PassTF 0.9575 0.8732 0.065 0.9426 0.8769 0.07 0.907 0.8320 0.0506 SreamTF 0.9565 0.8683 0.087 0.9426 0.848 0.0226 0.8985 0.8320 0.0580 LTMnc 0. 0.8426 0.8280 0.005 0.8009 0.699 0.352 0.7733 0.6698 0.0699 LTMnc 0.2 0.860 0.847 0.0090 0.8092 0.2796 0.3736 0.7800 0.6650 0.0734 LTMnc 0.3 0.8572 0.8458 0.0076 0.806 0.333 0.323 0.77 0.6650 0.0758 LTMnc 0.4 0.8776 0.8664 0.0063 0.877 0.2392 0.3553 0.7837 0.7279 0.0549 0000 000 0000 Runnng Tme (ms) 000 00 PassTF SreamTF LTMnc 0. LTMnc 0.2 Runnng Tme (ms) 00 0 PassTF SreamTF LTMnc 0. LTMnc 0.2 Runnng Tme (ms) 000 00 0 PassTF SreamTF LTMnc 0. LTMnc 0.2 LTMnc 0.3 LTMnc 0.4 0 5 20 25 30 35 LTMnc 0.3 LTMnc 0.4 3 6 9 2 5 LTMnc 0.3 LTMnc 0.4 4 8 2 6 20 #Voes (x k) (a) Flgh #Voes (x 0k) (b) Weaher #Voes (x k) (c) Sock Fgure 4: Runnng Tme of Sreamng Algorhms sgnfcanly hgher accuracy han LTMnc. Boh he average accuracy and mnmum accuracy of SreamTF and passtf are hgher han hose of LTMnc. Noably, he accuracy of passtf s hgher han ha of SreamT- F, whch can be explaned as follows. Boh SreamTF and passt- F sequenally fnd he ruh as well as esmae he source qualy over he daa sreams. The dfference beween SreamTF and passtf s on he esmaon of source qualy. The SreamTF algorhm esmaes onlne he source qualy based on sequenal bayesan esmaon n Equaon 7. However, he esmaon of source qualy based on sequenal bayesan canno opmze he lkelhood objecve funcon. On he conrary, he passtf algorhm sochascally opmzes onlne he objecve funcon n E- quaon 9 n order o accuraely esmae he source qualy usng graden descen n Equaon 20. As a resul, he performance of passtf s beer han ha of SreamTF. 5.3.2 Runnng Tme of Sreamng Algorhms We repor he runnng me of he algorhms n Fgures 4(a), 4(b) and 4(c), respecvely. We sequenally pass he voes for he eny O o our algorhms. For all he daases, SreamTF and passtf are faser han LTMnc, and able o process daa sreams a hgh speed. I s also neresng o see ha he runnng me of SreamTF and passtf decreases when more daa have been processed (.e., he runnng me ncreases sub-lnear wh he ncrease n he amoun of daa beng processed), as shown n Fgures 4(a), 4(b) and 4(c). The sreamng algorhms keep esmang onlne he confuson marx for source qualy. They ake more eraons o converge when esmang he source qualy from he nal perod of he daa sreams. As we know, he confuson marx of he source qualy s saonary. As he me passes by, our sreamng algorhms ake less and less eraons o converge. Fnally, we fnd ha he number of eraons for nferrng he source qualy becomes one when he esmaon of source confuson marx reaches he saonary pon. Thus, he me cos of our sreamng algorhms can be grealy reduced as more daa s beng processed. Accuracy 0.95 0.9 0.85 0.8 0.75 0.7 PassTF 0.65 f mn-passtf f 0.6 PassTF w mn-passtf w 0.55 PassTF s mn-passtf s 0.5 2 4 6 8 0 2 τ (a) Decay Seed Accuracy 0.95 0.9 0.85 0.8 0.75 0.7 PassTF 0.65 f mn-passtf f 0.6 PassTF w mn-passtf w 0.55 PassTF s mn-passtf s 0.5 0.5 0.6 0.7 0.8 0.9 κ (b) Decay Rao Fgure 5: Robusness of One-Pass Algorhms 5.3.3 Robusness of One-Pass Algorhms We evaluae he robusness of he passtf algorhm by varyng s parameers, decay rao κ and decay seed τ, o valdae s effecveness. We measure he performance of he algorhms by he average accuracy and mnmum accuracy (ndcaed by addng he prefx mn- o he algorhm names n he fgures), respecvely. We denoe he PassTF algorhm on dfferen daases such as flgh, weaher and sock by PassTF f, PassTF w and PassTF s, respecvely. The resuls n Fgure 5 show ha our passtf algorhm s robus as acheves que conssen hgh accuracy for dfferen values of decay rao and decay seed. I s worh menonng ha he runnng me of passtf also remans sable for dfferen values of decay rao and decay seed. 6. CONCLUSIONS In hs paper, we suded he problem of ruh dscovery n daa sreams, whch has a wde range of daa sream applcaons such as weaher forecas and flgh schedulng. We proposed a probablsc model ha ransforms he problem of ruh dscovery over daa sreams no a probablsc nference problem. We frs developed a sreamng algorhm ha dscovers he ruh under he consrans of one-pass naure, lmed memory usage and shor response me. Then, we also proposed a one-pass algorhm ha s able o sochas-

cally opmze he probablsc nference of source qualy, whch s able o furher mprove he accuracy of he sreamng algorhm. As for emprcal sudy, we verfed he effecveness and effcency of our algorhms usng hree real daases from he daa sream applcaons of weaher forecas, flgh schedulng and sock prce predcon. The expermenal resuls valdae he effecveness of our algorhms, n erms of boh negraon accuracy and runnng me. Acknowledgmens. We hank he revewers for gvng us many consrucve commens, wh whch we have sgnfcanly mproved our paper. Ths research s suppored n par by SHIAE Gran No. 85048, MSRA Gran No. 6903555, and HKUST Gran No. FS- GRF4EG3. 7. REFERENCES [] M. Arenas, L. Beross, and J. Chomck. Conssen query answers n nconssen daabases. In PODS, pages 68 79, 999. [2] C.M.BshopandN.M.Nasrabad.Paern recognon and machne learnng, volume. sprnger New York, 2006. [3] L. Blanco, V. Crescenz, P. Meraldo, and P. Papo. Probablsc models o reconcle complex daa from naccurae daa sources. In Advanced Informaon Sysems Engneerng, pages 83 97, 200. [4] L. Boou. Onlne learnng and sochasc approxmaons. On-lne learnng n neural neworks, 7:9, 998. [5] H. Chen, W.-S. Ku, H. Wang, and M.-T. Sun. Leveragng spao-emporal redundancy for rfd daa cleansng. In SIGMOD, pages 5 62, 200. [6] X. L. Dong, L. Ber-Equlle, and D. Srvasava. Inegrang conflcng daa: he role of source dependence. PVLDB, 2():550 56, 2009. [7] X. L. Dong, L. Ber-Equlle, and D. Srvasava. Truh dscovery and copyng deecon n a dynamc world. PVLDB, 2():562 573, 2009. [8] X. L. Dong, A. Halevy, and C. Yu. Daa negraon wh uncerany. The VLDB Journal, 8(2):469 500, 2009. [9] X. L. Dong, B. Saha, and D. Srvasava. Less s more: Selecng sources wsely for negraon. PVLDB, pages 37 48, 203. [0] A. Galland, S. Abeboul, A. Maran, and P. Senellar. Corroborang nformaon from dsagreeng vews. In Proceedngs of he ACM nernaonal conference on Web search and daa mnng, pages 3 40, 200. [] L. Ja, H. Wang, J. L, and H. Gao. Incremenal ruh dscovery for nformaon from mulple daa sources. In Web-Age Informaon Managemen, pages 56 66. Sprnger, 203. [2] G. Kasnec, J. V. Gael, D. Sern, and T. Graepel. Cobayes: Bayesan knowledge corroboraon wh assessors of unknown areas of experse. In Proceedngs of he ACM nernaonal conference on Web search and daa mnng, pages 465 474, 20. [3] M. Kuzu, M. Kanarcoglu, A. Inan, E. Berno, E. Durham, andb.maln.effcen prvacy-aware record negraon. In EDBT, pages 67 78. ACM, 203. [4] X. L, X. L. Dong, K. Lyons, W. Meng, and D. Srvasava. Truh fndng on he deep web: s he problem solved? PVLDB, pages 97 08, 203. [5] X. Lu, X. L. Dong, B. C. Oo, and D. Srvasava. Onlne daa fuson. PVLDB, 4(), 20. [6] A. Pal, V. Rasog, A. Machanavajjhala, and P. Bohannon. Informaon negraon over me n unrelable and unceran envronmens. In WWW, pages 789 798, 202. [7] J. Pasernack and D. Roh. Knowng wha o beleve (when you already know somehng). In Proceedngs of he Inernaonal Conference on Compuaonal Lnguscs, pages 877 885, 200. [8] J. Pasernack and D. Roh. Makng beer nformed rus decsons wh generalzed fac-fndng. In Proceedngs of he nernaonal jon conference on Arfcal Inellgence-Volume Volume Three, pages 2324 2329, 20. [9] A. D. Sarma, X. L. Dong, and A. Halevy. Daa negraon wh dependen sources. In Proceedngs of he 4h Inernaonal Conference on Exendng Daabase Technology, pages 40 42. ACM, 20. [20] D. Smh and S. Sngh. Approaches o mulsensor daa fuson n arge rackng: A survey. Knowledge and Daa Engneerng, IEEE Transacons on, 8(2):696 70, 2006. [2] P. P. Talukdar, Z. G. Ives, and F. Perera. Auomacally ncorporang new sources n keyword search-based daa negraon. In SIGMOD, pages 387 398, 200. [22] P. P. Talukdar, M. Jacob, M. S. Mehmood, K. Crammer, Z. G. Ives, F. Perera, and S. Guha. Learnng o creae daa-negrang queres. PVLDB, ():785 796, 2008. [23] N. Tabul. Sreamng daa negraon: Challenges and opporunes. In Daa Engneerng Workshops (ICDEW), pages 55 58, 200. [24] D. Wang, T. Abdelzaher, L. Kaplan, and C. C. Aggarwal. On quanfyng he accuracy of maxmum lkelhood esmaon of parcpan relably n socal sensng. Urbana, 5:680, 20. [25] D. Wang, L. Kaplan, H. Le, and T. Abdelzaher. On ruh dscovery n socal sensng: a maxmum lkelhood esmaon approach. In Proceedngs of he h nernaonal conference on Informaon Processng n Sensor Neworks, pages 233 244. ACM, 202. [26] M. Wu and A. Maran. Corroborang answers from mulple web sources. In WebDB, 2007. [27] M. Wu and A. Maran. A framework for corroborang answers from mulple web sources. Informaon Sysems, 36(2):43 449, 20. [28] Z. Yan, N. Zheng, Z. G. Ives, P. P. Talukdar, and C. Yu. Acvely solcng feedback for query answers n keyword search-based daa negraon. PVLDB, pages 205 26, 203. [29] X. Yn, J. Han, and P. S. Yu. Truh dscovery wh mulple conflcng nformaon provders on he web. TKDE, 20(6):796 808, 2008. [30] X. Yn and W. Tan. Sem-supervsed ruh dscovery. In WWW, pages 27 226, 20. [3] B.Zhao,B.I.Rubnsen,J.Gemmell,andJ.Han.A bayesan approach o dscoverng ruh from conflcng sources for daa negraon. PVLDB, 5(6):550 56, 202. [32] Z. Zhao and W. Ng. A model-based approach for rfddaa sream cleansng. In CIKM, pages 862 87, 202. [33] Z. Zhao, D. Yan, and W. Ng. A probablsc convex hull query ool. In Proceedngs of he 5h Inernaonal Conference on Exendng Daabase Technology, pages 570 573. ACM, 202.