Journal of Data Scence 5(007), 45-439 Comparsons of Gene Expresson Indexes for Olgonucleotde Arrays Mounr Aout Laboratore Génétque des Malades Mult-factorelles-CNRS UMR8090 Abstract: Hgh densty olgonucleotde arrays have become a standard research tool to montor the expresson of thousands of genes smultaneously. Affymetrx GeneChp arrays are the most popular. They use short olgonucleotdes to probe for genes n an RNA sample. However, mportant challenges reman n estmatng expresson level from raw hybrdzaton ntenstes on the array. In ths paper, we deal wth the problem of estmatng gene expresson based on a statstcal model. The present method s lke L and Wong model (001a), but assumes more generalty. More precsely, we show how the model ntroduced by L and Wong can be generalzed to provde new measure of gene expresson. Moreover, we provde a comparson between these two models. Gene expresson, model-based estmaton, olgonucleotde ar- Key words: rays. 1. Introducton Hgh densty olgonucleotde expresson arrays are now wdely used n many area of bomedcal research for measurements of gene expresson. In the Affymetrx system, an array contans several thousands of genes and ESTs. To probe genes, olgonucleotdes of length 5 bp are used. Typcally, a mrna molecule of nterest (usually related to a gene) s represented by a probe set. Every probe set conssts of 10-0 probe pars. Every probe par s composed of a perfect match PM, a secton of the mrna molecule of nterest and a msmatch MM,whch s dentcal to the perfect match probe except for the base n the mddle (13th) poston. After RNA samples are prepared, labeled and hybrdzed wth arrays, these are scanned and mages are produced and processed to obtan an ntensty value for each probe. These ntenstes, PM j and MM j,representtheamountof hybrdzaton for arrays =1,...I and probe pars j =1,..., J for any gven probe set. There has been consderable dscusson over the approprate algorthm for constructng sngle expresson estmates based on multple-probe hybrdzaton
46 Mounr Aout data. At present, there are several analytcal methods to measure such ntenstes. However, we wll only dscuss the Affymetrx Mcroarray Sute MAS4.0 and MAS5.0 (1999 and 001) and the method of L and Wong LW (001a). The MAS 4.0 uses an average over probe pars PM j MM j,j =1,...J for each array =1,...I. Ths average dfference (AD) s motvated by underlyng statstcal model: PM j MM j = θ + ɛ j,j =1...J. The expresson ndex on array s represented wth the θ. AD s an approprate estmate of θ f the error term ɛ j has equal varance for j =1,..., J. However, the equal varance assumpton does not hold for GeneChp probe level data, snce probes wth larger mean ntenstes have larger varances, see Irzarry et al. (003c). The latest verson of ths software MAS5.0 computes the ant-log of a robust average of log (PM j CT j ). A correspondng statstcal model s log(pm j CT j )=log(θ j )+ɛ j,j =1,..., J. The basc dsadvantage for ths method s that there s no learnng about probe characterstcs, based on the performance of each probe across chps. To account for probe affnty effect, LW method suggests that PM j MM j = θ φ j +ɛ j,= 1,...I, j =1,...J, ɛ = N(0,σ ). The probe affnty effect s represented by φ j.the man object of ths paper s to generalze ths model by consderng separate models for PM and MM and makng general assumptons on the errors. Ths paper s organzed as follows: The next secton deals wth a general model based on L and Wong s model. We make general assumptons on the emprcal varance and correlaton of and between PM and MM, and estmate the parameters usng maxmum lkelhood. Based on our analyss, we wll show that our model gves an unbased estmate of the expresson ndex wth low varance. Secton 3 s concerned by a specal case usng PM only wth nconstant varance. In addton, we compare how well these methods perform usng the spke-n experment H GU95A descrbed n more detals n the same secton.. The Full L and Wong Model.1 The full model: A smple case Followng L and Wong, the PM and MM ntenstes are modeled as: PM j = ν j + θ α j + θ φ j + ɛ P j (.1) MM j = ν j + θ α j + ɛ M j (.) where I denotes the number of samples and J denotes the number of probe pars n a probe set. θ s the expresson ndex, ν s a non-specfc cross-hybrdzaton term, α s the rate of ncrease of MM ntensty and φ s the addtonal rate of ncrease of the PM ntensty.
Comparsons of Gene Expresson Indexes 47 Frequency 0.0 0.5 1.0 1.5.0.5 3.0 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Cor(PM,MM) Fgure 1: Correlaton between PM and MM Frequency 0 1 3 4 0 1000 000 3000 4000 Stdv(PM) Fgure : Standard devaton of PM
48 Mounr Aout Although ths model was ntroduced by L and Wong, they have only treated the reduced case whch we wll call RLW : PM j MM j = θ φ j + ɛ j,ɛ = N(0,σ ) Lemon et al.(00) use the above equatons, but assume that the PM and MM values are ndependent so ther model descrbes the margnal dstrbutons. Recently, Tab (004) ntroduced a model n whch t s assumed that the errors are correlated but wth common varance and a constant correlaton across samples. In general, these assumptons do not ft the observatons as we wll see later. We propose then to augment the recent model to permt to the emprcally observed correlaton between PM and MM and the varances of PM and MM to change across the arrays as s shown n Fgures 1-3. More precsely, we assume that the errors terms follow a bvarate normal dstrbuton accordng to ( ɛ P j ɛ M j ) = N (( 0 0 ) ( σ, ρ σ ρ σ σ where σ s the varance and ρ s the correlaton coeffcent. In the followng ths model wll be called FLW1. )) Frequency 0.0 0.5 1.0 1.5.0.5 3.0 0 500 1000 1500 000 500 Stdv(MM) Fgure 3: Standard devaton of MM
Comparsons of Gene Expresson Indexes 49. The estmates Gven data (PM j,mm j ) we can estmate the parameters of our model usng the maxmum lkelhood. It s known that the lkelhood functon of the bvarate normal dstrbuton can be expressed as: L =,j L(PM j,mm j,θ,α j,φ j,ν j,σ,ρ ) =,j K exp 1 [ X σ (1 ρ ) 1 ρ X 1 X + X ] where X 1 = PM j ν j θ α j θ φ j and X = MM j ν j θ α j. The correspondng log lkelhood functon s l =,j log(k ),j 1 [ X σ (1 ρ ) 1 ρ X 1 X + X ] To get the estmates of the parameters we take the partal dervatves wth respect to the correspondng parameters and we set the resultng expresson equal to zero. Hence, we obtan: ˆφ j = ˆα j = θ σ (1 ρ ) [(PM j ρ MM j ) (1 ρ )(ν j + θ α j )] θ σ (1 ρ ) θ σ (1+ρ ) [PM j + MM j ν j θ φ j ] θ σ (1+ρ ) νˆ j = (PM j θ α j θ φ j )+(MM j θ α j ) A + B ˆθ = j φ j +(1 ρ )α j +(1 ρ )α j φ j ˆσ j = (X 1 ρ X 1 X + X ) J(1 ρ ) j ˆρ = X 1X Jσ, where A = j φ j [PM j ρ MM j (1 ρ )ν j ], B =(1 ρ ) j α j [PM j + MM j ν j. The last two equatons can be wrtten as: ˆσ j = (X 1 + X ) J
430 Mounr Aout ˆρ = j X 1X j (X 1 + X ) These formulas have to be understood as steps n an teratve procedure that wll lead to fnal estmates. In ths case we wll not be concerned by solvng these equatons. However, they are useful when t comes to dervng varous propertes. If we assume the other parameters[ to] be known, It wll be easy to see that ˆθ s an unbased estmate of θ snce E ˆθ = θ. For the varance, we get: Var( ˆθ )= σ (1 ρ ) j φ j +(1 ρ )α j +(1 ρ )α j φ j (.3).3 Comparsons between FLW1 and RLW In ths secton, we wll gve a bref descrpton of the reduced L and Wong model and make a comparson between the estmates obtaned n each model n terms of accuracy (bas) and precson (varance). For the RLW model, we recall that: Y j := PM j MM j = θ φ j + ɛ j, j φ j = J, ɛ j = N(0,σ ) The estmated expresson ndex ˆθ can be obtaned usng the maxmum lkelhood or the least squares. Hence j ˆθ = Y jφ j j φ j The varance of the estmate, based on the assumptons of RLW model s Var( ˆθ )= σ J But, based on the FLW1 assumptons, on can easly show that Var( ˆθ )= σ (1 ρ ) j φ j (.4) and t s easy to see that (.3) (.4). Gven the L and Wong Model, one could choose a sutable model based on the dstrbuton of the errors. Another mportant pont for the selecton of the convenent estmate s the unbasedness and low varance. Snce we have shown that the correspondng ˆθ for our model s an unbased estmate wth low varance,
Comparsons of Gene Expresson Indexes 431 and accordng to the comparson above, we see that the full model should be a good choce..4 The full model: A general case In ths secton secton, we propose to augment the last model to take nto account the dfference of the emprcally observed varances between PM and MM assshownnfgure4. Frequency 0 1 3 4 0 500 1000 1500 Stdv(PM) Stdv(MM) Fgure 4: Dfference between standard devaton of PM and MM We wll then assume that the error terms n.1 and. are dstrbuted accordng to ( ) (( ) ( ɛ P j 0 σ = )) ɛ M 1, ρ σ 1, σ, N, j 0 ρ σ 1, σ, σ, where σ1, and σ, are the varances and ρ s the correspondng correlaton coeffcent. From now on, we wll call ths model the FLW model.
43 Mounr Aout In ths case, the lkelhood functon has the form L =,j =,j K exp L(PM j,mm j,θ,α j,φ j,ν j,σ 1,,σ,,ρ ) [ ] 1 X1 X 1 X (1 ρ ) σ1, ρ + X σ 1, σ, σ, The same computatons as above lead to the maxmum lkelhood estmates of the parameters: ˆφ j = ˆα j = [ θ σ σ1, (1 ρ ) (PM j ρ 1, σ σ, MM j ) (1 ρ 1, θ 1 ρ θ σ1, (1 ρ ) [a PM j + b MM j ν j (a + b ) a θ φ j ] θ (a 1 ρ + b ) νˆ j = a (PM j θ α j θ φ j )+b (MM j θ α j ) a + b A + B ˆθ = ˆσ 1, = ˆσ, = ˆρ = j j X 1 J j X φ j σ 1, +(a + b )α j +a α j φ j J j X 1X ( j X 1 ) ( j X ) ] σ, )(ν j + θ α j ) where A = j B = j φ j [ 1 σ 1, PM j ρ σ 1, σ, MM j a ν j α j [a PM j + b MM j (a + b )ν j ] ] a = 1 σ 1, σ1, (1 ρ ) σ, b = 1 σ, σ, (1 ρ ) σ 1, and
Comparsons of Gene Expresson Indexes 433 Gven the other parameters, t s thus easy to see that the estmate ˆθ of the expresson ndex s unbased. For the varance we get Var( ˆθ )= j φ j σ 1, 1 ρ (.5) +(a + b )α j +a α j φ j On the other hand the varance of ˆθ basedontherlw s Var( ˆθ )= σ 1, + σ, ρ σ 1, σ, j φ j (.6) and t s not easy to compare these varances. For example when a 0wehave (.5) (.6). In general, we use data from the spke-n studes HGU95A and HGU133 to make ths comparson (see Fgures 5-6 and we see that (.5) (.6) for almost all data (99 per cent of data) Hstogram of VFLW/FRLW 10 0 10 0 30 log(vflw/vrlw) Hstogram of VFLW/FRLW 10 0 10 0 log(vflw/vrlw) Frequency 0e+00 1e+05 e+05 3e+05 4e+05 5e+05 6e+05 Fgure 5: Rato of log-varance between FLW and RLW- HGU133 Frequency 0e+00 e+05 4e+05 6e+05 8e+05 Fgure 6: Rato of log-varance between FLW and RLW- HGU95A 3. Numercal Results and Conclusons 3.1 The model based on PM only It has been observed that some MM probes may respond poorly to the changes n the expresson level of the target gene as dscussed n L and Wong (001b). Ths phenomenon rased questons on the effcency of usng MM
434 Mounr Aout probes, and led some nvestgators to calculate fold changes usng only PM probes. To nvestgate the relatve performance of PM-only usng RLW and FLW, wemodfedtheflw model to estmate gene expresson levels usng only PM probes, and compared t to RLW. The modfed FLW model becomes PM j = ν j + θ φ j + ɛ j where ɛ j = N(0,σ ) The same procedure as above gves: ˆφ j = ˆν j = ˆθ = θ (PM σ j ν j ) 1 σ θ σ (PM j θ φ j ) 1 σ j φ j(pm j ν j ) j ˆσ = (PM j θ φ j ν j ) J To evaluate how ths model performs, we use a spke-n study HGU95A desgned by Affymetrx. 3. Data HGU95AGeneChp s a subset of the data used to develop and valdate the MAS5.0 algorthm. Human crna fragments matchng 16 probe-sets on the HGU95A GeneChp were added to the hybrdzaton mxture of the arrays at concentratons rangng from 0 to 104 pcomolar. The same hybrdzaton mxture, obtaned from a common tssue source, was used for all arrays. The crnas were spked-n at a dfferent concentraton on each array (apart from replcates) arranged n a cyclc Latn square desgn wth each concentraton appearng once n each row and column. Wthn each experment, only the spke-n concentratons are vared, background s the same for all arrays. Fold change calculatons are always made wthn experment to ensure that only spked-n genes wll be dfferentally expressed. For more detals see(http://www.affymetrx. com/analyss/downloadcenter.affx). j φ j
3.3 Numercal results Comparsons of Gene Expresson Indexes 435 Ths secton s concerned by evaluatng how the FLW based on PM-only performs. Actually we present a numercal comparson between FLW and RLW usng the spke-n study HGU95A GeneChp. we computed our estmates usng the R envronment see Ihaka and Gentleman (1996), whch can be freely obtaned from (http://cran.r-project.org) and the methods for Affymetrx Olgonucleotde Arrays R package descrbed n Irrzary et al. (003a), whch s freely avalable as part of the Boconductor project http://www.boconductor.org. We then use a benchmark for Affymetrx GeneChp expresson measures developed by Cope et al. (003) whch ams to evaluate and compare summares of Affymetrx probe level data. We submtted our data to the correspondng webtool whch s avalable at (http://affycomp.bostat.jhsph.edu). The results obtaned are summarzed n the table below (see Tables 1-). We got results for RLW from (http://affycomp.bostat.jhsph.edu/affy/rafajhu.edu/030519.1451/completeassessment.pdf) and results correspondng to FLW are gven n the Affycompwebtool report. The score components for Table NR1 are as follows: 1. Sgnal detect slope: Slope obtaned from regressng expresson values on nomnal concentratons n the spke-n data.. Sgnal detect R: R-squared obtaned from regressng expresson values on nomnal concentratons n the spke-n data. 3. AUC (FP < 100): Area under the ROC curve up to 100 false postves. 4. AFP, call f fc > : Average false postves f we use fold-change > asa cut-off. 5. ATP, call f fc > : Average true postves f we use fold-change > asa cut-off. 6. IQR: Interquartle range of log ratos among genes not dfferentally expressed. 7. Obs ntended-fc slope: Slope obtaned from regressng observed log-foldchanges aganst nomnal log-fold-changes. 8. Obs (low)nt-fc slope: Slope obtaned from regressng observed log-foldchanges aganst nomnal log-fold-changes for genes wth nomnal concentratons less than or equal to. 9. FC =,AUC(FP < 100): Area under the ROC curve up to 100 false postves when comparng arrays wth nomnal fold changes of.
436 Mounr Aout 10. FC =, AFP, call f fc > : Average false postves f we use fold-change> as a cut-off when comparng arrays where nomnal fold-changes are. 11. FC =,ATP,callffc > : Average true postves f we use fold-change > as a cut-off when comparng arrays where nomnal fold-changes are. and for Table : 1. Medan SD: Medan SD across replcates.. null log-fc IQR: Inter-quartle range of the log-fold-changes from genes that should not change. 3. null log-fc 99.9%: 99.9% percentle of the log-fold-changes f from the genes that should not change. 4. Sgnal detect R: R-squared obtaned from regressng expresson values on nomnal concentratons n the spke-n data. 5. Sgnal detect slope: Slope obtaned from regressng expresson values on nomnal concentratons n the spke-n data. 6. low.slope: Slope from regresson of observed log concentraton versus nomnal log concentraton for genes wth low ntenstes. 7. med.slope: As above but for genes wth medum ntenstes. 8. hgh.slope: As above but for genes wth hgh ntenstes. 9. Obs-ntended-fc slope: Slope obtaned from regressng observed log-foldchanges aganst nomnal log-fold-changes. 10. Obs-(low)nt-fc slope: Slope obtaned from regressng observed log-foldchanges aganst nomnal log-fold-changes for genes wth nomnal concentratons less than or equal to. 11. low AUC: Area under the ROC curve (up to 100 false postves) for genes wth low ntensty standardzed so that optmum s 1. 1. med AUC: As above but for genes wth medum ntenstes. 13. hgh AUC: As above but for genes wth hgh ntenstes. 14. weghted avg AUC: A weghted average of the prevous 3 ROC curves wth weghts related to amount of data n each class (low,medum,hgh). For more detals we refer to Irzarry et al. ( 003c).
Comparsons of Gene Expresson Indexes 437 Table 1: Comparson results 1 FLW-PMonly RLW-PMonly Perfecton Sgnal detect slope 0.480 0.533 1 Sgnal detect R 0.85 0.846 1 AUC (FP < 100) 0.783 0.674 1 AFP, call f fc > 7.331 36.907 0 ATP, call f fc > 10.78 11.47 16 IQR 0.11 0.446 0 Obsntendedfc slope 0.471 0.53 1 Obs(low) ntfc slope 0.04 0.317 1 FC =, AUC (FP < 100) 0.460 0.167 1 FC=,AFP,callffc > 6.81 8.64 0 FC=,ATP,callffc > 1.000 1.50 16 Table : Comparson results 1 FLW-PMonly RLW-PMonly Perfecton Medan SD 0.066 0.13 0 null log-fc IQR 0.105 0.04 0 null log-fc IQR %99.9 0.656 1.437 0 Sgnal detect R 0.85 0.846 1 Sgnal detect slope 0.480 0.533 1 low.slope 0.138 0.49 1 med.slope 0.547 0.641 1 hgh.slope 0.404 0.390 1 Obs-ntended-fc slope 0.471 0.53 1 Obs-(low) nt-fc slope 0.04 0.317 1 low AUC 0.95 0.041 1 med AUC 0.831 0.0 1 hgh AUC 0.61 0.011 1 weghted average AUC 0.47 0.079 1 4. Conclusons We have presented a comparson between the reduced and full form of L and Wong models usng ether the full bvarate or PM-only models. To understand the dfference n the performance of calls generated by these two models, we
438 Mounr Aout used both theoretcal and numercal crtera. To make a decson as a choce of a model, one can make comparson n terms of accuracy(unbased or low bas) and precson (low varance). We have shown that FLW1 has a less varance than RLW. Furthermore, usng the Spken study, t seems clear that FLW has consderably less varance than RLW. We also see that the PM-only model provdes mportant mprovements n varous aspects compared to the same model based on RLW. References Affycomp-webtool (005). Boconductor expresson assessment tool for affymetrx olgonucleotde arrays (affycomp). Report. Affymetrx (1999). Mcroarray Sute User Gude, Verson 4. Affymetrx (001). Mcroarray Sute User Gude, Verson 5. Cope, L. M., Irzarry, R. A., Jaffee, H., Wu, Z. and Speed, T. P. (003). A benchmark for affymetrx genechp expresson measures. Bonformatcs 0, 33-331. Ihaka, R. and Gentleman, R. (1996). R: a language for data analyss and graphcs. J. Comput. Graph. Stat. 5, 99-314. Irzarry, R., Gauter, L. and Cope, L. (003a). An R package for analyses of Affymetrx olgonucleotde arrays. In The Analyss of Gene Expresson Data: Methods and Software (Edted by Parmgan, G., Garrett, E. S.,Irzarry, R. A. and Zeger, S. L.), 313-341. Sprnger. Irzarry, R., Hobbs, B., Colln, F., Beazer-Barclay, Y., Antonells, K., Scherf, U. and Speed, T. (003c). Exploraton, normalzaton, and summares of hgh densty olgonucleotde array probe level data. Bostatstcs 4, 49-64. Lemon, W. J., Palatn, J. J. T., Krahe, R. and Wrght, F. A. (00). Theoretcal and expermental comparsons of gene expresson ndexes for olgonucleotde arrays.bonformatcs 18,1470-6. L, C. and Wong, W. H. (001a). Model based analyss of olgonucleotde arrays:expresson ndex computaton and outlers detecton. Proc. Natoanl Academy of Scence 98, 31-36. L, C. and Wong, W. H. (001b). Model-based analyss of olgonucleotde arrays: Model valdaton, desgn ssues and standard error applcaton. Genome Bology, research003.1-003.11. Lockhart, D., Dong, H., Byrne, M., Follette, M., Gallo, M., Chee, M., Mttmann, M., Wang, C., Kobayash, M., Horton, H. and Brown, E.L. (1996). Expresson montorng by hybrdzaton to hgh-densty olgonucleotde arrays. Nat. Botechnol. 14, 1675-1680. Srvastava, M. S. (00). Methods of Multvarate Statstcs. John Wley.
Comparsons of Gene Expresson Indexes 439 Tab, Z. (004). Statstcal analyss of olgonucleotde mcroarray data. Comptes Rendus de l Acadme des Scences 37, 175-180. Receved January 3, 006; accepted Aprl 3, 006. Mounr Aout Department of Statstcs and Data Processng IUT de Caen (Lseux) 11 Bd Jules Ferry 14100 Lseux France m.aout@lseux.utcaen.uncaen.fr