Approximate Variance-Stabilizing Transformations for Gene-Expression Microarray Data

Similar documents
Window Width Selection for L 2 Adjusted Quantile Regression

Random Variables and Probability Distributions

Statistical Modeling Techniques for Reserve Ranges: A Simulation Approach

Robust Critical Values for the Jarque-bera Test for Normality

Continuous Distributions

ESTIMATION OF MODIFIED MEASURE OF SKEWNESS. Elsayed Ali Habib *

value BE.104 Spring Biostatistics: Distribution and the Mean J. L. Sherley

Variance Stabilization and Normalization for One-Color Microarray Data Using a Data-Driven Multiscale Approach

Overview. Family of powers and roots

Frequency Distribution Models 1- Probability Density Function (PDF)

Time Observations Time Period, t

Jaime Frade Dr. Niu Interest rate modeling

Sample Size for Assessing Agreement between Two Methods of Measurement by Bland Altman Method

Continuous random variables

INDIAN INSTITUTE OF SCIENCE STOCHASTIC HYDROLOGY. Lecture -26 Course Instructor : Prof. P. P. MUJUMDAR Department of Civil Engg., IISc.

Dynamic Replication of Non-Maturing Assets and Liabilities

Inferences on Correlation Coefficients of Bivariate Log-normal Distributions

Lattice Model of System Evolution. Outline

Expected Value of a Random Variable

Process capability estimation for non normal quality characteristics: A comparison of Clements, Burr and Box Cox Methods

Simulation Wrap-up, Statistics COS 323

GARCH Models. Instructor: G. William Schwert

1 The continuous time limit

ANALYSIS OF THE DISTRIBUTION OF INCOME IN RECENT YEARS IN THE CZECH REPUBLIC BY REGION

AP STATISTICS FALL SEMESTSER FINAL EXAM STUDY GUIDE

Price Impact and Optimal Execution Strategy

MFE8825 Quantitative Management of Bond Portfolios

BEHAVIOUR OF PASSAGE TIME FOR A QUEUEING NETWORK MODEL WITH FEEDBACK: A SIMULATION STUDY

The Fallacy of Large Numbers and A Defense of Diversified Active Managers

Multiple Regression. Review of Regression with One Predictor

The Fallacy of Large Numbers

The histogram should resemble the uniform density, the mean should be close to 0.5, and the standard deviation should be close to 1/ 12 =

the display, exploration and transformation of the data are demonstrated and biases typically encountered are highlighted.

Lecture 12: The Bootstrap

Value at Risk and Self Similarity

Copula-Based Pairs Trading Strategy

Factors in Implied Volatility Skew in Corn Futures Options

GENERATION OF STANDARD NORMAL RANDOM NUMBERS. Naveen Kumar Boiroju and M. Krishna Reddy

Strategies for Improving the Efficiency of Monte-Carlo Methods

Lecture 3: Probability Distributions (cont d)

PARAMETRIC AND NON-PARAMETRIC BOOTSTRAP: A SIMULATION STUDY FOR A LINEAR REGRESSION WITH RESIDUALS FROM A MIXTURE OF LAPLACE DISTRIBUTIONS

Characterization of the Optimum

Chapter 6 Simple Correlation and

Mortality Rates Estimation Using Whittaker-Henderson Graduation Technique

The data definition file provided by the authors is reproduced below: Obs: 1500 home sales in Stockton, CA from Oct 1, 1996 to Nov 30, 1998

Chapter 8. Markowitz Portfolio Theory. 8.1 Expected Returns and Covariance

Much of what appears here comes from ideas presented in the book:

COMPARATIVE ANALYSIS OF SOME DISTRIBUTIONS ON THE CAPITAL REQUIREMENT DATA FOR THE INSURANCE COMPANY

Linda Allen, Jacob Boudoukh and Anthony Saunders, Understanding Market, Credit and Operational Risk: The Value at Risk Approach

Power of t-test for Simple Linear Regression Model with Non-normal Error Distribution: A Quantile Function Distribution Approach

Probability and Statistics

A Study on the Risk Regulation of Financial Investment Market Based on Quantitative

Presented at the 2012 SCEA/ISPA Joint Annual Conference and Training Workshop -

MODELLING OF INCOME AND WAGE DISTRIBUTION USING THE METHOD OF L-MOMENTS OF PARAMETER ESTIMATION

Valencia. Keywords: Conditional volatility, backpropagation neural network, GARCH in Mean MSC 2000: 91G10, 91G70

M249 Diagnostic Quiz

Overnight Index Rate: Model, calibration and simulation

The mean-variance portfolio choice framework and its generalizations

Chapter 6 Forecasting Volatility using Stochastic Volatility Model

STAT 113 Variability

Mean-Variance Portfolio Theory

Australian Journal of Basic and Applied Sciences. Conditional Maximum Likelihood Estimation For Survival Function Using Cox Model

CEO Attributes, Compensation, and Firm Value: Evidence from a Structural Estimation. Internet Appendix

AP Statistics Chapter 6 - Random Variables

Pricing Dynamic Solvency Insurance and Investment Fund Protection

Numerical Descriptions of Data

An Improved Saddlepoint Approximation Based on the Negative Binomial Distribution for the General Birth Process

ELEMENTS OF MONTE CARLO SIMULATION

Discussion of Trends in Individual Earnings Variability and Household Incom. the Past 20 Years

Hedging Under Jump Diffusions with Transaction Costs. Peter Forsyth, Shannon Kennedy, Ken Vetzal University of Waterloo

Probability. An intro for calculus students P= Figure 1: A normal integral

Measuring Financial Risk using Extreme Value Theory: evidence from Pakistan

Chapter 6 Analyzing Accumulated Change: Integrals in Action

Subject CS1 Actuarial Statistics 1 Core Principles. Syllabus. for the 2019 exams. 1 June 2018

Value at Risk Ch.12. PAK Study Manual

Homework Assignments

Partial Equilibrium Model: An Example. ARTNet Capacity Building Workshop for Trade Research Phnom Penh, Cambodia 2-6 June 2008

Statistical Methods in Practice STAT/MATH 3379

Consistent estimators for multilevel generalised linear models using an iterated bootstrap

Analysis of extreme values with random location Abstract Keywords: 1. Introduction and Model

A Comparative Study of Various Forecasting Techniques in Predicting. BSE S&P Sensex

**BEGINNING OF EXAMINATION** A random sample of five observations from a population is:

An Application of Extreme Value Theory for Measuring Financial Risk in the Uruguayan Pension Fund 1

INDIAN INSTITUTE OF SCIENCE STOCHASTIC HYDROLOGY. Lecture -5 Course Instructor : Prof. P. P. MUJUMDAR Department of Civil Engg., IISc.

On Some Statistics for Testing the Skewness in a Population: An. Empirical Study

Chapter 2 Uncertainty Analysis and Sampling Techniques

Quantile Regression due to Skewness. and Outliers

Module Tag PSY_P2_M 7. PAPER No.2: QUANTITATIVE METHODS MODULE No.7: NORMAL DISTRIBUTION

MAS1403. Quantitative Methods for Business Management. Semester 1, Module leader: Dr. David Walshaw

Simulation of probability distributions commonly used in hydrological frequency analysis

A Study on Numerical Solution of Black-Scholes Model

UPDATED IAA EDUCATION SYLLABUS

The Two-Sample Independent Sample t Test

WC-5 Just How Credible Is That Employer? Exploring GLMs and Multilevel Modeling for NCCI s Excess Loss Factor Methodology

Strategies for High Frequency FX Trading

Traditional Optimization is Not Optimal for Leverage-Averse Investors

CS 237: Probability in Computing

Application of MCMC Algorithm in Interest Rate Modeling

Choice Probabilities. Logit Choice Probabilities Derivation. Choice Probabilities. Basic Econometrics in Transportation.

Market Risk Analysis Volume I

Transcription:

Approximate Variance-Stabilizing Transformations for Gene-Expression Microarray Data David M. Rocke Department of Applied Science University of California, Davis Davis, CA 95616 dmrocke@ucdavis.edu Blythe Durbin Department of Statistics University of California, Davis Davis, CA 95616 bpdurbin@ucdavis.edu October 19, 2002 Abstract Motivation. A variance stabilizing transformation for microarray data was introduced independently by Durbin et al. (2002), Huber et al. (2002) and Munson (2001), called the generalized logarithm or glog by Munson. In this paper, we derive several alternative, approximate variance stabilizing transformations that may be easier to use in some applications. Results. We demonstrate that the started-log and the log-linearhybrid transformation families can produce approximate variance stabilizing transformations for microarray data that are nearly as good as the glog transformation of Durbin et al. (2002), Huber et al. (2002), and Munson (2001). These transformations may be more convenient in some applications. Contact. dmrocke@ucdavis.edu Keywords. cdna array, generalized logarithm, log-linear hybrid, microarray, normalization, started logarithm, statistical analysis, transformation. To whom correspondence should be addressed. 1

1 Introduction Many traditional statistical methodologies, such as regression or the analysis of variance, are based on the assumptions that the data are normally distributed (or at least symmetrically distributed), with constant variance not depending on the mean of the data. If these assumptions are violated, the statistician may choose either to develop some new statistical technique which accounts for the specific ways in which the data fail to comply with the assumptions, or to transform the data. Where possible, data transformation is generally the easier of these two options (see Box and Cox, 1964, and Atkinson, 1985). Data from gene-expression microarrays, which allow measurement of the expression of thousands of genes simultaneously, can yield invaluable information about biology through statistical analysis. However, microarray data fail rather dramatically to conform to the canonical assumptions required for analysis by standard techniques. Rocke and Durbin (2001) demonstrate that the measured expression levels from microarray data can be modeled as y = α + µe η + ε (1) where y is the measured raw expression level for a single color, α is the mean background noise, µ isthetrueexpressionlevel,andη and ε are normallydistributed error terms with mean 0 and variance σ 2 η and σ 2 ε, respectively. This model also works well for Affymetrix GeneChip arrays either applied to the PM-MM data or to individual oligos. The variance of y under this model is Var(y) =µ 2 S 2 η + σ 2 ε, (2) where Sη 2 = e σ2 η(e σ2 η 1). In Durbin et al. (2002), Huber et al. (2002), and Munson (2001), it was shown that for a random variable z satisfying V (z) =a 2 + b 2 µ 2,withE(y) =µ, there is a transformation that stabilizes thevariancetothefirst order, meaning that the variance is almost constant no matter what the mean might be. There are several equivalent ways of writing this transformation, but we will use f c (z) =ln (z + z 2 + c 2 ) 2, (3) where c = a/b. This transformation converges to ln(z) for large z, and is approximately linear at 0 (Durbin et al. 2002). Since this is exactly the 2

natural logarithm when c = 0, it was called the generalized logarithm or glog transformation by Munson (2001), a terminology that we adopt. The inverse transformation is f 1 c (w) =e w c 2 e w /4 Both f c and its inverse are monotonic functions, defined for all values of z and w, with derivatives of all orders. For array data, we use z = y α or z = y ˆα so that the random variable satisfies (exactly or approximately) V (z) =a 2 + b 2 E(z) 2. 2 The Started Logarithm In some situations, it may not be convenient to use the glog transformation (3). In particular, the supposed ease of interpretation of log ratios has provided a major justification for use of the log transformation on microarray data. However, for a random variable z satisfying E(z) =µ and V (z) =a 2 + b 2 µ 2, the logarithmic transformation ln(z) has certain disadvantages. The delta method (i.e. propogation of errors) shows that V (ln(z)) b 2 + a 2 /µ 2, which goes to infinity as µ 0. Furthermore, when µ = 0, z will be frequently non-positive, for which the transformation is not defined. A common modification of the logarithmic transformation, designed at aminimumtoavoidnegativearguments,istoaddaconstanttoallofthe values before taking the logarithm. Following Tukey (1964; 1977) we call this the started logarithm; its form is g c (z) =ln(z + c) with c > 0. This transformation can, given the appropriate constant c, mitigate some of the problems with negative observations that plague the log transformation. A transformed observation g c (z) has approximate variance function V (g c (z)) = a2 + b 2 µ 2 (µ + c) 2. (4) This will not completely stabilize the variance of z if the variance function is (2), but we can ask for the choice of constant c which minimizes the maximum deviation from constancy. An examination of the function (4) showsthatittakesthevaluea 2 /c 2 at µ = 0 and has an asymptote at b 2 as µ. We will focus on the deviation of the variance from the limiting value b 2. 3

The derivative of (4) with respect to µ is 2b 2 µ(µ + c) 2 2(a 2 + b 2 µ 2 )(µ + c) (µ + c) 4. (5) The denominator of (5) is never zero for µ 0, so any change in sign of the derivative will occur where 2b 2 µ(µ + c) 2 2(a 2 + b 2 µ 2 )(µ + c) = 0 or µ = a2 b 2 c. Note also that the derivative of the variance function at µ =0is 2a 2 /c 3 < 0 (so long as c>0), indicating that the variance decreases initially, before increasing again at µ = a 2 /(b 2 c). It is clear that the value of c that minimizes the maximum deviation of (4) from b 2 is where the variance at 0 (a 2 /c 2 )is as much above b 2 asthevarianceattheminimumisbelowb 2 (see Figure 1). Since the minimum is at µ = a 2 /b 2 c, the variance at the minimum is a 2 + b 2 a 4 /(b 4 c 2 ) (a 2 /b 2 c + c) 2 = a 2 b 2 a 2 + b 2 c 2. The condition to minimize the maximum deviation from constant variance is a 2 c 2 b2 = b 2 a2 b 2 c = a 2 1/4 b a 2 + b 2 c 2 or The achieved minimum deviation is b 2 2 b 2, and the ratio of the standard deviation at 0 to the asymptotic standard deviation b is about 1.2. We illustrate this transformation with a case from Durbin et al. (2002) in which α =24, 800, a =4, 800 and b =.227. Figure 1 shows the standard deviation function for the optimal started-log transformation with c = a/(2 1/4 b) = 17, 781, as well as two other values of c. The dashed line shows the value b, which is the value that all of the transformations tend to as the expression gets large. The upper (dotted) curve is for c =0, corresponding to the logarithm of the background corrected data. The standard deviation approaches infinity as the estimated expression approaches 0. The lower curve (dot-dash) is for c =24, 800, corresponding to the log uncorrected intensity. Here the variance at zero and at the minimum is too 4

low. The optimal choice of c =17, 781 (middle curve, solid line) has the correct balance between the two. In this case, the logarithm of the raw intensity data is not too bad. There is no guarantee that this would be true in general, since the zero of the intensity scale is rather arbitrary. 3 Log-Linear Hybrid According to the two-component model (1), the untransformed data have approximately constant variance for µ close to 0 and approximately constant coefficient of variation for µ large. This suggests that we might use a linear transformation for small z andalogtransformationforlargez. Keeping this in mind, another variant of the logarithm that may be appropriate for microarray data is the log-linear hybrid transformation (Holder et al. 2001). Here we take the transformation to be ln(z) forz greater than some cutoff k, and a linear function c+dx below that cutoff. This eliminates the singularity at zero. We choose c and d so that the transformation is continuous with continuous derivative at k. The last requirements give the two equations ck + d = ln(k) c = 1/k and thus d =ln(k) 1. Thus, our transformation family is h k (z) = z/k +ln(k) 1, z k = ln(z), z > k (6) The asymptotic delta-method variance function is given by V (h k (z)) = (a 2 + b 2 µ 2 )/k 2, z k = b 2 + a 2 /µ 2, z > k. (7) Notethatthetwoexpressionsagreeatthesplicepoint,duetothechoiceof c and d to make the derivative continuous at k. It is easy to see that the choice of k that leads to the minimum deviation from constant variance is the one in which the variance at 0 is as much below b 2 as the variance at the splice point is above b 2.Thus b 2 a 2 /k 2 = (b 2 + a 2 /k 2 ) b 2 or k = 2a/b (8) 5

Figure 2 shows the optimal log-linear hybrid (solid line), the optimal started log (dotted line) and the optimal glog transformation (dot-dash line). In this case, the started log has a smaller maximum deviation from constant variance, but this is dependent on the parameter values and this can be reversed. Any of these transformations may be sufficient to stabilize the variance for practical purposes. One can further reduce the maximum deviation from constant variance by employing both a linear segment and a started log, so that the transformation would be linear below a cutoff k and above that point be ln(z + c). However, the extra complexity that this would entail would make this choice an unlikely alternative to the glog transformation of Durbin et al. (2002), Huber et al. (2002), and Munson (2001). It should also be noted that the started log and log-linear hybrid each correspond to a variance function. The started log will be the optimal variance stabilizing transformation if V (z) = (E(z) +c) 2 and the log linear hybrid will be optimal if the variance is constant at V (z) =k 2 when z<k and V (z) =E(z) 2 for z k. Thesefunctionswillbedifficult to distinguish from the variance function (2) generated by the two-component model (1), although it may be possible with large data sets. We prefer the transformation (3) corresponding to the variance function (2) because it is generated by the physically plausible model (1), but the results are likely to be similar if the parameters are chosen carefully. 4 Simulation Studies The relative performance of each of the three transformations was tested on data simulated from the two-component model of Rocke and Durbin (2001). The parameters used were σ η =0.227 and σ ε = 4800. We use the value b = σ η =.227 rather than S η =.236 since the logartithms of data distributed according the the two-component model have a standard deviation that tends exactly to σ η for large µ. To the order we are working, these quantities are the same,and make no practical difference for data analysis, but the differencecanshowupinlargesimulations. Datawere simulated for values of µ ranging from 0 to 1,000,000 at increments of 5,000. For each value of µ, 1000 samples of size 1000 were simulated from z = µe η + ε, where η N(0, σ 2 η)andε N(0, σ 2 ε). The simulated data sets were trans- formedusingeachofthethreetransformationsandusedtocalculateconfi- 6

dence intervals for the standard deviation and skewness of the transformed data. The optimal transformation within each family was used in all cases. Figure 3 shows the standard deviation of the transformed simulated data, averaged over 1000 samples, for all 3 transformations. As would be expected, the glog transformation shows the most nearly constant standard deviation. The standard deviation of the data transformed using the log-linear-hybrid transformation stabilizes somewhat sooner than that using the started-log transformation, but otherwise these two transformations appear of similar quality. Graphs (not shown) of the actual and model-predicted standard deviation of simulated data transformed using each of the three transformations, averaged over 1000 samples, show that the simulated data conform closely to the theoretical values, supporting the use of the delta-method theory in this analysis. Upon examining the standard deviation of simulated data for each of the three transformations, it appears that the glog transformation provides the most nearly constant variance of transformed data, followed by the log-linear hybrid transformation. However, the skewness of the simulated data can also be informative, as symmetry of data is also important when applying standard statistical methodologies. Figure 4 shows the skewness of simulated data from each of the three transformations, averaged over 1000 samples. For a dataset of size 1000, the skewness differs significantly from 0 at the 95% level if it is greater than.1518 in absolute value. The glog transformation shows significant skewness between µ =10, 000 and µ =35, 000, with a maximum skewness of -0.2475 occuring at µ = 15, 000. The started-log transformation shows significant skewness for values of µ<30, 000, with a maximum skewness of -1.2254 occuring at µ = 0. Finally, the log-linearhybrid transformation shows significant skewness for values of µ between 35, 000 and 65, 000, with a maximum skewness of -0.227 occuring at µ = 45, 000. The glog and log-linear-hybrid transformations appear to perform equivalently at symmetrizing the simulated data, and both do far better than the started log transformation. Taking both variance-stabilization and symmetry into account, the glog transformation appears to perform best on the simulated data, followed by the log-linear hybrid. 5 Example Figures 5 7 show the results of applying the three transformations to the data from Durbin et al. (2002). All are much improved from the raw data 7

or the logarithms of the background corrected data. Of these, the glog transformation (Figure 5) appears to have done the best job. The started log (Figure 6) has several high-variance genes at the low end that deviate more from constancy than is the case with the glog transformation. The log-linear hybrid (Figure 7) appears to have more low-variance genes near the low end (thus departing more from constancy of variance) than is the case with the variance-stabilizing transformation. 6 Conclusions We have compared three transformation families, each optimized for stability of variance, for use with microarray data. Any of these could be usefully employed in this application, although evidence from theory and from an application suggest that the glog transformation of Durbin et al. (2002), Huber et al. (2002), and Munson (2001) is probably the best choice when it is convenient to use it. Acknowledgements TheresearchreportedinthispaperwassupportedbygrantsfromtheNational Science Foundation (ACI 96-19020, and DMS 98-70172) and the National Institute of Environmental Health Sciences, National Institutes of Health (P43 ES04699). Tha authors are grateful for helpful suggestions from three referees that improved the presentation of the paper. References Atkinson, A.C. (1985) Plots, Transformations, and Regression: An Introduction to Graphical Methods of Diagnostic Regression Analysis Clarendon Press: Oxford. Bartosiewicz,M.,Trounstine,M.,Barker,D.,Johnston,R.,andBuckpitt, A. (2000) Development of a toxicological gene array and quantitative assessment of this technology, Archives of Biochemistry and Biophysics., 376, 66 73. Box, G.E.P., and Cox, D.R. (1964) An analysis of transformations, Journal of the Royal Statistical Society, Series B (Methodological), 26, 211 252. 8

Durbin, B.P., Hardin, J.S., Hawkins, D.M., and Rocke, D.M. (2002) A variance-stabilizing transformation for gene-expression microarray data, Bioinformatics, 18, S105 S110. Hawkins, D.M. (2002) Diagnostics for conformity of paired quantitative measurements, Statistics in Medicine, 21, 1913 1935. Holder,D.,Raubertas,R.F.,Pikounis,V.B.,Svetnik,V.,andSoper,K. (2001) Statistical analysis of high density oligonucleotide arrars: A SAFER approach, GeneLogic WorkshoponLow Level Analysis of Affymetrix GeneChip Data. Huber, W., von Heydebreck, A., Sültmann, H., Poustka, A., and Vingron, M. (2002) Variance stabilization applied to microarray data calibration and to the quantification of differential expression, Bioinformatics, 18, S96 S104. Munson, P. (2001) A Consistency Test for Determining the Significance of Gene Expression Changes on Replicate Samples and Two Convenient Variance-stabilizing Transformations, GeneLogic Workshop on Low Level Analysis of Affymetrix GeneChip Data. Rocke,D.,andDurbin,B.(2001) Amodelformeasurementerrorforgene expression arrays, Journal of Computational Biology, 8, 557 569. Tukey, J.W. (1964) On the comparative anatomy of transformations, Annals of Mathematical Statistics, 28, 602 632. Tukey, J.W. (1977) Exploratory Data Analysis Reading, MA: Addison-Wesley. 9

Figure 1. Standard Deviation of the Started-Log for Three Values of the Constant Optimal c Log Background Corrected Intensity Log Intensity Asymptotic 0 5*10^5 10^6 1.5*10^6 2*10^6 Expression Standard Deviation 0.10 0.15 0.20 0.25 0.30

Figure 2. Standard Deviation of the Optimal Log-Linear Hybrid and the Optimal Started Log Optimal Log-Linear Hybrid Optimal Started Log Optimal Glog 0 5*10^5 10^6 1.5*10^6 2*10^6 Expression Standard Deviation 0.15 0.20 0.25 0.30

Figure 3. Standard Deviation of Simulated Data for Three Transformations Glog Transformation Started-Log Transformation Log-Linear-Hybrid Transformation 0 2*10^5 4*10^5 6*10^5 8*10^5 10^6 Mean Standard Deviation 0.15 0.20 0.25 0.30

Figure 4. Skewness of Simulated Data for Three Transformations Glog Transformation Started-Log Transformation Log-Linear-Hybrid Transformation 0 2*10^5 4*10^5 6*10^5 8*10^5 10^6 Mean Skewness -1.5-1.0-0.5 0.0

Figure 5. Spread vs. Location for the Generalized Log Transformation 9 10 11 12 13 14 15 Robust Mean of Replicates Robust Standard Deviation of Replicates 0.0 0.2 0.4 0.6 0.8

Figure 6. Spread vs. Location for the Started-Log Transformation 9 10 11 12 13 14 15 Robust Mean of Replicates Robust Standard Deviation of Replicates 0.0 0.2 0.4 0.6 0.8

Figure 7. Spread vs. Location for the Log-Linear-Hybrid Transformation 9 10 11 12 13 14 15 Robust Mean of Replicates Robust Standard Deviation of Replicates 0.0 0.2 0.4 0.6 0.8