Environmental samples below the limits of detection comparing regression methods to predict environmental concentrations ABSTRACT INTRODUCTION

Size: px

Start display at page:

Download "Environmental samples below the limits of detection comparing regression methods to predict environmental concentrations ABSTRACT INTRODUCTION"

Thomasina Snow
5 years ago
Views:

1 Environmental samples below the limits of detection comparing regression methods to predict environmental concentrations Daniel Smith, Elana Silver, Martha Harnly Environmental Health Investigations Branch, California Department of Health Services Richmond, CA ABSTRACT In exposure assessment studies of trace environmental contaminants, investigators often face the difficulty of dealing statistically with chemical values below the laboratory's limits of detection. For example, in a study of pesticides measured in the dust of homes near agricultural fields, the goal is to use characteristics of the household to predict pesticide concentrations, even though many of the measurements are below the detection limit. We illustrate two regression techniques that can be useful in this situation: quantile regression and tobit regression. Both are available in SAS, as PROC QUANTREG and PROC QLIM respectively, yet have not been used extensively by epidemiologists and environmental health scientists. Quantile regression predicts a given quantile, such as the 50th or 90th percentile, rather than the mean as in standard regression, thereby sidestepping uncertainties in the dependent variable at the lower end of the distribution. Tobit regression incorporates this uncertainty, analogous to censoring in survival analysis. Both techniques follow the familiar multiple regression paradigm and can be used to model the effect of several explanatory variables. INTRODUCTION Studies of trace environmental contaminants often challenge both the laboratory and the data analyst by having many values near the laboratory s limits of quantitation, and some below these limits. Consider the hypothetical example in the figure below. One hundred random numbers were drawn from a normal distribution with a mean of 10. The entire distribution is shown in the left panel. If these were concentrations of an environmental compound, and the laboratory s limit of reporting were 7, then 21 of the 100 values in this case would fall below the limit (middle panel). The investigator would know the distribution of observations above 7, but only the number of observations below 7.? One way of dealing with this problem is to substitute a constant for the unknown values (right panel), such as half the detection limit or the detection limit divided by 2 (Hornung and Reed, 1990). While this method works well when there are relatively few below-detect observations, if the proportion of such values is sizable the distribution will be skewed and no longer normal. Further, the standard deviation may be over- or under-estimated, depending on where the unknown values are heaped. These problems can violate assumptions and bias results when the chemical concentration is to be used as a dependent variable in statistical analysis, such as multiple regression. Other more complicated methods involve simulating values for the below-detects by sampling from hypothetical distributions, or imputing their values from the known covariates (Lubin et al., 2004). Such procedures take several steps and may not be familiar to the researcher. Two simpler procedures can be used, quantile regression and tobit regression. These analyses can be accomplished in a single analytical step that follows the familiar multiple regression paradigm and programs are available in popular statistical packages such as SAS, Stata, and R. METHODS To illustrate these procedures, we use data from the CHAMACOS study (Center for the Health Assessment of Mothers and Children of Salinas, House dust was sampled from 239 homes in the Salinas Valley of California, an area of intense agricultural activity, and analyzed for pesticides used in nearby fields. Residents of the homes were surveyed about the household members, including whether they were agricultural workers themselves. The dataset also includes the amounts of pesticides applied on fields near the home, distance to the fields, and meteorological data. The goal is to try to predict pesticide concentrations in the house dust from these data. 1

2 As expected for a study of this type, many of the pesticide concentrations are below the laboratory s limits of quantitation. For example, diazinon had 34 of 239 samples (14%) below the limits of quantitation (which was 2 ng/g of dust), chlorpyrifos had 13% below, and chlorthal-dimethyl 6% below. DDT (which has not been used in the United States since 1972) was still detectable in some samples, but had 57% of values below the limits. QUANTILE REGRESSION Standard linear regression predicts the mean of the dependent variable Y for a particular value of the predictor variable X. In contrast, quantile regression predicts a given quantile of Y, such as the 50th or 90th percentile, for a given X. The 50th percentile of Y is of course the median, and is perhaps of most interest as an analog to the mean of Y. However, there may be other quantiles of interest, depending on the application. Quantile regression is considered a robust regression procedure, because a quantile such as the median depends on the ranks of the Y values, and not on specific values in the tails of the distribution. In the example above, even though 21% of the values are known only to be less than 7, the median of the entire distribution will be the same no matter what the unknown values really are. Quantile regression in SAS is accomplished with PROC QUANTREG. QUANTREG is still experimental as of this writing, but is available for SAS 9.1 Windows on a test basis at We illustrate quantile regression by modeling the pesticide diazinon. The values below the detection limit can be represented in the data set by zeros or half the detection limit the value given to them doesn t matter for computing quantiles, as long as they are less than the detectable values. We use seven predictors, X3_00198 (the amount of pesticide applied in the last two weeks), tempmax_mean and precip_sum (measures of temperature and rainfall), dist (categories for the distance to the nearest field), agwk_nc (number of agricultural workers in the home), wkcloth4 (whether agricultural work clothes are worn or stored in the home), and ipov200 (whether the household income is within 200% of the poverty limit). These analyses are for illustrative purposes only, and do not show the final models used in the CHAMACOS study. proc quantreg; class dist; model Log10diazinon = x3_00198 tempmax_mean precip_sum dist agwk_nc wkcloth4 ipov200 / quantile = 0.5; Most of the QUANTREG syntax is familiar to users of SAS s regression programs. The option quantile = 0.5 specifies that we want to predict the median of Log10diazinon. We use the default procedure for calculating confidence intervals here (inverse rank); there is an option to calculate the intervals by resampling, shown later. Note that it is not necessary to take a log transformation of the diazinon concentration to induce normality, as normality is not an assumption of quantile regression. We do it here to compare results to other models. The QUANTREG Procedure Model Information Data Set MYSAS.PEST04 Dependent Variable Log10_diazinon_1 Log10 diazinon 1 Number of Independent Variables 7 Number of Continuous Independent Variables 6 Number of Class Independent Variables 1 Number of Observations 239 Optimization Algorithm Simplex Method for Confidence Limits Inv_Rank Number of Observations Read 239 Number of Observations Used 239 Class Level Information Name Levels Values DIST

3 QUANTREG prints out summary statistics. MAD is median absolute deviation, a robust measure of scale. Summary Statistics Standard Variable Q1 Median Q3 Mean Deviation MAD Log10_diazin Quantile and Objective Function Quantile 0.5 Objective Function Predicted Value at Mean Parameter Estimates 95% Confidence Parameter DF Estimate Limits Intercept x3_ tempmax_mean precip_sum DIST DIST DIST agwk_nc WKCLOTH ipov The interpretation of the regression coefficients is similar to other regression analyses. For example, the increase in the median Log10diazinon concentration associated with having work clothes in the home is estimated as 0.22 (95% confidence interval ). Transforming back to the original diazinon scale, this corresponds to multiplying the median diazinon concentration by 1.66 (95% confidence interval ). The Stata program for quantile regression (called qreg) provides a goodness-of-fit measure for the model, referred to as pseudo-r 2. This is simply the proportional reduction in the sum of the absolute value of the residuals. The same statistic can be calculated from PROC QUANTREG by running the regression twice, once with the intercept only and once with the full model, comparing the sums of the absolute residuals. proc quantreg data = mydata; model log10diazinon = / quantile = 0.5; *Intercept only model; output out = intmodel res = resint; *Save the data with residuals; run; proc quantreg data = intmodel; *Bring in saved data to run full model; class dist; model Log10diazinon = x3_00198 tempmax_mean precip_sum dist agwk_nc wkcloth4 ipov200 / quantile = 0.5; *Full model; output out = fullmodel res = resfull; *Save these residuals too; run; data _null_; set fullmodel end = lastrow; *Bring in the results of the previous run; sresfull+abs(resfull); *Sum up the absolute values of both residuals; sresint+abs(resint); if lastrow then do; pseudor2 = (sresint sresfull)/sresint; put pseudor2 = ; end; run; 3

4 Another common goodness-of-fit statistic is the squared correlation coefficient between the observed and the predicted values. This is easily accomplished by adding predicted = pred to the output statement of the full model above, then invoking PROC CORR: proc corr data = fullmodel; var Log10diazinon pred; Finally, ODS graphics can provide a display of the relationship of the predictor variables to all the quantiles. ods rtf; ods graphics on; proc quantreg ci=resampling; class dist; model Log10_diazinon_1 = tempmax_mean precip_sum dist agwk_nc wkcloth4 ipov200 / quantile=all plot=quantplot (wkcloth4) / unpackpanel; run; ods graphics off; ods rtf close; We specify that we want the coefficient for wearing work clothes in the home plotted for all quantiles, with confidence intervals created by resampling. The plot provides the regression coefficient for WKCLOTH4 across all quantiles of Log10diazinon. Through most of the range, the coefficient is around 0.2, as we saw above in the regression on the median. The null part of the curve for quantiles less than the 15th percentile corresponds to the range where all the values were below the detection limit. Between the 15th and 20th percentiles, the regression coefficient is high, indicating that wearing or storing work clothes in the home has a greater impact on diazinon for homes with relatively lower concentrations. A discussion of quantile regression with examples from ecology is provided in the review by Cade and Noon (2003). 4

5 TOBIT REGRESSION Tobit regression (named for Nobel prize-winning economist James Tobin) treats observations below the limit of detection as censored values lying somewhere between zero and the detection limit, and adjusts the variance accordingly (Lubin et al, 2004). This can be viewed as survival analysis with left censored data, and indeed tobit regression can be accomplished with a survival analysis program such as PROC LIFEREG: proc lifereg; class distnew model (logdiazlower, logdiazupper) = x3_00198 tempmax_mean precip_sum distnew agwk_nc wkcloth4 ipov200 / distribution = normal; Here, the program requires us to create two variables for our pesticide concentration, logdiazlower and logdiazupper. The variables have the identical value for each observation above the detection limit. But for observations below the detection limit, logdiazlower is set to. and logdiazupper is the detection limit. In other words, we tell the program that for these observations, the unknown value is bracketed between nothingness and the detection limit 1. Tobit regression assumes that the dependent variable is normally distributed (unlike quantile regression which doesn t care about the distribution), so we specify distribution = normal, and have log transformed our dependent variable in this case. The output informs us that we have 34 left censored variables (the below detects). The LIFEREG Procedure Model Information Data Set WORK.FORTOBIT Dependent Variable log10lower Dependent Variable log10upper Number of Observations 239 Noncensored Values 205 Right Censored Values 0 Left Censored Values 34 Interval Censored Values 0 Name of Distribution Normal Log Likelihood Number of Observations Read 239 Number of Observations Used 239 Class Level Information Name Levels Values distnew Algorithm converged. 1 Tobin s original 1958 paper begins with an ode to nothingness. 5

6 The output again follows the familiar multiple regression format. Analysis of Parameter Estimates Standard 95% Confidence Chi- Parameter DF Estimate Error Limits Square Pr > ChiSq Intercept x3_ tempmax_mean precip_sum distnew distnew distnew agwk_nc WKCLOTH ipov Comparing the estimate for the work clothes variable (WKCLOTH4), we see that we have a somewhat larger regression coefficient than with quantile regression, although we remind ourselves that quantile regression is predicting the median, and tobit regression predicts the mean, so the results are not strictly comparable. SAS also provides a tobit regression program in SAS/ETS, called QLIM. It uses different terminology but gives the same results. proc qlim; class distnew; model Log10_diazinon = x3_00198 tempmax_mean precip_sum distnew agwk_nc wkcloth4 ipov200; endogenous Log10_diazinon ~ censored(lb = 0.30); In the endogenous statement, we tell QLIM that the value of Log10diazinon is censored below the lower bound of 0.30 (which is log10 of the detection limit, 2 ng/g). The QLIM Procedure Summary Statistics of Continuous Responses N Obs N Obs Standard Lower Upper Lower Upper Variable Mean Error Type Bound Bound Bound Bound Log10_diazinon_ Censored Class Level Information Class Levels Values distnew Model Fit Summary Number of Endogenous Variables 1 Endogenous Variable Log10_diazinon_1 Number of Observations 239 Log Likelihood Maximum Absolute Gradient Number of Iterations 15 AIC Schwarz Criterion

7 Parameter Estimates Standard Approx Parameter Estimate Error t Value Pr > t Intercept x3_ tempmax_mean precip_sum distnew distnew distnew agwk_nc WKCLOTH ipov The same goodness-of-fit measures described above for quantile regression can be calculated for tobit regression, although the syntax for the output statement of QLIM differs from that of QUANTREG. DISCUSSION To compare results between the methods, we focus again on wearing or storing work clothes in the home, and summarize various regressions for three of the pesticides. These pesticides have different numbers of observations below detection limits. The regression coefficients predicting log pesticide concentration have been exponentiated to provide a multiplier of the mean or median pesticide concentration. We compare quantile and tobit regressions to a linear regression where values below the detection limit were replaced with half the detection limit. For both diazinon and chlorthal-dimethyl, quantile regression on the medians provided lower point estimates than linear regression on the mean, despite the fact that a univariate examination of these variables showed medians close to the means. Quantile regression on the median of DDT is not possible because more than half the observations were below the detection limit. We note that the confidence intervals for quantile regression vary depending on the choice of the method used to calculate them. Little guidance is provided in the SAS documentation. Furthermore, we have observed (results not shown) that SAS s QUANTREG and the quantile regression programs in Stata and R all produce somewhat different confidence intervals, indicating that the programs are using different algorithms. Generally, tobit regression provided point estimates closer to linear regression, but with varying confidence intervals depending on the proportion of below detect values (wider as the proportion of nondetects increases). For DDT, tobit regression gives a much higher point estimate than linear regression where 57% of the DDT observations are heaped at one low value (half the detection limit). In addition, the confidence interval for DDT is much wider from tobit than from linear regression (the standard error is doubled). This reflects the true uncertainty from having so many of the observations censored, a fact that is masked in linear regression when substituting half the detection limit. The tobit regression confidence intervals from SAS s LIFEREG and QLIM also agree with those from Stata and R. Table. Estimated increase in pesticide house dust concentration from wearing work clothes in the home versus not Linear regression Values below detection limit replaced with 1/2 limit Diazinon 14% < Limit of Detection Chlorthal-dimethyl 6% < Limit of Detection DDT 57% < Limit of Detection 2.09 ( ) ( ) 2.14 ( ) Quantile regression on median Inverse Rank confidence interval (default) 1.66 ( ) 1.45 ( ) Bootstrap confidence interval (n=500) 1.66 ( ) 1.45 ( ) Tobit 2.15 ( ) 1.55 ( ) 4.79 ( ) 1 Ninety-five percent confidence limits given in parentheses 7

8 CONCLUSION Both quantile and tobit regression can be useful for situations where some values of the dependent variable are censored or below detection limits. Quantile regression is a robust procedure that can be used to predict any quantile of interest, although the lack of a standard procedure for calculating the variance of the estimates can produce inconsistent results across different programs. Tobit regression has a longer history of application and produces consistent results across various implementations. It is limited however to estimating the mean. It also requires the assumption of normality of the dependent variable, as with linear regression. REFERENCES Cade BS, Noon BR A gentle introduction to quantile regression for ecologists. Frontiers in Ecology and the Environment 1: Hornung RW, Reed LD Estimation of average concentration in the presence of nondetectable values. Applied Occupational and Environmental Hygiene 5: Lubin JH, Colt JS, Camann D, et al Epidemiologic evaluation of measurement data in the presence of detection limits. Environmental Health Perspectives 112: StataCorp Stata Statistical Software: Release 9. College Station, TX: StataCorp LP. Tobin J Estimation of relationships for limited dependent variables. Econometrica 26: Reprint available on-line at ACKNOWLEDGMENTS We thank the CHAMACOS investigators Brenda Eskenazi and Asa Bradman for use of a portion of their data for illustrating these procedures. The CHAMACOS study was jointly funded by the U.S Environmental Protection Agency grant RD and National Institute of Environmental Health Sciences grant PO1 ES Its contents do not necessarily represent the official views of the funding agencies. CONTACT INFORMATION Contact the author at: Daniel Smith Environmental Health Investigations Branch California Department of Health Services 850 Marina Bay Parkway, Building P, 3rd Floor Richmond, CA DSmith2@dhs.ca.gov Web: SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their respective companies. 8

Five Things You Should Know About Quantile Regression

Five Things You Should Know About Quantile Regression Robert N. Rodriguez and Yonggang Yao SAS Institute #analyticsx Copyright 2016, SAS Institute Inc. All rights reserved. Quantile regression brings the