Geostatistical Inference under Preferential Sampling

Similar documents
A comment on Christoffersen, Jacobs and Ornthanalai (2012), Dynamic jump intensities and risk premiums: Evidence from S&P500 returns and options

Limit Theorems for the Empirical Distribution Function of Scaled Increments of Itô Semimartingales at high frequencies

ROM SIMULATION Exact Moment Simulation using Random Orthogonal Matrices

Quantitative Introduction ro Risk and Uncertainty in Business Module 5: Hypothesis Testing Examples

IEOR E4703: Monte-Carlo Simulation

Two hours. To be supplied by the Examinations Office: Mathematical Formula Tables and Statistical Tables THE UNIVERSITY OF MANCHESTER

**BEGINNING OF EXAMINATION** A random sample of five observations from a population is:

Chapter 5 Univariate time-series analysis. () Chapter 5 Univariate time-series analysis 1 / 29

Consistent estimators for multilevel generalised linear models using an iterated bootstrap

Construction and behavior of Multinomial Markov random field models

Which GARCH Model for Option Valuation? By Peter Christoffersen and Kris Jacobs

ECE 340 Probabilistic Methods in Engineering M/W 3-4:15. Lecture 10: Continuous RV Families. Prof. Vince Calhoun

Simulation of Extreme Events in the Presence of Spatial Dependence

CPSC 540: Machine Learning

R. Kerry 1, M. A. Oliver 2. Telephone: +1 (801) Fax: +1 (801)

Math 416/516: Stochastic Simulation

Chapter 8. Introduction to Statistical Inference

Modelling Returns: the CER and the CAPM

2 Control variates. λe λti λe e λt i where R(t) = t Y 1 Y N(t) is the time from the last event to t. L t = e λr(t) e e λt(t) Exercises

Self-Exciting Corporate Defaults: Contagion or Frailty?

Chapter 6 Forecasting Volatility using Stochastic Volatility Model

A potentially useful approach to model nonlinearities in time series is to assume different behavior (structural break) in different subsamples

Exact Sampling of Jump-Diffusion Processes

Multivariate Cox PH model with log-skew-normal frailties

Introduction to Sequential Monte Carlo Methods

CPSC 540: Machine Learning

may be of interest. That is, the average difference between the estimator and the truth. Estimators with Bias(ˆθ) = 0 are called unbiased.

ME3620. Theory of Engineering Experimentation. Spring Chapter III. Random Variables and Probability Distributions.

Financial Econometrics

UQ, STAT2201, 2017, Lectures 3 and 4 Unit 3 Probability Distributions.

1. You are given the following information about a stationary AR(2) model:

The change of correlation structure across industries: an analysis in the regime-switching framework

Simulating Stochastic Differential Equations

Parameter estimation in SDE:s

Chapter 14 : Statistical Inference 1. Note : Here the 4-th and 5-th editions of the text have different chapters, but the material is the same.

Extend the ideas of Kan and Zhou paper on Optimal Portfolio Construction under parameter uncertainty

Inferences on Correlation Coefficients of Bivariate Log-normal Distributions

Using Halton Sequences. in Random Parameters Logit Models

arxiv: v1 [q-fin.rm] 13 Dec 2016

"Pricing Exotic Options using Strong Convergence Properties

(5) Multi-parameter models - Summarizing the posterior

1 The continuous time limit

Australian Journal of Basic and Applied Sciences. Conditional Maximum Likelihood Estimation For Survival Function Using Cox Model

Practical example of an Economic Scenario Generator

UNIVERSITY OF OSLO. Please make sure that your copy of the problem set is complete before you attempt to answer anything.

A comment on Christoffersen, Jacobs, and Ornthanalai (2012), Dynamic jump intensities and risk premiums: Evidence from S&P 500 returns and options $

Nonresponse Adjustment of Survey Estimates Based on. Auxiliary Variables Subject to Error. Brady T. West. University of Michigan, Ann Arbor, MI, USA

Point Estimators. STATISTICS Lecture no. 10. Department of Econometrics FEM UO Brno office 69a, tel

Statistical Tables Compiled by Alan J. Terry

Stochastic Differential Equations in Finance and Monte Carlo Simulations

Monte Carlo and Empirical Methods for Stochastic Inference (MASM11/FMSN50)

Analyzing Oil Futures with a Dynamic Nelson-Siegel Model

Valuing volatility and variance swaps for a non-gaussian Ornstein-Uhlenbeck stochastic volatility model

STA 4504/5503 Sample questions for exam True-False questions.

Regime Switching in the Presence of Endogeneity

Lecture outline. Monte Carlo Methods for Uncertainty Quantification. Importance Sampling. Importance Sampling

A Stochastic Reserving Today (Beyond Bootstrap)

ST440/550: Applied Bayesian Analysis. (5) Multi-parameter models - Summarizing the posterior

Practice Exam 1. Loss Amount Number of Losses

Distributed Computing in Finance: Case Model Calibration

A RIDGE REGRESSION ESTIMATION APPROACH WHEN MULTICOLLINEARITY IS PRESENT

Chapter 7 - Lecture 1 General concepts and criteria

IEOR E4703: Monte-Carlo Simulation

Estimation after Model Selection

Parametric Inference and Dynamic State Recovery from Option Panels. Torben G. Andersen

12. Conditional heteroscedastic models (ARCH) MA6622, Ernesto Mordecki, CityU, HK, 2006.

Small Area Estimation of Poverty Indicators using Interval Censored Income Data

Modelling strategies for bivariate circular data

STAT 509: Statistics for Engineers Dr. Dewei Wang. Copyright 2014 John Wiley & Sons, Inc. All rights reserved.

Application of MCMC Algorithm in Interest Rate Modeling

Homework Problems Stat 479

Applications of Good s Generalized Diversity Index. A. J. Baczkowski Department of Statistics, University of Leeds Leeds LS2 9JT, UK

European option pricing under parameter uncertainty

Machine Learning for Quantitative Finance

IEOR E4703: Monte-Carlo Simulation

Non-informative Priors Multiparameter Models

A Macro-Finance Model of the Term Structure: the Case for a Quadratic Yield Model

Monte Carlo Methods for Uncertainty Quantification

Credit Modeling and Credit Derivatives

Chapter 7: Estimation Sections

Missing Data. EM Algorithm and Multiple Imputation. Aaron Molstad, Dootika Vats, Li Zhong. University of Minnesota School of Statistics

Introduction Dickey-Fuller Test Option Pricing Bootstrapping. Simulation Methods. Chapter 13 of Chris Brook s Book.

Posterior Inference. , where should we start? Consider the following computational procedure: 1. draw samples. 2. convert. 3. compute properties

Chapter 8: Sampling distributions of estimators Sections

Window Width Selection for L 2 Adjusted Quantile Regression

Chapter 7 Sampling Distributions and Point Estimation of Parameters

Tutorial 11: Limit Theorems. Baoxiang Wang & Yihan Zhang bxwang, April 10, 2017

Statistical Inference and Methods

SPATIAL AUTOREGRESSIVE CONDITIONAL HETEROSCEDASTICITY MODEL AND ITS APPLICATION

STRESS-STRENGTH RELIABILITY ESTIMATION

Calibration of Interest Rates

Statistical Models and Methods for Financial Markets

Web-based Supplementary Materials for. A space-time conditional intensity model. for invasive meningococcal disease occurence

Computational Finance. Computational Finance p. 1

Modeling Yields at the Zero Lower Bound: Are Shadow Rates the Solution?

Modeling Co-movements and Tail Dependency in the International Stock Market via Copulae

ADVANCED OPERATIONAL RISK MODELLING IN BANKS AND INSURANCE COMPANIES

Assessing the performance of Bartlett-Lewis model on the simulation of Athens rainfall

A Test of the Normality Assumption in the Ordered Probit Model *

On modelling of electricity spot price

Transcription:

Geostatistical Inference under Preferential Sampling Marie Ozanne and Justin Strait Diggle, Menezes, and Su, 2010 October 12, 2015 Marie Ozanne and Justin Strait Preferential Sampling October 12, 2015 1 / 31

A simple geostatistical model Notation: The underlying spatially continuous phenomenon S(x), x R 2 is sampled at a set of locations x i, i = 1,..., n, from the spatial region of interest A R 2 Y i is the measurement taken at x i Z i is the measurement error The model: Y i = µ + S(x i ) + Z i, i = 1,..., n {Z i, i = 1,..., n} are a set of mutually independent random variables with E[Z i ] = 0 and Var(Z i ) = τ 2 (called the nugget variance) Assume E[S(x)] = 0 x Marie Ozanne and Justin Strait Preferential Sampling October 12, 2015 2 / 31

Thinking hierarchically Diggle et al. (1998) rewrote this simple model hierarchically, assuming Gaussian distributions: S(x) follows a latent Gaussian stochastic process Y i S(x i ) N(µ + S(x i ), τ 2 ) are mutually independent for i = 1,..., n If X = (x 1,..., x n ), Y = (y 1,..., y n ), and S(X ) = {S(x 1 ),..., S(x n )}, this model can be described by: [S, Y ] = [S][Y S(X )] = [S][Y 1 S(x 1 )]... [Y n S(x n )] where [ ] denotes the distribution of the random variable. This model treats X as deterministic Marie Ozanne and Justin Strait Preferential Sampling October 12, 2015 3 / 31

What is preferential sampling? Typically, the sampling locations x i are treated as stochastically independent of S(x), the spatially continuous process: [S, X ] = [S][X ] (this is non-preferential sampling). This means that [S, X, Y ] = [S][X ][Y S(X )], and by conditioning on X, standard geostatistical techniques can be used to infer properties about S and Y. Preferential sampling describes instances when the sampling process depends on the underlying spatial process: [S, X ] [S][X ] Preferential sampling complicates inference! Marie Ozanne and Justin Strait Preferential Sampling October 12, 2015 4 / 31

Examples of sampling designs 1 Non-preferential, uniform designs: Sample locations come from an independent random sample from a uniform distribution on the region of interest A (e.g. completely random designs, regular lattice designs). 2 Non-preferential, non-uniform design: Sample locations are determined from an independent random sample from a non-uniform distribution on A. 3 Preferential designs: Sample locations are more concentrated in parts of A that tend to have higher (or lower) values of the underlying process S(x) X, Y form a marked point process where the points X and the marks Y are dependent Schlather et al. (2004) developed a couple tests for determining if preferential sampling has occurred. Marie Ozanne and Justin Strait Preferential Sampling October 12, 2015 5 / 31

Why does preferential sampling complicate inference? Consider the situation where S and X are stochastically dependent, but measurements Y are taken at a different set of locations, independent of X. Then, the joint distribution of S, X, and Y is: We can integrate out X to get: [S, X, Y ] = [S][X S][Y S] [S, Y ] = [S][Y S] This means inference on S can be done by ignoring X (as is convention in geostatistical inference). However, if Y is actually observed at X, then the joint distribution is: [S, X, Y ] = [S][X S][Y X, S] = [S][X S][Y S(X )] Conventional methods which ignore X are misleading for preferential sampling! Marie Ozanne and Justin Strait Preferential Sampling October 12, 2015 6 / 31

Shared latent process model for preferential sampling The joint distribution of S, X, and Y (from previous slide): [S, X, Y ] = [S][X S][Y X, S] = [S][X S][Y S(X )] with the last equality holding for typical geostatistical modeling. 1 S is a stationary Gaussian process with mean 0, variance σ 2, and correlation function: for x, x separated by distance u ρ(u; φ) = Corr(S(x), S(x )) 2 Given S, X is an inhomogeneous Poisson process with intensity λ(x) = exp(α + βs(x)) 3 Given S and X, Y = (Y 1,..., Y n ) is set of mutually independent random variables such that Y i N(µ + S(x i ), τ 2 ) Marie Ozanne and Justin Strait Preferential Sampling October 12, 2015 7 / 31

Shared latent process model for preferential sampling Some notes about this model: Unconditionally, X follows a log-gaussian Cox process (details in Moller et al. (1998)) If we set β = 0 in [X S], then unconditionally, Y follows a multivariate Gaussian distribution Ho and Stoyan (2008) considered a similar hierarchical model construction for marked point processes Marie Ozanne and Justin Strait Preferential Sampling October 12, 2015 8 / 31

Simulation experiment Approximately simulate the stationary Gaussian process S on the unit square by simulating on a finely spaced grid, and then treating S as constant within each cell. Then, sample values of Y according to one of 3 sampling designs: 1 Completely random (non-preferential): Use sample locations x i that are determined from an independent random sample from a uniform distribution on A. 2 Preferential: Generate a realization of X by using [X S], with β = 2, and then generate Y using [Y S(X )]. 3 Clustered: Generate a realization of X by using [X S], but then generate Y on locations X using a separate independent realization of S. This is non-preferential, but marginally X and Y share the same properties as the preferential design. Marie Ozanne and Justin Strait Preferential Sampling October 12, 2015 9 / 31

Specifying the model for simulation S is stationary Gaussian with mean µ = 4, variance σ 2 = 1.5 and correlation function defined by the Matérn class of correlation functions: ρ(u; φ, κ) = (2 κ 1 Γ(κ)) 1 (u/φ) κ K κ (u/φ), u > 0 where K κ is the modified Bessel function of the second kind. For this simulation, φ = 0.15 and κ = 1. Set the nugget variance τ 2 = 0 so that y i is the realized value of S(x i ). Marie Ozanne and Justin Strait Preferential Sampling October 12, 2015 10 / 31

Simulation sampling location plots Figure: Underlying process realization and sampling locations from the simulation for (a) completely random sampling, (b) preferential sampling, and (c) clustered sampling Marie Ozanne and Justin Strait Preferential Sampling October 12, 2015 11 / 31

Estimating the variogram Theoretical variogram of spatial process Y (x): where x and x are distance u apart V (u) = 1 2 Var(Y (x) Y (x )) Empirical variogram ordinates: For (x i, y i ), i = 1,..., n where x i is the location and y i is the measured value at that location: v ij = 1 2 (y i y j ) 2 Under non-preferential sampling, v ij is an unbiased estimate of V (u ij ), where u ij is the distance between x i and x j A variogram cloud plots v ij against u ij ; these can be used to find an appropriate correlation function. For this simulation, simple binned estimators are used. Marie Ozanne and Justin Strait Preferential Sampling October 12, 2015 12 / 31

Empirical variograms under different sampling regimes Looking at 500 replicated simulations, the pointwise bias and standard deviation of the smoothed empirical variograms are plotted: Under preferential sampling, the empirical variogram is biased and less efficient! The bias comes from sample locations covering a much smaller range of S(x) values Marie Ozanne and Justin Strait Preferential Sampling October 12, 2015 13 / 31

Spatial prediction Goal: Predict the value of the underlying process S at a location x 0, given the sample (x i, y i ), i = 1,..., n. Typically, ordinary kriging is used to estimate the unconditional expectation of S(x 0 ), with plug-in estimates for covariance parameters. The bias and MSE of the kriging predictor at the point x 0 = (0.49, 0.49) are calculated for each of the 500 simulations, and used to form 95% confidence intervals: Model Parameter Confidence intervals for the following sampling designs: Completely random Preferential Clustered 1 Bias (-0.014,0.055) (0.951,1.145) (-0.048,0.102) 1 RMSE (0.345,0.422) (1.387,1.618) (0.758,0.915) 2 Bias (0.003,0.042) (-0.134,-0.090) (-0.018,0.023) 2 RMSE (0.202,0.228) (0.247,0.292) (0.214,0.247) Marie Ozanne and Justin Strait Preferential Sampling October 12, 2015 14 / 31

Kriging issues under preferential sampling For both models, the completely random and clustered sampling designs lead to approximately unbiased predictions (as expected). Under the Model 1 simulations, there is large, positive bias and high MSE for preferential sampling (here, β = 2) - this is because locations with high values of S are oversampled. Under the Model 2 simulations, there is some negative bias (and slightly higher MSE) due to preferential sampling (here, β = 2) ; however, the bias and MSE are not as drastic because: the variance of the underlying process is much smaller; the degree of preferentiality βσ is lower here than for Model 1. the nugget variance is non-zero for Model 2. Marie Ozanne and Justin Strait Preferential Sampling October 12, 2015 15 / 31

Fitting the shared latent process model Data: X, Y Likelihood for the data: L(θ) = [X, Y ] = E S [[X S][Y X, S]] where θ consists of all parameters in the model To evaluate [X S], the realization of S at all possible locations x A is needed; however, we can approximate S (which is spatially continuous) by a set of values on a finely spaced grid, and replace exact locations X by their closest grid point. Let S = {S 0, S 1 }, where S 0 represents values of S at the n observed locations x i X and S 1 denotes values of S at the other N n grid points. Unfortunately, estimating the likelihood with a sample average over simulations S j fails when the nugget variance is 0 because simulations of S j usually will not match up with the observed Y. Marie Ozanne and Justin Strait Preferential Sampling October 12, 2015 16 / 31

Evaluating the likelihood L(θ) = = = = [X S][Y X, S][S]dS [X S][Y X, S] [S Y ] [S Y ] [S]dS [S Y ] [X S][Y S 0 ] [S 0 Y ][S 1 S 0, Y ] [S 0][S 1 S 0 ]ds [X S] [Y S 0] [S 0 Y ] [S 0][S Y ]ds (1) The third equality uses [S] = [S 0 ][S 1 S 0 ], [S Y ] = [S 0 Y ][S 1 S 0, Y ], and [Y X, S] = [Y S 0 ]. The last equality uses [S 1 S 0, Y ] = [S 1 S 0 ]. Hence: [ L(θ) = E S Y [X S] [Y S ] 0] [S 0 Y ] [S 0] (2) Marie Ozanne and Justin Strait Preferential Sampling October 12, 2015 17 / 31

Approximating the likelihood A Monte Carlo approximation can be used to approximate the likelihood: L MC (θ) = m 1 where S j are simulations of S Y. m j=1 [X S j ] [Y S 0j] [S 0j Y ] [S 0j] Antithetic pairs of realizations are used to reduce Monte Carlo variance To simulate from [S Y ], we can simulate from several other unconditional distributions, and then notice that: S + ΣC Σ 1 0 (y µ + Z CS) has the distribution of S Y = y, where: S MVN(0, Σ),Y MVN(µ, Σ 0 ), Z N(0, τ 2 ) C is an n x N matrix which identifies the position of the data locations within all possible prediction locations Marie Ozanne and Justin Strait Preferential Sampling October 12, 2015 18 / 31

Goodness of fit We can use K-functions to assess how well the shared latent process model under preferential sampling fits the data. The K-function K(s) is defined by λk(s) = E[N 0 (s)], where N 0 (s) is the number of points in the process within distance s of a chosen origin and λ is the expected number of points in the process per unit area. Under our preferential sampling model, X marginally follows a log-gaussian Cox process with intensity Λ(x) = exp(α + βs(x)). The corresponding K-function is: K(s) = πs 2 + 2π s 0 γ(u)udu where γ(u) is the covariance function of Λ(x) (Diggle (2003)) By comparing the estimated K-function from the data to an envelope of estimates obtained from simulated realizations of the fitted model, goodness of fit can be determined. Marie Ozanne and Justin Strait Preferential Sampling October 12, 2015 19 / 31

Lead biomonitoring in Galicia, Spain Background Uses lead concentration, [Pb] (µg/g dry weight), in moss samples as measured variable Initial survey conducted in Spring 1995 to select the most suitable moss species and collection sites (Fernandez et al., 2000) Two further surveys of [Pb] in samples of Scleropodium purum October 1997: sampling conducted more intensively in subregions where large gradiants in [Pb] expected July 2000: used approximately regular lattice design; gaps arise where different moss species collected Marie Ozanne and Justin Strait Preferential Sampling October 12, 2015 20 / 31

Lead biomonitoring in Galicia, Spain Marie Ozanne and Justin Strait Preferential Sampling October 12, 2015 21 / 31

Lead biomonitoring in Galicia, Spain Summary statistics: Untransformed Log-transformed 1997 2000 1997 2000 Number of locations 63 132 63 132 Mean 4.72 2.15 1.44 0.66 Standard deviation 2.21 1.18 0.48 0.43 Minimum 1.67 0.80 0.52-0.22 Maximum 9.51 8.70 2.25 2.16 Marie Ozanne and Justin Strait Preferential Sampling October 12, 2015 22 / 31

Lead biomonitoring in Galicia, Spain Standard geostatistical analysis Assumptions: standard Gaussian model with underlying signal S(x) S(x) is a zero-mean stationary Gaussian process with: variance σ 2 Matern correlation function ρ(u; φ, κ) Gaussian measurement errors, Z i N(0, τ 2 ) Models fitted separately for 1997 and 2000 data Marie Ozanne and Justin Strait Preferential Sampling October 12, 2015 23 / 31

Lead biomonitoring in Galicia, Spain Standard geostatistical analysis Marie Ozanne and Justin Strait Preferential Sampling October 12, 2015 24 / 31

Lead biomonitoring in Galicia, Spain Analysis under preferential sampling Parameter estimation Goal: To investigate whether the 1997 sampling is preferential Use Nelder-Mead simplex algorithm (Nelder and Mead, 1965) to estimate model parameters m = 100, 000 Monte Carlo samples reduced standard error to approximately 0.3 and approximate generalized likelihood ratio test statistic to test β = 0 was 27.7 on 1 degree of freedom (p < 0.001) Marie Ozanne and Justin Strait Preferential Sampling October 12, 2015 25 / 31

Lead biomonitoring in Galicia, Spain Analysis under preferential sampling Parameter estimation Goal: To test the hypothesis of shared values of σ, φ, and τ Fit joint model to 1997 and 2000 data sets, treated as preferential and nonpreferential, respectively Fit model with and without constaints on σ, φ, and τ to get generalized likelihood ratio test statistic of 6.2 on 3 degrees of freedom (p = 0.102) Using shared parameter values (when justified) improves estimation efficiency and results in a better identified model (Altham, 1984) Marie Ozanne and Justin Strait Preferential Sampling October 12, 2015 26 / 31

Lead biomonitoring in Galicia, Spain Analysis under preferential sampling Parameter estimation Monte Carlo maximum likelihood estimates obtained for the model with shared σ, φ, and τ Preferential sampling parameter estimate is negative, ˆβ = 2.198; dependent on allowing two separate means Recall: Given S, X is an inhomogeneous Poisson process with intensity λ(x) = exp(α + βs(x)) Marie Ozanne and Justin Strait Preferential Sampling October 12, 2015 27 / 31

Lead biomonitoring in Galicia, Spain Analysis under preferential sampling Goodness of Fit Goodness of fit assessed using statistic T ; the resultant p-value = 0.03 T = 0.25 0 { ˆK(s) K(s)} 2 ds v(s) Marie Ozanne and Justin Strait Preferential Sampling October 12, 2015 28 / 31

Lead biomonitoring in Galicia, Spain Analysis under preferential sampling Prediction Figures in paper show predicted surfaces ˆT (x) = E[T (x) X, Y ], where T (x) = exp{s(x)} denotes the [Pb] on the untransformed scale Predictions based on the preferential sampling have much wider range over lattice of prediction locations compared to those that assume non-preferential sampling (1.310-7.654 and 1.286-5.976 respectively) Takeaway: Recognition of the preferential sampling results in a pronounced shift in the predictive distribution Marie Ozanne and Justin Strait Preferential Sampling October 12, 2015 29 / 31

Discussion Conventional geostatistical models and associated statistical methods can lead to misleading inferences if the underlying data have been preferentially sampled This paper proposes a simple model to take into account preferential sampling and develops associated Monte Carlo methods to enable maximum likleihood estimation and likelihood testing within the class of models proposed This method is computationally intensive - each model takes several hours to run Marie Ozanne and Justin Strait Preferential Sampling October 12, 2015 30 / 31

References Diggle, P.J., Menezes, R., Su, T.-l., 2010. Geostatistical inference under preferential sampling. Journal of the Royal Statistical Society, Series C (Applied Statistics), 59, 191-232. Menezes, R., 2005. Assessing spatial dependency under non-standard sampling. Ph.D. Dissertation, Universidad de Santiago de Compostela. Pati, D., Reich, B.J., Dunson, D.B., 2011. Bayesian geostatistical modelling with informative sampling locations. Biometrika, 98, 35-48. Gelfand, A.E., Sahu, S.K., Holland, D.M., 2012. On the effect of preferential sampling in spatial prediction. Environmetric, 23, 565-578. Lee, A., Szpiro, A., Kim, S.Y., Sheppard, L., 2015. Impact of preferential sampling on exposure prediction and health effect inference in the context of air pollution epidemiology. Environmetrics. Marie Ozanne and Justin Strait Preferential Sampling October 12, 2015 31 / 31