Statistical analysis and bootstrapping

Similar documents
Chapter 7 - Lecture 1 General concepts and criteria

Chapter 7: Point Estimation and Sampling Distributions

Chapter 8: Sampling distributions of estimators Sections

Definition 9.1 A point estimate is any function T (X 1,..., X n ) of a random sample. We often write an estimator of the parameter θ as ˆθ.

Chapter 8: Sampling distributions of estimators Sections

Normal Distribution. Notes. Normal Distribution. Standard Normal. Sums of Normal Random Variables. Normal. approximation of Binomial.

Lecture 23. STAT 225 Introduction to Probability Models April 4, Whitney Huang Purdue University. Normal approximation to Binomial

Lecture Notes 6. Assume F belongs to a family of distributions, (e.g. F is Normal), indexed by some parameter θ.

Two hours. To be supplied by the Examinations Office: Mathematical Formula Tables and Statistical Tables THE UNIVERSITY OF MANCHESTER

Chapter 5: Statistical Inference (in General)

Review of key points about estimators

Exercise. Show the corrected sample variance is an unbiased estimator of population variance. S 2 = n i=1 (X i X ) 2 n 1. Exercise Estimation

Applied Statistics I

Review of the Topics for Midterm I

Discrete Random Variables

Back to estimators...

Point Estimators. STATISTICS Lecture no. 10. Department of Econometrics FEM UO Brno office 69a, tel

Lecture 9 - Sampling Distributions and the CLT

MATH 3200 Exam 3 Dr. Syring

Review of key points about estimators

Chapter 8. Introduction to Statistical Inference

Point Estimation. Stat 4570/5570 Material from Devore s book (Ed 8), and Cengage

Point Estimation. Edwin Leuven

ECE 295: Lecture 03 Estimation and Confidence Interval

Version A. Problem 1. Let X be the continuous random variable defined by the following pdf: 1 x/2 when 0 x 2, f(x) = 0 otherwise.

Point Estimation. Principle of Unbiased Estimation. When choosing among several different estimators of θ, select one that is unbiased.

Interval estimation. September 29, Outline Basic ideas Sampling variation and CLT Interval estimation using X More general problems

Computer Statistics with R

Tutorial 6. Sampling Distribution. ENGG2450A Tutors. 27 February The Chinese University of Hong Kong 1/6

STAT/MATH 395 PROBABILITY II

1. Covariance between two variables X and Y is denoted by Cov(X, Y) and defined by. Cov(X, Y ) = E(X E(X))(Y E(Y ))

Chapter 5. Statistical inference for Parametric Models

4.1 Introduction Estimating a population mean The problem with estimating a population mean with a sample mean: an example...

Discrete probability distributions

Much of what appears here comes from ideas presented in the book:

Confidence Intervals Introduction

IEOR E4703: Monte-Carlo Simulation

MTH6154 Financial Mathematics I Stochastic Interest Rates

may be of interest. That is, the average difference between the estimator and the truth. Estimators with Bias(ˆθ) = 0 are called unbiased.

Chapter 3 - Lecture 3 Expected Values of Discrete Random Va

Chapter 16. Random Variables. Copyright 2010 Pearson Education, Inc.

Bias Reduction Using the Bootstrap

Chapter 6: Point Estimation

STAT 111 Recitation 3

Practice Exercises for Midterm Exam ST Statistical Theory - II The ACTUAL exam will consists of less number of problems.

Chapter 4: Commonly Used Distributions. Statistics for Engineers and Scientists Fourth Edition William Navidi

Chapter 7 Sampling Distributions and Point Estimation of Parameters

**BEGINNING OF EXAMINATION** A random sample of five observations from a population is:

Module 4: Point Estimation Statistics (OA3102)

Stat 139 Homework 2 Solutions, Fall 2016

Lecture 9 - Sampling Distributions and the CLT. Mean. Margin of error. Sta102/BME102. February 6, Sample mean ( X ): x i

3 ˆθ B = X 1 + X 2 + X 3. 7 a) Find the Bias, Variance and MSE of each estimator. Which estimator is the best according

Section 1.4: Learning from data

Chapter 5 Discrete Probability Distributions. Random Variables Discrete Probability Distributions Expected Value and Variance

Statistics, Their Distributions, and the Central Limit Theorem

Homework Problems Stat 479

12 The Bootstrap and why it works

Simulation Wrap-up, Statistics COS 323

Exam 2 Spring 2015 Statistics for Applications 4/9/2015

The Bernoulli distribution

Homework Problems Stat 479

Actuarial Mathematics and Statistics Statistics 5 Part 2: Statistical Inference Tutorial Problems

Probability Theory. Mohamed I. Riffi. Islamic University of Gaza

Section 0: Introduction and Review of Basic Concepts

A New Hybrid Estimation Method for the Generalized Pareto Distribution

Lecture 5: Fundamentals of Statistical Analysis and Distributions Derived from Normal Distributions

STAT 509: Statistics for Engineers Dr. Dewei Wang. Copyright 2014 John Wiley & Sons, Inc. All rights reserved.

Huber smooth M-estimator. Mâra Vçliòa, Jânis Valeinis. University of Latvia. Sigulda,

Dr. Maddah ENMG 625 Financial Eng g II 10/16/06

Lecture 10: Point Estimation

CPSC 540: Machine Learning

STATISTICS and PROBABILITY

6 Central Limit Theorem. (Chs 6.4, 6.5)

Bernoulli and Binomial Distributions

Unit 5: Sampling Distributions of Statistics

Unit 5: Sampling Distributions of Statistics

Review for Final Exam Spring 2014 Jeremy Orloff and Jonathan Bloom

Business Statistics 41000: Probability 3

1. The number of dental claims for each insured in a calendar year is distributed as a Geometric distribution with variance of

Sampling Distribution

continuous rv Note for a legitimate pdf, we have f (x) 0 and f (x)dx = 1. For a continuous rv, P(X = c) = c f (x)dx = 0, hence

Homework Problems Stat 479

Statistical estimation

Statistics for Business and Economics

MVE051/MSG Lecture 7

Engineering Statistics ECIV 2305

IEOR E4602: Quantitative Risk Management

STRESS-STRENGTH RELIABILITY ESTIMATION

BIO5312 Biostatistics Lecture 5: Estimations

Sampling and sampling distribution

Chapter 7: SAMPLING DISTRIBUTIONS & POINT ESTIMATION OF PARAMETERS

Qualifying Exam Solutions: Theoretical Statistics

CPSC 540: Machine Learning

درس هفتم یادگیري ماشین. (Machine Learning) دانشگاه فردوسی مشهد دانشکده مهندسی رضا منصفی

Chapter 4 Continuous Random Variables and Probability Distributions

8.1 Estimation of the Mean and Proportion

Central Limit Theorem, Joint Distributions Spring 2018

Chapter 3 - Lecture 4 Moments and Moment Generating Funct

The Binomial Distribution

Chapter 7. Sampling Distributions and the Central Limit Theorem

Transcription:

Statistical analysis and bootstrapping p. 1/15 Statistical analysis and bootstrapping Michel Bierlaire michel.bierlaire@epfl.ch Transport and Mobility Laboratory

Statistical analysis and bootstrapping p. 2/15 Introduction The outputs of the simulator are random variables. Running the simulator provides one realization of these r.v. We have no access to the pdf or CDF of these r.v. Well... this is actually why we rely on simulation. How to derive statistics about a r.v. when only instances are known? How to measure the quality of this statistic?

Statistical analysis and bootstrapping p. 3/15 Sample mean and variance Consider X 1,..., X n independent and identically distributed (i.i.d.) r.v. E[X i ] = µ, Var(X i ) = σ 2. The sample mean n X = 1 n i=1 X i is an unbiased estimate of the population mean µ, as E[ X] = µ. The sample variance S 2 = 1 n 1 n (X i X) 2 i=1 is an unbiased estimator of the population variance σ 2, as E[S 2 ] = σ 2. (see proof: Ross, chapter 7)

Statistical analysis and bootstrapping p. 4/15 Sample mean and variance Recursive computation: 1. Initialize X 0 = 0, S 2 1 = 0. 2. Update the mean 3. Update the variance S 2 k+1 = ( X k+1 = X k + X k+1 X k k +1 1 1 k ) S 2 k +(k +1)( X k+1 X k ) 2.

Statistical analysis and bootstrapping p. 5/15 Mean Square Error Consider X 1,..., X n i.i.d. r.v. with CDF F. Consider a parameter θ(f) of the distribution (mean, quantile, mode, etc.) Consider θ(x1,...,x n ) an estimator of θ(f). The Mean Square Error of the estimator is defined as [ ) 2 ] MSE(F) = E F ( θ(x1,...,x n ) θ(f), where E F emphasizes that the expectation is taken under the assumption that the r.v. all have distribution F. If F is unknown, it is not immediate to find an estimator of MSE.

Statistical analysis and bootstrapping p. 6/15 How many draws must be used? Let X a r.v. with mean θ and variance σ 2. We want to estimate the mean θ of the simulated distribution. The estimator used is the sample mean: X. The mean square error is E[( X θ) 2 ] = σ2 n The sample mean X is normally distributed with mean θ and variance σ 2 /n. So we can stop generating data when σ/ n is small. σ is approximated by the sample variance S. Law of large numbers: at least 100 draws (say) should be used. See Ross p. 121 for details.

Statistical analysis and bootstrapping p. 7/15 Mean Square Error Other indicators than the mean are desired. Theoretical results about the MSE cannot always be derived. Solution: rely on simulation. Method: bootstrapping.

Statistical analysis and bootstrapping p. 8/15 Empirical distribution function Consider X 1,..., X n i.i.d. r.v. with CDF F. Consider a realization x 1,...,x n of these r.v. The empirical distribution function is defined as F e (x) = 1 n #{i x i x}, that is the number of values less or equal to x. CDF of a r.v. that can take any x i with equal probability.

Statistical analysis and bootstrapping p. 9/15 Empirical CDF 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 F e (x), n = 10 F(x) 0 0.5 1 1.5 2 2.5 3 3.5 4

Statistical analysis and bootstrapping p. 10/15 Empirical CDF 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 F e (x), n = 100 F(x) 0 1 2 3 4 5 6 7 8

Statistical analysis and bootstrapping p. 11/15 Empirical CDF 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 F e (x), n = 1000 F(x) 0 1 2 3 4 5 6 7 8

Statistical analysis and bootstrapping p. 12/15 Mean Square Error We use the empirical distribution function F e We can approximate [ ) 2 ] MSE(F) = E F ( θ(x1,...,x n ) θ(f), by [ ) 2 ] MSE(F e ) = E Fe ( θ(x1,...,x n ) θ(f e ), θ(f e ) can be computed directly from the data (mean, variance, etc.)

Statistical analysis and bootstrapping p. 13/15 Mean Square Error We want to compute [ ) 2 ] MSE(F e ) = E Fe ( θ(x1,...,x n ) θ(f e ), F e is the CDF of a r.v. that can take any x i with equal probability. Therefore, MSE(F e ) = 1 n n n i 1 =1 n i n =1 [ ( θ(xi1,...,x in ) θ(f e)) 2 ] Clearly impossible to compute when n is large. Solution: simulation.,

Statistical analysis and bootstrapping p. 14/15 Bootstrapping For r = 1,...,R Draw x r 1,...,x r n from F e, that is draw from the data: 1. Let s be a draw from U[0,1] 2. Set j = floor(ns). 3. Return x j. Compute M r = ( θ(x r 1,...,x r n) θ(f e )) 2, Estimate of MSE(F e ) and, therefore, of MSE(F): 1 R R r=1 M r Typical value for R: 100.

Statistical analysis and bootstrapping p. 15/15 Bootstrap: simple example Data: 0.636, -0.643, 0.183, -1.67, 0.462 Mean= -0.206 MSE= E[( X θ) 2 ] = S 2 /n= 0.1817 r ˆθ MSE 1-0.643-0.643-0.643 0.462 0.462-0.201 2.544e-05 2-0.643 0.183 0.636 0.636 0.636 0.2896 0.2456 3-1.67-1.67 0.183 0.462 0.636-0.411 0.04204 4-1.67-0.643 0.183 0.183 0.636-0.2617 0.003105 5-0.643 0.462 0.462 0.636 0.636 0.3105 0.2667 6-1.67-1.67 0.183 0.183 0.183-0.5573 0.1234 7-0.643 0.183 0.183 0.462 0.636 0.1642 0.137 8-1.67-1.67-0.643 0.183 0.183-0.7225 0.2667 9 0.183 0.462 0.462 0.636 0.636 0.4756 0.4646 10-0.643 0.183 0.183 0.462 0.636 0.1642 0.137 0.1686

Statistical analysis and bootstrapping p. 16/15 Appendix: MSE for the mean Consider X 1,..., X n i.i.d. r.v. Denote θ = E[X i ] and σ 2 = Var(X i ). Consider X = n i=1 X i/n. E[ X] = n i=1 E[X i]/n = θ. MSE: E[( X θ) 2 ] = Var X ( n ) = Var X i /n i=1 = n Var(X i )/n 2 i=1 = σ 2 /n.