Chapter 6 Part 3 October 21, Bootstrapping

Similar documents
You created this PDF from an application that is not licensed to print to novapdf printer (

Question 1a 1b 1c 1d 1e 1f 2a 2b 2c 2d 3a 3b 3c 3d M ult:choice Points

Econ 371 Problem Set #4 Answer Sheet. 6.2 This question asks you to use the results from column (1) in the table on page 213.

Basic Procedure for Histograms

Monte Carlo Simulation (General Simulation Models)

Chapter 11 Part 6. Correlation Continued. LOWESS Regression

Handout seminar 6, ECON4150

Chapter 6 Part 6. Confidence Intervals chi square distribution binomial distribution

Normal populations. Lab 9: Normal approximations for means STT 421: Summer, 2004 Vince Melfi

Final Exam - section 1. Thursday, December hours, 30 minutes

CHAPTER 7 INTRODUCTION TO SAMPLING DISTRIBUTIONS

*1A. Basic Descriptive Statistics sum housereg drive elecbill affidavit witness adddoc income male age literacy educ occup cityyears if control==1

Rationale. Learning about return and risk from the historical record and beta estimation. T Bills and Inflation

ECO220Y, Term Test #2

IOP 201-Q (Industrial Psychological Research) Tutorial 5

Measures of Center. Mean. 1. Mean 2. Median 3. Mode 4. Midrange (rarely used) Measure of Center. Notation. Mean

tm / / / / / / / / / / / / Statistics/Data Analysis User: Klick Project: Limited Dependent Variables{space -6}

Module 9: Single-level and Multilevel Models for Ordinal Responses. Stata Practical 1

u panel_lecture . sum

Percentiles, STATA, Box Plots, Standardizing, and Other Transformations

Question scores. Question 1a 1b 1c 1d 1e 2a 2b 2c 2d 2e 2f 3a 3b 3c 3d M ult:choice Points

Overview/Outline. Moving beyond raw data. PSY 464 Advanced Experimental Design. Describing and Exploring Data The Normal Distribution

Math 2311 Bekki George Office Hours: MW 11am to 12:45pm in 639 PGH Online Thursdays 4-5:30pm And by appointment

NCSS Statistical Software. Reference Intervals

Simple Descriptive Statistics

Some Characteristics of Data

Descriptive Analysis

The Central Limit Theorem (Solutions) COR1-GB.1305 Statistics and Data Analysis

Key Objectives. Module 2: The Logic of Statistical Inference. Z-scores. SGSB Workshop: Using Statistical Data to Make Decisions

Description Remarks and examples References Also see

Descriptive Statistics

R & R Study. Chapter 254. Introduction. Data Structure

Stat 101 Exam 1 - Embers Important Formulas and Concepts 1

DATA SUMMARIZATION AND VISUALIZATION

Financial Econometrics Jeffrey R. Russell Midterm 2014

Monte Carlo Simulation (Random Number Generation)

Hydrology 4410 Class 29. In Class Notes & Exercises Mar 27, 2013

David Tenenbaum GEOG 090 UNC-CH Spring 2005

Maximum Likelihood Estimation Richard Williams, University of Notre Dame, Last revised January 13, 2018

Statistics & Flood Frequency Chapter 3. Dr. Philip B. Bedient

Computing Statistics ID1050 Quantitative & Qualitative Reasoning

Chapter 7 1. Random Variables

4.2 Probability Distributions

Dot Plot: A graph for displaying a set of data. Each numerical value is represented by a dot placed above a horizontal number line.

Sampling Distribution of and Simulation Methods. Ontario Public Sector Salaries. Strange Sample? Lecture 11. Reading: Sections

MATHEMATICS APPLIED TO BIOLOGICAL SCIENCES MVE PA 07. LP07 DESCRIPTIVE STATISTICS - Calculating of statistical indicators (1)

1 Describing Distributions with numbers

Summarising Data. Summarising Data. Examples of Types of Data. Types of Data

LESSON 7 INTERVAL ESTIMATION SAMIE L.S. LY

chapter 2-3 Normal Positive Skewness Negative Skewness

Hypothesis Tests: One Sample Mean Cal State Northridge Ψ320 Andrew Ainsworth PhD

Problem Set 6 ANSWERS

The data definition file provided by the authors is reproduced below: Obs: 1500 home sales in Stockton, CA from Oct 1, 1996 to Nov 30, 1998

Lecture 12: The Bootstrap

Describing Data: One Quantitative Variable

Cameron ECON 132 (Health Economics): FIRST MIDTERM EXAM (A) Fall 17

Chapter 7 Sampling Distributions and Point Estimation of Parameters

Summary of Statistical Analysis Tools EDAD 5630

Diploma in Business Administration Part 2. Quantitative Methods. Examiner s Suggested Answers

ECON 214 Elements of Statistics for Economists 2016/2017

Random Variables and Probability Distributions

1 Exercise One. 1.1 Calculate the mean ROI. Note that the data is not grouped! Below you find the raw data in tabular form:

Analysis of 2x2 Cross-Over Designs using T-Tests for Non-Inferiority

Both the quizzes and exams are closed book. However, For quizzes: Formulas will be provided with quiz papers if there is any need.

Application of the Bootstrap Estimating a Population Mean

Description Quick start Menu Syntax Options Remarks and examples Acknowledgment Also see

Chapter 7. Random Variables

Lecture 2 Describing Data

(ii) Give the name of the California website used to find the various insurance plans offered under the Affordable care Act (Obamacare).

Labor Force Participation and the Wage Gap Detailed Notes and Code Econometrics 113 Spring 2014

Descriptive Statistics (Devore Chapter One)

Introduction to Descriptive Statistics

Statistical Intervals (One sample) (Chs )

Statistics for Business and Economics

CHAPTER 6. ' From the table the z value corresponding to this value Z = 1.96 or Z = 1.96 (d) P(Z >?) =

Data Analysis and Statistical Methods Statistics 651

GGraph. Males Only. Premium. Experience. GGraph. Gender. 1 0: R 2 Linear = : R 2 Linear = Page 1

Random Effects ANOVA

Maximum Likelihood Estimation Richard Williams, University of Notre Dame, Last revised January 10, 2017

Numerical Descriptive Measures. Measures of Center: Mean and Median

Statistical Tables Compiled by Alan J. Terry

Lecture Data Science

Financial Econometrics Jeffrey R. Russell. Midterm 2014 Suggested Solutions. TA: B. B. Deng

The following content is provided under a Creative Commons license. Your support

Valid Missing Total. N Percent N Percent N Percent , ,0% 0,0% 2 100,0% 1, ,0% 0,0% 2 100,0% 2, ,0% 0,0% 5 100,0%

Web Appendix. Are the effects of monetary policy shocks big or small? Olivier Coibion

Description of Data I

Two Hours. Mathematical formula books and statistical tables are to be provided THE UNIVERSITY OF MANCHESTER. 22 January :00 16:00

DECISION SUPPORT Risk handout. Simulating Spreadsheet models

Standard Deviation. Lecture 18 Section Robb T. Koether. Hampden-Sydney College. Mon, Sep 26, 2011

Business Statistics 41000: Probability 4

Introduction to Computational Finance and Financial Econometrics Descriptive Statistics

Statistics 431 Spring 2007 P. Shaman. Preliminaries

Simulation Lecture Notes and the Gentle Lentil Case

Examples: Random Variables. Discrete and Continuous Random Variables. Probability Distributions

Fundamentals of Statistics

MATH 264 Problem Homework I

Chapter 7. Inferences about Population Variances

The normal distribution is a theoretical model derived mathematically and not empirically.

AP STATISTICS FALL SEMESTSER FINAL EXAM STUDY GUIDE

Transcription:

Chapter 6 Part 3 October 21, 2008 Bootstrapping

From the internet: The bootstrap involves repeated re-estimation of a parameter using random samples with replacement from the original data. Because the sampling is with replacement, some items in the data set are selected two or more times and others are not selected at all. When this is repeated a hundred or a thousand times, we get pseudo-samples that behave similarly to the underlying distribution of the data. Bootstrapping by Hand - Sampling with replacement Original 5000 observations:. sum(aftrig),det Fast Triglycerides-BL Anti ------------------------------------------------------------- Percentiles Smallest 1% 43 27 5% 59.5 27 10% 71 27 Obs 5000 25% 97 28 Sum of Wgt. 5000 50% 140 Mean 169.0732 Largest Std. Dev. 110.6217 75% 207 933 90% 301 936 Variance 12237.16 95% 377 982 Skewness 2.412733 99% 562.5 1000 Kurtosis 12.38013. return list scalars: r(n) = 5000 r(sum_w) = 5000 r(mean) = 169.0732 r(var) = 12237.16167409482 Frequency 0 200 400 600 0 169.07 250 500 750 1000 I have omitted the x-axis to make it easier for you to see the very small bars on the right hand tail. You can see that we have a very skewed distribution. skewness = 2.4 (as opposed to 0) Fasting triglycerides for original dataset of 5000 Page -1-

Below we have bootstrapping by hand. I have selected 4 samples each of size 10 and obtained the mean of each set of 10. log: W:\WP51\Biometry\AAAABiostat1725_Fall2008\Handouts\Chapter 6\data\Chapter6Part3.log log type: text opened on: 20 Oct 2008, 16:32:28. *dofile used is sample4setsof10.do. use "W:\WP51\Biometry\AAAABiostat1725_Fall2008\Handouts\Chapter 6\data\samplingAHT_5000.dta". sum(aftrig) AFTRIG 5000 169.0732 110.6217 27 1000. sort ID. set seed 50. bsample 10. list ID AFTRIG +---------------+ ID AFTRIG --------------- 1. 4631 269 2. 3695 74 3. 3001 73 4. 2035 131 5. 1364 80 --------------- 6. 4947 81 7. 2529 115 8. 2616 104 9. 3424 168 10. 4439 185 +---------------+. sum(aftrig) AFTRIG 10 128 63.19107 73 269. clear Page -2-

. use "W:\WP51\Biometry\AAAABiostat1725_Fall2008\Handouts\Chapter 6\data\samplingAHT_5000.dta". sort ID. set seed 51. bsample 10. list ID AFTRIG +---------------+ ID AFTRIG --------------- 1. 1230 223 2. 3650 427 3. 3376 454 4. 3686 393 5. 4816 86 --------------- 6. 1336 139 7. 4139 150 8. 2299 56 9. 706 111 10. 897 113 +---------------+. sum(aftrig) AFTRIG 10 215.2 151.6412 56 454. clear. use "W:\WP51\Biometry\AAAABiostat1725_Fall2008\Handouts\Chapter 6\data\samplingAHT_5000.dta". sort ID. set seed 52. bsample 10. list ID AFTRIG +---------------+ ID AFTRIG --------------- 1. 1812 146 2. 1495 184 3. 2742 119 4. 1265 103 5. 2036 91 --------------- 6. 1699 85 7. 4579 131 8. 1329 70 9. 510 191 10. 1511 132 +---------------+. sum(aftrig) AFTRIG 10 125.2 40.36445 70 191. clear Page -3-

. use "W:\WP51\Biometry\AAAABiostat1725_Fall2008\Handouts\Chapter 6\data\samplingAHT_5000.dta". sort ID. set seed 53. bsample 10. list ID AFTRIG +---------------+ ID AFTRIG --------------- 1. 4219 87 2. 3815 116 3. 2833 186 4. 3260 148 5. 2819 112 --------------- 6. 2055 150 7. 4161 103 8. 3753 179 9. 1522 58 10. 842 222 +---------------+. sum(aftrig) AFTRIG 10 136.1 50.14966 58 222. log close log: W:\WP51\Biometry\AAAABiostat1725_Fall2008\Handouts\Chapter 6\data\Chapter6Part3.log log type: text closed on: 20 Oct 2008, 16:32:28 So from each of the 4 sets of 10 observations we obtained an estimate of the mean value of the population of 5000 fasting triglycerides. If we had selected a 1000 samples of size 10, we would have a 1000 estimates of the means of fasting triglycerides (one for each sample). You could then create a histogram of the 1000 means. This sample of 1000 means is called the sampling distribution of means. If for each of the 1000 samples, we had asked for the variance instead of the mean, then we would have the sampling distribution of variances and we could obtain the mean of the 1000 variances. Below is how you get the means for the sampling distribution of means using single command rather than a separate command for each sample. We are assuming that the dataset samplingaht_5000.dta represents a population of people. That is, instead of treating it like the sample that it is, we are going to treat it as though it is a population. Results from dofile: Page -4-

. do "W:\WP51\Biometry\AAAABiostat1725_Fall2008\Handouts\Chapter 6\data\sample4setsof10singlecommand.do". clear. log using "W:\WP51\Biometry\AAAABiostat1725_Fall2008\Handouts\Chapter 6\data\Chapter6Part3No2.log ----------------------------------------------------------------------------- log: W:\WP51\Biometry\AAAABiostat1725_Fall2008\Handouts\Chapter 6\data\Chapter6Part3No2.log log type: text opened on: 20 Oct 2008, 18:28:17. *dofile used is sample4setsof10singlecommand.do. use "W:\WP51\Biometry\AAAABiostat1725_Fall2008\Handouts\Chapter 6\data\samplingAHT_5000.dta". sort ID. set seed 50. bs TGmeans = r(mean) TGvariances = r(var), reps(4) size(10) noisily saving(tg_r4_s10) :summarize AFTRIG bootstrap: First call to summarize with data as is: This first summarize is for all 5000 people. AFTRIG 5000 169.0732 110.6217 27 1000 Warning: Since summarize is not an estimation command or does not set e(sample), bootstrap has no way to determine which observations are used in calculating the statistics and so assumes that all observations are used. This means no observations will be excluded from the resampling because of missing values or other reasons. If the assumption is not true, press Break, save the data, and drop the observations that are to be excluded. Be sure that the dataset in memory contains only the relevant data. Bootstrap replications (4) This is the mean of the first sample of size 10. AFTRIG 10 128 63.19107 73 269 This is the mean of the second sample of size 10. Page -5-

AFTRIG 10 229.7 262.1522 78 933 This is the mean of the third sample of size 10. AFTRIG 10 162.1 77.60649 94 290 This is the mean of the fourth sample of size 10. AFTRIG 10 141.5 60.25732 63 240 Bootstrap results Number of obs = 5000 Replications = 4 command: summarize AFTRIG TGmeans: r(mean) TGvariances: r(var) Observed Bootstrap Normal-based Coef. Std. Err. z P> z [95% Conf. Interval] -------- TGmeans 169.0732 45.14911 3.74 0.000 80.58257 257.5638 TGvariances 12237.16 32104.68 0.38 0.703-50686.86 75161.19. clear The observed coefficients above are the mean (169.1) and variance (12237.16) of AFTRIG with all 5000 participants. As part of the bootstrapping routine we asked Stata to save the mean and variance for each of the 4 samples of size 10 in a data set that we called TG_R4_S10.dta. Notice below that the data set has only 4 observations because we asked for only 4 replications.. use TG_R4_S10.dta (bootstrap: summarize) Page -6-

. des Contains data from TG_R4_S10.dta obs: 4 bootstrap: summarize vars: 2 20 Oct 2008 18:28 size: 48 (99.9% of memory free) storage display value variable name type format label variable label TGmeans float %9.0g r(mean) TGvariances float %9.0g r(var) Sorted by:. list TGmeans TGvariances +--------------------+ TGmeans TGvari~s -------------------- 1. 128 3993.111 2. 229.7 68723.79 3. 162.1 6022.767 4. 141.5 3630.944 +--------------------+ Notice that the only one of the means that is the same as the means obtained by hand is the first one because both by hand version and by single command version use the same seed (50). Notice below that the mean of the 4 sample means is 165.3 which is not very close to 169.1, the mean of the population. We need to select more than 4 samples and samples larger than 10 to get a good estimate of the population mean.. sum(tgmeans),det r(mean) ------------------------------------------------------------- Percentiles Smallest 1% 128 128 5% 128 141.5 10% 128 162.1 Obs 4 25% 134.75 229.7 Sum of Wgt. 4 50% 151.8 Mean 165.325 Largest Std. Dev. 45.14911 75% 195.9 128 90% 229.7 141.5 Variance 2038.442 95% 229.7 162.1 Skewness.8415428 99% 229.7 229.7 Kurtosis 2.078988. end of do-file Below is a data set of 1000 samples of size 100 which we obtained from the original dataset of 5000. Page -7-

Notice below that the mean of our 1000 samples of size 100 is 168.6493. Now we are getting closer to the 169.0732, the mean of the original distribution of size 5000.. clear. use "W:\WP51\Biometry\AAAABiostatFall2008\Handouts\Chapter 6\Data\samplingAHT_5000.dta", clear. log using W:\WP51\Biometry\AAAABiostatFall2008\Handouts\Chapter 6\Data\classbootstrap.log -------- log: classbootstrap.log log type: text opened on: 20 Oct 2008, 22:31:13. set more off. sort ID. set seed 50. bs TGmeans = r(mean) TGvariances = r(var), reps(1000) size(100) saving(tg_r1000_s100):summarize AFTRIG (running summarize on estimation sample) Warning: Since summarize is not an estimation command or does not set e(sample), bootstrap has no way to determine which observations are used in calculating the statistics and so assumes that all observations are used. This means no observations will be excluded from the resampling because of missing values or other reasons. If the assumption is not true, press Break, save the data, and drop the observations that are to be excluded. Be sure that the dataset in memory contains only the relevant data. Bootstrap replications (1000) ----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5... 50... 100... 150... 200... 250... 900... 950... 1000 Bootstrap results Number of obs = 5000 Replications = 1000 command: summarize AFTRIG TGmeans: r(mean) TGvariances: r(var) Observed Bootstrap Normal-based Coef. Std. Err. z P> z [95% Conf. Interval] -------- TGmeans 169.0732 11.13638 15.18 0.000 147.2463 190.9001 TGvariances 12237.16 4060.487 3.01 0.003 4278.753 20195.57. clear Note that the observed coefficient above gives the mean and variance of the original data set of 5000. Page -8-

. use "W:\WP51\Biometry\AAAABiostatFall2008\Handouts\Chapter 6\Data\TG_R1000_S100.dta (bootstrap: summarize). des Contains data from TG_R1000_S100.dta obs: 1,000 bootstrap: summarize vars: 2 11 Oct 2007 15:01 size: 12,000 (99.9% of memory free) storage display value variable name type format label variable label TGmeans float %9.0g r(mean) TGvariances float %9.0g r(var) Sorted by:. sum(tgmeans),det r(mean) ------------------------------------------------------------- Percentiles Smallest 1% 145.205 139.57 5% 151.605 142.66 10% 154.67 142.97 Obs 1000 25% 161.015 143.12 Sum of Wgt. 1000 50% 168.07 Mean 168.6493 Largest Std. Dev. 10.87832 75% 175.655 201.48 90% 182.885 202.55 Variance 118.3378 95% 187.825 202.81 Skewness.21841 99% 196.41 205.24 Kurtosis 3.022783 Notice that we have a better estimate (168.6) of the population mean 169.1.. list TGmeans TGvariances in 1/8 +--------------------+ TGmeans TGvari~s -------------------- 1. 163.34 12052.33 2. 172.58 13293.34 3. 163.69 11420.68 4. 163.79 13522.59 5. 162.91 7851.315 -------------------- 6. 184.65 21049.91 7. 172.02 13014.1 8. 185.27 21222.18 +--------------------+ The graph below is a histogram of the 1000 means we got above. Notice that this histogram is rather symmetric looking and not at all like the very skewed histogram of the original variable AFTRIG. Page -9-

Frequency 0 50 100 150 Bootstrapping using samplingaht_5000.dta seed = 50, reps = 1000 and size = 100 140 160 180 200 r(mean) Now let us get more samples and larger samples than we have before. Below we have the output and histogram of 5000 samples of size 3000 selected from the original distribution of AFTRIG.. use "W:\WP51\Biometry\AAAABiostatFall2008\Handouts\Chapter 6\Data\samplingAHT_5000.dta",clear. set more off. sort ID. set seed 50. bs TGmeans = r(mean) TGvariances = r(var), reps(5000) size(3000) saving(tg_r5000_s3000):summarize AFTRIG bootstrap: First call to summarize with data as is: AFTRIG 5000 169.0732 110.6217 27 1000 Warning: Since summarize is not an estimation command or does not set e(sample), bootstrap has no way to determine which observations are used in calculating the statistics and so assumes that all observations are used. This means no observations will be excluded from the resampling because of missing values or other reasons. If the assumption is not true, press Break, save the data, and drop the observations that are to be excluded. Be sure that the dataset in memory contains only the relevant data. Bootstrap replications (5000) Page -10-

Below is a partial list of the sample means. Sample 1 AFTRIG 3000 164.9087 101.8332 28 900 Sample 2 AFTRIG 3000 170.2687 107.4036 27 933 Sample 3 AFTRIG 3000 167.427 108.1541 27 1000 Sample 4 Sample 5 AFTRIG 3000 166.026 105.4002 27 982 AFTRIG 3000 168.3437 107.5008 27 933 etc. Let us look at the results of all 5000 samples of size 3000.. use TG_R5000_S3000.dta (bootstrap: summarize) Page -11-

. des Contains data from TG_R5000_S3000.dta obs: 5,000 bootstrap: summarize vars: 2 20 Oct 2008 14:11 size: 60,000 (99.8% of memory free) - storage display value variable name type format label variable label - TGmeans float %9.0g r(mean) TGvariances float %9.0g r(var) - Sorted by: Below is the mean of the 5000 means.. sum(tgmeans),det r(mean) ------------------------------------------------------------- Percentiles Smallest 1% 164.5487 161.3803 5% 165.7843 162.6567 10% 166.4912 162.6573 Obs 5000 25% 167.6655 162.946 Sum of Wgt. 5000 50% 169.0547 Mean 169.0712 - this is a pretty Largest Std. Dev. 2.022116 good estimate of 75% 170.418 175.5127 169.0732 90% 171.635 175.5527 Variance 4.088954 95% 172.4112 175.7897 Skewness.0816943 99% 174.0015 176.6107 Kurtosis 3.032509. list TGmeans TGvariances in 1/5 +---------------------+ TGmeans TGvari~s --------------------- 1. 164.9087 10370 Notice that these 5 means match up with the means 2. 170.2687 11535.54 of the 5 samples listed above. 3. 167.427 11697.31 4. 166.026 11109.2 The variances are the SDs squared from the 5. 168.3437 11556.43 5 samples above. +---------------------+ Page -12-

Frequency 0 200 400 600 800 Bootstrapping using samplingaht_5000.dta seed = 50, reps = 5000 and size = 3000 160 165 170 175 180 r(mean) The normal curve that I have superimposed on the histogram has the same mean (169.0712) as the distribution of the 5000 AFTRIG means. The distribution of the 5000 AFTRIG means has kurtosis 3.03 (that is pretty close to 3) and skewness 0.08 (which is pretty close to 0). So the distribution matches up pretty well with a normal distribution. You will see that the table below now matches up with the Stata runs above. The n in the table below is the size of the samples. Page -13-

Page -14-

There are a number of things to notice in the table above. 1. As the number of repetitions and the sample size get larger the values in the column labeled means get closer to 169.0732 (the mean of the 5000 AFTRIG values). This is Fact 1: μ X = μ X 2. As the number of repetitions and the sample size get larger the values in the column labeled SD (i.e. the standard deviation of the distribution of means) σ n begins to look like the column labeled. The is the size of the samples selected. This is Fact 2: 2 σ 2 X Var( X ) Var( X ) = σ = = X n n or taking the square root of each of the terms above we get The SD of the sampling distributions of means is the SEM. n 3. As the number of repetitions and the sample size get larger the values in the column labeled Min get larger and those in the column labeled Max get smaller. 4. As the number of repetitions and the sample size get larger the values in the column labeled skewness get closer to zero and the values in the column labeled kurtosis get closer to 3. Page -15-