Chapter 6 Part 3 October 21, Bootstrapping

Size: px

Start display at page:

Download "Chapter 6 Part 3 October 21, Bootstrapping"

Vernon Potter
5 years ago
Views:

1 Chapter 6 Part 3 October 21, 2008 Bootstrapping

2 From the internet: The bootstrap involves repeated re-estimation of a parameter using random samples with replacement from the original data. Because the sampling is with replacement, some items in the data set are selected two or more times and others are not selected at all. When this is repeated a hundred or a thousand times, we get pseudo-samples that behave similarly to the underlying distribution of the data. Bootstrapping by Hand - Sampling with replacement Original 5000 observations:. sum(aftrig),det Fast Triglycerides-BL Anti Percentiles Smallest 1% % % Obs % Sum of Wgt % 140 Mean Largest Std. Dev % % Variance % Skewness % Kurtosis return list scalars: r(n) = 5000 r(sum_w) = 5000 r(mean) = r(var) = Frequency I have omitted the x-axis to make it easier for you to see the very small bars on the right hand tail. You can see that we have a very skewed distribution. skewness = 2.4 (as opposed to 0) Fasting triglycerides for original dataset of 5000 Page -1-

3 Below we have bootstrapping by hand. I have selected 4 samples each of size 10 and obtained the mean of each set of 10. log: W:\WP51\Biometry\AAAABiostat1725_Fall2008\Handouts\Chapter 6\data\Chapter6Part3.log log type: text opened on: 20 Oct 2008, 16:32:28. *dofile used is sample4setsof10.do. use "W:\WP51\Biometry\AAAABiostat1725_Fall2008\Handouts\Chapter 6\data\samplingAHT_5000.dta". sum(aftrig) AFTRIG sort ID. set seed 50. bsample 10. list ID AFTRIG ID AFTRIG sum(aftrig) AFTRIG clear Page -2-

4 . use "W:\WP51\Biometry\AAAABiostat1725_Fall2008\Handouts\Chapter 6\data\samplingAHT_5000.dta". sort ID. set seed 51. bsample 10. list ID AFTRIG ID AFTRIG sum(aftrig) AFTRIG clear. use "W:\WP51\Biometry\AAAABiostat1725_Fall2008\Handouts\Chapter 6\data\samplingAHT_5000.dta". sort ID. set seed 52. bsample 10. list ID AFTRIG ID AFTRIG sum(aftrig) AFTRIG clear Page -3-

5 . use "W:\WP51\Biometry\AAAABiostat1725_Fall2008\Handouts\Chapter 6\data\samplingAHT_5000.dta". sort ID. set seed 53. bsample 10. list ID AFTRIG ID AFTRIG sum(aftrig) AFTRIG log close log: W:\WP51\Biometry\AAAABiostat1725_Fall2008\Handouts\Chapter 6\data\Chapter6Part3.log log type: text closed on: 20 Oct 2008, 16:32:28 So from each of the 4 sets of 10 observations we obtained an estimate of the mean value of the population of 5000 fasting triglycerides. If we had selected a 1000 samples of size 10, we would have a 1000 estimates of the means of fasting triglycerides (one for each sample). You could then create a histogram of the 1000 means. This sample of 1000 means is called the sampling distribution of means. If for each of the 1000 samples, we had asked for the variance instead of the mean, then we would have the sampling distribution of variances and we could obtain the mean of the 1000 variances. Below is how you get the means for the sampling distribution of means using single command rather than a separate command for each sample. We are assuming that the dataset samplingaht_5000.dta represents a population of people. That is, instead of treating it like the sample that it is, we are going to treat it as though it is a population. Results from dofile: Page -4-

6 . do "W:\WP51\Biometry\AAAABiostat1725_Fall2008\Handouts\Chapter 6\data\sample4setsof10singlecommand.do". clear. log using "W:\WP51\Biometry\AAAABiostat1725_Fall2008\Handouts\Chapter 6\data\Chapter6Part3No2.log log: W:\WP51\Biometry\AAAABiostat1725_Fall2008\Handouts\Chapter 6\data\Chapter6Part3No2.log log type: text opened on: 20 Oct 2008, 18:28:17. *dofile used is sample4setsof10singlecommand.do. use "W:\WP51\Biometry\AAAABiostat1725_Fall2008\Handouts\Chapter 6\data\samplingAHT_5000.dta". sort ID. set seed 50. bs TGmeans = r(mean) TGvariances = r(var), reps(4) size(10) noisily saving(tg_r4_s10) :summarize AFTRIG bootstrap: First call to summarize with data as is: This first summarize is for all 5000 people. AFTRIG Warning: Since summarize is not an estimation command or does not set e(sample), bootstrap has no way to determine which observations are used in calculating the statistics and so assumes that all observations are used. This means no observations will be excluded from the resampling because of missing values or other reasons. If the assumption is not true, press Break, save the data, and drop the observations that are to be excluded. Be sure that the dataset in memory contains only the relevant data. Bootstrap replications (4) This is the mean of the first sample of size 10. AFTRIG This is the mean of the second sample of size 10. Page -5-

7 AFTRIG This is the mean of the third sample of size 10. AFTRIG This is the mean of the fourth sample of size 10. AFTRIG Bootstrap results Number of obs = 5000 Replications = 4 command: summarize AFTRIG TGmeans: r(mean) TGvariances: r(var) Observed Bootstrap Normal-based Coef. Std. Err. z P> z [95% Conf. Interval] TGmeans TGvariances clear The observed coefficients above are the mean (169.1) and variance ( ) of AFTRIG with all 5000 participants. As part of the bootstrapping routine we asked Stata to save the mean and variance for each of the 4 samples of size 10 in a data set that we called TG_R4_S10.dta. Notice below that the data set has only 4 observations because we asked for only 4 replications.. use TG_R4_S10.dta (bootstrap: summarize) Page -6-

8 . des Contains data from TG_R4_S10.dta obs: 4 bootstrap: summarize vars: 2 20 Oct :28 size: 48 (99.9% of memory free) storage display value variable name type format label variable label TGmeans float %9.0g r(mean) TGvariances float %9.0g r(var) Sorted by:. list TGmeans TGvariances TGmeans TGvari~s Notice that the only one of the means that is the same as the means obtained by hand is the first one because both by hand version and by single command version use the same seed (50). Notice below that the mean of the 4 sample means is which is not very close to 169.1, the mean of the population. We need to select more than 4 samples and samples larger than 10 to get a good estimate of the population mean.. sum(tgmeans),det r(mean) Percentiles Smallest 1% % % Obs 4 25% Sum of Wgt. 4 50% Mean Largest Std. Dev % % Variance % Skewness % Kurtosis end of do-file Below is a data set of 1000 samples of size 100 which we obtained from the original dataset of Page -7-

9 Notice below that the mean of our 1000 samples of size 100 is Now we are getting closer to the , the mean of the original distribution of size clear. use "W:\WP51\Biometry\AAAABiostatFall2008\Handouts\Chapter 6\Data\samplingAHT_5000.dta", clear. log using W:\WP51\Biometry\AAAABiostatFall2008\Handouts\Chapter 6\Data\classbootstrap.log log: classbootstrap.log log type: text opened on: 20 Oct 2008, 22:31:13. set more off. sort ID. set seed 50. bs TGmeans = r(mean) TGvariances = r(var), reps(1000) size(100) saving(tg_r1000_s100):summarize AFTRIG (running summarize on estimation sample) Warning: Since summarize is not an estimation command or does not set e(sample), bootstrap has no way to determine which observations are used in calculating the statistics and so assumes that all observations are used. This means no observations will be excluded from the resampling because of missing values or other reasons. If the assumption is not true, press Break, save the data, and drop the observations that are to be excluded. Be sure that the dataset in memory contains only the relevant data. Bootstrap replications (1000) Bootstrap results Number of obs = 5000 Replications = 1000 command: summarize AFTRIG TGmeans: r(mean) TGvariances: r(var) Observed Bootstrap Normal-based Coef. Std. Err. z P> z [95% Conf. Interval] TGmeans TGvariances clear Note that the observed coefficient above gives the mean and variance of the original data set of Page -8-

10 . use "W:\WP51\Biometry\AAAABiostatFall2008\Handouts\Chapter 6\Data\TG_R1000_S100.dta (bootstrap: summarize). des Contains data from TG_R1000_S100.dta obs: 1,000 bootstrap: summarize vars: 2 11 Oct :01 size: 12,000 (99.9% of memory free) storage display value variable name type format label variable label TGmeans float %9.0g r(mean) TGvariances float %9.0g r(var) Sorted by:. sum(tgmeans),det r(mean) Percentiles Smallest 1% % % Obs % Sum of Wgt % Mean Largest Std. Dev % % Variance % Skewness % Kurtosis Notice that we have a better estimate (168.6) of the population mean list TGmeans TGvariances in 1/ TGmeans TGvari~s The graph below is a histogram of the 1000 means we got above. Notice that this histogram is rather symmetric looking and not at all like the very skewed histogram of the original variable AFTRIG. Page -9-

11 Frequency Bootstrapping using samplingaht_5000.dta seed = 50, reps = 1000 and size = r(mean) Now let us get more samples and larger samples than we have before. Below we have the output and histogram of 5000 samples of size 3000 selected from the original distribution of AFTRIG.. use "W:\WP51\Biometry\AAAABiostatFall2008\Handouts\Chapter 6\Data\samplingAHT_5000.dta",clear. set more off. sort ID. set seed 50. bs TGmeans = r(mean) TGvariances = r(var), reps(5000) size(3000) saving(tg_r5000_s3000):summarize AFTRIG bootstrap: First call to summarize with data as is: AFTRIG Warning: Since summarize is not an estimation command or does not set e(sample), bootstrap has no way to determine which observations are used in calculating the statistics and so assumes that all observations are used. This means no observations will be excluded from the resampling because of missing values or other reasons. If the assumption is not true, press Break, save the data, and drop the observations that are to be excluded. Be sure that the dataset in memory contains only the relevant data. Bootstrap replications (5000) Page -10-

12 Below is a partial list of the sample means. Sample 1 AFTRIG Sample 2 AFTRIG Sample 3 AFTRIG Sample 4 Sample 5 AFTRIG AFTRIG etc. Let us look at the results of all 5000 samples of size use TG_R5000_S3000.dta (bootstrap: summarize) Page -11-

13 . des Contains data from TG_R5000_S3000.dta obs: 5,000 bootstrap: summarize vars: 2 20 Oct :11 size: 60,000 (99.8% of memory free) - storage display value variable name type format label variable label - TGmeans float %9.0g r(mean) TGvariances float %9.0g r(var) - Sorted by: Below is the mean of the 5000 means.. sum(tgmeans),det r(mean) Percentiles Smallest 1% % % Obs % Sum of Wgt % Mean this is a pretty Largest Std. Dev good estimate of 75% % Variance % Skewness % Kurtosis list TGmeans TGvariances in 1/ TGmeans TGvari~s Notice that these 5 means match up with the means of the 5 samples listed above The variances are the SDs squared from the samples above Page -12-

14 Frequency Bootstrapping using samplingaht_5000.dta seed = 50, reps = 5000 and size = r(mean) The normal curve that I have superimposed on the histogram has the same mean ( ) as the distribution of the 5000 AFTRIG means. The distribution of the 5000 AFTRIG means has kurtosis 3.03 (that is pretty close to 3) and skewness 0.08 (which is pretty close to 0). So the distribution matches up pretty well with a normal distribution. You will see that the table below now matches up with the Stata runs above. The n in the table below is the size of the samples. Page -13-

15 Page -14-

16 There are a number of things to notice in the table above. 1. As the number of repetitions and the sample size get larger the values in the column labeled means get closer to (the mean of the 5000 AFTRIG values). This is Fact 1: μ X = μ X 2. As the number of repetitions and the sample size get larger the values in the column labeled SD (i.e. the standard deviation of the distribution of means) σ n begins to look like the column labeled. The is the size of the samples selected. This is Fact 2: 2 σ 2 X Var( X ) Var( X ) = σ = = X n n or taking the square root of each of the terms above we get The SD of the sampling distributions of means is the SEM. n 3. As the number of repetitions and the sample size get larger the values in the column labeled Min get larger and those in the column labeled Max get smaller. 4. As the number of repetitions and the sample size get larger the values in the column labeled skewness get closer to zero and the values in the column labeled kurtosis get closer to 3. Page -15-

You created this PDF from an application that is not licensed to print to novapdf printer (http://www.novapdf.com)

You created this PDF from an application that is not licensed to print to novapdf printer (http://www.novapdf.com) Monday October 3 10:11:57 2011 Page 1 (R) / / / / / / / / / / / / Statistics/Data Analysis Education Box and save these files in a local folder. name: