Chapter 6 Part 3 October 21, 2008 Bootstrapping
From the internet: The bootstrap involves repeated re-estimation of a parameter using random samples with replacement from the original data. Because the sampling is with replacement, some items in the data set are selected two or more times and others are not selected at all. When this is repeated a hundred or a thousand times, we get pseudo-samples that behave similarly to the underlying distribution of the data. Bootstrapping by Hand - Sampling with replacement Original 5000 observations:. sum(aftrig),det Fast Triglycerides-BL Anti ------------------------------------------------------------- Percentiles Smallest 1% 43 27 5% 59.5 27 10% 71 27 Obs 5000 25% 97 28 Sum of Wgt. 5000 50% 140 Mean 169.0732 Largest Std. Dev. 110.6217 75% 207 933 90% 301 936 Variance 12237.16 95% 377 982 Skewness 2.412733 99% 562.5 1000 Kurtosis 12.38013. return list scalars: r(n) = 5000 r(sum_w) = 5000 r(mean) = 169.0732 r(var) = 12237.16167409482 Frequency 0 200 400 600 0 169.07 250 500 750 1000 I have omitted the x-axis to make it easier for you to see the very small bars on the right hand tail. You can see that we have a very skewed distribution. skewness = 2.4 (as opposed to 0) Fasting triglycerides for original dataset of 5000 Page -1-
Below we have bootstrapping by hand. I have selected 4 samples each of size 10 and obtained the mean of each set of 10. log: W:\WP51\Biometry\AAAABiostat1725_Fall2008\Handouts\Chapter 6\data\Chapter6Part3.log log type: text opened on: 20 Oct 2008, 16:32:28. *dofile used is sample4setsof10.do. use "W:\WP51\Biometry\AAAABiostat1725_Fall2008\Handouts\Chapter 6\data\samplingAHT_5000.dta". sum(aftrig) AFTRIG 5000 169.0732 110.6217 27 1000. sort ID. set seed 50. bsample 10. list ID AFTRIG +---------------+ ID AFTRIG --------------- 1. 4631 269 2. 3695 74 3. 3001 73 4. 2035 131 5. 1364 80 --------------- 6. 4947 81 7. 2529 115 8. 2616 104 9. 3424 168 10. 4439 185 +---------------+. sum(aftrig) AFTRIG 10 128 63.19107 73 269. clear Page -2-
. use "W:\WP51\Biometry\AAAABiostat1725_Fall2008\Handouts\Chapter 6\data\samplingAHT_5000.dta". sort ID. set seed 51. bsample 10. list ID AFTRIG +---------------+ ID AFTRIG --------------- 1. 1230 223 2. 3650 427 3. 3376 454 4. 3686 393 5. 4816 86 --------------- 6. 1336 139 7. 4139 150 8. 2299 56 9. 706 111 10. 897 113 +---------------+. sum(aftrig) AFTRIG 10 215.2 151.6412 56 454. clear. use "W:\WP51\Biometry\AAAABiostat1725_Fall2008\Handouts\Chapter 6\data\samplingAHT_5000.dta". sort ID. set seed 52. bsample 10. list ID AFTRIG +---------------+ ID AFTRIG --------------- 1. 1812 146 2. 1495 184 3. 2742 119 4. 1265 103 5. 2036 91 --------------- 6. 1699 85 7. 4579 131 8. 1329 70 9. 510 191 10. 1511 132 +---------------+. sum(aftrig) AFTRIG 10 125.2 40.36445 70 191. clear Page -3-
. use "W:\WP51\Biometry\AAAABiostat1725_Fall2008\Handouts\Chapter 6\data\samplingAHT_5000.dta". sort ID. set seed 53. bsample 10. list ID AFTRIG +---------------+ ID AFTRIG --------------- 1. 4219 87 2. 3815 116 3. 2833 186 4. 3260 148 5. 2819 112 --------------- 6. 2055 150 7. 4161 103 8. 3753 179 9. 1522 58 10. 842 222 +---------------+. sum(aftrig) AFTRIG 10 136.1 50.14966 58 222. log close log: W:\WP51\Biometry\AAAABiostat1725_Fall2008\Handouts\Chapter 6\data\Chapter6Part3.log log type: text closed on: 20 Oct 2008, 16:32:28 So from each of the 4 sets of 10 observations we obtained an estimate of the mean value of the population of 5000 fasting triglycerides. If we had selected a 1000 samples of size 10, we would have a 1000 estimates of the means of fasting triglycerides (one for each sample). You could then create a histogram of the 1000 means. This sample of 1000 means is called the sampling distribution of means. If for each of the 1000 samples, we had asked for the variance instead of the mean, then we would have the sampling distribution of variances and we could obtain the mean of the 1000 variances. Below is how you get the means for the sampling distribution of means using single command rather than a separate command for each sample. We are assuming that the dataset samplingaht_5000.dta represents a population of people. That is, instead of treating it like the sample that it is, we are going to treat it as though it is a population. Results from dofile: Page -4-
. do "W:\WP51\Biometry\AAAABiostat1725_Fall2008\Handouts\Chapter 6\data\sample4setsof10singlecommand.do". clear. log using "W:\WP51\Biometry\AAAABiostat1725_Fall2008\Handouts\Chapter 6\data\Chapter6Part3No2.log ----------------------------------------------------------------------------- log: W:\WP51\Biometry\AAAABiostat1725_Fall2008\Handouts\Chapter 6\data\Chapter6Part3No2.log log type: text opened on: 20 Oct 2008, 18:28:17. *dofile used is sample4setsof10singlecommand.do. use "W:\WP51\Biometry\AAAABiostat1725_Fall2008\Handouts\Chapter 6\data\samplingAHT_5000.dta". sort ID. set seed 50. bs TGmeans = r(mean) TGvariances = r(var), reps(4) size(10) noisily saving(tg_r4_s10) :summarize AFTRIG bootstrap: First call to summarize with data as is: This first summarize is for all 5000 people. AFTRIG 5000 169.0732 110.6217 27 1000 Warning: Since summarize is not an estimation command or does not set e(sample), bootstrap has no way to determine which observations are used in calculating the statistics and so assumes that all observations are used. This means no observations will be excluded from the resampling because of missing values or other reasons. If the assumption is not true, press Break, save the data, and drop the observations that are to be excluded. Be sure that the dataset in memory contains only the relevant data. Bootstrap replications (4) This is the mean of the first sample of size 10. AFTRIG 10 128 63.19107 73 269 This is the mean of the second sample of size 10. Page -5-
AFTRIG 10 229.7 262.1522 78 933 This is the mean of the third sample of size 10. AFTRIG 10 162.1 77.60649 94 290 This is the mean of the fourth sample of size 10. AFTRIG 10 141.5 60.25732 63 240 Bootstrap results Number of obs = 5000 Replications = 4 command: summarize AFTRIG TGmeans: r(mean) TGvariances: r(var) Observed Bootstrap Normal-based Coef. Std. Err. z P> z [95% Conf. Interval] -------- TGmeans 169.0732 45.14911 3.74 0.000 80.58257 257.5638 TGvariances 12237.16 32104.68 0.38 0.703-50686.86 75161.19. clear The observed coefficients above are the mean (169.1) and variance (12237.16) of AFTRIG with all 5000 participants. As part of the bootstrapping routine we asked Stata to save the mean and variance for each of the 4 samples of size 10 in a data set that we called TG_R4_S10.dta. Notice below that the data set has only 4 observations because we asked for only 4 replications.. use TG_R4_S10.dta (bootstrap: summarize) Page -6-
. des Contains data from TG_R4_S10.dta obs: 4 bootstrap: summarize vars: 2 20 Oct 2008 18:28 size: 48 (99.9% of memory free) storage display value variable name type format label variable label TGmeans float %9.0g r(mean) TGvariances float %9.0g r(var) Sorted by:. list TGmeans TGvariances +--------------------+ TGmeans TGvari~s -------------------- 1. 128 3993.111 2. 229.7 68723.79 3. 162.1 6022.767 4. 141.5 3630.944 +--------------------+ Notice that the only one of the means that is the same as the means obtained by hand is the first one because both by hand version and by single command version use the same seed (50). Notice below that the mean of the 4 sample means is 165.3 which is not very close to 169.1, the mean of the population. We need to select more than 4 samples and samples larger than 10 to get a good estimate of the population mean.. sum(tgmeans),det r(mean) ------------------------------------------------------------- Percentiles Smallest 1% 128 128 5% 128 141.5 10% 128 162.1 Obs 4 25% 134.75 229.7 Sum of Wgt. 4 50% 151.8 Mean 165.325 Largest Std. Dev. 45.14911 75% 195.9 128 90% 229.7 141.5 Variance 2038.442 95% 229.7 162.1 Skewness.8415428 99% 229.7 229.7 Kurtosis 2.078988. end of do-file Below is a data set of 1000 samples of size 100 which we obtained from the original dataset of 5000. Page -7-
Notice below that the mean of our 1000 samples of size 100 is 168.6493. Now we are getting closer to the 169.0732, the mean of the original distribution of size 5000.. clear. use "W:\WP51\Biometry\AAAABiostatFall2008\Handouts\Chapter 6\Data\samplingAHT_5000.dta", clear. log using W:\WP51\Biometry\AAAABiostatFall2008\Handouts\Chapter 6\Data\classbootstrap.log -------- log: classbootstrap.log log type: text opened on: 20 Oct 2008, 22:31:13. set more off. sort ID. set seed 50. bs TGmeans = r(mean) TGvariances = r(var), reps(1000) size(100) saving(tg_r1000_s100):summarize AFTRIG (running summarize on estimation sample) Warning: Since summarize is not an estimation command or does not set e(sample), bootstrap has no way to determine which observations are used in calculating the statistics and so assumes that all observations are used. This means no observations will be excluded from the resampling because of missing values or other reasons. If the assumption is not true, press Break, save the data, and drop the observations that are to be excluded. Be sure that the dataset in memory contains only the relevant data. Bootstrap replications (1000) ----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5... 50... 100... 150... 200... 250... 900... 950... 1000 Bootstrap results Number of obs = 5000 Replications = 1000 command: summarize AFTRIG TGmeans: r(mean) TGvariances: r(var) Observed Bootstrap Normal-based Coef. Std. Err. z P> z [95% Conf. Interval] -------- TGmeans 169.0732 11.13638 15.18 0.000 147.2463 190.9001 TGvariances 12237.16 4060.487 3.01 0.003 4278.753 20195.57. clear Note that the observed coefficient above gives the mean and variance of the original data set of 5000. Page -8-
. use "W:\WP51\Biometry\AAAABiostatFall2008\Handouts\Chapter 6\Data\TG_R1000_S100.dta (bootstrap: summarize). des Contains data from TG_R1000_S100.dta obs: 1,000 bootstrap: summarize vars: 2 11 Oct 2007 15:01 size: 12,000 (99.9% of memory free) storage display value variable name type format label variable label TGmeans float %9.0g r(mean) TGvariances float %9.0g r(var) Sorted by:. sum(tgmeans),det r(mean) ------------------------------------------------------------- Percentiles Smallest 1% 145.205 139.57 5% 151.605 142.66 10% 154.67 142.97 Obs 1000 25% 161.015 143.12 Sum of Wgt. 1000 50% 168.07 Mean 168.6493 Largest Std. Dev. 10.87832 75% 175.655 201.48 90% 182.885 202.55 Variance 118.3378 95% 187.825 202.81 Skewness.21841 99% 196.41 205.24 Kurtosis 3.022783 Notice that we have a better estimate (168.6) of the population mean 169.1.. list TGmeans TGvariances in 1/8 +--------------------+ TGmeans TGvari~s -------------------- 1. 163.34 12052.33 2. 172.58 13293.34 3. 163.69 11420.68 4. 163.79 13522.59 5. 162.91 7851.315 -------------------- 6. 184.65 21049.91 7. 172.02 13014.1 8. 185.27 21222.18 +--------------------+ The graph below is a histogram of the 1000 means we got above. Notice that this histogram is rather symmetric looking and not at all like the very skewed histogram of the original variable AFTRIG. Page -9-
Frequency 0 50 100 150 Bootstrapping using samplingaht_5000.dta seed = 50, reps = 1000 and size = 100 140 160 180 200 r(mean) Now let us get more samples and larger samples than we have before. Below we have the output and histogram of 5000 samples of size 3000 selected from the original distribution of AFTRIG.. use "W:\WP51\Biometry\AAAABiostatFall2008\Handouts\Chapter 6\Data\samplingAHT_5000.dta",clear. set more off. sort ID. set seed 50. bs TGmeans = r(mean) TGvariances = r(var), reps(5000) size(3000) saving(tg_r5000_s3000):summarize AFTRIG bootstrap: First call to summarize with data as is: AFTRIG 5000 169.0732 110.6217 27 1000 Warning: Since summarize is not an estimation command or does not set e(sample), bootstrap has no way to determine which observations are used in calculating the statistics and so assumes that all observations are used. This means no observations will be excluded from the resampling because of missing values or other reasons. If the assumption is not true, press Break, save the data, and drop the observations that are to be excluded. Be sure that the dataset in memory contains only the relevant data. Bootstrap replications (5000) Page -10-
Below is a partial list of the sample means. Sample 1 AFTRIG 3000 164.9087 101.8332 28 900 Sample 2 AFTRIG 3000 170.2687 107.4036 27 933 Sample 3 AFTRIG 3000 167.427 108.1541 27 1000 Sample 4 Sample 5 AFTRIG 3000 166.026 105.4002 27 982 AFTRIG 3000 168.3437 107.5008 27 933 etc. Let us look at the results of all 5000 samples of size 3000.. use TG_R5000_S3000.dta (bootstrap: summarize) Page -11-
. des Contains data from TG_R5000_S3000.dta obs: 5,000 bootstrap: summarize vars: 2 20 Oct 2008 14:11 size: 60,000 (99.8% of memory free) - storage display value variable name type format label variable label - TGmeans float %9.0g r(mean) TGvariances float %9.0g r(var) - Sorted by: Below is the mean of the 5000 means.. sum(tgmeans),det r(mean) ------------------------------------------------------------- Percentiles Smallest 1% 164.5487 161.3803 5% 165.7843 162.6567 10% 166.4912 162.6573 Obs 5000 25% 167.6655 162.946 Sum of Wgt. 5000 50% 169.0547 Mean 169.0712 - this is a pretty Largest Std. Dev. 2.022116 good estimate of 75% 170.418 175.5127 169.0732 90% 171.635 175.5527 Variance 4.088954 95% 172.4112 175.7897 Skewness.0816943 99% 174.0015 176.6107 Kurtosis 3.032509. list TGmeans TGvariances in 1/5 +---------------------+ TGmeans TGvari~s --------------------- 1. 164.9087 10370 Notice that these 5 means match up with the means 2. 170.2687 11535.54 of the 5 samples listed above. 3. 167.427 11697.31 4. 166.026 11109.2 The variances are the SDs squared from the 5. 168.3437 11556.43 5 samples above. +---------------------+ Page -12-
Frequency 0 200 400 600 800 Bootstrapping using samplingaht_5000.dta seed = 50, reps = 5000 and size = 3000 160 165 170 175 180 r(mean) The normal curve that I have superimposed on the histogram has the same mean (169.0712) as the distribution of the 5000 AFTRIG means. The distribution of the 5000 AFTRIG means has kurtosis 3.03 (that is pretty close to 3) and skewness 0.08 (which is pretty close to 0). So the distribution matches up pretty well with a normal distribution. You will see that the table below now matches up with the Stata runs above. The n in the table below is the size of the samples. Page -13-
Page -14-
There are a number of things to notice in the table above. 1. As the number of repetitions and the sample size get larger the values in the column labeled means get closer to 169.0732 (the mean of the 5000 AFTRIG values). This is Fact 1: μ X = μ X 2. As the number of repetitions and the sample size get larger the values in the column labeled SD (i.e. the standard deviation of the distribution of means) σ n begins to look like the column labeled. The is the size of the samples selected. This is Fact 2: 2 σ 2 X Var( X ) Var( X ) = σ = = X n n or taking the square root of each of the terms above we get The SD of the sampling distributions of means is the SEM. n 3. As the number of repetitions and the sample size get larger the values in the column labeled Min get larger and those in the column labeled Max get smaller. 4. As the number of repetitions and the sample size get larger the values in the column labeled skewness get closer to zero and the values in the column labeled kurtosis get closer to 3. Page -15-