Lecture 12: The Bootstrap Reading: Chapter 5 STATS 202: Data mining and analysis October 20, 2017 1 / 16
Announcements Midterm is on Monday, Oct 30 Topics: chapters 1-5 and 10 of the book everything until and including today s lecture. We will post two practice exams soon. Closed book, no notes. All hard equations will be provided. SCPD students: if you haven t chosen your proctor already, you must do it ASAP. For guidelines see: http://scpd.stanford.edu/programs/courses/ graduate-courses/exam-monitor-information 2 / 16
The learning curve and choosing k in k-fold cross validation 1-Err 0.0 0.2 0.4 0.6 0.8 The learning curve 0 50 100 150 200 Size of Training Set Recall that as we increase k, we decrease the bias but increase the variance of the cross validation error. How does the test error change as we increase the size n of the training set? Consider the curve on the left: If n = 200, then 5-fold CV estimates error using dataset of size 4 5 200 = 160: introduces little bias! If n = 50, then 5-fold CV estimates error using dataset of size 4 5 50 = 40: introduces more bias. 3 / 16
Cross-validation vs. the Bootstrap Cross-validation: principally used to estimate prediction error. The Bootstrap: principally used to estimate various measures of error or uncertainty of parameter estimates, e.g. standard error (SE) of parameter estimates, confidence intervals for parameters. One of the most important techniques in all of Statistics. Widely applicable, extremely powerful, computer intensive method. Popularized by Brad Efron, from Stanford. 4 / 16
Standard errors in linear regression Standard error: SD of an estimate from a sample of size n. 5 / 16
Classical way to compute Standard Errors Example: Estimate the variance of a sample x 1, x 2,..., x n : ˆσ 2 = 1 n 1 What is the Standard Error of ˆσ 2? n (x i x) 2. i=1 1. Assume that x 1,..., x n are i.i.d. normally distributed. 2. From that assumption one can derive that V ar(ˆσ 2 ) = 2σ4 therefore SE(ˆσ 2 ) = 2σ 2 n 1. 3. Problem: We typically don t know σ! 4. So assume ˆσ 2 n 1 is reasonable close to σ 2 n 1. 5. Then can use the estimate SE(ˆσ 2 ) = 2ˆσ 2 n 1. n 1, 6 / 16
Limitations of the classical approach The classical approach works for certain statistics under specific modeling assumptions. However, what happens if: The modeling assumptions for example, x 1,..., x n being normal break down? The estimator does not have a simple form and its sampling distribution cannot be derived analytically? 7 / 16
Example. Investing in two assets Suppose that and are the returns of two assets. These returns are observed every day: (x 1, y 1 ),..., (x n, y n ). 2 1 0 8 / 16
Example. Investing in two assets We have a fixed amount of money to invest and we will invest a fraction α on and a fraction (1 α) on. Therefore, our return will be α + (1 α). Our goal will be to minimize the variance of our return as a function of α. One can show that the optimal α is: α = σ 2 Cov(, ) σ 2 + σ2 2Cov(, ). Proposal: Use an estimate: ˆσ 2 ˆα = Cov(, ˆ ) ˆσ 2 + ˆσ2 2 Cov(, ˆ ). 9 / 16
Example. Investing in two assets Suppose we compute the estimate ˆα = 0.6 using the samples (x 1, y 1 ),..., (x n, y n ). How sure can we be of this value? If we sampled another set of observations (x 1, y 1 ),..., (x n, y n ), would we get a wildly different ˆα? In this thought experiment, we know the actual joint distribution P (, ), so we can resample the n observations to our hearts content. 10 / 16
Resampling the data from the true distribution 3 3 3 3 11 / 16
Computing the standard error of ˆα Suppose we can sample as many data as we want. For each resampling of the data, (x (1) 1, y(1) 1 (x (2) 1, y(2) 1 ),..., (x(1) n, y n (1) ) ),..., (x(2) n, y n (2) )... we can compute a value of the estimate ˆα (1), ˆα (2),.... The standard deviation of these values approximates the Standard Error of ˆα. 12 / 16
In reality, we only have one dataset of size n! However, this dataset can be used to approximate the joint distribution of P of and by forming the empirical distribution ˆP (, ) which gives probability 1 n to each pair (x i, y i ). The Bootstrap: Instead of sampling new datasets from the unknown distribution P, resample from the empirical distribution ˆP. Equivalently, 2 1 0resample 1 the 2 data by drawing n samples with replacement from the actual observations. 2 2 13 / 16
A schematic of the Bootstrap Obs *1 Z 3 1 3 5.3 4.3 5.3 2.8 2.4 2.8 αˆ*1 Obs Obs 1 4.3 2.4 2 2.1 1.1 3 5.3 2.8 Original Data (Z) *2 Z Z *B 2 3 1 Obs 2.1 5.3 4.3 1.1 2.8 2.4 ˆα *2 ˆα *B 2 2 1 2.1 2.1 4.3 1.1 1.1 2.4 Each resampled dataset Z b is called a bootstrap replicate. 14 / 16
Comparing Bootstrap resamplings to resamplings from the true distribution 0 50 100 150 200 0 50 100 150 200 α 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.4 0.5 0.6 0.7 0.8 0.9 α 0.3 0.4 0.5 0.6 0.7 0.8 0.9 α True Bootstrap 15 / 16
Bootstrapping your favorite statistics The bootstrap is broadly applicable and can be used to estimate the SE of a wide variety of statistics including linear regression coefficients, model predictions ˆf(x 0 ), principal component loadings,... 16 / 16