Stat 139 Homework 2 Solutions, Fall 2016

Size: px

Start display at page:

Download "Stat 139 Homework 2 Solutions, Fall 2016"

Mavis Sparks
5 years ago
Views:

1 Stat 139 Homework 2 Solutions, Fall 2016 Problem 1. The sum of squares of a sample of data is minimized when the sample mean, X = Xi /n, is used as the basis of the calculation. Define g(c) as a function w.r.t. c as: g(c) = (X i c) 2. Show that this function is minimized at the value c = X. Solution: In order to minimize a function, we have to take the first derivative (w.r.t. c) and set to zero. Then we can take the second derivative and make sure it is positive at x (concave up): g (c) = 2 (x i c) 0 = c = x i = n c = x i = c = x i /n = x g (c) = 2 1 = 2n > 0 Problem 2. Let X 1, X 2,..., X n be a sample of independent random variables drawn from a population with mean µ and variance σ 2. Let X be the sample average. Recall that σ 2 can be estimated by S 2, the usual sample variance, defined as: n S 2 = (X i X) ( 2 ) = 1 Xi 2 n n 1 n 1 X 2. (a) Show that E(X 2 i ) = σ2 + µ 2, using the fact that σ 2 = E ( (X i µ) 2). Solution: E(X 2 i ) = E(X 2 i 2µ 2 + 2µ 2 ) = E(X 2 i 2µX i + µ 2 ) + E(µ 2 ) = E ( (X i µ) 2) + E(µ 2 ) = σ 2 + µ 2 Note: E(µ 2 ) = µe(µ) = µe(x i ) = E(µX i ). (b) Show that E(S 2 ) = σ 2, i.e., S 2 is an unbiased estimator of the population variance. Solution: [ ( )] ( ) E(S 2 1 ) = E Xi 2 n n 1 X 2 = 1 E(Xi 2 ) ne( n 1 X 2 ) = 1 ( n(σ 2 + µ 2 ) n(σ 2 /n + µ 2 ) ) n 1 Note: E( X 2 ) = σ 2 X + µ 2 X = σ 2 /n + µ 2 based on the Law of Large Numbers. Problem 3. Let X 1, X 2,..., X 25 be i.i.d. Normal r.v.s. with mean µ = 1 and variance σ 2 = 3 2 = 9. Let S 2 be the usual variance estimate: S 2 = (X i X) 2 /(n 1), and let ˆσ 2 be the estimate using µ in the calculation instead: ˆσ 2 = (X i µ) 2 /n. Write a simulation in R, using a for-loop based 1

2 on at least 10,000 iterations, to determine the following (be sure to include the relevant R code and output): (a) That both estimators (S 2 and ˆσ 2 ) are unbiased. Solution: Based on 10,000 iterations, the observed means of both estimators were within 0.01 units of the true variance of 9. We could formally test if the is significantly different from 9 (based on n = 10, 000 realizations), but that is overkill. Here is the relevant R code: > nsims=10000 > mu=1 > sigma=3 > n=25 > sigma2.hat=s2=rep(na,nsims) > > for(i in 1:nsims){ + sample=rnorm(n,mean=mu,sd=sigma) + xbar=mean(sample) + sigma2.hat[i]=sum((sample-mu)^2)/n + s2[i]=var(sample) + } > mean(sigma2.hat) [1] > mean(s2) [1] (b) Provide a separate histogram for each of the two sampling distributions. Which has lower spread? Solution: Based on the R output below, ˆσ 2 has slightly smaller spread than S 2 (about 3% lower standard deviation). > sd(sigma2.hat) [1] > sd(s2) [1] (c) Which estimator is closer to the true value more often. Solution: Based on the R output below, ˆσ 2 is as close or closer than S 2 about 52.4% of the time. > mean(abs(sigma2.hat-sigma^2)>abs(s2-sigma^2)) [1] (d) Are you sure of your answers above? What could you do to be more certain? Solution: No, the answers are not certainly true above since these are based on random simulations. We could be more certain if we based this study on more iterations, or if we performed a formal test to see if the results above were statistically significant. 2

3 Histogram of sigma2.hat Histogram of s sigma2.hat s2 Problem 4. The National Football League (NFL) instituted a new rule in 2016 that changed how kickoffs are returned (a touchback is placed at the 25 instead of the 20 yard line in hopes to increase touchbacks to reduce injuries). This problem will investigate what effect this may have on that type of play in the game based on just 1 week, n = 16 games, of data. (a) In the entire year of 2015 (256 games), 1470 out of 2627 kickoffs were touchbacks. So far in 2016, 103 out of 165 kickoffs have been touchbacks. Perform a formal hypothesis test to determine whether the rate of touchbacks has changed from 2015 to What does this mean for whether the rule change had an effect on kickoffs? Solution: If we treat these as two samples from two separate populations (or superpopulations), then we can perform a 2-sample z-test for proportions. Note, each observation (each individual kickoff) is a Bernoulli r.v. with parameter p, and thus has mean µ = p and variance σ = p(1 p), and the CLT applies to the 2 sample proportions, ˆp, since these are averages individual observations: H 0 : p 2016 = p 2015 vs. H A : p 2016 p 2015 Z = ˆp 2016 ˆp 2015 ( ) = ˆp pooled (1 ˆp pooled ) n1 n2 p value = 2(1 P (Z > 1.624)) = Note: ˆp pooled = ( )/( ) is the combined rate of touchbacks. Also, this test can instead be performed as a one sample test where only ˆp 2015 is used as an estimate and p 2016 = 1470/2627 = is the true parameter value, and this difference in approach is addressed in the next problem. (b) The proportion from 2015 could be treated as a population parameter or as an estimate of some super-population (the theoretical construct that there is some mechanism producing these data). Practically speaking, give an argument why it does not matter which way you treated it in the previous part. Solution: This does not matter since the sample size is so large (n = 2627). When an estimate is based on so much data, the estimator s variance is so small that it has almost no bearing on the result and can be treated like a constant. 3

4 (c) Calculate a 95% confidence interval for the true proportion of kickoffs that end in touchbacks in Solution: ˆp 2016 ± z ˆp 2016 (1 ˆp 2016 ) (0.3758) = ± 1.96 = (0.550, 0.698) n (d) Do the confidence interval and hypothesis test agree? How do you know? Solution: Yes they agree. Treating the proportion from 2015 as a population parameter (p =.05596), our confidence interval covers that value, and thus it should not be rejected as the true underlying proportion of kickoffs that are touchbacks in (e) There are 32 teams in the NFL and each team essentially uses the same players on kickoffs throughout the year (the kicker is the most important player on kickoffs and he very rarely changes). How does this affect the assumptions of your inferences in part (a) and (c)? Solution: The observations are not independent, either within a season (due to a clustering effect by team/kicker) or between seasons (since the data in some way are paired from one season to the next). (f) The entire 2015 season may not be the best comparison group for this study. Provide a different comparison group and/or analysis approach that may be more appropriate. Solution: A better approach would be to compare week 1 of 2016 to week 1 of There may be changes as the season goes along (especially weather). Also, one could perform a paired test looking at the difference within teams from 2015 to Problem 5. Use R to perform the following simulation based on 100,000 iterations to mimic the previous problem. (a) Assume 2015 is a discrete population of kickoffs with exactly 1470 touchbacks and 1157 nontouchbacks. Sample, without replacement, 165 kickoffs and measure the proportion of kickoffs that are touchbacks within this sample. Repeat this sampling 100,000 times. Provide a histogram of the sampled proportions. What proportion of sample proportions is greater than what was actually observed in 2016? Solution: The histogram is provided below along with the relevant R output (see R code file for simulation code). The proportion of sample proportions greater than what was observed is estimated to be about 3.5%. (b) Now assume 2015 is an infinite population of kickoffs where p = 1470/2627 of the kickoffs are touchbacks. (Note: this is equivalent to sampling with replacement from the discrete population). Sample from the theoretically infinite population (or sample with replacement from the finite population) 165 kickoffs and measure the proportion of kickoffs that are touchbacks within this sample. Repeat this sampling 100,000 times. Provide a histogram of the sampled proportions. What proportion of sample proportions is greater than what was actually observed in 2016? Solution: The proportion of sample proportions greater than what was observed is estimated to be about 3.9% here, slightly larger than in part (a). Since this is higher than the respectively calculation for part (a), this hints that the sampling distribution has fatter tails (higher variance), which will be discussed in the next part. 4

5 > mean(sample_props1>phat_1) #part a [1] > mean(sample_props2>phat_1) #part b [1] Histogram of sample_props1 Histogram of sample_props sample_props sample_props2 (c) How do the histograms of the two sampling procedures compare? How would the histogram change if part (a) was based on a discrete population with 1/4 as many observations? Feel free to use empirical evidence and statistics/measures to support your claim. Note: this is a different issue than seen in 4(b). Solution: They are very similar (both approximately normal). The variance of the histogram for (b), when sampling was done with replacement, has slightly more variability than when performed without replacement. This difference in the variance will be exacerbated if the population size is even smaller (closer to the sample size). > mean(sample_props1) [1] > var(sample_props1) [1] > mean(sample_props2) [1] > var(sample_props2) [1] (d) Turn your results in parts (a) and (b) into 2-sided p-values. How do these compare to the hypothesis test from the previous problem? Solution: Technically, we should change the inequality from parts (a) and (b) as they should be greater than or equal to and not just greater than. How to turn this into a 2-sided p-value then would depend on if the reference distribution (here, the histogram) is symmetric or not. If it s symmetric, we can just take the 1-tail probability and multiply by 2. If it is not symmetric, then we would have to be much more careful in how to calculate this (and determine extremity by distance from the null hypothesis mean). Luckily, our histogram is roughly normal, so we are OK to take the 1-tail probability and multiply by 2. Ignoring the equality we can just take our results from parts (a) and (b) and get p-values of Taking into account the equality we see the p-values are more similar to the calculations done by hand: 5

6 > 2*mean(sample_props1>=phat_1) [1] > 2*mean(sample_props2>=phat_1) [1] Note: due to the discreteness of this r.v., it makes a difference if we include the equality. 6

Unit 5: Sampling Distributions of Statistics

Unit 5: Sampling Distributions of Statistics Statistics 571: Statistical Methods Ramón V. León 6/12/2004 Unit 5 - Stat 571 - Ramon V. Leon 1 Definitions and Key Concepts A sample statistic used to estimate