Chapter 6 Part 6. Confidence Intervals chi square distribution binomial distribution

Size: px

Start display at page:

Download "Chapter 6 Part 6. Confidence Intervals chi square distribution binomial distribution"

Clarence Watkins
5 years ago
Views:

1 Chapter 6 Part 6 Confidence Intervals chi square distribution binomial distribution October 8, 008

2 Brief review of what we covered last time. In order to get a confidence interval for the population mean ( μ ) we get a confidence interval for the sample mean. In order to obtain this confidence interval we used the sampling distribution of means whose distribution is represented by the random variable X σ. We discovered that if you know the population variance ( ), then X μ σ n follows the N(0,1) distribution. We started with z < X μ < z σ n α / 1 ( α/ ) 1 ( α/ ) cuts off an area equal to (where z 1 ( α / ) ) This allowed us to get the following is the cutoff on the N(0, 1) curve that σ n σ n Pr X z1 ( α/ ) < μ < X + z1 ( α/ ) = 1 confidence interval is α σ z n z σ, + 1 (a/ ) 1 (a/ ) n X α. Then the (1 - )%. We changed from probability to confidence when we switched from (a random variable) to (a sample estimate of the population mean). For samples of size n, these confidence intervals are all the same length and are all symmetric about, the sample mean. If we don t know the population variance, then we ll need to estimate the population s X μ s n variance using the sample variance ( ). We can then get that follows the t distribution with degrees of freedom = get the (1- )% confidence interval to be α n 1 s t n t s, + n 1,1 ( α/ ) n 1,1 ( α/ ) n. Using the same process as above, we. Page --

3 n 1 z n 1 The subscript for the cutoff t contains an that the didn t have. The indicates which distribution to use. These confidence intervals are all centered about t t 1,1 α, but they are not all of the same length (length = ) because as the n ( / ) sample standard deviation changes, the length changes. The length of the confidence intervals that use the normal distribution will be shorter than the intervals using the t distribution because. α α If = 0.05, then = 0.05 t t > z n 1,1 ( α/ ) 1 ( α/ ). di invttail(, 0.05) di invnormal(0.975) s n 4.30 = t df = 0975, = z So, for eample, the distribution with degrees of freedom (i.e. = 3) has a cutoff for the area of.5% that is over twice the size of the cutoff for the.5% area using the N(0, 1). As the degrees of freedom get larger the.5% cutoff for the t distribution will get closer to The need for a confidence interval about the variance came up in the contet of trying to find a way to measure variability, say in a quality control situation. We already know that we can estimate the variance of the population by the variance of a sample. But that is going to give us just a point estimate. We would like to be able to get a confidence interval. The confidence interval for the variance uses neither the t n nor the normal distribution. n 1 What is known is that the sample variance s multiplied by follows the chi σ square ( χ ) distribution. The graph on the right below (the chi square curve with 5 degrees of freedom) shows that we will have to calculate the two cutoffs separately rather than having something like and We will associate the cutoff on the left (0.83 in the graph on the right below) with and the cutoff on the right (1.83 in the graph below) with χ n 1,1 ( α / ). χ n 1, ( α / ) I am using S to represent the random variable and s as the sample estimate. Page -3-

4 Chi square distributions with various degrees of freedom.8 g().6.4 df = 1 df =. df = 5 df = 10 df = So we get that Pr χ n ( n 1) S < σ < χ 1, α/ n 11, ( α/ ) = 1 α You want σ in the middle, the steps to accomplish that are ( n 1) S 1) Divide all three terms by. This is a positive number because a n 1> 0 S > 0 sample of size 1 is not interesting so and (technically the sample variance could be zero but that would mean that all of the elements of the sample are the same, again not a very interesting case). 1 1 > > 0 b> a a b ) If, then. This allows you to flip each of the 3 terms so that we get σ in the middle with the result below. Pr ( ) n 1 S ( n 1 < < ) S σ χ χ n 11, ( α/ ) n 1, α/ = 1 α Page -4-

5 α ( n 1) s, ( n 1 ) s So that the (1 - )% confidence interval is. χn 11, ( α/ ) χn 1, α/ To get the confidence interval for the standard deviation, you just take the square root of each of the endpoints of the interval above (see Cheat Sheet). The confidence intervals for the variance are not symmetric about the variance. As you pick different samples of size n, you get different sample variances and hence the intervals are also not all the same length. Stata commands: chi(df,) provides the cumulative chi-squared distribution with df degrees of freedom invchi(df,p) provides the inverse of chi(). If chi(df,) = p, then invchi(df,p) =. chitail(df,) provides the upper-tail cumulative chi-squared with df degrees of freedom invchitail(df,p) provides the inverse of chitail(). If chitail(df,) = p, then invchitail(df,p) =. Page -5-

6 When we were mapping the square of the N(0, 1) curve onto a chi square distribution with 1 degree of freedom, we saw that the 1.96 on the N(0,1) curve mapped into 1.96 which is equal to Because the two ends of the normal distribution both map into the upper tail of the chi square, we have that 3.84 cuts off the upper 5% of the one degree of freedom chi square distribution. The first commands below go with the upper (right) part of the chi square distribution.. di chitail(1,3.84) (where 1 = df, 3.84 = cutoff and the area is to the right of 3.84). di invchitail(1,0.05) (where 1 = df, 0.05 = probability = area to the right of 3.84) The commands below go with the lower (left) part of the chi square distribution. If 3.84 cuts off 0.05 in the upper tail of a chi square with 1 degree of freedom, then it cuts off 0.95 in the lower portion of the curve.. di chi(1,3.84) (where 1 = df, 3.84 = cutoff and the area is to the left of 3.84). di invchi(1,0.95) (where 1 = df, 0.95 = probability = area to the left of 3.84) To find the 95% confidence interval for the variance of the blood lead concentration α (s =.41, with n = 98), we have α = 005. = and that 1 ( α / ) = Since n= 98, df= n - 1 = 97. So we need to calculate See the graph below. χ and χ 97, , We can consider the area to the left each of the two cutoffs.. di invchi(97,0.05) di invchi(97,0.975) Page -6-

7 Or we can use the area to the right of each of the two cutoffs.. di invchitail(97,0.05) di invchitail(97,0.975) ( n 1) s, ( n 1 ) s χ χ 97, , = 97(. 41) 97(. 41), di (97*.41)/ di (97*.41)/ So the 95% CI for.41 is (17.3, 30.34). The interval is not symmetric about.41.. di di Page -7-

8 If the gold standard that the labs were trying to achieve was 15, then.41 is statistically different from 15 (at the 0.05 level) since 15 is not in the interval. All of the variances in the interval are considered to be statistically similar and are considered to be statistically different from every variance outside the interval. If we wanted the confidence interval for the standard deviation (4.73) of the blood lead distribution, then all we have to do is take the square root of the endpoints for the confidence interval for the variance to get the endpoints for the confidence interval for the standard deviation.. di sqrt( ) di sqrt( ) So the confidence interval for the standard deviation 4.73 is (4.15, 5.51). Notice that this interval is not symmetric about di di The dofile below generates the chi square density with 0 degrees of freedom. clear *the chi square distribution is a special case of a Gamma Distribution. *this dofile generates a chi square density with df = 0. *to get graph of the density function, *use a connected twoway graph with = and y = y. set obs 100 gen = sum(1) *you can change the df to whatever you like. It has to be an integer. gen df = 0 gen y = gammaden(df/,,0,) twoway (connected y, msymbol(none)),/* */title(chi square distribution with 0 degrees of freedom) Below I give you a dofile that will calculate the confidence interval for the variance. Stata does not provide you with commands that will calculate the confidence for you. Page -8-

9 The dofile below generates a confidence interval for the variance. You have to change n, ssq and alpha to fit your situation. clear set obs 1 *n is the sample size on which the variance is calculated *The data here is on page 7 of the Chapter 6 Part 6 handout. gen n = 98 *ssq is the variance gen ssq =.41 *alpha is such that (1 - alpha)*100 is the % for the CI gen alpha = 0.05 gen df = n - 1 gen aover = alpha/ gen oneminus = 1 - aover gen chill = invchi(df,oneminus) gen chiul = invchi(df,aover) gen LB = (df*ssq)/chill gen UB = (df*ssq)/chiul * the command below says list the upper & lower bounds listed in the first record list LB UB in 1 Binomial distribution with n = 6 and p = f() It is going to be a little tricky to get a confidence interval for p the probability of success. You can see that we can t get cutoffs as we have done for the other distributions because you can t find a point that just cuts off.5% say. So let us first go back and see what we know about the binomial distribution. Page -9-

10 Binomial Distribution 1) The binomial distribution has two parameters, p = the probability of a success on a single trial and n = the number of trials. ) The n trials have to be independent and each of the trials has to have the same probability of success. 3) The binomial distribution is a discrete distribution which is defined for the integers 0, 1,,..., n. 4) Like the chi square distribution, the binomial distribution is not symmetric (ecept when p = 0.5). 5) The mean of the binomial distribution is np and the variance is npq or np(1 - p). 6) For the normal distribution the point estimates (as opposed to interval estimation) for the population mean and variance are (the sample mean) and s (the sample variance). For the binomial distribution what is the point estimate or sample estimate for the population parameter p (the probability of success)? First to distinguish between the population parameter and the sample statistic we will use a hat on the sample statistic. So = the number of successes/the total number of trials (i.e. the proportion of successes). [Aside: You will frequently see denoted as ]. $μ 7) For the normal distribution the average of the means of all of the random samples from a given population gives us the mean of that population. The average of the variances of all of the random samples gives us the variance of the population. It is also true that the average of the proportion of successes for each of the random samples is also the population proportion. So is an unbiased estimator of the population proportion p. $p $p Eample: Occupational Health (Rosner page 87) The proportion of deaths due to lung cancer in males aged in England and Wales during the period was 1%. Suppose that of 0 deaths that occur among male workers in this age group who have worked at least one year in a chemical plant, 5 are due to lung cancer. Is there a difference between the proportion of deaths due to lung cancer in this plant and the proportion in the general population? Page -10-

11 One way of approaching this problem is to get a confidence interval for the sample estimate of the proportion of deaths due to lung cancer and see if 1% is or is not in the interval. So first we need the sample estimate of the population value (i.e.1%). Success in this 5 case is death due to lung cancer and the number of trials n is 0. So p $ = = 05. or 0 this could also be epressed as 5%. How do we get a confidence interval for the sample estimate 0.5. Well the theoretical background for the confidence interval for a binomial distribution is not as simple as it was for the confidence intervals for the sample means and variances of normal distributions. This is in part because the binomial distribution is discrete. So we can t just find say 0.05 in each of the tails and use the cutoffs as the endpoints for the confidence interval. We can t do this because the probability mass is only at the integers and is not necessarily in neat little batches that add up to 0.05 (see graph page 9). Let us first use Stata to get the interval and then I ll eplain what Stata is doing. I will use the immediate command because we don t have a data set, we just have the population and sample values. Immediate command for variable distributed as binomial cii #obs #succ [, ciib_options] cii is epecting the number of observations (0 for this eample) and number of successes (5).. cii Binomial Eact -- Variable Obs Mean Std. Err. [95% Conf. Interval] When we use Stata to get the CI for 0.5 (5/0) we get p 1 = 0.09 (the left-hand endpoint of the CI) and p = 0.49 (the right-hand endpoint of the confidence interval). The mean listed in the Stata results above is not the mean of the binomial distribution with n = 0 and p = 0.5. In the Stata output the value listed under the mean is actually $p (0.5). Had we been Page -11-

12 estimating the mean of a population 0.5 would have been the sample mean ( this output is sort of one version of the output fits all. The value called the standard error (SE) in the Stata output is ). So pq n = = This is the SD of the sampling distribution for the probability of successes. But the SD of the sampling distribution is called the standard error (this works the same way for the normal distribution). How is a binomial confidence interval defined: An eact 100% (1 - α) confidence interval for the binomial parameter p that is always valid is given by (p 1, p ) where p 1, p satisfy the equations Pr(X $ X ~ binomial with parameter p 1 ) = α/ = n k= ( n) k p ( 1 p ) k 1 1 n k Pr(X # X ~ binomial with parameter p ) = α/ = n k p k p n k ( 1 ) k = 0 Where n is the number of trials (0 in this case) and is the number of successes (5 in this case). I ll show you in a graph what is essentially happening but first I ll show you that the values Stata came up with satisfy the definition given above.. cii Binomial Eact -- Variable Obs Mean Std. Err. [95% Conf. Interval] p = p = Page -1-

13 k (. )( ) k k = 5 0 k = Stata binomial(0,4, ) = binomial(0,4, ) = 4 0 j ( ) ( ) j j = 0 0 j Stata 9: Binomial(0,5, ) = = 0 j 0 j ( ) ( ) j = 5 0 j. di Binomial(0,5, ) k (. )( ) k k = 0 0 k = binomial(0,6, ) = di binomial(0,5, ) Notice that what we are finding are two binomial distributions that have the same number of trials (0) and successes (5) as our sample. However, the p s are different. Page -13-

14 Occupational Health (Rosner page 87) binomials that give endpoints for 95% CI for 0.5 f() p = 0.5 p1 = p = The binomial distribution with probability = p 1 (i.e. the circles) is such that its upper tail (i.e. from 5 up, including 5) is equal to 0.05 (or α/). The binomial distribution with probability = p (i.e. the s) is such that its lower tail (i.e. from 5 down, including 5) is equal to The red triangles above give the original binomial distribution (i.e. n = 0 and success = 5 so = 0.5). $p The green circles give the binomial with n = 0 and p 1 = Note that although p 1 is the left-hand endpoint of the CI, we obtain α/ = 0.05 from the upper tail of the green (circles) distribution (i.e. the sum is from 5 to 0). The weights at each of 5, 6,..., 0 added together equal 0.05 (i.e. the sum of the bars). The distribution with p 1 = is to the left of the original distribution (i.e. the red triangles). The blue s give the binomial with n = 0 and p = Note that although p is the right-hand endpoint of the CI, we obtain α/ = 0.05 from the lower tail of the blue ( s) Page -14-

15 distribution (i.e. the sum is from 0 to 5). The weights at each of 0, 1,, 3, 4 and 5 added together equal The distribution with is to the right of the original distribution. p = Notice that we know the binomials are in the correct order because p 1 < $p < p so 0p 1 < 0 $p < 0p and 0p 1, 0 $p and 0p are the respective means for the three distributions. Below is the dofile used to obtain the graph above. The */ and /* marks in the graph command are a way of continuing the command onto more than one line. clear *binomial3.do set obs 1 *note that fof is the binomial density f()with n = 0 and p = 0.5 *fofp1 is the binomial density with n = 0 and p = *fofp is the binomial density with n = 0 and p = gen = sum(1) - 1 gen fof = binomial(0,,0.5) - binomial(0,-1,0.5) label variable fof "p = 0.5" gen fofp1 = binomial(0,, ) - binomial(0,-1, ) label variable fofp1 "p1 = " gen fofp = binomial(0,, ) - binomial(0,-1, ) label variable fofp "p = " twoway (scatter fof, mcolor(red) msymbol(triangle)) (scatter fofp1, mcolor(mint) msymbol(circle)) /* */ (scatter fofp, mcolor(blue) msymbol(lg)), ytitle(f()) line(5) /* */ title(eample 4: Original binomial (p = 0.5)) /* */ subtitle(with binomials that give endpoints for 95% CI) /* */ legend(on rows(1)) scheme(s1color) plotregion(lwidth(none)) Stata 9 version gen fof = Binomial(0,,0.5) - Binomial(0,+1,0.5) In order to use the non-immediate form, I created a little file with 0 observations and two variables outcome (5 coded as 1 and 15 coded as zero) and newoutcome (5 coded as and 15 coded as 1). You can see below that when I coded the variable newoutcome as 1/ rather than 0/1, you just don t get an answer. In order to use the Stata command your variable must be coded 1 for success (here a cancer death) and 0 for failure (here a non-cancer death). The need for this coding is a Stata quirk. Page -15-

16 . tab outcome outcome Freq. Percent Cum not CA CA Total label list outlbl: 0 not CA 1 CA. ci outcome,binomial -- Binomial Eact -- Variable Obs Mean Std. Err. [95% Conf. Interval] outcome gen newoutcome = outcome + 1. tab newoutcome newoutcome Freq. Percent Cum Total ci newoutcome,binomial -- Binomial Eact -- Variable Obs Mean Std. Err. [95% Conf. Interval] See addition on the net page. Page -16-

17 Amendment to Chapter 6 Part 6 handout. Below is what I hope is a clearer eplanation of Problem 5.66 than was in the earlier version of the handout. So let us go back to Rosner s Problem 5.66: In Problem 5.65 we found that for 17 year old boys the probability of having elevated diastolic BP (i.e. DBP $ 90) was or 1.05% (the population p or population percent). This means if the population percent was true for the sample of 000, we would epect approimately 0 (000*0.0105) successes (i.e. 0 boys with elevated DBP). Twenty-five is the number of successes (boys with elevated DBP) in the sample of 000. So (the sample estimate of the population p) = 5/000 = Below $p we see that the 95% CI for (i.e 95% CI for $p ) is (0.008, 0.018). Notice that (the population percentage of successes) is in the 95% confidence interval about the sample estimate ( = 0.015) of p. This means that the sample estimate $p of p (0.015) and the population value of p (0.0105) are considered to be similar. Or we might say that it is reasonable to believe that the sample of 000 with 5 successes could have been drawn from the population with p (probability of success) = cii 000 5,level(95) -- Binomial Eact -- Variable Obs Mean Std. Err. [95% Conf. Interval] return list scalars: r(ub) = r(lb) = r(se) = r(mean) =.015 r(n) = 000. di r(ub) - r(lb) = length of the interval Page -17-

18 . cii 000 5,level(90) -- Binomial Eact -- Variable Obs Mean Std. Err. [90% Conf. Interval] di r(ub) - r(lb) Once you know what Stata is calling the upper bound and lower bound you do not have to use the return list command.. cii 000 5,level(85) -- Binomial Eact -- Variable Obs Mean Std. Err. [85% Conf. Interval] di r(ub) - r(lb) We can see that the 85% CI has the shortest length and the 95% CI has the longest length. So the more certain you are, the longer the interval. Page -18-

Chapter 6 Part 3 October 21, Bootstrapping

Chapter 6 Part 3 October 21, Bootstrapping Chapter 6 Part 3 October 21, 2008 Bootstrapping From the internet: The bootstrap involves repeated re-estimation of a parameter using random samples with replacement from the original data. Because the