Key Objectives. Module 2: The Logic of Statistical Inference. Z-scores. SGSB Workshop: Using Statistical Data to Make Decisions

SGSB Workshop: Using Statistical Data to Make Decisions Module 2: The Logic of Statistical Inference Dr. Tom Ilvento January 2006 Dr. Mugdim Pašić Key Objectives Understand the logic of statistical inference Understand the concept of a sampling distribution, a sample statistic, and a standard error Understand the use of a confidence interval Understand the basic elements of a hypothesis test Understand the p-value in a statistical test Z-scores If we take a value in a variable Subtract the mean from this value And divide by the standard deviation We have created a z-score z = ( x x) i i sx A z-score expresses a value as being so many standard deviations away from the mean (1 std dev, 2 std devs, or even 1.75 std devs) Let s be detectives and determine is a fraud has been committed! A wholesale furniture company had a fire in its warehouse. After determining that the fire was an accident, the company sought to recover costs by making a claim to the insurance company. The retailer had to submit data to estimate the Gross Profit Factor (GPF). GPF= Profit/Selling Price* 100 The retailer estimated the GPF based on what was expressed as a random sample of 253 items sold in the past year and calculated GPF as 50.8%. (n = 253) The insurance company was suspicious of this value and expected a value closer to 48% based on past experience. The insurance company hired us to record all sales in the past year (n=3,005) to calculate a population GPF. Here s the Descriptive Statistics of the of the GPF for the population (N=3005) GPF Mean 48.901 Standard Error 0.252 Median 50.210 Mode 58.330 Standard Deviation 13.829 Sample Variance 191.244 Kurtosis 61.061 Skewness -4.216 Range 350.510 Minimum -202.510 Maximum 148.000 Sum 146947.440 Count 3005 Here s the Descriptive Statistics of the of the GPF for the population (N=3005) GPF Mean 48.901 Let s calculate the z-score and Standard Error 0.252 Median 50.210 probability for the mean level Mode 58.330 obtained by the store, 50.8: Standard Deviation 13.829 Sample Variance 191.244 Kurtosis 61.061 z = (50.8-48.9)/13.83 Skewness -4.216 Range 350.510 Minimum -202.510 =.137 Maximum 148.000 Sum 146947.440 Count 3005 1

35% 30% 25% 20% 15% 10% 5% 0% Level of approval of how President Bush is handling gas prices, September, 2005 AP ABC Zogby* CNN We expect variability from sample to sample we call it sampling error *Based on the combined Excellent and Good Ratings Now we move toward inference Remember we noted that A parameter is a numerical descriptive measure of the population We use Greek terms to represent it It is hardly ever known A sample statistic is a numerical descriptive measure from a sample Based on the observations in the sample We want the sample to be derived from a random process Sampling Distribution A Sampling Distribution is the notion of taking many samples of the same size Making an estimate from each sample, for example, a mean And then looking at how the sample estimates are distributed. Later, we compare our one sample (with its estimate) to the theoretical sampling distribution. Sampling Distribution Example For example: What if we rolled three die Calculated the mean for each of the three rolls And then repeated this many times (n=216) What would the distribution of many rolls look like? The resulting distribution is called a Sampling Distribution for the mean of the roll of three die As an experiment Stem and Leaf of 3-die of a Priori Sampling Distribution Roll 3 times 5 4 1 Roll 3 times 4 4 3 Roll 3 times 5 5 2 Roll 3 times 6 1 1 Roll 3 times 6 4 2 Mean 3.33 3.67 4.00 2.67 4.00 Median Roll 3 times 3 3 2 2.67 3 4 4 5 1 4 1 0 1 1 3 3 3 3 1 7 7 7 7 7 7 6 2 0 0 0 0 0 0 0 0 0 0 10 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 15 2 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 21 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 25 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 27 3 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 27 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 25 4 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 21 4 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 15 5 0 0 0 0 0 0 0 0 0 0 10 5 3 3 3 3 3 3 6 5 7 7 7 3 6 0 1 Stem is the ones; leaf is the first decimal place. It is a symmetrical, bell-shaped curve. More specifically, it is a normal distribution. 2

Sampling Distribution of the Sample Mean(0) for rolling 3 die Sampling Distribution the so what??? Descriptives Mean 3.50 Standard Error 0.07 Median 3.50 Mode 3.33 Standard Deviation 0.99 Sample Variance 0.98 Kurtosis -0.40 Skewness 0.00 Range 5.00 Minimum 1.00 Maximum 6.00 Sum 756 Count 216 The standard deviation for rolling a die is: F = 1.7078 Divide this figure by the Square root of 3 (sample Size): 1.7078 =.986 3 This is also called the Standard Error of 0 Sample statistics are random variables they vary from sample to sample Sometime the random sampling distribution is known the normal distribution or the t- distribution If we know what the distribution looks like, then we can calculate probabilities of taking a random sample and getting an estimate of a parameter if the real value is hypothesized to be a null value What do we want to see? If the sample mean is a good estimator of μ We would expect the values of the mean to cluster around μ We wouldn t want to cluster to be at a point above or below μ (not be biased) And we might say our estimator is good if the cluster of the sample means around μ is tighter than the sampling distribution of some other possible estimator (minimum variance) If we think of it as a Bullseye Target we want our estimates to center around the true value A tighter fit around the target Minimum Variance better and preferred A biased estimator may have a tight fit, but consistently misses the target in a discernable way 3

Inferences from a sample Our sample estimator to population mean is: x x = n The variance of our sample estimate is given as: This is the kind of pattern we would like to see a tight fit around the population parameter ( x X ) 2 s = n 1 where n is equal to the sample size s 2 is a unbiased estimator of population variance σ 2 2 Inferences from a sample The standard deviation represents the average deviation around the sample mean. But we only took one sample out of an infinite number of possible samples. A reasonable question would be, What is the deviation around our estimator (i.e., the sample mean)? What if we took many samples, and recorded the mean of each sample, what would the spread look like of the means from each sample? Inferences from a sample If we could take an infinite number of samples, each sample would most likely yield a different sample mean. Yet, each one would be expressed as a reasonable estimate of the true population mean. So, if we were able to take repeated samples, each of sample size n, what would be the standard deviation of the sample estimates? Inferences from a sample Sampling theory specifies the variance of the sampling distribution of a mean as: 2 x σ Var( x) Var = = n n σ σ x = This is called the Standard Error of the mean n We use two theorems to help us make inferences In the case of the mean, we use two theorems concerning the normal distribution that help us make inferences One depends upon the variable of interest being normally distributed The other does not - Central Limit Theorem 4

For variables that are distributed normally If repeated samples of a variable Y of size n are drawn from a normal distribution, with mean µ and variance F 2, the sampling distribution will be distributed normally with the mean of the sampling distribution equal to µ and the variance equal to F 2 /n. σ = x σ n For variables that are distributed normally What this means: If we could repeatedly take random samples of size n from a normal distribution. And then take the mean of each sample We would expect the mean of the sample means to equal the population value (µ) And the variance of the sample means would equal F 2 /n Central Limit Theorem If repeated sample of Y of size n are drawn from any population (regardless of its distribution as normal or otherwise) having a mean µ and variance F 2, the sampling distribution of the sample means approaches normality, with µ and variance F 2 /n. As long as the sample size is sufficiently large Central Limit Theorem The Central Limit Theorem is a very powerful theorem. It relaxes the assumption of the distribution of the population variable Note: this is based on the notion that our samples are drawn on a random or probability basis. That is, each element of the population has an equal chance of being selected Inferences from a Sample Sampling distribution of of the mean for different population distributions and different sample sizes Comparison of the Characteristics of the Population, Sample, and the Sampling Distribution for the Mean Sampling Population Our Sample Distribution Referred to as Parameters Sample Statistics Statistics How it is viewed Mean Assumed real : = 3X/N Observed 0 = 3x/n Theoretical : = 30/N Variance Standard Deviation F 2 = 3(X- :) 2 /N s = (s 2 ).5 Note: N and X are for the Population, n and x are for the sample F s 2 = 3(x- 0) 2 /(n-1) F 2 0 = F2 /n F 0 = F//n 5

Sampling theory and sampling distributions help make inferences to a population Let's use the example of the mean to set up our discussion of a sampling distribution. Suppose we are looking at a variable, e.g., interest and fee income from a credit card customer. We think of the population as being real or even observable, but we will work with a sample. We believe there is an average income from this population, designated as µ. We want to take a sample to estimate µ. So how do we use this information? We draw a random sample We think of our sample as one of many possible samples of size n from a population with parameters : and F. If the variable is distributed normally, we can use information about the sampling distribution of the mean to make inferences from the sample to the population. Even if the variable is not distributed normally, if our sample size (n) is large enough, we can assume the sampling distribution of sample mean is distributed normally (Central Limit Theorem) Inferences from a sample The standard error of the mean is the standard deviation of a sampling distribution of means with population parameters equal to µ and F 2. If we don't know F 2 we use the unbiased sample estimate of s 2 to estimate the sampling variance of the mean. Here s Our Strategy We use the theoretical sampling distribution to make inferences from our sample to the population. The sampling distribution of an estimator is based on repeated samples of the same sample size (denoted as n). We may never actually take repeated samples But we could think of this happening And our observed sample as one of many possible samples, of size n, we could have drawn from the population Here s our strategy We expect that the standard deviation of sampling distribution of the estimator (in this case the mean) will be smaller than that of the population or the samples themselves. We expect some variability across samples, but not as much as we would find in the population. Thus the sampling error is smaller than the standard deviation for the population. The standard error depends upon: The size of n (as n gets larger the SE gets smaller) The variance of the population variable itself. We can think of this as homogeneity. The larger the sample size, and the more homogeneous the population, the smaller the standard error for our estimator. 6

Let s be detectives and determine is a fraud has been committed! A wholesale furniture company had a fire in its warehouse. After determining that the fire was an accident, the company sought to recover costs by making a claim to the insurance company. The retailer had to submit data to estimate the Gross Profit Factor (GPF). GPF= Profit/Selling Price* 100 The retailer estimated the GPF based on what was expressed as a random sample of 253 items sold in the past year and calculated GPF as 50.8%. The insurance company was suspicious of this value and expected a value closer to 48% based on past experience. The insurance company hired us to record all sales in the past year (n=3,005) to calculate a population GPF. Here s the Descriptive Statistics of the of the GPF for the population (N=3005) GPF Mean 48.901 Standard Error 0.252 Median 50.210 Mode 58.330 Standard Deviation 13.829 Sample Variance 191.244 Kurtosis 61.061 Skewness -4.216 Range 350.510 Minimum -202.510 Maximum 148.000 Sum 146947.440 Count 3005 Let s calculate the z-score and probability for the mean level obtained by the store, 50.8: z = (50.8-48.9)/13.83 =.137 Here s the Descriptive Statistics of the of the GPF for the population (N=3005) GPF Mean 48.901 Standard Error 0.252 Median 50.210 Mode 58.330 Standard Deviation 13.829 Sample Variance 191.244 Kurtosis 61.061 Skewness -4.216 Range 350.510 Minimum -202.510 Maximum 148.000 Sum 146947.440 Count 3005 The store indicated they took a random sample of 253 items. So use the Std Error as the denominator for our z-score, based on n=253. z = ( 50.8 48.9) 13.83 253 1.90 = = 2.185.869 Something that is more than 2 standard deviations away from the mean in a normal distribution is rare we expect to find this less than 5% of the time. Confidence Interval A confidence interval is referred to as an interval estimate of a population parameter. We calculate a Bound of Error around the estimate which is expressed as a plus or minus part around the mean. The plus or minus part is so many standard deviations around the mean (using the standard error as the measure of the standard deviation of the sampling distribution). How many standard deviations is based on the known probabilities from the normal distribution, or from the t-distribution. Confidence Interval Confidence Interval Remember, we only have one sample And thus one interval estimate If we could draw repeated samples 95 percent of the Confidence Intervals calculated on the sample mean Would contain the true population parameter Our one sample interval estimate may not contain the true population parameter x ± z α / 2 σ x x ± zα / 2s x If the population standard deviation is known Use s if the population standard deviation is not known and the sample size is sufficiently large 7

Confidence Interval x ± zα / 2s x We take our estimate And put a plus or minus bound around it Based on the standard error And so many standard deviations that is the z α/2 part of the formula Think of it as ± 2 standard deviations To construct a confidence interval we need A probability level we are comfortable with how much certainty. It s also called Confidence Coefficient, such 95%. It is our confidence in being right in our inference It contrasts to ", which is the chance of being wrong. A 95% C.I. has: " = (1 -.95) =.05 or 5% chance of being wrong " refers to the probability of being wrong in our confidence interval The C.I. Formula The larger the probability level for a C.I. The C.I. Formula The larger the probability level for a C.I. The smaller the value of ", CONFIDENCE LEVEL 100(1- ") " "/2 z "/2 CONFIDENCE LEVEL 100(1- ") " "/2 z "/2 90% 90%.10 95% 95%.05 99% 99%.01 The C.I. Formula The larger the probability level for a C.I. The smaller the value of ", and "/2 The C.I. Formula The larger the probability level for a C.I. The smaller the value of ", and "/2 The larger the z value (or t-value) for the formula CONFIDENCE LEVEL 100(1- ") 90% 95% 99% ".10.05.01 "/2.05.025.005 z "/2 CONFIDENCE LEVEL 100(1- ") 90% 95% 99% ".10.05.01 "/2.05.025.005 z "/2 1.645 1.96 2.575 8

The width of the Confidence Interval depends on " For Example 90% C.I. 7 " 1.645(.424) = 7 ".697 95% C.I. 7 " 1.96(.424) = 7 ".832 99% C.I. 7 " 2.575(.424) = 7 " 1.092 For any given sample size, if you want to be more certain (that is have a smaller " ) you have to accept a wider interval What is the Z-Value? It is based on the Standard Normal Distribution A value of ±1.96 corresponds to a probability of.95 in the standard normal table Think of 1.96 as roughly 2 So we are saying ±2 standard deviations correspond to 95% of the observations (the Empirical Rule). Standard deviations represented by the standard error Most software packages will use the t- distribution instead of the z or standard normal The t-distribution adjust the z-value for degrees of freedom Degrees of freedom in this case is n-1 The t-distribution fits well with the Central Limit Theorem As the sample size gets larger the t-distribution more approximates the standard normal distribution and the z-values Comparison of Z and t values Table 3. Comparison of Z-values and t-values for different sample sizes for a 95% Confidence Interval Sample size Z-value t-value 10 too small 2.262 20 too small 2.093 30 1.96 2.045 50 1.96 2.009 100 1.96 1.984 500 1.96 1.965 1000 1.96 1.962 Modified C.I. Formula using a t- value What is a Confidence Interval? x ± tα / 2, n 1d. f. s n It is an interval estimate of a population parameter The plus or minus part is also known as a Bound of Error Placed in a probability framework 9

What is a Confidence Interval? Not all the C.I. will contain the true population parameter We calculate the probability that the estimation process will result in an interval that contains the true value of the population mean If we had repeated samples Most of the C.I.s would contain the population parameter But not all of them will 85.00 80.00 75.00 70.00 65.00 Figure 2. 95% C.I. for 55 Samples, N=50 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 85.00 80.00 75.00 70.00 65.00 If you want to be more certain (e.g., 99% C.I.), you must accept a wider interval Figure 3. 99% C.I. for 55 Samples, N=50 Calculating a C.I. In Excel Excel will calculate confidence intervals for use using: Tools Data Analysis Descriptives I then using the follow options, including: Identify the Input Range, mark a label if needed Identify the Output range Check Descriptive statistics Check and set the level of the Confidence Interval Excel gives the plus/minus part using a t-value 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 An Example Employee Salaries, n = 150 One of the Excel files available to you is some data on salaries and characteristics of employees called Manager Ratings.xls We will focus on the Rating and Salary variables. I have provided a histogram of salary and the Excel output from the descriptive statistics, along with a 95% confidence interval. Frequency 30 25 20 15 10 5 0 10

Excel Output RATING SALARY Mean 5.901 71.633 Standard Error 0.121 0.874 Median 5.800 71.000 Mode 5.000 76.000 Standard Deviation 1.486 10.704 Sample Variance 2.208 114.569 Kurtosis -0.304-0.173 Skewness 0.077 0.370 Range 7.300 55.000 Minimum 1.800 48.000 Maximum 9.100 103.000 Coefficient of Variation 25.182 14.942 Sum 885.1 10745 Count 150 150 Confidence Level(95.0%) 0.240 1.727 Confidence Interval Example The calculation of the confidence interval is: $71.633 ± 1.9761(.874) $ 71.633 ± $1.727 95% C.I. = $69.906 to $73.360 We are 95% confident that the true mean salary is somewhere between $69,906 and $73,360. Hypothesis ing An alternative to a confidence interval is to use the point estimate and conduct a hypothesis test We compare our estimate to a hypothesized value and in the context of a sampling distribution We are going to ask, how likely was it that we took a random sample and estimated a sample statistic if the real population value is really the hypothesized value. Rare Event Approach Most inferences will be made using a rare event approach We will take a sample And compare it to a hypothesized population And see how close or far away our sample estimate is from the sampling distribution Automobile Batteries The manufacturer claims that life of his automobile batteries is 54 months on average, with a standard deviation of 6 months. We are involved in a consumer group and we decide to take a sample of 100 batteries and test the claim. We select 100 batteries at random them over time and record the length of the battery life Our Sample Data The mean battery life for our sample is: Mean = 52 months Std Dev = 4.5 months So what do we do next? Our batteries didn t last as long on average as the manufacturer said, but it is just a sample. How can we test to see if the claim is legitimate? 11

Our ing Strategy If the world works as the manufacturer says And I would have taken repeated samples of size 100 The sampling distribution would be a normal distribution And have a mean equal to the population mean for battery life, i.e., 54 months And a standard deviation of F divided by the square root of n 6 σ x = =.60 100 Our ing Strategy We want to look at our sample as being part of the theoretical Sampling Distribution. That is, ~ N ( μ, σ/(n).5 ) x x In this case, ~ N (54, 0.6) And see how likely it is that our sample came from that distribution In other words, how likely is it to get a sample mean of 52 from a random sample of 100 batteries when the true population mean is 54 months How do I do this? I hypothesize that the true mean is 54 I calculate a z-score based on my sample value (52.0) and the hypothesized mean and standard error (of the sampling distribution) I look up the probability of finding a z-score equal to or less than our calculated value The Calculate my z-score (52 54) z = = 3.33.6 Look up the value in the standard normal table Draw it out! And p =.0010 after that point z = -3.33 corresponds to a probability of.4990 up to that point This is really a rare event given the claim of the manufacturer that the batteries really last 54 months on average Basic Elements of a Hypothesis Null Hypothesis Alternative Hypothesis Statistic Assumptions Significance and Conclusion 52 months 54 months 12

Basic Elements of a Hypothesis Null Hypothesis is the hypothesis that will be accepted unless we have convincing evidence to the contrary. The null hypothesis is based on expectations of no change, nothing happening, no difference, or the same old same old a straw man, It is in contrast to the rare event, and in most cases we want to reject the null hypothesis. The null hypothesis is often expressed as H o, in the following form: H o : μ = $68,500 Basic Elements of a Hypothesis Alternative Hypothesis or research hypothesis We look to see if the data provide convincing evidence of its truth. The alternative hypothesis which is more in line with our true expectations for the experiment or research. Our alternative hypothesis can be that or sample value is different from the hypothesized value (either higher or lower), or we can specify that we expect it to be above or below the hypothesized value. Basic Elements of a Hypothesis The alternative hypothesis is expressed as H a, in one of the following forms: H a : : μ > $68,500 one tailed, upper H a : : μ < $68,500 one-tailed, lower H a : : μ $68,500 two-tailed test Basic Elements of a Hypothesis Statistic - we calculate a z-score based on the observed value from our sample, the expected value from the hypothesized value, and divided by the standard error from knowledge of the sampling distribution. ( x μ) Z* = σ X The z-score approach gives us how many standard deviations away our test statistic is from the hypothesized value. A test statistic that is more than 2 standard deviations way from the null hypothesis is considered relatively rare. Basic Elements of a Hypothesis Assumptions - there are generally some assumptions that go into any statistical test. In most cases the test assumes the sample is drawn in a random fashion, the sampling distribution is known, issues about the variability in the data, and issues about sample size. For small samples we must assume that the population is distributed approximately normal. We can relax that assumption as the sample size gets larger based on the Central Limit Theorem. Basic Elements of a Hypothesis Significance Level and Conclusion We will place a probability around our test statistic that gives us a sense of how confident we can be in rejecting (or not rejecting) the null hypothesis. One approach is to set a critical value based on an a priori probability level - usually set at.1,.05, or.01. An alternative approach is to calculate a p-value for the test statistic. 13

Basic Elements of a Hypothesis This probability level is referred to as alpha (α) and is the probability of being wrong when rejecting the Null Hypothesis. By this we mean that we might have a sample estimate that is more than 2 standard deviations away from the hypothesized value. While this may be a rare event, it is still a possible event. It is possible, but not probable. We could reject the null hypothesis as being wrong, but we can never be certain. We pick a level of alpha that is relatively small Basic Elements of a Hypothesis The most common way to express the significance of a test is to calculate a p-value for the test statistic. Instead of specifying α a priori, we could we look at observed significance level associated with our test statistic. This is called a p-value and most software programs regularly compute this for you. If the p-value is less than the alpha level you are comfortable with, you have evidence to reject the null hypothesis in favor of the alternative hypothesis. Salary Example of a Hypothesis Let s say the industry average for manager salaries is $68,500, and we want to see if our company is above the industry average (a one-tailed test, upper). We are using a sample of 150 employees to make this test that we assume was taken randomly. Salary Example of a Hypothesis We will set up the following hypothesis test. H o : μ = $68,500 H a : μ > $68,500 Statistic: t* = (71,633 68,500)/874 = 3.5847 Our test statistic is over 3.5 standard deviations away from the hypothesized value a very rare event! To rare for chance! Salary Example of a Hypothesis Salary Hypothesis Data Null Hypothesis μ= 68.5 Level of Significance 0.05 Sample Size 150 Sample Mean 71.63333333 Sample Standard Deviation 10.70370736 Intermediate Calculations Standard Error of the Mean 0.873954046 Degrees of Freedom 149 t Statistic 3.585238088 Rules of Thumb A t-test value greater than 2 (or less than -2) begins to show a rare event A p-value that is less than.05 also shows a relatively rare event Upper-Tail Upper Critical Value 1.655143933 p -Value 0.000227888 Reject the null hypothesis 14

Our Statistics will generally take the form (observed expected) test statistic = standard error Observed an estimate based on our sample Expected a hypothesized value under the Null Hypothesis Standard Error a derived value based on a sampling distribution Confidence Interval for a proportion It is easy! p = proportion q = 1-p Variance = p*q Standard Error = [(p*q)/n].5 p q n If n = 1000 Std error =.0158 (assuming p =.5 and q =.5) 1.96 *.0158 =.031 This gives us the " 3 percent 15