STAT 250 Dr. Kari Lock Morgan The Big Picture Describing Data: One Quantitative Variable Population Sampling SECTIONS 2.2, 2.3 One quantitative variable (2.2, 2.3) Statistical Inference Sample Descriptive Statistics Descriptive Statistics In order to make sense of data, we need ways to summarize and visualize it Summarizing and visualizing variables and relationships between two variables is often known as descriptive statistics (also known as exploratory data analysis) Type of summary statistics and visualization methods depend on the type of variable(s) being analyzed (categorical or quantitative) Today: One quantitative variable Question of the Day How obese are Americans? Obesity Trends* Among U.S. Adults BRFSS, 1990, 2000, 2010 (*BMI 30, or about 30 lbs. overweight for 5 4 person) 1990 2010 2000 Obesity in America Obesity is a HUGE problem in America We ll explore the topic of obesity in America question with two different types of data, both collected by the CDC: Proportion of adults who are obese in each state BMI for a random sample of Americans No Data <10% 10% 14% 15% 19% 20% 24% 25% 29% 30% Source: Behavioral Risk Factor Surveillance System, CDC. 1
Behavioral Risk Factor Surveillance System Obesity by State: 2013 http://www.cdc.gov/obesity/data/table-adults.html Dotplot In a dotplot, each case is represented by a dot and dots are stacked. Easy way to see each case?? Histogram The height of the each bar corresponds to the number of cases within that range of the variable 5 states with obesity rate between 33.25 and 33.75 Minitab: Graph -> Dotplot -> One Y -> Simple Minitab: Graph -> Histogram -> Simple 33.25 to 33.75 Shape National Health and Nutrition Examination Survey Long right tail Symmetric Right-Skewed Left-Skewed 2
BMI of Americans BMI of Americans The distribution of BMI for American adults is a) Symmetric b) Left-skewed c) Right-skewed Notation The sample size, the number of cases in the sample, is denoted by n We often let x or y stand for any variable, and x 1, x 2,, x n represent the n values of the variable x x 1 = 32.4, x 2 = 28.4, x 3 = 26.8, Mean The mean or average of the data values is mean = sum of all data values number of data values mean = x 1 + x 2 + + x n n = x n Sample mean: x Population mean: ( mu ) Minitab: Stat -> Basic Statistics -> Display Descriptive Statistics Mean Median The average obesity rate across the 50 states is µ = 28.606. The median, m, is the middle value when the data are ordered. If there are an even number of values, the median is the average of the two middle values. The average BMI for Americans in this sample is x = 24.887. The median splits the data in half. Minitab: Stat -> Basic Statistics -> Display Descriptive Statistics 3
Measures of Center For symmetric distributions, the mean and the median will be about the same For skewed distributions, the mean will be more pulled towards the direction of skewness Measures of Center m = 24.163 =24.887 Mean is pulled in the direction of skewness Skewness and Center A distribution is left-skewed. Which measure of center would you expect to be higher? a) Mean b) Median Outlier An outlier is an observed value that is notably distinct from the other values in a dataset. Outliers Resistance A statistic is resistant if it is relatively unaffected by extreme values. The median is resistant while the mean is not. More info here Mean Median With Outlier 105.22 101.0 Without Outlier 102.56 100.5 4
Frequency Frequency 0 50 150 0 50 150 Outliers When using statistics that are not resistant to outliers, stop and think about whether the outlier is a mistake Standard Deviation The standard deviation for a quantitative variable measures the spread of the data If not, you have to decide whether the outlier is part of your population of interest or not Usually, for outliers that are not a mistake, it s best to run the analysis twice, once with the outlier(s) and once without, to see how much the outlier(s) are affecting the results s = x x 2 n 1 Sample standard deviation: s Population standard deviation: ( sigma ) Minitab: Stat -> Basic Statistics -> Display Descriptive Statistics Standard Deviation The larger the standard deviation, the more variability there is in the data and the more spread out the data are The standard deviation gives a rough estimate of the typical distance of a data values from the mean Standard Deviation s 1-15 -10-5 0 5 10 15 s 4-15 -10-5 0 5 10 15 Both of these distributions are bell-shaped Two Ways of Measuring Obesity 95% Rule States as cases Individual people as cases If a distribution of data is approximately symmetric and bell-shaped, about 95% of the data should fall within two standard deviations of the mean. Differences? For a population, 95% of the data will be between µ 2 and µ + 2 For a sample, 95% of the data will be between x 2s and x 2s 5
Frequency 0 50 150 Frequency 0 50 150 The 95% Rule 95% Rule Give an interval that will likely contain 95% of obesity rates of states. (x 2s, x + 2s ) (28.606 2*3.377, 28.606 2*3.377) (21.852, 35.36) 47/50 = 94% 95% Rule The 95% Rule Could we use the same method to get an interval that will contain 95% of BMIs of American adults? s 1-3 -2-1 0 1 2 3 a) Yes b) No s 4-15 -10-5 0 5 10 15 StatKey The 95% Rule The standard deviation for hours of sleep per night is closest to a) ½ b) 1 c) 2 d) 4 e) I have no idea z-score The z-score for a data value, x, is z = x x s for sample data, and x μ z = σ for population data. z-score measures the number of standard deviations away from the mean 6
z-score A z-score puts values on a common scale A z-score is the number of standard deviations a value falls from the mean Challenge: For symmetric, bell-shaped distributions, 95% of all z-scores fall between what two values? z-score Which is better, an ACT score of 28 or a combined SAT score of 2100? ACT: = 21, = 5 SAT: = 1500, = 325 Assume ACT and SAT scores have approximately bell-shaped distributions a) ACT score of 28 b) SAT score of 2100 c) I don t know Other Measures of Location Maximum = largest data value Minimum = smallest data value Quartiles: Q 1 = median of the values below m. Q 3 = median of the values above m. Five Number Summary Five Number Summary: Min Q 1 m Q 3 25% 25% 25% 25% Max Minitab: Stat -> Basic Statistics -> Display Descriptive Statistics Five Number Summary > summary(study_hours) Min. 1st Qu. Median 3rd Qu. Max. 2.00 10.00 15.00 20.00 69.00 The distribution of number of hours spent studying each week is a) Symmetric b) Right-skewed c) Left-skewed d) Impossible to tell Percentile The P th is the value which is greater than P% of the data We already used z-scores to determine whether an SAT score of 2100 or an ACT score of 28 is better We could also have used s: ACT score of 28: 91st SAT score of 2100: 97th 7
Five Number Summary Five Number Summary: Min Q 1 m Q 3 25% 25% 25% 25% Max Measures of Spread Range = Max Min Interquartile Range (IQR) = Q 3 Q 1 Is the range resistant to outliers? a) Yes b) No 0 th 25 th 50 th 75 th 100 th Is the IQR resistant to outliers? a) Yes b) No Measures of Center: Mean (not resistant) Median (resistant) Comparing Statistics Measures of Spread: Standard deviation (not resistant) IQR (resistant) Range (not resistant) Most often, we use the mean and the standard deviation, because they are calculated based on all the data values, so use all the available information Middle 50% of data Boxplot Minitab: Graph -> Boxplot -> One Y -> Simple Lines ( whiskers ) extend from each quartile to the most extreme value that is not an outlier Median Q 3 Q 1 Boxplot Boxplot *For boxplots, outliers are defined as any point more than 1.5 IQRs beyond the quartiles (although you don t have to know that) Outlier This boxplot shows a distribution that is a) Symmetric b) Left-skewed c) Right-skewed 8
Summary: One Quantitative Variable Summary Statistics Center: mean, median Spread: standard deviation, range, IQR 5 number summary Percentiles Visualization Dotplot Histogram Boxplot To Do Read Sections 2.2 and 2.3 Do Homework 2.2, 2.3 (due Friday, 9/18) Other concepts Shape: symmetric, skewed, bell-shaped Outliers, resistance z-scores 9