Stat 101 Exam 1 - Embers Important Formulas and Concepts 1

Size: px

Start display at page:

Download "Stat 101 Exam 1 - Embers Important Formulas and Concepts 1"

Gary Jordan
5 years ago
Views:

1 1 Chapter Definitions Stat 101 Exam 1 - Embers Important Formulas and Concepts 1 1. Data Any collection of numbers, characters, images, or other items that provide information about something. 2. Categorical/Qualitative Variables Name categories for grouping. 3. Quantitative Variables When a variable contains measured numerical values with measurement units. 4. Identifier Variable Each record has a unique value like Student ID or SSN. 2 Chapter Definitions 1. Frequency Table Records to totals and uses the category names to label each row. 2. Relative Frequency Table Displays percentages of the values in each category. 3. Bar Chart Displays the distribution of a categorical variable, showing counts for each category next to each other for easy comparison. 4. Relative Frequency Bar Chart Same as a bar chart but displays the percentage of people in each category rather than the counts. 5. Pie Charts Shows a whole group of cases as a circle. The circle is sliced into pieces whose size is proportional to the fraction of the whole in each category. 1 This version: February 7, 2015, by Jennifer Pajda-De La O. May not include all things that could possibly be tested on. To be used as an additional reference to studying all Chapters 1-7. All definitions, formulas, and selected problems come from Intro Stats by De Veaux, Velleman and Bock, 4th edition, published by Pearson.

2 6. Contingency Table Table which shows how the individuals are distributed along each variable. 7. Marginal Distribution Row total or column total in contingency tables. 8. Conditional Distribution Show distribution of one variable for just those cases that satisfy a condition on another variable. Example: Event B given Event A occurs first. 3 Chapter Definitions 1. Distribution Slices up all the possible values of the variable into equal width bins and gives the number of values (or counts) falling into each bin. 2. Histogram Uses adjacent bars to show the distribution of a quantitative variable. Each bar shows the frequency of values falling into each bin. 3. Unimodal Histogram with one peak. 4. Bi-modal Histogram with two peaks. 5. Uniform Histogram that doesn t appear to have any mode. Bars are approximately the same height for each bin. 6. Symmetric Histogram in which the two halves on either side of the center look approximately like mirror images. 7. Skew Histogram that is not symmetric. 8. Skew Left Histogram with a long tail on the left. 9. Skew Right Histogram with a long tail on the right. 10. Boxplot Displays the 5 number summary as a central box with whiskers that etend to the nonoutlying data values.

3 3.2 Formulas 1. Median = Once the data is ordered from smallest to largest, it is the middle value in the data. Divides the histogram into 2 equal pieces. 2. Mean = Average of all of the values = x 3. Range = Max Min 4. Q1 = Median of the lower half of the data 5. Q3 = Median of the upper half of the data 6. IQR = Q3 Q1 7. Upper Fence for Boxplot = Q IQR 8. Lower Fence for Boxplot = Q1 1.5IQR 9. Variance: s 2 = (y ȳ) 2 n Standard Deviation: s = s 2 (y ȳ) 2 = n 1 4 Chapter Definitions 1. z-score Tells how many standard deviations a value is from the mean. Regardless of direction, the farther a data vlaue is from the mean the more unusual it is. 2. Standard Normal Model A Normal Model N(µ, σ) with mean 0 and standard deviation Rule In a normal model, about 68% of values fall within 1 standard deviation of the mean, about 95% fall within 2 standard deviations of the mean; about 99.7% of values fall within 3 standard deviations of the mean. 4. Normal Percentile The normal percentile corresponding to a z-score gives the percentage of values in a standard normal distribution found at that z-score or below. Compared to area under the curve. See Table Z in Appendix D of the textbook.

4 4.2 Formulas 1. z-score: z = y ȳ s = y µ σ 4.3 Properties about the area under a Normal Curve 1. The total area is 100% 2. The mean is the center of a normal curve 3. A Standard Normal Curve has mean = 0 and standard deviation = 1 4. When a normal curve is split in half from the mean, each side contains 50% of the area 5. The normal curve is symmetric 6. If a normal curve is split with 30% of the area on one side, the other side of the curve is 70% of the area 7. If a normal curve has 60% of the area in the middle, the remaining portions are a total of 40%. This 40% is allocated half to each side. So the far left has 20% of the area, the middle is 60% of the area, and the far right side has 20% of the area. Textbook Table Z Appendix D Note: These tables give the percentage to the left of the z value. 5 Chapter Definitions 1. Scatterplot Shows the relationship between 2 quantitative variables. 2. Direction of Scatterplot Positive direction means as one variable increases so does the other. Decreasing direction means the association is negative. 3. Form of Scatterplot Is it in a straight line or some other form? 4. Strength of Scatterplot Strong association if there is little scatter around the underlying relationship. 5. Outlier A point that does not fit the overall pattern seen in the scatterplot.

5 5.2 Formulas 1. Z-Scores for a Scatterplot z x = x x s x z y = y ȳ s y 2. Correlation Coefficient r = = zxzy n 1 [(x x)(y ȳ)] (n 1)s xs y 5.3 Properties of Correlation Coefficient 1. Always between -1 and 1 2. Does not matter which variable you consider as x and y. 3. Treats x and y symmetrically 4. No units 5. Not affected by changes in scale or if the variables are standardized (changing to z- score) 6. Depends only on z-scores 7. Does NOT prove causation. Only provides a relationship between 2 variables. 6 Chapter Definitions 1. Linear Model Equation of the form ŷ = b 0 + b 1 X. ŷ means estimated values for y. 2. Predicted Values Value of ŷ found for a given x-value in the data. 3. Fitted Linear Model ŷ = b 0 + b 1 X 4. Residual Differences between data values and the corresponding values predicted by the model (observed - expected) 5. R 2 Gives the fraction of variability of y accounted for by the least squares linear regression on x. It is an overall measure of how successful the regression is in linearly relating y to x.

6 6. Least Squares Criterion Specifies the unique line that minimizes the variance of the residuals or the sum of squared residuals. 6.2 Formulas 1. Residual = Observed value - Predicted value = y ŷ 2. b 1 = r (s y /s x ) 3. b 0 = ȳ b 1 x 7 Extra Information Review any and all notes and supplementary materials. It may be the case that something was accidentally omitted from this study guide. Also, review any problems that may have been discussed in class as not all example problems may have been provided here.

7 8 Example Problems 1. Use the below table to answer the following questions. Gender Eye Color Blue Green Brown Total Male Female Total (a) Construct a frequency table for Eye Color based on the data above. (b) Find the marginal distribution of gender. (c) What percentage of females have blue eyes? (d) What percentage of green eyed people are male? (e) What percentage of people are females and have green eyes? (f) What percentage of blue eyed people are female? (g) What percentage of males have brown eyes? (h) What percentage of people have brown eyes? (i) What percentage of people are males and have blue eyes? Q40 pg 40 We are investigating whether people taking antidepressants (SSRIs) might be at greater risk of bone fractures. We are given the below contingency table. SSRI no SSRI Total Fractures No Fractures Total (a) What percent of people taking SSRIs have fractures? (b) What percent of people not taking SSRIs have fractures? (c) Is the risk of bone fractures the same among people who were taking SSRIs versus those who were not? 2. In a histogram that is skewed to the right, which is larger, the mean or the median? 3. In a histogram that is skewed or has outliers, which should be reported, the mean or the median? 4. In a histogram that is skewed or has outliers, which should be reported, the IQR or the Standard Deviation? 5. In a histogram that is symmetric with no outliers, which pair of things should be reported, mean with the standard deviation, mean with the IQR, median with the standard deviation, or the median with the IQR?

8 Q23 pg 75 Here are costs of nine compact refrigerators rated very good or excellent by Consumer Reports on their website. 150, 150, 160, 180, 150, 140, 120, 130, 120 Find (a) Mean (b) Median and Quartiles (c) Range and IQR (d) What are the values of the upper fence and the lower fence? (e) Are there any outliers in this data? Why? (f) Find the variance and standard deviation. Q22 pg 103 Shown below are the histogram and summary statistics for the number of camp sites at public parks in Vermont. (a) Which statistics would you use to identify the center and spread of this distribution? Why? (b) How many parks would you classify as outliers? Explain. (c) Create a boxplot for this data.

9 6. Given mean 16 and standard deviation 3. (a) Standardize y = 9 (b) Standardize y = 21 (c) Which of the two above is most unusual? 7. Use the model N(8, 2). (a) What is the mean? (b) What is the standard deviation? (c) What is the variance? (d) Standardize y = Use a normal model with a mean of 50 and standard deviation of 5. (a) Using the model described above, draw the model showing what the Rule predicts. (b) In what interval would you expect the central 99.7% of values to be found? (c) What percent of values are above 50? (d) What percent of values are between 40 and 60? (e) What percent of values are between 40 and 50? (f) What percent of values are between 50 and 60? (g) What percent of values are between 45 and 50? (h) What percent of values are between 50 and 65? (i) What percent of values are above 60? (j) What percent of values are below 45? (k) What percent of values are between 40 and 65? (l) What percent of values are between 45 and 60? 9. Based on the Normal Model N(50, 5), answer the following questions. (a) What percent of values are above 50? (b) What percent of values are above 62? (c) What percent of values are below 39? (d) What percent of values are above 43? (e) What percent of values are below 58? (f) What percent of values are between 37 and 52? (g) What percent of values are betwen and 66?

10 10. Based on the Normal Model N(50, 5), answer the following questions. (a) What cutoff value bounds the highest 5% of values? (b) What cutoff value bounds the lowest 25% of values? (c) What cutoff value bounds the middle 70% of values? 11. Use the given data to answer the following questions. Energy Used (KWH) Price ($) mean = mean = std dev = 6.22 std dev = r = (a) Draw a scatterplot of the above data to study the association between the energy used and the price it cost. (b) What is the direction of the association? (c) What is the form of the relationship? (d) What is the strength of the relationship? (e) Are there any outliers? (f) What is the correlation coefficient? Now consider the regression line for the above data. (a) What is the slope of the regression line? (b) What is the intercept of the regression line? (c) What percent of variation is explained by this model? (d) What does the slope mean in this context? (e) What does the intercept mean in this context? (f) Write down the overall model. (g) What would you predict for the price in the case where there are 18 KWH of energy used? Is this prediction reasonable? (h) The energy company actually charges you $27.50 for 18 KWH of energy used. Is this good? How much would you save or lose compared to what you expected to pay?

11 (i) What would you predict for the price in the case where there are 35 KWH of energy used? Is this prediction reasonable? (j) If the energy used is 3 standard deviations above the mean energy used, what is the predicted standardized value for price? 12. For the below residual plots, decide if a linear model is appropriate. In the case that the linear model is not appropriate, decide which condition is violated (linearity, outlier, or equal spread). Figure 1: Residual Plots (a) Plot 1 (b) Plot 2 (c) Plot 3 (d) Plot 4 (e) Plot 6 (f) Plot 7

12 9 Example Solutions 1. Use the table to answer the questions. (a) Frequency Table: (b) Male: (c) 6 18 = 33.3%. (d) 7 9 = 77.8%. (e) 2 45 = 4.4%. (f) 6 11 = 54.5%. (g) = 55.6%. (h) = 55.6%. (i) 5 45 = 11.1%. Q40 pg 40 (a) 14/137 = 10.2% (b) 244/4871 = 5.01% = 0.6 = 60%. Female: Eye Color Count Blue 11 Green 9 Brown 25 = 0.4 = 40%. (c) Yes, the risk of bone fractures is about twice as high among the people who were taking SSRIs than among those who were not. 2. The mean is larger. 3. The median. 4. IQR. 5. Mean with the standard deviation. Q23 pg 75 (a) Mean = ȳ = 1300/9 = (b) First, order the numbers from lowest to highest. Median = middle value = 150 Q3 = 150 Q1 = 130 (c) IQR = = 20 Range = Max - Min = = 60 (d) Upper Fence = Q IQR = (20) = 180, Lower Fence = Q1 1.5IQR = (20) = 100.

13 (e) There are not outliers because Max Upper Fence and Min Lower Fence. (f) Variance = (y ȳ) 2 n 1 = ( )2 + ( ) 2 + ( ) 2 + ( ) 2 + ( ) 2 + ( = = , Std Dev = V ariance = = Q22 pg 103 (a) Median and IQR because the histogram is skewed right. (b) IQR = = 50 Upper Fence = Q IQR = (50) = = 153 Lower Fence = Q1-1.5IQR = (50) = = 47 The number of parks cannot be negative so the lower fence should be placed at 0. Since max > upper fence, there are outliers. There are probably 4-5 based on the histogram. (c) Note that the histogram below is rounded so if you use exact numbers, it may look slightly different.

14 6. Given mean 16 and standard deviation 3. (a) z = = 7 3 = (b) z = = 5 3 = (c) y = 9 is more unusual. 7. Use the model N(8, 2). (a) Mean = 8. (b) Standard deviation = 2. (c) Variance = 2 2 = 4. (d) z = = 2 2 = 1.

15 8. Use a normal model with a mean of 50 and standard deviation of 5. (a) For the Rule, we will have the following points on the graph (not shown here). µ 3σ = 50 3(5) = 35 µ 2σ = 50 2(5) = 40 µ σ = 50 5 = 45 µ = 50 µ + σ = = 55 µ + 2σ = (5) = 60 µ + 3σ = (5) = 65 (b) Between 35 and 65. (c) 50%. (d) 95%. (e) 47.5%. (f) 47.5%. (g) 34%. (h) 49.85%. (i) 2.5%. (j) 16%. (k) Between 40 and 50 is 47.5%. Between 50 and 65 is 49.85%. Between 40 and 65 is 97.35%. (l) Between 45 and 50 is 34%. Between 50 and 60 is 47.5%. Between 45 and 60 is 81.5%. 9. Based on the Normal Model N(50, 5), answer the following questions. Draw pictures to help you see what is going on. (a) 50%. (b) Step 1: Standardize. z = = 12 = 2.4. Step 2: Calculate value from a 5 5 calculator or a table. From calculator: normalcdf(2.4, 999) = Solution =0.82%. From table: We want area(z > 2.4). The table gives area(z < 2.4) = Our answer is = , which is 0.82%. (c) Step 1: Standardize. z = = 11 = 2.2. Step 2: Calculate value from a 5 5 calculator or a table. From calculator: normalcdf( 999, 2.2) = Solution = 1.39%. From table: We want area(z < 2.2). The table gives us this directly and the value is Solution = 1.39%. (d) Step 1: Standardize. z = = 7 = 1.4. Step 2: Calculate value from a 5 5 calculator or a table. From calculator: normalcdf( 1.4, 999) = Solution = 91.92$. From table: We want area(z > 1.4). The table gives area(z < 1.4) = Our answer is = , which is 91.92%.

16 (e) Step 1: Standardize. z = = 8 = 1.6. Step 2: Calculate value from a 5 5 calculator or a table. From calculator: normalcdf( 999, 1.6) = Solution = 94.52$. From table: We want area(z < 1.6). The table gives us this directly and the value is Solution = 94.52%. (f) Step 1: Standardize both values. z 1 = = 2.6. z 5 2 = = 0.4. Step 2: 5 Calculate value from a calculator or a table. From calculator: normalcdf( 2.6, 0.4) = Solution = 65.08%. From table: We want area( 2.6 < z < 0.4). The table gives area(z < 2.6) = and area(z < 0.4) = Our answer is = , which is 65.07%. The difference between the calculator answer and the table answer is because of rounding. Show your work! (g) Step 1: Standardize both values. z 1 = = z 5 2 = = 3.2. Step 2: 5 Calculate value from a calculator or a table. From calculator: normalcdf(1.45, 3.2) = Solution = 7.28%. From table: We want area(1.45 < z < 3.2). The table gives area(z < 1.45) = and area(z < 3.2) = Our answer is = , which is 7.28%. 10. Based on the Normal Model N(50, 5), answer the following questions. Draw pictures to help you see what is going on. (a) Highest 5% of values corresponds to a z-value of Find this using invnorm(0.95) on your calculator, or looking for the value on a table. Use the z-score formula to solve for the value you are looking for. z = = x = x 50 5 x = (b) Lower 25% of values corresponds to a z-value of Find this using invnorm(0.25) on your calculator, or looking for the value on a table. Use the z-score formula to solve for the value you are looking for. z = 0.67 = x = x 50 5 x = (c) We need 2 values for z in this case. Let the lower value of z be z L and the upper value of z be z R. Find these values by typing invnorm(0.15) and invnorm(0.85) on your calculator. We have z L = z R = 1.04 respectively. Solve for the two values of x. z L = 1.04 = x = x 50 x L = 44.8, z R = 1.04 = x = x 50 x R = Use the data to answer the following. (a) The scatterplot would be as below.

17 Price ($) Price ($) vs Energy Used (KWH) Energy Used (KWH) (b) Direction is positive. (c) The form of the relationship is linear. (d) The strength of the relationship is strong. (e) There are no outliers. (f) The correlation coefficient is Now consider the regression line for the above data. (a) The slope of the regression line is b 1 = r sy s x = ( ) = (b) The intercept of the regression line is b 0 = ȳ b 1 x = (9.125) = (c) The percent of variation explained by this model is R 2 = (0.9687) 2 = = 93.84%. (d) The slope means that as the energy used increases by 1 KWH, the price increases by $ (e) The intercept means that the base price for services, with no energy used, costs $ (f) The overall model is: ŷ = x. (g) You would predict the price to be (18) = $ This prediction is reasonable because we are not extrapolating from the data.

18 (h) The energy company actually charges you $27.50 for 18 KWH of energy used. This is good because we would expect to pay more than this for the energy used. You would save $2.63. Note that when you take = 2.63 you are calculating the residual. (i) You would predict the price to be (35) = $ This prediction is not reasonable and cannot be trusted because we are extrapolating. (j) The predicted standardized value for price is r(sd) = (3) = Classify each of the residual plots. (a) Linear model is not appropriate. The linearity condition is violated. (b) Linear model is appropriate. (c) Linear model is not appropriate. The outlier condition is violated. (d) Linear model is not appropriate. The equal spread condition is violated. (e) Linear model is not appropriate. The equal spread condition is violated. (f) Linear model is not appropriate. The linearity condition is violated.

Math 2311 Bekki George Office Hours: MW 11am to 12:45pm in 639 PGH Online Thursdays 4-5:30pm And by appointment

Math 2311 Bekki George Office Hours: MW 11am to 12:45pm in 639 PGH Online Thursdays 4-5:30pm And by appointment Math 2311 Bekki George bekki@math.uh.edu Office Hours: MW 11am to 12:45pm in 639 PGH Online Thursdays 4-5:30pm And by appointment Class webpage: http://www.math.uh.edu/~bekki/math2311.html Math 2311 Class