Math146 - Chapter 3 Handouts. The Greek Alphabet. Source: Page 1 of 39

Source: www.mathwords.com The Greek Alphabet Page 1 of 39

Some Miscellaneous Tips on Calculations Examples: Round to the nearest thousandth 0.92431 0.75693 CAUTION! Do not truncate numbers! Example: 1 6 = 0.166666 A common mistake is to truncate this decimal, and write it as: Round it off correctly (say, to three decimal places) as: However, in-between zeros DO count as significant digits. Examples: Round to three significant figures 0.20361 0.00059254 A FINAL CAUTION! Be very careful to not overly round off intermediate calculations, if you are going to use those numbers in a subsequent calculation. A better method is to store those values on your calculator (using the memory registers), OR to just do the calculation using a single command which will probably involve the use of a lot of parentheses ( ). Page 2 of 39

Section 3.1 Measures of Central Tendency Descriptive measures: used to describe data sets. Measure of center a value at the of a data set Three different measures of center: 1. 2. 3. 1. Mean Probably the most commonly used measure of center. Same as Add the values and divide by the total number of values. Write in symbols as: where, xi = the data values n = number of values in the sample is the uppercase Greek letter sigma, summation symbol, means add all this stuff up. A parameter is a descriptive measure for a. A statistic is a descriptive measure for a. Page 3 of 39

Number of Data Values Symbol for mean Sample statistics Population parameters Formula for mean Round-Off Rule: Round off your final answer to is present in the original set of data. than Example: Contents of a sample of cans of regular Coke have the following weights in lbs: 0.8192 0.8150 0.8163 0.8211 0.8181 0.8247 Advantage of using mean as the measure of center for a data set: Takes every into account Much statistical inference that will be performed is based on the mean Disadvantage: Can be dramatically affected by a few. Example: What People Earn (see next page) Is the mean a good measure of center for this data set? Page 4 of 39

What People Earn* Annual Salary Sorted Data Admin. Clerk $38,000 1 $20,000 Real estate agent $103,100 2 $20,000 Professional golfer $5,500,000 3 $23,500 Dogwalker $20,000 4 $25,000 High school counselor $58,900 5 $26,000 Mechanical engineer $47,900 6 $38,000 Mechanical engineer $46,000 7 $39,500 Health-care director $68,000 8 $40,000 Bridal salon owner $25,000 9 $42,000 Private investigator $210,000 10 $46,000 Part-time acupuncturist $40,000 11 $46,000 Surgery resident $39,500 12 $47,900 Housekeeping aide $23,500 13 $54,600 Migrant family liaison $26,000 14 $58,900 Court clerk $54,600 15 $68,000 Sales manager $180,000 16 $103,100 Deputy sheriff $46,000 17 $180,000 Fishing guide $42,000 18 $210,000 Radio host $32,000,000 19 $5,500,000 Lay pastor $20,000 20 $32,000,000 = Mean = Median = * Source: Parade (Tri-City Herald), March 2004 Note: This data is NOT randomly selected just some data I happened to pick from one particular page. Page 5 of 39

2. Median Can help overcome the disadvantage of the mean being dramatically affected by. Median is physically the when the original data values are sorted in order of increasing order. Symbol is for a median Procedure to find median: Sort data in order. If odd number of values, median = If even number of values, median = Example: What People Earn (see previous page) The median is. Resistant measure not sensitive to the influence of a few. Because of this, the median is frequently used for particular types of data, instead of the mean. Page 6 of 39

Example: Tri-City Housing Market, Source: Tri-City Herald, January 14, 2018 Comment: median price in January 2017 was $225,000. Page 7 of 39

Example: US Household Income, 2017 Source: Tri-City Herald, 9/13/2018 Why do we use median instead of mean for data like housing prices and income? Consider the population of: All home selling prices in the Tri- Cities All household incomes in the U.S. The median is useful to eliminate the effect of the. Page 8 of 39

The for data that is strongly is commonly quoted instead of the, such as: Example: TV commercial for Abreva Studies show 4.1 days median healing time. Page 9 of 39

3. Is the value that occurs Used mainly for, not numerical data Can have more than one, if more than one value occurs with the greatest frequency. If no value is repeated, there is. Example: Class survey, people who have a tattoo or not. people said yes people said no The mode is. Example: What People Earn (see previously) Mode: Data set is: Is the mode useful here? Page 10 of 39

How to determine the most appropriate measure of center : If there are no affecting the mean, and if the distribution is fairly,then the mean is the most appropriate choice for measure of center, because it takes of the data values into consideration. If there extreme values affecting the mean, and if the distribution is fairly in either direction, then the median might be a better choice, because it is to the extreme values, and provides the best typical value of the data set. If the data consists of qualitative data, then the is the only appropriate measure of center. The mode is not commonly used with data. For one reason, if you have continuous data, it is very possible that your data set will not have a mode, because there are no repeated values. The only time you would use it numerical set of data is if you specifically want to know what the most frequent data value is. Page 11 of 39

1. From the class survey, here are the responses for how many of the United States you have visited (combined with Math 146 Spring 2018 responses): No. of states 1 5 8 1 5 8 1 6 8 2 6 10 3 6 10 3 6 10 3 6 12 3 6 13 3 6 18 3 6 18 4 6 20 4 6 20 4 6 22 4 7 28 4 7 32 4 7 41 4 7 48 5 7 5 7 5 7 Notice that there are a total of n = 57 data values. 1. Make a dotplot of the data (below), and use the plot to describe the distribution of the data. 2. Find the following values: Mean = Median = Mode = 3. Which measure of center would you quote for this dataset, and why? Number of states Page 12 of 39

2. Answer and explain (briefly) your answers to the following. Note: one way to explain is by coming up with a small example data set. a. Is it possible that the mean might not equal any of the values in a data set? b. Is it possible that the median might not equal any of the values in a data set? c. Is it possible that the mean might be smaller than all of the values in a data set? d. Is it possible that the median might be smaller than all of the values in a data set? e. Is it possible that the mean might be larger than only one value in a data set? f. Is it possible that the median might be larger than only one value in a data set? 3. Think about all of the human beings alive at this moment. a. Which value do you think is greater: the mean age or the median age of all human beings alive at this moment? Explain your answer. b. Which value do you think is greater: the mean age of all human beings alive at this moment, or the mean age of all Americans alive at this moment? Explain. c. Estimate the median age of all human beings alive at this moment. Estimate the median age of all Americans alive at this moment. Page 13 of 39

Section 3.2 Measures of Dispersion Dispersion is the degree to which the data are Example: Class Grades, 10 sample test grades from two sections Section 1: 26, 43, 57, 64, 65, 79, 82, 88, 92, 104 Section 2: 50, 57, 66, 68, 70, 70, 72, 75, 82, 90 Page 14 of 39

Different Measures of Dispersion/Variation: 1. 2. 3. 4. 1. Difference between value and value To calculate: range = Benefit: Easy to compute Drawback: as other measures of variation because it depends only on highest and lowest values. Example: Section 1: Range = Section 2: Range = 2. Preferred measure of variation when the is used as the measure of center. Measures the variation of all sample values Larger values of standard deviation indicate Only equals 0 if all data values are the Units are the same as the units of the original data A drawback is that it is not, because its value can be strongly affected by a few extreme data values. Page 15 of 39

Formula: where, s = standard deviation of a xi = x = mean of values n = number of data values in the Example: Section 1 Grades xi (data values) xi - x (xi - x ) 2 26 43 57-13 169 64-6 36 65-5 25 79 9 81 82 12 144 88 18 324 92 22 484 104 34 1156 n = 10 data values Sum of squared deviations = = Sample variance: Sample standard deviation: Same round-off rule: One decimal place than the original data for the final answer. Page 16 of 39

In STATDISK: Data/Explore Data Descriptive Statistics Section 1: Section 2: Could tell just by looking at the graphs that Section 1 was more spread out. Now we have an actual measure of that variation. Standard deviation for Section 2 is much smaller, because the values are not spread as far apart, in general the values are closer to the. Page 17 of 39

Sample Statistics Population Parameters Number of Data Values n N Mean x x x n N Standard deviation Variance KEY!!! Standard deviation and variance are closely related if you have one, you can calculate the other! Standard deviation = Variance = Page 18 of 39

Page 19 of 39

Page 20 of 39

Page 21 of 39

Empirical Rule Notice that this rule applies to data having an approximately distribution. It can be used to determine the percentage of data that will lie within a certain number of standard deviations of the mean. 1. About of all data values fall within 1 standard deviation of the mean. 2. About of all data values fall within 2 standard deviations of the mean. 3. About of all data values fall within 3 standard deviations of the mean.. Note: Can also be used assuming population parameters µ and. Page 22 of 39

Example: Using the Empirical Rule Men s Pulse Data (Data Set 1, 12 th ed.) Men s pulse data from STATDISK x = s = x s = x + s = From the Empirical Rule, would expect about 68% of the data values to fall within the range of to. No. of values in this range = x 2s = x + 2s = From the Empirical Rule, would expect about 95% of the data values to fall within the range of to. No. of values in this range = Page 23 of 39 Pulse (bpm) 1 46 2 50 3 52 4 54 5 56 6 56 7 58 8 58 9 60 10 60 11 60 12 60 13 62 14 62 15 64 16 64 17 64 18 66 19 66 20 66 21 68 22 68 23 68 24 68 25 68 26 70 27 70 28 70 29 72 30 74 31 74 32 74 33 76 34 78 35 80 36 80 37 84 38 86 39 88 40 90

The following table reports the daily high temperatures ( F) in February 2006 for three locations. Feb Date Lincoln, Neb San Luis Obispo, CA Sedona, AZ 1 53 68 62 2 59 69 64 3 40 77 62 4 36 68 66 5 36 76 61 6 44 71 61 7 46 79 68 8 34 85 68 9 41 87 63 10 39 67 66 11 27 75 57 12 30 81 62 13 61 83 66 14 68 64 63 15 41 57 55 16 26 57 51 17 11 53 48 18 14 54 48 19 28 52 47 20 47 58 49 21 51 58 50 22 53 67 55 23 48 70 61 24 65 67 62 25 36 65 64 26 53 64 67 27 71 63 66 28 73 62 57 1. Use the dot plots of the data below to rank the three locations in order of smallest to largest standard deviation: Smallest std. dev: Middle std. dev: Largest std. dev: 2. On the next page, calculate (by hand) the standard deviation for the temperature data from San Luis Obispo. Std. dev. = 3. Using the standard deviation for the San Luis Obispo data, estimate the standard deviation for the other two locations: Lincoln: Sedona: Page 24 of 39

Calculate the standard deviation for the San Luis Obispo data set, using the following table: Day High Temp. ( F) xi x (xi x ) 2 1 68 2 69 3 77 4 68 5 76 6 71 3.25 10.5625 7 79 11.25 126.5625 8 85 17.25 297.5625 9 87 19.25 370.5625 10 67-0.75 0.5625 11 75 7.25 52.5625 12 81 13.25 175.5625 13 83 15.25 232.5625 14 64-3.75 14.0625 15 57-10.75 115.5625 16 57-10.75 115.5625 17 53-14.75 217.5625 18 54-13.75 189.0625 19 52-15.75 248.0625 20 58-9.75 95.0625 21 58-9.75 95.0625 22 67-0.75 0.5625 23 70 2.25 5.0625 24 67-0.75 0.5625 25 65-2.75 7.5625 26 64-3.75 14.0625 27 63-4.75 22.5625 28 62-5.75 33.0625 Sum = Variance = Standard Deviation = Page 25 of 39

Section 3.4 Measures of Position and Outliers In this section, will introduce a number measures of position, which describe the entire set of data. of a certain data value within the z-scores z-scores are standardized values: z score equals the number of standard deviations that a given data value is above or below the z score is if value is greater than mean, z score is if value is less than mean can use z-score to identify To calculate: For a sample: z = For a population: z = where x = the particular data value. Round z scores off to decimal places. The z-scores allow us to compare values from different data sets by providing a standard basis of comparison. Page 26 of 39

Example: Comparing Standardized Test Scores Suppose a college admissions office needs to compare scores of students who take the Scholastic Aptitude Test (SAT) with those who take the American College Test (ACT). Among the college s applicants who take the SAT, scores have a mean of 1500 and a standard deviation of 240. Among the college s applicants who take the ACT, scores have a mean of 21 and a standard deviation of 6. Mike scored 1740 on the SAT, and Packard scored 30 on the ACT. Who did relatively better on their test? Standardize the comparison by calculating the z-scores: Mike: z = Packard: z = Page 27 of 39

Identifying Outliers Using z-scores Note that this method applies only to distributions that are fairly, because it is based on the Empirical Rule. Within a particular data set, the z-score is useful for giving us some idea of the relative standing of a particular data value: If a data value has a z-score fairly near 0, it is to the mean, a very data value. If a data value has a z-score of less than -2, or greater than +2, that means it is very from the mean, and very far away from most of the data values. That would be a less typical data value. unusual ordinary unusual -3-2 -1 0 1 2 3 z score Ordinary or usual values: Unusual values: Page 28 of 39

Percentiles and Quartiles Percentiles Percentiles are numbers that divide a data set into parts, with about of the data in each part. equal A data set has percentiles: The interpretation the kth percentile of an observation means that of the observations are less than or equal to the observation. Example: a data set with 500 values in it, sorted in ascending order Percentiles would divide it into 100 groups, with 5 data values in each group. 1 st 2 nd 3 rd 4 th 5 th 6 th 7 th 8 th 9 th 10 th 496 th 497 th 498 th 499 th 500 th Quartiles values that divide the data into four roughly equal groups. There are three Quartiles, Q1, Q2 and Q3 Note that the data has to be sorted in ascending order. Q1 separates the bottom of the values Q2 is the separates bottom from top Q3 separates the top. Page 29 of 39

Note: If the number of observations in the data set is odd, include the median when determining Q1 and Q3. Quartiles are a measure. Page 30 of 39

Example: For twelve data values (even) sorted in ascending value: 2 5 6 10 15 17 24 27 27 28 30 31 Q2 = median =. Q1 = median of bottom half =. Q3 = median of top half =. Example: For eleven data (odd) values sorted in ascending value: 5 6 10 15 17 24 27 27 28 30 31 Q2 = median =. Q1 = median of bottom half =. Q3 = median of top half =. Note that different textbooks or statistical software packages may have slightly different methods on how to find the quartiles. STATDISK will always give the same results for the quartiles as the method in our textbook as long as there are an even number of data values. Page 31 of 39

Interquartile Range (IQR) The interquartile range is another measure of. IQR = IQR represents the range of values over which of the data is spread. Checking for Outliers Outliers An outlier is a data value that is from the other data values, an extreme observation. An outlier could be the result of: An (measurement, sampling, or recording) Just an unusually observation Checking for outliers using Quartiles: Calculate the fences, cutoff points for determining outliers: Lower fence = Upper fence = A data value is considered an outlier if: It is the lower fence, or It is the upper fence. Page 32 of 39

Example: Natural Selection (source: Workshop Statistics, 4 th edition) A landmark study on the topic of natural selection was conducted by Hermon Bumpus in 1898. Bumpus gathered extensive data on house sparrows that were brought to the Anatomical Laboratory of Brown University in Providence, Rhode Island, following a particularly severe winter storm. Some of the sparrows were revived, but some sparrows perished. Bumpus analyzed his data to investigate whether or not those that survived tended to have distinctive physical characteristics related to their fitness. The following sorted data are the total length measurements (in millimeters, from the tip of the sparrow s beak to the tip of its tail) for the 24 adult males that died and the 35 adult males that survived. (note: I also added a column of numbers next to each data list, just to identify the data values) Minimum Q1 Q2 Q3 Maximum Interquartile Range (IQR) Lower Fence Upper Fence Sparrow Died Sparrow Lived Note: the first five rows that you filled out are called the: Page 33 of 39

Sparrow Died Length (mm) Sparrow Lived Length (mm) 1 156 1 153 2 158 2 154 3 160 3 155 4 160 4 155 5 160 5 156 6 161 6 156 7 161 7 157 8 161 8 157 9 161 9 158 10 161 10 158 11 162 11 158 12 162 12 158 13 162 13 158 14 162 14 159 15 162 15 159 16 162 16 159 17 163 17 159 18 163 18 159 19 164 19 160 20 165 20 160 21 165 21 160 22 165 22 160 23 166 23 160 24 166 24 160 25 160 26 160 27 160 28 161 29 161 30 161 31 161 32 162 33 163 34 165 35 166 Page 34 of 39

Boxplots This is a graphical display of the data based on the. Many books call what we are going to make a modified boxplot, because we are going to indicate the outliers on our graph. Page 35 of 39

Sparrow died : Sparrow lived : Conclusions? 1. What do the boxplots reveal, as far as whether or not there appears to be a difference in lengths between the sparrows that survived and the sparrows that died? 2. What type of a study was this: observational, or designed experiment? 3. As such, can we conclude that being shorter caused the sparrows to be more likely to survive the storm? Page 36 of 39

PULSE RATES - MEN Men's Pulse Rates SORTED (bpm) 1 46 2 50 3 52 4 54 5 56 6 56 7 58 8 58 9 60 10 60 11 60 12 60 13 62 14 62 15 64 16 64 17 64 18 66 19 66 20 66 21 68 22 68 23 68 24 68 25 68 26 70 27 70 28 70 29 72 30 74 31 74 32 74 33 76 34 78 35 80 36 80 37 84 38 86 39 88 40 90 1. Take your own pulse rate for one minute: Pulse = 2. Calculate your z-score. z = beats per minute Assuming the distribution is bell-shaped, is your data value an outlier? 3. Find the quartiles (do NOT add your data value): Q1 = Q2 = Q3 = 4. Find the IQR (interquartile range): IQR = 5. Identify any outliers (circle them): Lower fence = Upper fence = Page 37 of 39

PULSE RATES - WOMEN Women's Pulse Rates SORTED (bpm) 1 56 2 60 3 62 4 62 5 64 6 64 7 66 8 68 9 68 10 72 11 72 12 72 13 72 14 72 15 72 16 72 17 74 18 74 19 76 20 76 21 78 22 78 23 78 24 78 25 78 26 78 27 78 28 80 29 82 30 82 31 82 32 88 33 90 34 90 35 90 36 96 37 98 38 98 39 100 40 104 1. Take your own pulse rate for one minute: Pulse = 2. Calculate your z-score. z = beats per minute Assuming the distribution is bell-shaped, is your data value an outlier? 3. Find the quartiles (do NOT add your data value): Q1 = Q2 = Q3 = 4. Find the IQR (interquartile range): IQR = 5. Identify any outliers (circle them): Lower fence = Upper fence = Page 38 of 39

Women s Pulse Data: Men s Pulse Data: mean = mean = standard deviation = standard deviation = Who has the relatively higher pulse rate (compared to their respective populations): A woman with a pulse rate of 100 bpm, or a man with a pulse rate of 88 bpm? Check by calculating the z-score for both. Women s 5-number summary: Men s 5-number summary: Min = Min = Q1 = Q1 = Q2 = Q2 = Q3 = Q3 = Max = Max = Boxplot of Women s Pulse Rate Data (indicating potential outliers): Boxplot of Men s Pulse Rate Data (indicating potential outliers): Pulse Rate (bpm) What shape do the distributions appear to be, and what conclusions can you make based on the comparison of the two boxplots? Page 39 of 39