Handout 5: Summarizing Numerical Data STAT 100 Spring 2016

Similar documents
Descriptive Statistics

You should already have a worksheet with the Basic Plus Plan details in it as well as another plan you have chosen from ehealthinsurance.com.

Statistics vs. statistics

Handout 3 More on the National Debt

DECISION SUPPORT Risk handout. Simulating Spreadsheet models

1.2 Describing Distributions with Numbers, Continued

Dot Plot: A graph for displaying a set of data. Each numerical value is represented by a dot placed above a horizontal number line.

Handout 4 numerical descriptive measures part 2. Example 1. Variance and Standard Deviation for Grouped Data. mf N 535 = = 25

appstats5.notebook September 07, 2016 Chapter 5

Prepared By. Handaru Jati, Ph.D. Universitas Negeri Yogyakarta.

Spreadsheet Directions

Lecture 2 Describing Data

Standardized Data Percentiles, Quartiles and Box Plots Grouped Data Skewness and Kurtosis

Math 2311 Bekki George Office Hours: MW 11am to 12:45pm in 639 PGH Online Thursdays 4-5:30pm And by appointment

1 Describing Distributions with numbers

Descriptive Statistics

DATA SUMMARIZATION AND VISUALIZATION

Example: Histogram for US household incomes from 2015 Table:

Applications of Data Dispersions

Computing interest and composition of functions:

Introduction to Basic Excel Functions and Formulae Note: Basic Functions Note: Function Key(s)/Input Description 1. Sum 2. Product

DazStat. Introduction. Installation. DazStat is an Excel add-in for Excel 2003 and Excel 2007.

MATHEMATICS APPLIED TO BIOLOGICAL SCIENCES MVE PA 07. LP07 DESCRIPTIVE STATISTICS - Calculating of statistical indicators (1)

Financial Econometrics

Session Window. Variable Name Row. Worksheet Window. Double click on MINITAB icon. You will see a split screen: Getting Started with MINITAB

Technology Assignment Calculate the Total Annual Cost

WEB APPENDIX 8A 7.1 ( 8.9)

. (i) What is the probability that X is at most 8.75? =.875

Statistics, Measures of Central Tendency I

Form 162. Form 194. Form 239

PDQ-Notes Reynolds Farley

A LEVEL MATHEMATICS ANSWERS AND MARKSCHEMES SUMMARY STATISTICS AND DIAGRAMS. 1. a) 45 B1 [1] b) 7 th value 37 M1 A1 [2]

$0.00 $0.50 $1.00 $1.50 $2.00 $2.50 $3.00 $3.50 $4.00 Price

Chapter 4-Describing Data: Displaying and Exploring Data

2 Exploring Univariate Data

3.1 Measures of Central Tendency

starting on 5/1/1953 up until 2/1/2017.

How Wealthy Are Europeans?

Practical Session 8 Time series and index numbers

MLC at Boise State Lines and Rates Activity 1 Week #2

An application program that can quickly handle calculations. A spreadsheet uses numbers like a word processor uses words.

Bidding Decision Example

Creating a Rolling Income Statement

Data that can be any numerical value are called continuous. These are usually things that are measured, such as height, length, time, speed, etc.

Common Compensation Terms & Formulas

Basic Data Analysis. Stephen Turnbull Business Administration and Public Policy Lecture 3: April 25, Abstract

MBEJ 1023 Dr. Mehdi Moeinaddini Dept. of Urban & Regional Planning Faculty of Built Environment

MLC at Boise State Polynomials Activity 2 Week #3

NOTES TO CONSIDER BEFORE ATTEMPTING EX 2C BOX PLOTS

Data screening, transformations: MRC05

Chapter 4-Describing Data: Displaying and Exploring Data

Monte Carlo Simulation (Random Number Generation)

2.2: The Lorenz Curve

Scheduled Pension Payments to County Retirees Mendocino County Employees Retirement Association

Excel Build a Salary Schedule 03/15/2017

Math 14, Homework 6.2 p. 337 # 3, 4, 9, 10, 15, 18, 19, 21, 22 Name

An Excel Modeling Practice Problem

SUMMARY STATISTICS EXAMPLES AND ACTIVITIES

VARIABILITY: Range Variance Standard Deviation

STAB22 section 1.3 and Chapter 1 exercises

Week 1 Variables: Exploration, Familiarisation and Description. Descriptive Statistics.

Stat 101 Exam 1 - Embers Important Formulas and Concepts 1

Further Mathematics 2016 Core: RECURSION AND FINANCIAL MODELLING Chapter 6 Interest and depreciation

XLSTAT TIP SHEET FOR BUSINESS STATISTICS CENGAGE LEARNING

Form 155. Form 162. Form 194. Form 239

Jacob: The illustrative worksheet shows the values of the simulation parameters in the upper left section (Cells D5:F10). Is this for documentation?

Risk Analysis. å To change Benchmark tickers:

Excel Tutorial 9: Working with Financial Tools and Functions TRUE/FALSE 1. The fv argument is required in the PMT function.

Statistics 511 Supplemental Materials

Chapter 3 Discrete Random Variables and Probability Distributions

For 466W Forest Resource Management Lab 5: Marginal Analysis of the Rotation Decision in Even-aged Stands February 11, 2004

Math 1526 Summer 2000 Session 1

Describing Data: One Quantitative Variable

Overview/Outline. Moving beyond raw data. PSY 464 Advanced Experimental Design. Describing and Exploring Data The Normal Distribution

Frequency Distribution and Summary Statistics

Basic Procedure for Histograms

Chapter 4 Continuous Random Variables and Probability Distributions

Putting Things Together Part 2

2 DESCRIPTIVE STATISTICS

Some estimates of the height of the podium

Lecture Week 4 Inspecting Data: Distributions

Describing Data: Displaying and Exploring Data

GOALS. Describing Data: Displaying and Exploring Data. Dot Plots - Examples. Dot Plots. Dot Plot Minitab Example. Stem-and-Leaf.

Description of Data I

Math of Finance Exponential & Power Functions

SUPPLEMENTARY LESSON 1 DISCOVER HOW THE WORLD REALLY WORKS ASX Schools Sharemarket Game THE ASX CHARTS

STAT 113 Variability

Lab#3 Probability

Chapter 2: Random Variables (Cont d)

The Standard Deviation as a Ruler and the Normal Model. Copyright 2009 Pearson Education, Inc.

Homework: Due Wed, Nov 3 rd Chapter 8, # 48a, 55c and 56 (count as 1), 67a

STAT Chapter 5: Continuous Distributions. Probability distributions are used a bit differently for continuous r.v. s than for discrete r.v. s.

Homework Assignment Section 3

CHAPTERS 5 & 6: CONTINUOUS RANDOM VARIABLES

Much of what appears here comes from ideas presented in the book:

SFSU FIN822 Project 1

Numerical Descriptions of Data

GuruFocus User Manual: My Portfolios

Monetary Economics Measuring Asset Returns. Gerald P. Dwyer Fall 2015

Optimization Methods in Management Science

Transcription:

In this handout, we will consider methods that are appropriate for summarizing a single set of numerical measurements. Definition Numerical Data: A set of measurements that are recorded on a naturally numeric scale. Example: Typical Household Income The Census Bureau provides a variety of information at the county level for all counties across the U.S. in its State & County QuickFacts data sets (http://quickfacts.census.gov/qfd/download_data.html). In this handout, we will consider one numerical variable that was measured in each county: Typical Household Income. These data can be found in the file CensusData.xlsx on the course website. A portion of the data set is shown below. Note that I have used the Filter option to work with only the counties in Minnesota. To do this, highlight the State column and then select Filter from the Data tab. A drop-down arrow should now appear in the State column. Click on this and then select the appropriate box to show only the values from Minnesota. A portion of the results are shown below. It s difficult to gain insight into the typical household incomes across counties in Minnesota by viewing the data in the above table, so let s alternatively consider a few graphical representations which summarize these measurements. Which do you prefer, and why? 1

MEASURES OF LOCATION: CENTER Suppose we wanted to summarize the location (on a number line) of the data that were measured on typical household incomes in Minnesota counties. The most common measures of location summarize the center of a data set: the mean and the median. Definitions Mean: The arithmetic average of all values. This is calculated by adding up all of the values and dividing by the total number of measurements. Median: This is the middle value of a data set, after the numerical values have been put in order. If the data set contains an even number of observations, then the median is the average of the middle two observations. Getting these Summaries in Excel: In Excel, you can calculate the mean and median using the following functions: Summary Excel Function Mean =AVERAGE( ) Median =MEDIAN( ) Enter the following in your Excel spreadsheet to calculate these summaries: Write the values that Excel returns in Column N on the spreadsheet shown below. 2

Excel Tip You can name a range of cells to be referenced later in a formula. Do this by highlighting the data values in a specific range (in this case, highlight the Typical Household Income values for Minnesota counties) and then giving the data range a name in the box just above the column labels. For example, I chose to name the range of data containing Typical Household Income for Minnesota counties MN_Income in the worksheet shown below. This range can now be referenced in a formula using only its name: 3

The mean (or average) is the balance point in the distribution. The median is the middle value. Since there are 87 measurements in this data set, the middle value would be the 44 th of the ordered measurements. Questions: 1. Does the mean necessarily have to be a value in the data set? Explain. 2. Does the median necessarily have to be a value in the data set? Explain. 3. In Minnesota, the typical household income is highest in Scott County ($83,415). Suppose this data value was replaced by a value that was even larger (say $90,000). What effect would this have on the mean Typical Household Income across Minnesota counties? What about the median? 4. Note that in our data set, each county is represented by a single value for Typical Household Income. How do you think the U.S. Census Bureau came up with this one measurement for each county? Discuss. 4

MEASURES OF LOCATION: PERCENTILES In addition to the mean and/or median, summaries called percentiles are also used to describe a set of measurements. These percentiles give us insight into the entire spectrum of data values. Definition Percentile: The p th percentile of a set of measurements is defined to be the point in the data set where p% of the measurements fall at or below that value. To see how percentiles are calculated, consider the county level Typical Household Income in Minnesota counties. A graph of these values is shown below. One way to understand percentiles is to find the percentage of observations that fall at or below a particular point in the data set. For example, note that about 3% of the counties in Minnesota have typical income levels below $40,000. So, the 3 rd percentile of this data set is about $40,000. Typical Household Income Percentiles $35,000 0% $40,000 3% $45,000 26% $50,000 64% $55,000 79% $60,000 87% $65,000 89% $70,000 94% $75,000 97% $80,000 98% $85,000 100% 5

A cumulative density function (CDF) plot can be used to display all of these percentiles. To create a CDF plot on the graph below, plot the typical household income levels from the preceding table on the x-axis and their respective percentiles on the y-axis. Next, instead of first selecting a Typical Household Income value and then calculating the percentage of data points at or below that point, we could work backwards. In other words, we could define certain percentiles and then determine the Typical Household Income level for that percentile. For example, the bottom 10% of the incomes falls at or below $43,285. So, the 10 th percentile for this data set is $43,285. Income per household Percentiles $35,307 0% $43,285 10% $44,472 20% Q 1 - $44,820 25% $45,475 30% $46,960 40% Median - $47,959 50% $49,420 60% $51,987 70% Q 3 - $52,598 75% $55,590 80% $66,208 90% $83,415 100% 6

Percentiles can be calculated in Excel using the =PERCENTILE( ) function. For example, in the worksheet shown below, I entered the following formulas in Column N so that Excel would return those percentiles in Column N. Excel returns the percentiles as shown below: 7

Note that the CDF plot based on the table we just obtained in Excel is equivalent to the one sketched earlier. Questions: Use the preceding table of percentiles and/or the corresponding CDF plot to answer the following questions. 1. What is the median Typical Household Income for MN counties? 2. How could you determine the median from the CDF plot? Discuss. 3. What is the minimum Typical Household Income in MN? What is the maximum? How could you identify these from the CDF plot? Discuss. 4. The CDF plot has a longer tail on the upper-end than on the lower-end. What does this imply about Typical Household Income across the 87 counties in MN? Discuss. 5. The CDF plot is fairly steep between $45,000 and $55,000. What does this imply about Typical Household Income across the 87 counties in MN? Discuss. 8

IS A MEASURE OF CENTER ENOUGH? Note that by calculating a summary such as a mean or median for a data set, we condense information from all of the measurements down to a single value. For example, consider the Typical Household Income across counties for three different states (Minnesota, Wisconsin, and Virginia): The following picture shows the average for each state. Questions: 1. What differences exist in the Typical Household Income values across these three states? Discuss. 2. Suppose that your friend tries to summarize the differences across these three states using only the mean (i.e., average) from each state. Do you think that this single summary (the mean) tells the whole story well? Why or why not? 9

Note that instead of simply comparing only the the means from each state, we could have also considered percentiles. Putting the CDF plots for all three states on the same graph allows us to make very rich comparisons across states. Questions: 1. Consider the poorest people in each state. In which state do the county-level typical household incomes tend to be the lowest? Likewise, consider the richest people in each state. In which state are the county-level typical household incomes the highest? 2. Which state seems to have the most problems with income inequality? Discuss. 3. How do the income levels of MN and WI compare? 4. What is an advantage to using the CDF plot to make comparisons across states? 10

MEASURES OF VARIABILITY As noted above, describing an entire data set well involves more than simply summarizing its center with a mean or a median. We should also consider the amount of variability in the data set (i.e., a measure of how different the measurements are from one another). Several quantities exist for summarizing the amount of variability. A few of them are discussed below. Definition Range: The difference between the largest and smallest measurements in a data set. For example, consider the Minnesota counties Typical Household Incomes: In Excel, you can calculate this using the =MAX( ) and =MIN( ) functions: Write the value that Excel returns on the spreadsheet below: 11

Definition Mean Absolute Deviation: For each measurement, calculate how far away that measurement is from the mean of the data set. The mean absolute deviation is the average of these absolute distances. MAD Distanceto the me an n To see how this is calculated, first consider the mean for Minnesota: Then, consider the distance between each measurement and the mean: Next, we consider the length of each of these distances (also called the absolute value of the residuals) on the following plot. The average of this data set is the mean absolute deviation. 12

In Excel, you can use the =AVEDEV( ) function to calculate the mean absolute deviation: Definition Standard Deviation: Like the mean absolute deviation, the standard deviation also measures the typical distance from the mean. For each value in the data set, we calculate how far away that measurement is from the mean of the data set. The standard deviation is a function of these squared distances. StandardDe viation Distanceto the me an n - 1 2 You can use the =STDEV( ) function to calculate this in Excel: Write the value that Excel returns on the spreadsheet below: 13

Use Excel to compute the following summaries for each of the three states listed below. Then, answer the questions that follow the table. Virginia Wisconsin Minnesota Mean Minimum 25 th percentile Median 75 th percentile Maximum Range MAD Standard Deviation 14

Questions: 1. In which state does the Typical Household Income of counties tend to be the highest? The lowest? Discuss. 2. Which state appears to have the most problems with income inequality? Which state appears to have the least problems with income inequality? How did you decide this? 3. Virginia s data set consists of 134 counties, while Wisconsin s consists of only 72. Your friend argues that there is less variability (and therefore less income inequality) in Wisconsin s data set simply because it has a smaller number of measurements. Why is this reasoning incorrect? What is the real reason there is less variability in Wisconsin than in Virginia? 15