Getting to know a data-set (how to approach data) Overview: Descriptives & Graphing

Similar documents
Steps with data (how to approach data)

Getting to know data. Play with data get to know it. Image source: Descriptives & Graphing

Lecture 1: Review and Exploratory Data Analysis (EDA)

Lecture Week 4 Inspecting Data: Distributions

Frequency Distribution and Summary Statistics

Stat 101 Exam 1 - Embers Important Formulas and Concepts 1

Summarising Data. Summarising Data. Examples of Types of Data. Types of Data

DATA SUMMARIZATION AND VISUALIZATION

Description of Data I

2 Exploring Univariate Data

Overview/Outline. Moving beyond raw data. PSY 464 Advanced Experimental Design. Describing and Exploring Data The Normal Distribution

Math 2311 Bekki George Office Hours: MW 11am to 12:45pm in 639 PGH Online Thursdays 4-5:30pm And by appointment

Lecture 2 Describing Data

Exploring Data and Graphics

Simple Descriptive Statistics

Fundamentals of Statistics

Basic Procedure for Histograms

Week 1 Variables: Exploration, Familiarisation and Description. Descriptive Statistics.

9/17/2015. Basic Statistics for the Healthcare Professional. Relax.it won t be that bad! Purpose of Statistic. Objectives

Some Characteristics of Data

Introduction to Descriptive Statistics

A LEVEL MATHEMATICS ANSWERS AND MARKSCHEMES SUMMARY STATISTICS AND DIAGRAMS. 1. a) 45 B1 [1] b) 7 th value 37 M1 A1 [2]

Establishing a framework for statistical analysis via the Generalized Linear Model

David Tenenbaum GEOG 090 UNC-CH Spring 2005

Descriptive Statistics

Descriptive Statistics Bios 662

Chapter 3. Numerical Descriptive Measures. Copyright 2016 Pearson Education, Ltd. Chapter 3, Slide 1

Descriptive Statistics

Lectures delivered by Prof.K.K.Achary, YRC

Terms & Characteristics

Categorical. A general name for non-numerical data; the data is separated into categories of some kind.

STATISTICAL DISTRIBUTIONS AND THE CALCULATOR

chapter 2-3 Normal Positive Skewness Negative Skewness

AP STATISTICS FALL SEMESTSER FINAL EXAM STUDY GUIDE

CHAPTER 2 Describing Data: Numerical

Data Distributions and Normality

STAT 113 Variability

Dot Plot: A graph for displaying a set of data. Each numerical value is represented by a dot placed above a horizontal number line.

Both the quizzes and exams are closed book. However, For quizzes: Formulas will be provided with quiz papers if there is any need.

Describing Data: One Quantitative Variable

Measures of Center. Mean. 1. Mean 2. Median 3. Mode 4. Midrange (rarely used) Measure of Center. Notation. Mean

Table of Contents. New to the Second Edition... Chapter 1: Introduction : Social Research...

Empirical Rule (P148)

Descriptive Analysis

Summary of Statistical Analysis Tools EDAD 5630

Standardized Data Percentiles, Quartiles and Box Plots Grouped Data Skewness and Kurtosis

Review: Chebyshev s Rule. Measures of Dispersion II. Review: Empirical Rule. Review: Empirical Rule. Auto Batteries Example, p 59.

The Normal Distribution & Descriptive Statistics. Kin 304W Week 2: Jan 15, 2012

MBEJ 1023 Dr. Mehdi Moeinaddini Dept. of Urban & Regional Planning Faculty of Built Environment

Chapter 6. y y. Standardizing with z-scores. Standardizing with z-scores (cont.)

Numerical Descriptions of Data

MATHEMATICS APPLIED TO BIOLOGICAL SCIENCES MVE PA 07. LP07 DESCRIPTIVE STATISTICS - Calculating of statistical indicators (1)

Data that can be any numerical value are called continuous. These are usually things that are measured, such as height, length, time, speed, etc.

Measures of Central Tendency Lecture 5 22 February 2006 R. Ryznar

appstats5.notebook September 07, 2016 Chapter 5

STAB22 section 1.3 and Chapter 1 exercises

IOP 201-Q (Industrial Psychological Research) Tutorial 5

Contents Part I Descriptive Statistics 1 Introduction and Framework Population, Sample, and Observations Variables Quali

Statistics I Chapter 2: Analysis of univariate data

Unit 2 Statistics of One Variable

Numerical Descriptive Measures. Measures of Center: Mean and Median

Descriptive Statistics in Analysis of Survey Data

MEASURES OF DISPERSION, RELATIVE STANDING AND SHAPE. Dr. Bijaya Bhusan Nanda,

MEASURES OF CENTRAL TENDENCY & VARIABILITY + NORMAL DISTRIBUTION

The normal distribution is a theoretical model derived mathematically and not empirically.

PSYCHOLOGICAL STATISTICS

Exploratory Data Analysis

Module Tag PSY_P2_M 7. PAPER No.2: QUANTITATIVE METHODS MODULE No.7: NORMAL DISTRIBUTION

2018 AAPM: Normal and non normal distributions: Why understanding distributions are important when designing experiments and analyzing data

2011 Pearson Education, Inc

DESCRIPTIVE STATISTICS II. Sorana D. Bolboacă

Data screening, transformations: MRC05

1 Exercise One. 1.1 Calculate the mean ROI. Note that the data is not grouped! Below you find the raw data in tabular form:

STAT 157 HW1 Solutions

Measures of Central tendency

1 Describing Distributions with numbers

Measures of Dispersion (Range, standard deviation, standard error) Introduction

SOLUTIONS TO THE LAB 1 ASSIGNMENT

Basic Data Analysis. Stephen Turnbull Business Administration and Public Policy Lecture 3: April 25, Abstract

Handout 4 numerical descriptive measures part 2. Example 1. Variance and Standard Deviation for Grouped Data. mf N 535 = = 25

Lecture 07: Measures of central tendency

MAS1403. Quantitative Methods for Business Management. Semester 1, Module leader: Dr. David Walshaw

3.1 Measures of Central Tendency

Graphical and Tabular Methods in Descriptive Statistics. Descriptive Statistics

Section3-2: Measures of Center

KARACHI UNIVERSITY BUSINESS SCHOOL UNIVERSITY OF KARACHI BS (BBA) VI

Center and Spread. Measures of Center and Spread. Example: Mean. Mean: the balance point 2/22/2009. Describing Distributions with Numbers.

Introduction to Computational Finance and Financial Econometrics Descriptive Statistics

Quantitative Analysis and Empirical Methods

Descriptive Statistics (Devore Chapter One)

Engineering Mathematics III. Moments

Chapter 6 Simple Correlation and

Measures of Central Tendency: Ungrouped Data. Mode. Median. Mode -- Example. Median: Example with an Odd Number of Terms

E.D.A. Exploratory Data Analysis E.D.A. Steps for E.D.A. Greg C Elvers, Ph.D.

Review: Types of Summary Statistics

4. DESCRIPTIVE STATISTICS

Moments and Measures of Skewness and Kurtosis

Some estimates of the height of the podium

Statistics & Statistical Tests: Assumptions & Conclusions

Math146 - Chapter 3 Handouts. The Greek Alphabet. Source: Page 1 of 39

Transcription:

Overview: Descriptives & Graphing 1. Getting to know a data set 2. LOM & types of statistics 3. Descriptive statistics 4. Normal distribution 5. Non-normal distributions 6. Effect of skew on central tendency 7. Principles of graphing 8. Univariate graphical techniques 2 Getting to know a data-set (how to approach data) 3

Level of measurement & types of statistics Image source: http://www.flickr.com/photos/peanutlen/2228077524/ 13 Golden rule of data analysis A variable's level of measurement determines the type of statistics that can be used, including types of: descriptive statistics graphs inferential statistics 14 Levels of measurement and non-parametric vs. parametric Categorical & ordinal data DV non-parametric (Does not assume a normal distribution) Interval & ratio data DV parametric (Assumes a normal distribution) non-parametric (If distribution is non-normal) DVs = dependent variables 15

Parametric statistics Statistics which estimate parameters of a population, based on the normal distribution Univariate: mean, standard deviation, skewness, kurtosis, t-tests, ANOVAs Bivariate: correlation, linear regression Multivariate: multiple linear regression 16 Parametric statistics More powerful (more sensitive) More assumptions (population is normally distributed) Vulnerable to violations of assumptions (less robust) 17 Non-parametric statistics Statistics which do not assume sampling from a population which is normally distributed There are non-parametric alternatives for many parametric statistics e.g., sign test, chi-squared, Mann- Whitney U test, Wilcoxon matched-pairs signed-ranks test. 18

Non-parametric statistics Less powerful (less sensitive) Fewer assumptions (do not assume a normal distribution) Less vulnerable to assumption violation (more robust) 19 Univariate descriptive statistics 20

What do we want to describe? The distributional properties of variables, based on: Central tendency(ies): e.g., frequencies, mode, median, mean Shape: e.g., skewness, kurtosis Spread (dispersion): min., max., range, IQR, percentiles, variance, standard deviation 22 Measures of central tendency Statistics which represent the centre of a frequency distribution: Mode (most frequent) Median (50 th percentile) Mean (average) Which ones to use depends on: Type of data (level of measurement) Shape of distribution (esp. skewness) Reporting more than one may be appropriate. 23 Measures of central tendency Mode / Freq. /%s Median Mean Nominal x x Ordinal If meaningful x Interval Ratio If meaningful 24

Measures of distribution Measures of shape, spread, dispersion, and deviation from the central tendency Non-parametric: Min. and max. Range Percentiles Parametric: SD Skewness Kurtosis 25 Measures of spread / dispersion / deviation Nominal Min / Max, Range Percentile Var / SD x x x Ordinal If meaningful x Interval Ratio 26 Descriptives for nominal data Nominal LOM = Labelled categories Descriptive statistics: Most frequent? (Mode e.g., females) Least frequent? (e.g., Males) Frequencies (e.g., 20 females, 10 males) Percentages (e.g. 67% females, 33% males) Cumulative percentages Ratios (e.g., twice as many females as males) 27

Descriptives for ordinal data Ordinal LOM = Conveys order but not distance (e.g., ranks) Descriptives approach is as for nominal (frequencies, mode etc.) Plus percentiles (including median) may be useful 28 Descriptives for interval data Interval LOM = order and distance, but no true 0 (0 is arbitrary). Central tendency (mode, median, mean) Shape/Spread (min., max., range, SD, skewness, kurtosis) Interval data is discrete, but is often treated as ratio/continuous (especially for > 5 intervals) 29 Descriptives for ratio data Ratio = Numbers convey order and distance, meaningful 0 point As for interval, use median, mean, SD, skewness etc. Can also use ratios (e.g., Category A is twice as large as Category B) 30

Mode (Mo) Most common score - highest point in a frequency distribution a real score the most common response Suitable for all levels of data, but may not be appropriate for ratio (continuous) Not affected by outliers Check frequencies and bar graph to see whether it is an accurate and useful statistic 31 Frequencies (f) and percentages (%) # of responses in each category % of responses in each category Frequency table Visualise using a bar or pie chart 32 Median (Mdn) Mid-point of distribution (Quartile 2, 50 th percentile) Not badly affected by outliers May not represent the central tendency in skewed data If the Median is useful, then consider what other percentiles may also be worth reporting 33

Summary: Descriptive statistics Level of measurement and normality determines whether data can be treated as parametric Describe the central tendency Frequencies, Percentages Mode, Median, Mean Describe the variability: Min., Max., Range, Quartiles Standard Deviation, Variance 34 Four moments of a normal distribution 12 10 8 6 4 2 -ve Skew Mean SD Kurtosis Column 1 Column 2 Column 3 +ve Skew 0 Row 1 Row 2 Row 3 Row 4 36

Four moments of a normal distribution Four mathematical qualities (parameters) can describe a continuous distribution which at least roughly follows a bell curve shape: 1 st = mean (central tendency) 2 nd = SD (dispersion) 3 rd = skewness (lean / tail) 4 th = kurtosis (peakedness / flattness) 37 Average score Mean (1st moment ) Mean = Σ X / N For normally distributed ratio or interval (if treating it as continuous) data. Influenced by extreme scores (outliers) 38 Beware inappropriate averaging... With your head in an oven and your feet in ice you would feel, on average, just fine The majority of people have more than the average number of legs (M = 1.9999). 39

Standard deviation (2nd moment) SD = square root of the variance = Σ (X - X) 2 N 1 For normally distributed interval or ratio data Affected by outliers Can also derive the Standard Error (SE) = SD / square root of N 40 Skewness (3rd moment ) Lean of distribution +ve = tail to right -ve = tail to left Can be caused by an outlier, or ceiling or floor effects Can be accurate (e.g., cars owned per person would have a skewed distribution) 41 Skewness (3rd moment) (with ceiling and floor effects) Image source http://www.visualstatistics.net/visual%20statistics%20multimedia/normalization.htm Negative skew Ceiling effect Positive skew Floor effect 42

Kurtosis (4th moment ) Flatness or peakedness of distribution +ve = peaked -ve = flattened By altering the X &/or Y axis, any distribution can be made to look more peaked or flat add a normal curve to help judge kurtosis visually. 43 Kurtosis (4th moment ) Image source: https://classconnection.s3.amazonaws.com/65/flashcards/2185065/jpg/kurtosis-142c1127af2178fb244.jpg 44 Judging severity of skewness & kurtosis View histogram with normal curve Deal with outliers Rule of thumb: Skewness and kurtosis > -1 or < 1 is generally considered to sufficiently normal for meeting the assumptions of parametric inferential statistics Significance tests of skewness: Tend to be overly sensitive (therefore avoid using) 45

Areas under the normal curve If distribution is normal (bell-shaped - or close): ~68% of scores within +/- 1 SD of M ~95% of scores within +/- 2 SD of M ~99.7% of scores within +/- 3 SD of M 46 Areas under the normal curve Image source: https://commons.wikimedia.org/wiki/file:empirical_rule.png 47 Non-normal distributions 48

Types of non-normal distribution Modality Uni-modal (one peak) Bi-modal (two peaks) Multi-modal (more than two peaks) Skewness Positive (tail to right) Negative (tail to left) Kurtosis Platykurtic (Flat) Leptokurtic (Peaked) 49 Non-normal distributions 50 Histogram of people's weight 8 Histogram 6 4 2 Frequency 0 Std. Dev = 17.10 Mean = 69.6 N = 20.00 40.0 50.0 60.0 70.0 80.0 90.0 100.0 110.0 WEIGHT 51

Histogram of daily calorie intake N = 75 52 Histogram of fertility 53 60 50 At what age do you think you will die? Example normal distribution 1 40 Frequency 30 20 10 0 Mean =81.21 Std. Dev. =18.228 N =188 0 20 40 60 Die 80 100 120 140 54

60 40 20 0 Very feminine Fairly feminine Androgynous 60 40 20 0 Very feminine Fairly masculine Very masculine Fairly feminine Androgynous Fairly masculine 50 40 30 20 10 0 Very masculine Fairly feminine Androgynous Fairly masculine Very masculine C ou nt Distribution for females Femininity-Masculinity Distribution for males Gender: female Gender: male Count Count Femininity-Masculinity Femininity-Masculinity 56 Non-normal distribution: Use non-parametric descriptive statistics Min. & Max. Range = Max. - Min. Percentiles Quartiles Q1 Mdn (Q2) Q3 IQR (Q3-Q1) 57

Effects of skew on measures of central tendency +vely skewed distributions mode < median < mean symmetrical (normal) distributions mean = median = mode -vely skewed distributions mean < median < mode 58 Effects of skew on measures of central tendency 59 Transformations Converts data using various formulae to achieve normality and allow more powerful tests Loses original metric Complicates interpretation 60

Review questions 1. If a survey question produces a floor effect, where will the mean, median and mode lie in relation to one another? 61 Review questions 2. Would the mean # of cars owned in Australia to exceed the median? 62 Review questions 3. Would the mean score on an easy test exceed the median performance? 63

Principles of graphing Clear purpose Maximise clarity Minimise clutter Allow visual comparison 68 Graphs (Edward Tufte) Visualise data Reveal data Describe Explore Tabulate Decorate Communicate complex ideas with clarity, precision, and efficiency 69

Graphing steps 1. Identify purpose of the graph (make large amounts of data coherent; present many #s in small space; encourage the eye to make comparisons) 2. Select type of graph to use 3. Draw and modify graph to be clear, non-distorting, and welllabelled (maximise clarity, minimise clarity; show the data; avoid distortion; reveal data at several levels/layers) 70 Software for data visualisation (graphing) 1. Statistical packages e.g., SPSS Graphs or via Analyses 2. Spreadsheet packages e.g., MS Excel 3. Word-processors e.g., MS Word Insert Object Micrograph Graph Chart 71 Cleveland s hierarchy Image source:http://www.processtrends.com/toc_data_visualization.htm 72

Univariate graphs } Bar graph Non-parametric i.e., nominal or ordinal Pie chart Histogram Stem & leaf plot Data plot / Error bar Box plot } Parametric i.e., normally distributed interval or ratio 74 13 Bar chart (Bar graph) Allows comparison of heights of bars X-axis: Collapse if too many categories Y-axis: Count/Frequency or % - truncation exaggerates differences Can add data labels (data values for each bar) 1 2 12 1 1 1 0 12 9 Count 11 11 10 Count 8 7 6 5 4 10 3 9 2 1 Note truncated Y-axis 9 So cio logy Information Technolo Bio lo gy P sy ch olo gy A nt h ro p o lo gy AREA 0 So cio lo gy Information Technolo Bio lo gy P sy ch o lo gy An thro p o lo gy AREA 75

30 00 20 00 10 00 0 1 2.5 2 2. 5 3 2.5 42.5 52.5 6 2.5 60 0 50 0 40 0 Std. D ev = 30 9.10 6 M e an = 24.0 N = 5 57 5.0 0 20 0 10 0 0 Std. Dev = 9.1 6 M ea n = 24.0 N = 557 5.0 0 1000 800 600 400 200 0 9 1 3 17 21 25 29 33 3 7 4 1 45 49 53 57 61 6 5 Std. Dev = 9.16 Mean = 2 4 N = 5575.00 Pie chart Use a bar chart instead Bio lo gy Hard to read Difficult to show Small values An thro p o lo gy Small differences Rotation of chart and position of slices influences perception So cio lo gy P sy ch o lo gy Information T echnolo 76 Pie chart Use bar chart instead Image source: https://priceonomics.com/how-william-cleveland-turned-data-visualization/ 77 Histogram For continuous data (Likert?, Ratio) X-axis needs a happy medium for # of categories Y-axis matters (can exaggerate) Participant Age Participant Age Participant Age 78

Histogram of male & female heights Image source: Wild, C. J., & Seber, G. A. F. (2000). Chance encounters: A first course in data analysis and inference. New York: Wiley. Wild & Seber (2000) 79 Stem & leaf plot Use for ordinal, interval and ratio data (if rounded) May look confusing to unfamiliar reader 80 Stem & leaf plot Contains actual data Collapses tails Underused alternative to histogram Frequency Stem & Leaf 7.00 1. & 192.00 1. 22223333333 541.00 1. 444444444444444455555555555555 610.00 1. 6666666666666677777777777777777777 849.00 1. 88888888888888888888888888899999999999999999999 614.00 2. 0000000000000000111111111111111111 602.00 2. 222222222222222233333333333333333 447.00 2. 4444444444444455555555555 291.00 2. 66666666677777777 240.00 2. 88888889999999 167.00 3. 000001111 146.00 3. 22223333 153.00 3. 44445555 118.00 3. 666777 99.00 3. 888999 106.00 4. 000111 54.00 4. 222 339.00 Extremes (>=43) 81

569 5491 5328 1938 2224 3032 630 3004 4998 5564 2928 2151 2183 2960 3501 688 4054 527 608 2241 2336 1488 4038 3519 2270 2718 2067 2438 5506 2265 5186 3653 2476 2923 4302 2576 4279 1465 661 2570 2321 2944 2743 641 2780 2334 1493 2312 4307 1425 2920 4308 1963 4167 3552 2814 2646 148 686 390 1747 5481 3998 1793 2902 2319 2626 1504 641 2822 2845 4141 2688 2262 3645 4438 4120 4997 5315 3531 1955 2672 5274 417 4482 3020 2614 2829 2179 3442 638 510 620 649 4998 2359 5536 4250 2243 2527 3995 56 5510 2962 3673 3003 2187 5538 3990 415 2965 3009 1969 2335 2101 2690 436 3634 3028 688 149 1421 4317 2293 3564 2317 335 678 2452 3034 2138 5558 5316 1985 2108 2515 3593 2743 5186 4228 3165 4284 645 2492 2960 2480 2291 266 647 2476 4193 4308 608 17 2699 334 3556 3045 4349 2552 3466 4162 2752 1503 80 73 724 190 4351 4159 2596 122 3137 324 3040 4028 2659 1808 2861 2114 727 2313 62 1978 3006 517 2330 4186 1823 75 3532 2044 1404 284 2596 1495 578 2821 4162 5591 4495 473 36 2003 2898 2451 1983 400 5475 2847 3588 1419 320 3444 4052 690 2116 5427 2834 2559 318 724 4268 2950 571 521 3137 689 394 727 1906 4035 3635 342 442 2229 231 396 3040 3383 3562 5545 5525 122 4004 2358 733 4107 2385 5563 3351 5524 Box plot (Box & whisker) Useful for interval and ratio data Represents min., max, median, quartiles, & outliers 82 Box plot (Box & whisker) Alternative to histogram Useful for screening Useful for comparing variables Can get messy - too much info Confusing to unfamiliar reader 10 8 6 4 2 T ime Management-T 1 0 Missing Male Participant Gender Female Self-Confidence-T 1 83 Data plot & error bar Data plot Error bar 84

Line graph Alternative to histogram Implies continuity e.g., time Can show multiple lines 8.0 7.5 7.0 Mean 6.5 6.0 5.5 5.0 OVERALL SCALES-T 0 OVERALL SCALES-T 2 OVERALL SCALES-T1 OVERALL SCALES-T 3 85 Graphical integrity (part of academic integrity) 86 "Like good writing, good graphical displays of data communicate ideas with clarity, precision, and efficiency. Like poor writing, bad graphical displays distort or obscure the data, make it harder to understand or compare, or otherwise thwart the communicative effect which the graph should convey." Michael Friendly Gallery of Data Visualisation 87

Tufte s graphical integrity Some lapses intentional, some not Lie Factor = size of effect in graph size of effect in data Misleading uses of area Misleading uses of perspective Leaving out important context Lack of taste and aesthetics 88 References 1. Chambers, J., Cleveland, B., Kleiner, B., & Tukey, P. (1983). Graphical methods for data analysis. Boston, MA: Duxbury Press. 2. Cleveland, W. S. (1985). The elements of graphing data. Monterey, CA: Wadsworth. 3. Jones, G. E. (2006). How to lie with charts. Santa Monica, CA: LaPuerta. 4. Tufte, E. R. (1983). The visual display of quantitative information. Cheshire, CT: Graphics Press. 5. Tufte. E. R. (2001). Visualizing quantitative data. Cheshire, CT: Graphics Press. 6. Tukey J. (1977). Exploratory data analysis. Addison-Wesley. 7. Wild, C. J., & Seber, G. A. F. (2000). Chance encounters: A first course in data analysis and inference. New York: Wiley. 90