Summarising Data. Summarising Data. Examples of Types of Data. Types of Data

Similar documents
2 Exploring Univariate Data

Exploring Data and Graphics

Introduction to Descriptive Statistics

STAT 113 Variability

Stat 101 Exam 1 - Embers Important Formulas and Concepts 1

Percentiles, STATA, Box Plots, Standardizing, and Other Transformations

A LEVEL MATHEMATICS ANSWERS AND MARKSCHEMES SUMMARY STATISTICS AND DIAGRAMS. 1. a) 45 B1 [1] b) 7 th value 37 M1 A1 [2]

Description of Data I

Math 2311 Bekki George Office Hours: MW 11am to 12:45pm in 639 PGH Online Thursdays 4-5:30pm And by appointment

Lecture 1: Review and Exploratory Data Analysis (EDA)

Lecture 2 Describing Data

appstats5.notebook September 07, 2016 Chapter 5

Sampling & Confidence Intervals

Overview/Outline. Moving beyond raw data. PSY 464 Advanced Experimental Design. Describing and Exploring Data The Normal Distribution

Week 1 Variables: Exploration, Familiarisation and Description. Descriptive Statistics.

1 Describing Distributions with numbers

PSYCHOLOGICAL STATISTICS

Describing Data: One Quantitative Variable

DATA SUMMARIZATION AND VISUALIZATION

Frequency Distribution and Summary Statistics

Descriptive Statistics Bios 662

Fundamentals of Statistics

Measures of Center. Mean. 1. Mean 2. Median 3. Mode 4. Midrange (rarely used) Measure of Center. Notation. Mean

Lecture Week 4 Inspecting Data: Distributions

Numerical Descriptions of Data

Getting to know a data-set (how to approach data) Overview: Descriptives & Graphing

Measures of Central Tendency Lecture 5 22 February 2006 R. Ryznar

9/17/2015. Basic Statistics for the Healthcare Professional. Relax.it won t be that bad! Purpose of Statistic. Objectives

Numerical Descriptive Measures. Measures of Center: Mean and Median

Some estimates of the height of the podium

KING FAHD UNIVERSITY OF PETROLEUM & MINERALS DEPARTMENT OF MATHEMATICAL SCIENCES DHAHRAN, SAUDI ARABIA. Name: ID# Section

Unit 2 Statistics of One Variable

DATA HANDLING Five-Number Summary

MATHEMATICS APPLIED TO BIOLOGICAL SCIENCES MVE PA 07. LP07 DESCRIPTIVE STATISTICS - Calculating of statistical indicators (1)

Section 6-1 : Numerical Summaries

Basic Procedure for Histograms

Categorical. A general name for non-numerical data; the data is separated into categories of some kind.

Descriptive Statistics (Devore Chapter One)

Descriptive Analysis

Descriptive Statistics

1. In a statistics class with 136 students, the professor records how much money each

Dot Plot: A graph for displaying a set of data. Each numerical value is represented by a dot placed above a horizontal number line.

Chapter 3. Numerical Descriptive Measures. Copyright 2016 Pearson Education, Ltd. Chapter 3, Slide 1

Exploratory Data Analysis

IOP 201-Q (Industrial Psychological Research) Tutorial 5

STAT 157 HW1 Solutions

Steps with data (how to approach data)

Graphical and Tabular Methods in Descriptive Statistics. Descriptive Statistics

3.1 Measures of Central Tendency

Lecture 07: Measures of central tendency

Simple Descriptive Statistics

4. DESCRIPTIVE STATISTICS

Some Characteristics of Data

Getting to know data. Play with data get to know it. Image source: Descriptives & Graphing

The Range, the Inter Quartile Range (or IQR), and the Standard Deviation (which we usually denote by a lower case s).

Empirical Rule (P148)

CHAPTER 2 Describing Data: Numerical

Section3-2: Measures of Center

Putting Things Together Part 2

Descriptive Statistics

Statistics I Chapter 2: Analysis of univariate data

Chapter 6. y y. Standardizing with z-scores. Standardizing with z-scores (cont.)

the display, exploration and transformation of the data are demonstrated and biases typically encountered are highlighted.

AP STATISTICS FALL SEMESTSER FINAL EXAM STUDY GUIDE

Section 2.2 One Quantitative Variable: Shape and Center

Both the quizzes and exams are closed book. However, For quizzes: Formulas will be provided with quiz papers if there is any need.

Numerical summary of data

MBEJ 1023 Dr. Mehdi Moeinaddini Dept. of Urban & Regional Planning Faculty of Built Environment

CSC Advanced Scientific Programming, Spring Descriptive Statistics

MEASURES OF DISPERSION, RELATIVE STANDING AND SHAPE. Dr. Bijaya Bhusan Nanda,

6683/01 Edexcel GCE Statistics S1 Gold Level G2

y axis: Frequency or Density x axis: binned variable bins defined by: lower & upper limits midpoint bin width = upper-lower Histogram Frequency

MEASURES OF CENTRAL TENDENCY & VARIABILITY + NORMAL DISTRIBUTION

Measures of Dispersion (Range, standard deviation, standard error) Introduction

Chapter 2: Descriptive Statistics. Mean (Arithmetic Mean): Found by adding the data values and dividing the total by the number of data.

Today s plan: Section 4.1.4: Dispersion: Five-Number summary and Standard Deviation.

Data that can be any numerical value are called continuous. These are usually things that are measured, such as height, length, time, speed, etc.

STAB22 section 1.3 and Chapter 1 exercises

The Mode: An Example. The Mode: An Example. Measure of Central Tendency: The Mode. Measure of Central Tendency: The Median

DESCRIPTIVE STATISTICS II. Sorana D. Bolboacă

The Normal Distribution & Descriptive Statistics. Kin 304W Week 2: Jan 15, 2012

Handout 4 numerical descriptive measures part 2. Example 1. Variance and Standard Deviation for Grouped Data. mf N 535 = = 25

Ti 83/84. Descriptive Statistics for a List of Numbers

Math 2200 Fall 2014, Exam 1 You may use any calculator. You may not use any cheat sheet.

Standardized Data Percentiles, Quartiles and Box Plots Grouped Data Skewness and Kurtosis

CHAPTER 2 DESCRIBING DATA: FREQUENCY DISTRIBUTIONS AND GRAPHIC PRESENTATION

Center and Spread. Measures of Center and Spread. Example: Mean. Mean: the balance point 2/22/2009. Describing Distributions with Numbers.

Copyright 2005 Pearson Education, Inc. Slide 6-1

2CORE. Summarising numerical data: the median, range, IQR and box plots

22.2 Shape, Center, and Spread

2 DESCRIPTIVE STATISTICS

Monte Carlo Simulation (General Simulation Models)

Introduction to Computational Finance and Financial Econometrics Descriptive Statistics

Chapter 3 Descriptive Statistics: Numerical Measures Part A

How Wealthy Are Europeans?

STA1510 (BASIC STATISTICS) AND STA1610 (INTRODUCTION TO STATISTICS) NOTES PART 1

Edexcel past paper questions

STAT Chapter 6 The Standard Deviation (SD) as a Ruler and The Normal Model

STAT Chapter 6 The Standard Deviation (SD) as a Ruler and The Normal Model

Basic Data Analysis. Stephen Turnbull Business Administration and Public Policy Lecture 3: April 25, Abstract

Transcription:

Summarising Data Summarising Data Mark Lunt Arthritis Research UK Epidemiology Unit University of Manchester Today we will consider Different types of data Appropriate ways to summarise these data 17/10/2017 Types of Data Examples of Types of Data Qualitative Nominal Outcome is one of several categories Nominal Blood group; Hair colour. Ordinal Outcome is one of several ordered categories Ordinal Strongly agree, agree, disagree, strongly disagree. Quantitative Discrete Can take one of a fixed set of numerical values Discrete Number of children. Continuous Can take any numerical value Continuous Birthweight.

Caveats with Data Types Types of Variables Distinction between nominal and ordinal variables can be subjective: e.g. vertebral fracture types: Wedge, Concavity, Biconcavity, Crush. Could argue that a crush is worse than a biconcavity which is worse than a concavity..., but this is not self-evident. Distinction between ordinal and discrete variables can be subjective: e.g. cancer staging I, II, III, IV: sounds discrete, but better treated as ordinal. Continuous variables generally measured to a fixed level of precision, which makes them discrete. Not a problem, provide there are enough levels. What type of variable are each of the following: Number of visits to a G.P. this year Marital Status Size of tumour in cm Pain, rated as minimal/moderate/severe/unbearable Blood pressure (mm Hg) Summarizing of Count the number of subjects in each group. The count is commonly refered to as the frequency The proportion in each group is referred to as the relative frequency Stata command to produce a tabulation is tabulate varname region Freq. Percent Cum. ------------+----------------------------------- Canada 422 22.84 22.84 USA 541 29.27 52.11 Mexico 223 12.07 64.18 Europe 493 26.68 90.85 Asia 169 9.15 100.00 ------------+----------------------------------- Total 1,848 100.00

of Bar Chart Bar Chart: Data represented as a series of bars, height of bar proportional to frequency. Pie Chart: Data represented as a circle divided into segments, area of segment proportional to frequency. Pictograms: Similar to bar chart, but uses a number of pictures to represent each bar. Bar chart is the easiest to understand. Frequency 0 200 400 600 Canada USA Mexico Europe Asia region Summarizing The Histogram Simplest method: treat as qualitative data. Divide observations into groups May be unnecessary for discrete data. Look at the frequency distribution of these groups Can use table or diagram. Similar to a bar chart Continuous, not categorical variable Area of bars proportional to probability of observation being in that bar Axis can be Frequency (heights add up to n) Percentage (heights add up to 100%) (Areas add up to 1)

How Many Groups? Histograms female male Impossible to say. Depends on the number of observations: if individual groups are too small, results are meaningless. With discrete variables, exact positions of boundaries may be important. Tables need few groups, graphs can have more if sufficient numbers. May be decided for you in software. 0.02.04.06.08 140 160 180 200 140 160 180 200 Graphs by sex measured height (cm) Histogram: Effect of Wrong number of bins Bar charts and histograms in Stata 0.02.04.06 0 10 20 30 x 24 bins (default) 0.01.02.03.04 0 10 20 30 x 30 bins (correct) histogram varname produces a histogram Number of bars can by set by option bin() Width of a bar can be set by option width() histogram varname, discrete produces a bar chart What stata calls a bar chart is the mean of second variable subdivided by category, rather than a frequency.

of The Normal Distribution Need to know: 1 What is a typical value ( location ) 2 How much do the values vary ( scale ) Simplest distribution to summarize is the normal distribution Other summary statistics (skewness, kurtosis etc) thought of relative to normal distribution. Symmetrical Bell-shaped distribution Easiest to use mathematically Many variables are normally distributed Can be described by two numbers Mean (measure of location) Standard Deviation (measure of variation) Histogram & Normal Distribution Non-Normal Distributions 0.02.04.06.08 female male 140 160 180 200 140 160 180 200 Graphs by sex measured height (cm) normal nurseht Normal distribution is symmetric. Asymmetric distributions are called skewed : Positively skewed = some extremely high values (mean > median). Negatively skewed = some extremely low values (mean < median). Distribution may have more than one peak : bi-modal. Usually formed by mixing two different groups.

Non-Normal Distributions Measures of Location 0.05.1.15.2 4 2 0 2 4 6 y1 Bimodal Distribution 0.05.1.15.2 0 5 10 15 20 y2 Positively Skewed Dist n What is the value of a typical observation? May be: (Arithmetic) Mean Median Other forms of mean Rarely used Only if data has been transformed Arithmetic Mean Median Add them up and divide by how many there are. x = x 1 + x 2 +... + x n n = (Σ n i=1 x i)/n Arrange in increasing order, pick the middle. If an even number of observations, take mean of middle two. Ignores the precise magnitude of most observations Contains less information than mean May be useful if there are outliers Less easy to use mathematically.

Mean vs. Median Percentiles Consider this series of durations of absence from work due to sickness (in days). 1,1,2,2,3,3,4,4,4,4,5,6,6,6,6,7,8,10,10,38,80 Mean = 10 Median = 5 Very few observations are as large as the mean: median is more typical. The x th percentile is the value than which x% of observations are smaller and (100 x)% are larger. The median is the 50th percentile. Other centiles can easily be calculated, eg 5th, 25th etc. Measures of Variation Simple Measures of Variation How close to the typical value are other values. Range Inter-quartile range Variance Range Inter-quartile Range (Largest measurement) - (smallest measurement) Depends on only two measurements Can only increase as you add more to the sample (75th centile) - (25th centile). Less sensitive to extreme values Need fairly large numbers of observations

Standard Deviation Summary Statistics in Stata Standard Deviation = Σ(x i x) 2 /n Nearly the average difference from the mean Uses information from every observation Not robust to outliers Variance is easy to use mathematically Standard deviation is the same units as the observations summarize varlist will give mean, SD, min and max summarize varlist, detail also gives percentiles tabstat or table can produce tables of summary statistics : Table 1 Example Quantitative variables Need a measure of location & variation Normal variables: mean and SD Skewed variables: median and IQR Need to give units Qualitative variables Number and % in each category Age in years: Mean (SD) 63 (7.9) Spine BMD in g/cm 2 : Median (IQR) 1.05 (0.78, 1.30) Gender: n (%) Male 1537 (44) Female 1924 (56)

The Box and Whisker Plot Box and Whisker Plots Very efficient summary of distribution: Shows median, upper and lower quartiles (25th and 75th percentiles). Also shows range of normal values and individual unusual values. Definitions of normal and unusual differ. Will demonstrate skewness, not bimodality. Stata command: graph box varname, [by(groupname)] 4 2 0 2 4 Normal Distribution 0 5 10 15 20 Positively Skewed Dist n Transforming Data Further Reading Skewed distributions may be made symmetric by a transformation. Taking logs is the most common. Other transformations (e.g. square root, reciprocal) can be used, but can be very difficult to interpret. May be better to transform back to original units to present results. Geometric mean is back-transformation of mean of log-transformed data. Edward R. Tufte, The Visual Display of Quantitative Information was the classic text on statistical graphs. Huge data visualisation industry now