Lecture Week 4 Inspecting Data: Distributions

Lecture Week 4 Inspecting Data: Distributions Introduction to Research Methods & Statistics 2013 2014 Hemmo Smit

So next week No lecture & workgroups But Practice Test on-line (BB) Enter data for your own research Practice SPSS skills with own data

Overview Descriptive research Describing and presenting data Frequency distributions Graphical displays (1) Measures of Central Tendency and Variability Graphical displays (2): Boxplots Read: Leary: Chapter 6 Howell: Chapter 2

Types of descriptive research Survey research Demographic research Attitudes, lifestyles, behaviors, problems Patterns of basic life events: birth, marriage, migration, death. Epidemiological research Occurrence of disease and death

3 types of surveys Cross-sectional Successive independent samples Longitudinal (panel survey design) One-shot cross-section of the population Changes over time Different respondents each time! Are samples comparable? Changes over time Same respondents more than once! Drop out

Describing and presenting data 3 criteria for a good description: 1) Accurate 2) concise Trade-off 3) comprehensible - Loss of information - Possible distortion Data can be presented in numerical and graphical format TIP: Always start with graphs Beware: Scale of measurement?!?

( y y) ) How to describe a distribution? A) Overall pattern 1) Shape - number of peaks (uni-, bi- of multi-modal)? - symmetrical or skewed? 2) Central tendency / Location: midpoint 3) Spread: a little or a lot? B) Deviations from the pattern - Outliers: observations that lie far from the majority - Tails: thick or thin?

Frequency distributions: Example How do children recall stories? Respondents: 25 children Task: Tell researcher about a movie Dependent variable: number of and then statements (see Howell, Exercise 2.1, p.55)

Raw data and frequency distributions Table 1. # and then statements 18 17 16 18 15 15 18 16 20 18 22 20 17 21 17 19 17 21 20 19 18 12 23 20 20 Table 2. # and then statements Score f P 12 1 0.04 15 2 0.08 16 2 0.08 17 4 0.16 18 5 0.20 19 2 0.08 20 5 0.20 21 2 0.08 22 1 0.04 23 1 0.04 Total 25 1.00

Absolute and relative frequencies Absolute frequency (f) = Number of respondents with a given score Disadvantage: hard to interpret / compare Relative frequency (P) = Proportion of the total with a given score (P = f / n) Advantage: easy to interpret Note: 0 < P < 1 P x 100 = %

SPSS: Frequencies - Menu Analyze > Desciptive Statistics > Frequencies

SPSS: Frequencies Dialog box

SPSS: Frequencies - Output

Grouped frequency distribution (1) Simple frequency distributions unclear in case of: - small number of participants in each category and/or - variables with many categories Solution: grouped frequency table Distribute the raw data over K class intervals and make a new frequency distribution Make sure all intervals are: - exhaustive and mutually exclusive - of equal width

Grouped frequency distribution (2) Rule 1: number of classes (K) = n Rule 2: class interval width (I) = range / number of classes (Range (R) = highest score lowest score) In our example Number of intervals = 25 = 5 Range = 23 12 = 11 Interval width = 11 / 5 2 or 3 Score f P 12-14 1 0.04 15-17 8 0.32 18-20 12 0.48 21-23 4 0.16 total 25 1.00

SPSS: Grouped frequency distribution (1)

SPSS: Grouped frequency distribution (2) 1 2

SPSS: Grouped frequency distribution (3) 2 1 3

SPSS: Grouped frequency distribution (4) 1 2

SPSS: Grouped frequency distribution (5)

Cumulative frequency distributions (1) Class interval Real lower limit Real upper limit 12-14 11.5 14.5 13 15-17 14.5 17.5 16 18-20 17.5 20.5 19 21-23 20.5 23.5 22 Total Real lower limit = lower limit 0.5 Real upper limit = upper limit + 0.5 Midpoint = upper limit + lower limit / 2 Midpoint f P F

Cumulative frequency distributions (2) Class interval Real lower limit Real upper limit Midpoint f P F 12-14 11.5 14.5 13 1 0.04 15-17 14.5 17.5 16 8 0.32 18-20 17.5 20.5 19 12 0.48 21-23 20.5 23.5 22 4 0.16 Total 25 1.00 F = Cumulative Relative Frequency (CRF): add all previous proportions.

Cumulative frequency distributions (3) Class interval Real lower limit Real upper limit Midpoint f P F 12-14 11.5 14.5 13 1 0.04 0.04 15-17 14.5 17.5 16 8 0.32 0.36 18-20 17.5 20.5 19 12 0.48 0.84 21-23 20.5 23.5 22 4 0.16 1.00 Total 25 1.00 NB. Also possible: cumulative absolute frequency

( y y) ) Cumulative frequency distributions (4) The cumulative relative frequency polygon graphs the possibility that someone has a score of X or lower.

Count Count Graphical displays: Nominal / Ordinal Raw data Grouped Bar 4 6 3 2 4 1 2 0 2 3 4 5 6 score 7 8 9 0 2-3 4-5 score 6-7 8-9 Pie 9 2 3 8-9 2-3 8 4 7 6 6-7 4-5 5

Graphical displays: Interval Histograms Stem & Leaf Display Freq. Stem & Leaf 1,00 Extremes (=<12,0) 2,00 15. 00 2,00 16. 00 4,00 17. 0000 5,00 18. 00000 2,00 19. 00 5,00 20. 00000 2,00 21. 00 1,00 22. 0 1,00 23. 0 Stem width: 1 Each leaf: 1 case(s)

Histogram symmetrical or skewed? Symmetrical Negatively skewed Positively skewed

SPSS: Graphs Chart Builder / Legacy Dialogs

SPSS: Graphs > Legacy Dialogs

SPSS - Graphs > Chart builder 3 1 2

Measures of central tendency 1. Mode (Mo) = most common score 2. Median (Mdn) = middle score (50 th percentile) Median location 1 N 2 3. Mean (M) = average x x 1 x 2... n x n or x 1 n x i

s2 sxx Central tendency and skewness Shape Mode Median Mean positive skew symmetrical negative skew A B C A A A C B A

Measures of variability 1. Range (R) = Highest score Lowest score 2. Interquartile range (IQR) = Q3 Q1 3. Standard deviation (s or σ) = spread around the mean 4. Variance (s² or σ²) = spread around the mean

Variance and standard deviation Score Deviation Squared x 1 x 2 x 3 x x 1 x n i x x 2 x x 3 x n x ( x1 x) ( x2 x) ( x x Sum x 0 0 2 2 2 3 ) 2 ( x n x) Variance s Standard deviation s 2 x x ( x n i x) 1 ( xi x) n 1 2 2 The standard deviation and variance are: only suitable as measures of spread around the mean Not robust against outliers

Five-number summary and boxplot Five-number summary consists of: Minimum = Lowest (non-outlying) score Q1 = 25 th percentile (25% lower, 75% higher) Median (=Q2) = 50 th percentile Q3 = 75 th percentile Maximum = Highest (non-outlying) score Graphical display: Boxplot

Boxplot - Example Data: 3 13 17 19 22 24 25 28 35 39 44 45 83 86 93 Nummerical (five-number summary) Max = 93 Q3 = 45 M = 28 Q1 = 19 Rule of thumb Outlier = observation that lies 1.5 x IQR above Q3 or below Q1. Min = 3 IQR = 45 19 = 26 Graphical (boxplot) Q1 1.5*IQR = -20 Q3 + 1.5*IQR = 84

Overview Scale of Measurement Nominal Graphical CT Spread Bar chart Mode --- (Pie chart) Ordinal Boxplot Median Range IQR Interval Histogram Mean - Standard dev. (and higher) (Stem&Leaf display) - Variance

What have you learned today? What are the various ways to represent distributions numerically? What are the various ways to represent distributions graphically? How to describe a distribution How to create and evaluate various numerical and graphical representations of distributions How to determine what numerical and graphical representation is suitable for a variable.

Next week No lecture and workgroups Practice test on Blackboard Enter your own data In two weeks Normal distribution and standard scores Read: Howell: Chapter 3