Statistics I Chapter 2: Analysis of univariate data
Numerical summary Central tendency Location Spread Form mean quartiles range coeff. asymmetry median percentiles interquartile range coeff. kurtosis mode variance standard deviation coeff. of variation
Descriptive statistics What are they useful? Can we calculate them for all types of variables? Which are the most useful in each case? How can we use the calculator or Excel?
Measures of central tendency The mean The median The mode
Central tendency: the (artithmetic) mean The (artithmetic) mean The mean is the average of all the data n i=1 x = x i n = x 1 +... + x n n It is the most common measure of location It is the center of gravity of the data It can be calculated only for quantitative variables
The mean: example For the experience of the 46 professionals of a computer company, Which is the mean? x = 1 + 1 + 1 + 1 + 1 + 2 + 2 + 2 + 2 + + 17 + 20 46 = 7.5 años How can we calculate it using the absolute frequency table? and using the relative one? Experience, x i absolute freq., n i relative freq., f i 1 5 0,109 2 4 0,087 3 4 0,087 4 4 0,087 5 3 0,065 6 4 0,087 7 1 0,022 8 4 0,087 10 4 0,087 11 2 0,043 12 2 0,043 13 2 0,043 14 1 0,022 15 1 0,022 16 3 0,065 17 1 0,022 20 1 0,022 Total 46 1
The mean with grouped data This is the same formula but using the center of each interval. For the salary of the 46 professionals of a computer company, Which is the mean? Note: the mean salary using the raw data equals 17250.413
The mean: properties Linearity: If Y = a + bx ȳ = a + b x If the 46 professionals salaries is increased by 2 %, How the mean salary changes? Afterwards the salary is reduced in 100 dolars, Wich is the final mean salary? Disadvantages: Affected by extreme values (outliers) Example: X : 3, 1, 5, 4, 2, Y : 3, 1, 50, 4, 2 x = 3 + 1 + 5 + 4 + 2 5 = 3 ȳ = 3 + 1 + 50 + 4 + 2 5 = 12 Its value has been multiplied by 4!! When the data is skewed an alternative robust measure of central tendency is more appropriate
Central tendency: the median...is the most central datum 1 1 1 3 3 5 5 7 8 8 9 1. Order the data from smallest to largest 2. Include repetitions 3. The median is the physical centre 1 1 1 3 3 5 5 7 8 8 M = 3 + 5 2 Median Ordered list from smallest to largest: x (1), x (2),..., x (n) if n odd M = x ((n+1)/2) x (n/2) +x (n/2+1) 2 if n even = 4
The media via the table of frequencies Experience, x i n i f i N i F i 1 5 0,109 5 0,109 2 4 0,087 9 0,196 3 4 0,087 13 0,283 4 4 0,087 17 0,370 5 3 0,065 20 0, 435 < 0.5 M=6 4 0,087 24 0, 522 > 0.5 7 1 0,022 25 0,543 8 4 0,087 29 0,630 9 0 0 29 0,630 10 4 0,087 33 0,717 11 2 0,043 35 0,761 12 2 0,043 37 0,804 13 2 0,043 39 0,848 14 1 0,022 40 0,870 15 1 0,022 41 0,891 16 3 0,065 44 0,957 17 1 0,022 45 0,978 18 0 0 45 0,978 10 0 0 45 0,978 20 1 0,022 46 1,000
The meadian: properties Linearity: If Y = a + bx M y = a + bm x If the 46 professionals salaries is increased by 2 %, How the median salary changes? Afterwards the salary is reduced in 100 dolars, Wich is the final median salary? Can we calculate the meadian with the education level data? Can we calculate the meadian with the 0-1 position of responsability variable? Advantage: Not affected by outliers Example: X : 3, 1, 5, 4, 2, Y : 3, 1, 50, 4, 2 M x = 3 M y = 3 When the data is skewed it is a better measure of central tendency than the mean.
The median and the mean for asymmetric data Annual gross salary in 2014, Encuesta de Estructura Salarial 2014, I.N.E. La diferencia entre el salario medio y el mediano se explica porque en el cálculo del valor medio influyen notablemente los salarios muy altos aunque se refieran a pocos trabajadores. (En la Nota de Prensa del INE de 28 de octubre de 2016)
Central tendency: the mode...is the most frequent value The mode of the variable experience in the 46 professionals example is 1 year, with an absolute frequency of 5 employees. The values 2,3,4,8 and 10 have an absolute frequency of 4 employees.
Central tendency: the mode Does this definition make sense with the education level data? Does this definition make sense with the 0-1 position of responsability variable?
Central tendency: the mode Does this definition make sense with continuous data? modal interval
The mode: properties It can be calculated for both qualitative and quantitative variables. Indeed, it is the only descriptive measurement (mean, median, mode) that makes sense for nominal qualitative variables. Not affected by outliers There can be no mode. There can be more than one mode: bimodal trimodal plurimodal What it can be indicate?
Location measures Quartiles Percentiles
Location measures: quartiles and percentiles Quartiles split the ranked data into four segments with an equal number of values per segment. Percentiles split the ranked data into a hundred segments with an equal number of values per segment. 1. Order the data from smallest to largest 2. Include repetitions 3. Select each quartile (percentile) according to: The first quartil Q1 has position 1 (n + 1). 4 The second quartil Q2 (= median) has position 1 (n + 1). 2 The third quartil Q3 has position 3 (n + 1). 4 The k-th percentile Pk, has position k(n + 1)/100, k = 1,..., 99.
Quartiles: example
Percentiles: example
Masures of spread The range and the interquartile range The variance and the standard deviation The coefficient of variation
Variation: range and interquartile range (IQR) The Range is the simplest measure of variation R = x máx x mín Ignores the way the data is distributed Sensitive to outliers Example: Given observations 3, 1, 5, 4, 2, R = 5 1 = 4 Example: Given observations 3, 1, 5, 4, 100, R = 100 1 = 99 The Interquartile range (IQR) can eliminate some outlier problems. Eliminate high and low observations and calculate the range of the middle 50 % of the data RIC = 3rd cuartil 1st cuartil = Q 3 Q 1
Variation: Interquartile range and boxplot Outliers are observations that fall below the value of Q1 1.5 IQR above the value of Q3 + 1.5 IQR For extreme outliers, replace 1.5 by 3 in the above definition MEDIANA x min Q 1 (Q 2) Q 3 x max 25% 25% 25% 25% 12 24 31 42 58 RI=18
Measure of variation: variance Average of squared deviations of values from the mean Population variance Sample variance n ˆσ 2 i=1 = (x i x) 2 n N σ 2 i=1 = (x i µ) 2 N faster to calculate { }}{ n i=1 = x i 2 n( x) 2 n divided by n Sample quasi-variance (corrected sample variance) n s 2 i=1 = (x i x) 2 n 1 They are related via = n i=1 x 2 i n( x) 2 n 1 ˆσ 2 = n 1 n s2 divided by n 1 If a, b (b 0) are real numbers and y = a + bx, then s 2 y = b 2 s 2 x
Measure of variation: standard deviation (SD) The most-commonly used measure of spread Population standard deviation, sample standard deviation and sample quasi-standard deviation are respectively Shows variation about the mean σ = σ 2 ˆσ = ˆσ 2 s = s 2 Has the same units as the original data, whilst variance is in units 2 Variance and SD are both affected by outliers
Calculating variance and standard deviation Example: X : 11, 12, 13, 16, 16, 17, 18, 21, Y : 14, 15, 15, 15, 16, 16, 16, 17, Z : 11, 11, 11, 12, 19, 20, 20, 20 x = 124 8 = 15.5 ȳ = 124 8 = 15.5 z = 124 8 = 15.5 n i=1 n i=1 n i=1 x 2 i = 11 2 + 12 2 +... + 21 2 = 2000 y 2 i = 14 2 + 15 2 +... + 17 2 = 1928 z 2 i = 11 2 + 11 2 +... + 20 2 = 2068 n sx 2 i=1 = x i 2 n( x) 2 2000 8(15.5)2 = = 78 = 11.1429 sx = 3.3381 n 1 8 1 7 sy 2 1928 8(15.5)2 = = 6 = 0.8571 sy = 0.9258 8 1 7 sz 2 2068 8(15.5)2 = = 146 = 20.8571 sz = 4.5670 8 1 7
Comparing standard deviations Example cont.: X : 11, 12, 13, 16, 16, 17, 18, 21, Y : 14, 15, 15, 15, 16, 16, 16, 17, Z : 11, 11, 11, 12, 19, 20, 20, 20 x = 15.5 s x = 3.3 11 12 13 14 15 16 17 18 19 20 21 y = 15.5 s y = 0.9 11 12 13 14 15 16 17 18 19 20 21 z = 15.5 s z = 4.6 11 12 13 14 15 16 17 18 19 20 21
Measure of variation: coefficient of variation (CV) Measures relative variation and is defined as CV = s x Is a unitless number (sometimes given in % s) Shows variation relative to mean Example: Stock A: Average price last year = 50, Standard deviation = 5 Stock B: Average price last year = 100, Standard deviation = 5 CV A = 5 50 = 0.10 CV B = 5 100 = 0.05 Both stocks have the same SDs, but stock B is less variable relative to its mean price
Numerical summaries and frequency tables. Standarization. If the data is discrete then k i=1 x = x in i n and s 2 = k i=1 x 2 i n i n x 2 n 1 If the data is continuous, we replace x i in the above difinition, by the mid-points of class intervals To standardize variable x means to calculate x x s If you apply this formula to all observations x 1,..., x n and call the transformed ones z 1,..., z n, then the mean of the z s is zero with the standard deviation of one Standarization = finding z-score
Measures of form Fisher s coefficient of asymmetry Fisher coefficient of kurtosis Empirical rule
Shape: comparing mode, mean and median Three types of distributions: Skewed to the left Mean < Median < Mode Symmetric Mean = Median = Mode Skewed to the right Mode < Median < Mean LEFT SKEWED x < M SYMMETRIC x = M RIGHT SKEWED M < x Note: The distribution in the middle is known as bell-shaped or normal
Measures of form: Asymmetry n i=1 (x i x) 3 Fisher s coefficient of asymmetry γ 1 = 1 n S. The data is 3 skewed to the right (positive) if γ 1 > 0, and vice versa. Asimetría a la derecha Asimetría a la izquierda Frequency 0 10 20 30 40 50 60 γ 1 = 2.236 Frequency 0 50 100 150 200 γ 1 = 1.401 0 1 2 3 4 5 6 0.0 0.2 0.4 0.6 0.8 1.0
Measures of form: kurtosis Fisher s coefficient of kurtosis γ 2 = 1 n n i=1 (x i x) 4 S 3 4 For the standard normal, γ 2 = 0. If γ 2 > 0 leptokurtic (sharper than the standard normal) and platykurtic if γ 2 < 0 Distribución Leptocúrtica Distribución Platicúrtica Density 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Density 0.0 0.2 0.4 0.6 0.8 1.0 2 0 2 4 1.0 0.0 1.0 2.0
Empirical rule If the data is bell-shaped (normal), that is, symmetric and with light tails, the following rule holds: 68 % of the data are in ( x 1s, x + 1s) 95 % of the data are in ( x 2s, x + 2s) 99.7 % of the data are in ( x 3s, x + 3s) Note: This rule is also known as 68-95-99.7 rule Example: We know that for a sample of 100 observations, the mean is 40 and the quasi-standard deviation is 5. Assuming that the data is bell-shaped, give the limits of an interval that captures 95 % of the observations. 95 % of x i s are in: ( x ± 2s) = (40 ± 2(5)) = (30, 50)