Percentiles, STATA, Box Plots, Standardizing, and Other Transformations Lecture 3 Reading: Sections 5.7 54 Remember, when you finish a chapter make sure not to miss the last couple of boxes: What Can Go Wrong? and Ethics in Action 1 Measures of Relative Standing: Percentiles Fraction.6.5 3 World bank data again n = 174 countries, bin width = 5 828 56.747 What is approx. median (5 th percentile)? 2 th percentile? 85.64 th percentile? 45 15.57.57.57.57 2 4 6 Inflation Rate, 211 2. su inflation_211, detail Reading STATA Output inflation_211 ------------------------------------------------------------- Percentiles Smallest 1% -2.517798-4.895247 5%.922363-2.517798 1% 2.75173-644478 Obs 174 25% 32996-833333 Sum of Wgt. 174 5% 4.977675 Mean 6.646499 Largest Std. Dev. 6.77998 75% 853968 26.921 9% 123155 332422 Variance 45.96813 95% 17.71178 477686 Skewness 3.7732 99% 477686 53287 Kurtosis 22.85972 Median? Range? Sample size? 3 Lecture 3, Page 1 of 8
Trips Freq. Percent Cum. Trips Freq. Percent Cum. 294 35.85 35.85 19 1 2 95.85 1 76 97 452 2 3 7 962 2 66 8.5 537 21 2 4 966 3 58 7.7 64 22 4 9 96.95 4 47 5.73 65.98 23 1 2 97.7 5 47 5.73 71.71 24 4 9 97.56 6 36 49 76 25 2 4 97.8 7 3 3.66 79.76 26 4 9 989 8 28 31 837 27 2 4 98.54 9 15 1.83 85. 28 3 7 98.9 1 9 1 86 3 1 2 99 11 16 1.95 88.5 34 1 2 995 12 25 3.5 91 35 1 2 997 13 9 1 92 36 1 2 999 14 5.61 92.8 41 1 2 99.51 15 9 1 93.9 43 1 2 99.63 16 5.61 94.51 44 1 2 99.76 17 6.73 954 45 1 2 99.88 18 4 9 95.73 5 1 2 1. cont d Total 82 1. What is the median? What is the 75 th percentile? 4 Discrete Histogram (bin width = 1) 5 1 2 3 4 5 Number Fishing Trips 5 Reading STATA Output. summarize Number_of_Trips, detail; Number_of_Trips ------------------------------------------------------------- Percentiles Smallest 1% 5% 1% Obs 82 25% Sum of Wgt. 82 5% 2 Mean 4.52439 Largest Std. Dev. 6.684273 75% 6 43 9% 12 44 Variance 44.6795 95% 17 45 Skewness 2.717188 99% 3 5 Kurtosis 1381 How can the 1 th percentile and the 25 th percentile both be zero? 6 Lecture 3, Page 2 of 8
One Popular Use of Percentiles Quartiles: 1 st quartile: obs btwn th and 25 th percentiles 2 nd quartile: obs btwn 25 th and 5 th percentiles 3 rd quartile: obs btwn 5 th and 75 th percentiles 4 th quartile: obs btwn 75 th and 1 th percentiles Quintiles: Divide variable into fifths: e.g. top quintile includes obs btwn 8 th and 1 th percentiles Deciles: Divide variable into tenths: e.g. bottom decile includes obs btwn th and 1 th percentiles Note: You are responsible for knowing the meaning of these terms if they appear on a test, exam, etc. 7 Practice Reading and Interpreting Alesina et al (21) Why Doesn t the United States Have a European-Style Welfare State? What do these numbers mean? How should they be interpreted? 8 Interquartile Range (IQR) Interquartile range: 75 th percentile minus 25 th percentile Measures spread of middle observations What does it measure? 9 Lecture 3, Page 3 of 8
Boxplot of Inflation Distribution, n = 174 countries LAV Median 75 th Percentile Upper Adjacent Value (UAV) UAV marks biggest obs. within 1.5 IQR s of the 75 th percentile Outside Values 25 th Percentile whisker 2 4 6 Inflation Rate, 211 1 x1 x2 x3-4 -2 2 4-2 2 4.5-4 -2 2 4.6.8 1-2 -1 1 2 11 x1 x2 x3 4 6 8 1 12.6 4 6 8 1 12.6.8 4 6 8 1 12.5 1 1.5 2 2.5 4 6 8 1 12 12 Lecture 3, Page 4 of 8
x1 x2 x3 6 8 1 12 14 6 8 1 12 14.5 5 6 8 1 12 14 6 8 1 12 14 13 Discrete Histogram (bin width = 1) How would the box plot of the Wisconsin fishing trip data be unusual? 5 1 2 3 4 5 Number Fishing Trips 14 Outliers Outliers: extremely large or small values different from the bulk of the data Robust: not sensitive to outliers Is the sample mean a robust measure of central tendency? Is the sample median robust? However, the mean retains more information from sample & has useful statistical properties Is the IQR robust? variance? 15 Lecture 3, Page 5 of 8
Charitable Donors: Stats Can http://www5.statcan.gc.ca/cansim/a5?lang=eng&id=1112&pattern=1112&searchtypebyvalue=1&p2=35 Donors and donations 211 Number of taxfilers 4 24,841,63 Number of donors 2,3 5,79,7 Percentage of donors aged to 24 years 2,3,6 3 Percentage of donors aged 25 to 34 years 2,3,6 12 Percentage of donors aged 35 to 44 years 2,3,6 17 Percentage of donors aged 45 to 54 years 2,3,6 23 Percentage of donors aged 55 to 64 years 2,3,6 21 Percentage of donors aged 65 years and over 2,3,6 25 2 Charitable donor is defined as a taxfiler reporting a charitable donation amount on line 34 of the personal income tax form. Average Age of Donors? 16 Section 5.7 Grouped Data tells how to approximate the mean & s.d. with grouped data % aged to 24 [21] 3 % aged 25 to 34 [29.5] 12 % aged 35 to 44 [39.5] 17 % aged 45 to 54 [49.5] 23 % aged 55 to 64 [59.5] 21 % aged 65 and over [7] 25 Mean 21 + 2 29.5 + 7 39.5 + 3 49.5 + 1 59.5 + 5 7 52 years What if we use 75 years old for last category? Then mean 53.5. What if we use 12 years old for first category? Then mean 52.. 17 Logic of Calculation: Smaller Example Survey a random sample of 4 A&S students asking how many courses are you currently taking. A tabulation: num_courses Freq. Percent Cum. ------------+----------------------------------- 2 3 7.5 7.5 4 7 17.5 25. 5 28 7. 95. 6 2 5. 1. ------------+----------------------------------- Total 4 1. 4 X = i=1 n x i = 3 i=1 2 + 7 4 i=1 + i=1 5 + i=1 6 4 =.75 2 + 75 4 +.7 5 +.5 6 = 4.65 28 2 = 3 2 + 7 4 + 28 5 + 2 6 4 18 Lecture 3, Page 6 of 8
Similarly for standard deviation num_courses Freq. Percent Cum. ------------+----------------------------------- 2 3 7.5 7.5 4 7 17.5 25. 5 28 7. 95. 6 2 5. 1. ------------+----------------------------------- Total 4 1. s = 4 x i X 2 i=1 n 1 = 3 i=1 2 4.65 2 + 4 4.65 2 28 i=1 7 i=1 + 5 4.65 2 + 6 4.65 2 4 2 i=1 4 39 =.75 2 4.65 2 + 75 4 4.65 2 +.7 5 4.65 2 +.5 6 4.65 2 4 39 =.89 And, if you ignore 4/39, you get.88 (very close to right answer) 19 Standard Deviation of Age of Donors? % aged - 24 [21] 3 % aged 25-34 [29.5] 12 % aged 35-44 [39.5] 17 % aged 45-54 [49.5] 23 % aged 55-64 [59.5] 21 % aged 65 & over [7] 25 s 2 21 52 2 + 2 29.5 52 2 + 7 39.5 52 2 + 3 49.5 52 2 + 1 59.5 52 2 + 5 7 52 2 = 21.6 years 2 s. d. 21.6 = 14.5 years 2 Standardization ( z-scores ) Standardize: z = x X s x z: how many s.d. s a value is from the mean (+ if above; - if below) Z has a mean of and s.d. of 1 and no units Eg: mean inflation 6.64, s.d. 6.78; 2.91 in Canada: z=-.55=(2.91-6.64)/6.78 What does -.55 mean?.8.6.6 Inflation Rate, 211 n = 174 countries 2 4 6 Inflation Rate, 211 Inflation Rate, 211 n = 174 countries -2 2 4 6 standardized (z-scores) 21 Lecture 3, Page 7 of 8
Linear Transformations Linear transformation can be written as Y = a + bx where a and b are constants Linear transformation of X? Y = 2 X Y = X 2 1 = (X 1)(X + 1) Y = (X - 1)/2 Linear transformations change scale of a variable but not shape of the distribution Standardization is a linear transformation 22 Fraction 5 1 15 Gov t debt (% GDP), 21 Fraction mean = 587, med = 494 sd = 347.6 mean = 53, med = 5 sd = 22.58-1 -5 5 Change: 25 to 21 5 1 15 2 Gov t debt (% GDP), 25 Fraction mean = 534, med = 46.65 sd = 35.9 Change = Debt1 Debt5 53 = 587 534 Linear combinations have simple effect on mean. But this does not work (at all) for median or sd. World Bank data again, Central gov t debt, n = 48 countries 23 Fraction Fraction.5 mean=14955, med=91 sd = 16243 4 8 GDP per capita mean=14.955, med=9 sd = 1643.5 2 4 6 8 1 GDP per capita ($1s) Fraction5.5 mean=8.972, med=916 sd = 149 6 7 8 9 1 11 ln(gdp per capita) Non-linear transformations (natural log is very popular) can often transform skewed data to be more symmetric. Linear transformations (such as changing units) do not affect the shape at all. CIA data again, US$, PPP, 212 est., n = 185 countries 24 Lecture 3, Page 8 of 8