Exploratory Data Analysis Stemplots (or Stem-and-leaf plots) Stemplot and Boxplot T -- leading digits are called stems T -- final digits are called leaves STAT 74 Descriptive Statistics 2 Example: (number of hysterectomies performed by 5 male doctors) 27, 5,, 25, 6, 25, 5,, 7, 44, 2, 6, 59, 4, 2 2 755 557 764 467 4 4 5 9 6 7 65 56 STAT 74 Descriptive Statistics Example: Number of hysterectomies performed by 5 male doctors: 27, 5,, 25, 6, 25, 5,, 7, 44, 2, 6, 59, 4, 2 by female doctors, the numbers are: 5, 7,, 4,, 9, 25, 29,, () () 2 557 57 467 49 4 4 2 59 5 9 6 7 56 STAT 74 Descriptive Statistics 4 Back-to-back stem-plot () () 75 94 95 2 557 467 4 4 5 9 6 7 56 Example: (Height data with gender) :, 6, 64, 65, 65, 65, 66, 67 : 6, 64, 69, 7, 7, 7, 72, 72, 72, 72, 7, 74, 75 (See data sheet) Back-to-back 5554 6 49 7 222245 Split-back-to-back 4 6* 4 * => - 4 76555 6# 9 # => 5-9 7* 22224 7# 5 STAT 74 Descriptive Statistics 5 STAT 74 Descriptive Statistics 6
Order Statistics (Hogg & Tanis) Examine Distribution Let x, x 2,, x n be a random sample from a distribution of continuous type. Let y be the smallest of x i s, y 2 the next smallest x i s in order of magnitude, and so y n be the largest of x i s, i.e., y < y 2 < <y n. Then y i, for i =, 2,, n, is called the order statistic of x i. STAT 74 Descriptive Statistics 7 STAT 74 Descriptive Statistics Percentile (measure of location) If < p <, the p th sample percentile, denote as has approximately np sample observations less than it and also n( p) sample observations great than it, and it is undefined if p < n or p >. n + n + To find ~ π p :. Compute (n + ) p [position index] 2. If (n + ) p is an integer, then the p th sample percentile ~ π p= y (n+)p a not an integer, and is equal to r + then b ~ a π p = y r + ( y r + yr ) r is an integer b a a a is a fraction = ( ) y r + y r+ b b b STAT 74 Descriptive Statistics 9 STAT 74 Descriptive Statistics Example: [odd number of data],6,6,64,64,65,65,65,66,67,69,7,7,7,72,72,72,72,7,74,75 Q = 64.5 Median = 69 Q = 72 To find ~π.25 = ~q (the first quartile), a (n + ) p = (2 + ).25 = 5.5 => r = 5 & =.5 b = (.5) 64 +.5 65 = 64.5 ~π.25 Quartiles: The first quartile, Q, or 25 th percentile, is the median of the lower half of the list of ordered observations. The third quartile, Q, or 75 th percentile, is the median of the upper half of the list of ordered observations. ( m ~ = 69, ~π. 75 = 72 ) STAT 74 Descriptive Statistics STAT 74 Descriptive Statistics 2 2
Example: (data sheet) [odd number of data],6,6,64,64,65,65,65,66,67,69,7,7,7,72,72,72,72,7,74,75 Q = 64.5 Median = 69 Q = 72 [even number of data] 6,,6,6,64,64,65,65,65,66,67,69,7,7,7,72,72,72,72,7,74,75 Q = 64 Median = 6 Q = 72 IQR = 72-64 = Interquartile range (IQR) = Q Q IQR = 72 64.5 = 7.5 STAT 74 Descriptive Statistics STAT 74 Descriptive Statistics 4 Example: (data sheet without outlier 6 ) The five-number summary.minimum value.q.median.q.maximum value Min =, Q = 64.5, Median = 69, Q = 72, Max = 75. 7 5 2 STAT 74 Descriptive Statistics 5 STAT 74 Descriptive Statistics 6 Inner and outer fences Mild and Extreme outliers The inner fences are located at a distance of.5 IQR below Q and at a distance of.5 IQR above Q. The outer fences are located at a distance of IQR below Q and at a distance of IQR above Q. T Data values falling between the inner and outer fences are considered mild outliers. T Data values falling outside the outer fences are considered extreme outliers. When outliers exist, the whisker extended to the smallest and largest data values within the inner fence. STAT 74 Descriptive Statistics 7 STAT 74 Descriptive Statistics
Outer fence Inner fence IQR Inner fence 4 4 Outer fence 2 2 22 22 STAT 74 Descriptive Statistics 9 STAT 74 Descriptive Statistics 2 Side-by-side Box Plot 4 Quantile For data values y, y 2,,y n such that y y 2 y n in a sample data set, y r is the quantile of order r/(n+). 7 2 5 9 Quantile-quantile plot or q-q plot is a plot that plotting quantiles of one sample against quantiles of another sample or a distribution. sex STAT 74 Descriptive Statistics 2 STAT 74 Descriptive Statistics 22 Quantile-Quantile Plot Examine the distributions of the two samples: Quantile-Quantile Plot Sample 2 quantile Sample :.5,, 7.4,, 9.2,.9 Sample 2: -.92, -., -.2,.2,.62,.4.2 (,.2) - -2 5 Sample quantile STAT 74 Descriptive Statistics 2 STAT 74 Descriptive Statistics 24 4
Quantile-Quantile Plot Sample 2 quantile Examine Distribution with Quantile-Quantile Plot.2 (,.2) - -2 5 Sample quantile Two distributions are about the same. Two distributions are not the same. STAT 74 Descriptive Statistics 25 STAT 74 Descriptive Statistics 26 z-percentile Example:,, 7,, 9, n = 6, and the value is the [4/(6+)]th quantile, i.e.,.57 quantile or 57.th percentile of the distribution of the sample. The 57.th percentile of a standard normal distribution is a z-score that has 57.% of the distribution below it. Therefore, use.57 for area (.574 is closest to it) and look for z score corresponding to it, and it is around.. So, z-percentile for this.57 quantile is.. (sample quantile, z-percentile) => (,.) Normal Quantile Plot Z-percentile. - -2 (,.) 5 Sample quantile STAT 74 Descriptive Statistics 27 STAT 74 Descriptive Statistics 2 Examine Distribution with Normal Quantile Plot Examine Other Distributions Close to Normal Not Normal T Find the quantiles of the data and the corresponding percentiles of desired distribution to be examined. T Plot the ordered pairs of sample quantiles and their corresponding percentiles of desired distribution. T Examine if it has the straight line pattern. STAT 74 Descriptive Statistics 29 STAT 74 Descriptive Statistics 5
Sample Quantile Let x () denote the smallest sample observation, x (2) the 2 nd smallest sample observation, and the x (n) the largest sample observation. For, i =, 2,, n, x (i) is the [(i-.5)/n]th sample quantile. (Approximation) Example:,, 7,, 9, Grouping and Displaying Categorical Data n = 6, and the value, x (4), is the [(4-.5)/6]th quantile of the distribution of the sample, i.e..5 sample quantile or 5. percentile. STAT 74 Descriptive Statistics STAT 74 Descriptive Statistics 2 Frequency Table and Charts 7 Class Frequency Relative Frequency 9 9/22 =.49 = 4.9% /22 =.59 = 59.% Total 22 % Bar Chart Pie Chart Examine Bivariate Data 5 4.9% 4 59.% 2 Pe rc ent STAT 74 sex Descriptive Statistics STAT 74 Descriptive Statistics 4 Two Categorical Variables Contingency Table Cancer No cancer Row Total Smoker 2 5 Non- Smoker 5 45 5 Column Total 25 75 Percent 4 2 Cluster bar chart Cancer Odds of smoker to have cancer: 2/ = 6/9 Odds of nonsmoker to have cancer: 5/45 = /9 Odds Ratio = (6/9)/(/9) = 6 STAT 74 Descriptive Statistics 5 Have Cancer Do not have Cancer Smoke Do not Smoke Smoking Status STAT 74 Descriptive Statistics 6 6
Two Quantitative variables Data: Temperature Mortality Index 4 52 4 6 42 6 42 4 72 44 45 9 46 77 47 4 94 49 6 5 95 5 5 5 52 2 STAT 74 Descriptive Statistics 7 Two Quantitative variables Average annual temperature and the mortality index for a type of breast cancer in women in certain region of Europe. Mortality Index 9 7 5 4 STAT 74 Average Temperature Descriptive Statistics 5 A Categorical & A Quantitative Variables Side-by-side Boxplot 4 9 Time Plot Youngstown Homicide Rate by Year 7 2 9 Value HRATE 7 5 4 2 5 969 97 97 975 977 979 9 9 95 97 99 99 99 995 YEAR sex STAT 74 Descriptive Statistics 9 STAT 74 Descriptive Statistics 4 Time Plot Time Plot National Homicide Rate By Year National Homicide Rate By Year 2 9 National Homicide Rate 7 5 4 2 National Homicide Rate 9 7 6 5 4 2 964 96 972 976 9 94 9 992 964 96 972 976 9 94 9 992 YEAR YEAR STAT 74 Descriptive Statistics 4 STAT 74 Descriptive Statistics 42 7