Petra Petrovics Descriptive Statistics 2 nd seminar
DESCRIPTIVE STATISTICS Definition: Descriptive statistics is concerned only with collecting and describing data Methods: - statistical tables and graphs - descriptive measures Descriptive measure a single number that provides information about a set of data
Definition of a Population I. Central Tendency - mean calculation - mode location - median II. Percentiles, Quartiles III. Dispersion IV. Shape
I.1. Means Arithmetic mean (average) Geometric mean the ratio of any two consecutive numbers is constant e.g. compound interest rate Harmonic mean units of measurement differ between the numerator and denominator e.g. miles per hour Quadratic mean e.g. the form of standard deviation
Arithmetic Mean Typically referred to as mean. The most common measure of central tendency. It is the only common measure in which all the values play an equal role. Symbol:, called X-bar x Raw Data Expressions: x x1 x2... x n n i 1 n n x i Frequency Distribution Expressions: n i 1 x f i f i x i
I.2. Median Statistic which has an equal number of variates above and below it n 1 Raw Data Expressions: 2 ranked value Independent from extreme values + Just from data in order - The middle term Me me n f 2 f me ' me 1 h me= lower boundary of the median class n = total number of variates in the frequency distribution f me-1 = cumulative frequency of the class below the median class f me = frequency of the median class h = class interval
I.3. Mode The value that occurs most frequently Typical value Mo mo k 1 k mo = the lower class boundary of the mode s class k 1 = the difference between the frequencies of the mode s class and the previous class k 2 = the difference between the frequencies of the mode s class and the next class h = class interval k 1 2 h
II. Percentiles and Quartiles The P th percentile of a group of members is that value below which lie P% (P percent) of the numbers in the group. Q 1 (lower quartile): The first quartile is the 25th percentile. It is that point below which lie ¼ of the data. Q 2 (middle quartile): The median is the data below which lie half the data. It is the 50th percentile. Q 3 (upper quartile): The third quartile is the 75th percentile point. It is that below which lie 75 percent of the data.
III. Measures of Dispersion 1. Range 2. Interquartile Range 3. Population and Sample Standard Deviation 4. Population and Sample Variance 5. Coefficient of Variation
III.1. Range The range of a set of observations is the difference between the largest observation and the smallest observation. R X X max III.2. IQR min Interquartile range: difference between the first and third quartiles. IQR Q Q 3 1
III.3. Standard Deviation The standard deviation is a measure of dispersion around the mean. A low standard deviation indicates that the data points tend to be very close to the mean, whereas high standard deviation indicates that the data are spread out over a large range of values.
III.4. Variance Variance of a set of observations: the average squared deviation of the data points from their mean. Population variance: Sample variance: S n n 2 2 ( X i X ) fi ( X i X ) 2 i 1 i 1 n n i 1 n n 2 2 ( X i X ) fi ( X i X ) 2 i 1 i 1 n n 1 i i 1 III.5. Coefficient of Variation The measure of dispersion around the mean in %. s V V X X f 1 f i
IV. Measures of Shape Skewness is a measure of the degree of asymmetry of a frequency distribution. Kurtosis is a measure of the flatness (versus peakedness) of a frequency distribution.
IV.1.Kurtosis The measure of the extent to which observations cluster around the central point. Positive cluster more and have longer tails Negative cluster less and have shorter tails For a normal distribution, the value of the kurtosis statistic is zero.
IV.2. Skewness Skewed to the left (long right tail) Mo Me X Symmetry Me Mo X Skewed to the right X Me Mo
Box Plot The box plot is a set of five summary measures of the distributions of the data: - the median of the data - the lower quartile - the upper quartile - the smallest observation - the largest observation + asymetry
Source: Aczel [1996] Box&Whiskers
Source: Aczel [1996] Elements of Box Plot
Source: Aczel [1996]
In SPSS
SPSS: File / Open / Data Employee data.sav Analyze / Descriptive Statistics / Frequencies
I. Central Tendency x Mode (Mo) The value that occurs most frequently. Median (Me) Statistic which has an equal number of variates above and below it.
II. Percentiles and quartiles The P th percentile of a group of members is that value below which lie P% (P percent) of the numbers in the group. Q 1 (lower quartile): The first quartile is the 25th percentile. It is that point below which lie ¼ of the data. Q 2 (middle quartile): median Q 3 (upper quartile): it is that below which lie 75 percent of the data.
III. Measures of dispersion Standard deviation A measure of dispersion around the mean. Range The difference between the largest observation and the smallest observation.
IV. Measures of Shape
Interpretation 474 employees were examined. (Number of cases) The average current salary is $34.419,57. Half of the employees earn more than $28.275,00, the other half of them earn less. The most frequently current salary is $30.750. The average dispersion around the mean is $17.075,661. Long right tail assimetry. (A distribution with a significant positive skewness has a long right tail.) Positive kurtosis indicates that the observations cluster more and have a longer tails than those in the normal distribution. The lowest current salary is $15.750. The highest current salary is $135.000. 25% of employees have lower current salary than $24.000 and 75% have higher. 75% of employees have lower current salary than $37.162,5 and 25% have higher.
Graphs / Legacy Dialogs/ Boxplot Boxplot of the current salary categorized by the employment category
Box Plot The highest salary The least standard deviation Q 3 Me Q 1
Thanks for your attention stgpren@uni-miskolc.hu