Quantitative Introduction ro Risk and Uncertainty in Business Module 5: Hypothesis Testing Examples M. Vidyasagar Cecil & Ida Green Chair The University of Texas at Dallas Email: M.Vidyasagar@utdallas.edu October 13, 2012
Outline Kolmogorov-Smirnov (K-S) Tests 1 Kolmogorov-Smirnov (K-S) Tests 2 3
Outline Kolmogorov-Smirnov (K-S) Tests 1 Kolmogorov-Smirnov (K-S) Tests 2 3
Outline Kolmogorov-Smirnov (K-S) Tests 1 Kolmogorov-Smirnov (K-S) Tests 2 3
Take the height and weight data given earlier. Compute the mean µ and standard deviation σ of the raw data without binning. For heights, we get µ h = 67.9639, σ h = 1.9318, µ w = 127.2639, σ w = 11.5442. Then we plot the actual empirical distribution against the Gaussian distribution corresponding to these choices. The next slides depict the results.
Empirical vs. Fitted Height Density
Empirical vs. Fitted Height Distribution
Application of One-Sample K-S Test to Height Data We have 5000 samples, so n = 5000. Let us look at the 95% confidence level, so δ = 0.05. Compute the threshold θ(n, δ) = ( 1 2n log 2 ) 1/2. δ For the chosen values, this turns out to be θ = 0.0192. So if the actual maximum difference between the empirical and fitted distribution functions for heights is larger than this threshold, then we can say with confidence 95% that the samples are not generated by a Gaussian distribution.
Application of One-Sample K-S Test to Height Data 2 The actual maximum difference is 0.0103. So can we accept the hypothesis that heights follow a Gaussian distribution with mean µ h and standard deviation σ h? The fact that the actual maximum is less than θ says only that we cannot reject the null hypothesis. So let us see if we can make a less wishy-washy statement.
Application of One-Sample K-S Test to Height Data 3 The K-S test above shows that the probability of the samples not coming from a Gaussian distribution is 95%. That might not be a good reason to accept the null hypothesis. So let us ask: What is the threshold for saying that the probability of the samples not coming from a Gaussian distribution is 50%? To find this threshold, substitute δ = 0.5 (not δ = 0.05 as earlier) into θ(n, δ) = ( 1 2n log 2 ) 1/2. δ This gives θ = 0.0118. Since the actual maximum is less than even this number, we can accept the null hypothesis, since the likelihood of its being false is less than half.
Empirical vs. Fitted Weight Density
Empirical vs. Fitted Weight Distribution
Application of One-Sample K-S Test to Weight Data We have 5000 samples, so n = 5000. Let us look at the 95% confidence level, so δ = 0.05. Compute the threshold θ(n, δ) = ( 1 2n log 2 ) 1/2. δ For the chosen values, this turns out to be θ = 0.0192 if δ = 0.05 and θ = 0.0118 if δ = 0.5. These are the same numbers as before! Is this a coincidence? No! The threshold θ depends only on the number of samples n and the confidence level δ.
Application of One-Sample K-S Test to Weight Data 2 Actual maximum difference between empirical and fitted distribution of weights is 0.0060, which is lower than the threshold for δ = 0.5. So we can accept the hypothesis that weights follow a Gaussian distribution with mean µ h and standard deviation σ h.
Outline Kolmogorov-Smirnov (K-S) Tests 1 Kolmogorov-Smirnov (K-S) Tests 2 3
We fitted a Gaussian to the logarithm of prices of homes sold in UK during June 2012. The results are shown on next slide.
Home Prices: Empirical vs. Fitted Density
Application of One-Sample K-S Test to Home Prices Data Recall that the K-S threshold is ( 1 θ(n, δ) = 2n log 2 ) 1/2. δ There are 54675 homes sold, so n = 54675. If we take δ = 0.05 as usual, then the threshold is θ = 0.0058.
Application of K-S Test to Home Prices Data 2 The mean of the log home price is µ l = 12.1175, and the standard deviation is σ l = 0.6608. The next slide shows the empirical and fitted distribution function.
Home Prices: Empirical vs. Fitted Distribution
Application of K-S Test to Home Prices Data 3 Now if we compute the K-S statistic, which is the maximum difference between the empirical and fitted, it works out to 0.0565, which is far higher than θ. So with 95% confidence, we can reject the null hypothesis that home prices follow a log-normal (actually log-gaussian) distribution.
Outline Kolmogorov-Smirnov (K-S) Tests 1 Kolmogorov-Smirnov (K-S) Tests 2 3
Example of Student t Test Problem: From the 200 samples of height, determine whether the mean of the first 120 samples differs in a statistically significant way from that of the last 80 samples. The null hypothesis is that the two means are the same. We compute the test statistic d t and compare against the t test threshold. If d t is larger than the threshold we reject the null hypothesis; otherwise we accept it.
Example of Student t Test (Cont d) We have Now the test statistic is x 1 = 68.0703, x 2 = 67.7694, S 1 = 3.3581, S 2 = 4.3730, S 12 = 3.7957. d t = x 1 x 2 S 12 (1/m1 ) + (1/m 2 ) = 0.5492. Let Φ t,198 denote the distribution function of the t distribution with m 1 + m 2 2 = 198 degrees of freedom. Then the threshold is Φ 1 t,198 (0.95) = 1.6526. Since the test statistic is smaller than the t test threshold, we cannot reject the null hypothesis that both means are the same.
Outline Kolmogorov-Smirnov (K-S) Tests 1 Kolmogorov-Smirnov (K-S) Tests 2 3
Example Kolmogorov-Smirnov (K-S) Tests Again take the height data, with m 1 = 100, m 2 = 10. So we are testing whether the next 10 samples have the same variance as the first 100 samples. In this case while V 1 = 3.3413, S 2 = 26.6335, S 2 V 1 = 7.9711, x l = Φ 1 χ 2,9 (0.05) = 3.3251, x u = Φ 1 (0.95) = 16.9190. χ 2,9 Since the chi-squared test statistic lies within the interval [x l, x u ] we accept the null hypothesis.