1. (a) Measures of location: STA 248 H1S Winter 2008 Assignment 1 Solutions i. The mean, 100 1=1 x i/100, can be made arbitrarily large if one of the x i are made arbitrarily large since the sample size is fixed at 100. ii. Since the median is the value in the middle of the ordered data, it s value will be unaffected by the largest value, so making it arbitrarily large will not change the median. iii. To calculate the 10% trimmed mean we first remove the 10 largest and 10 smallest data values, so the one value that is made arbitrarily large will be disregarded and the 10% trimmed mean will not change. (b) Measures of spread: i. The standard deviation is ni=1 (x i x) 2 /(n 1) where x is the mean. Since the mean can be made arbitrarily large by changing one data value, but the rest of the x i remain the same, the standard deviation will also become large. ii. The interquartile range is the difference between the value such that 75% of the data are below it and the value such that 25% of the data are below it. Making one value arbitrarily large while keeping the rest the same will not affect it. 2. (a) Internet traffic is heaviest on Fridays and least on Saturdays and Sundays. Of the weekdays, it is lightest on Wednesdays. (b) The greatest spread occurs on Fridays and the least on the weekends (Saturday and Sunday). (c) The distributions all appear to be slighly right skewed (the median is closer to the left end of the distribution than the right with longer upper whiskers than lower whiskers), although there is little skew in the distributions on Saturday and Sunday. There our large outliers on Monday, Thursday, and Friday, which may be observations from a long right tail on those days. 3. (a) R code used: > par(mfrow=c(3,3)) > hist(rgamma(1000,.5,1/.5),main="alpha=.5, beta=.5") > hist(rgamma(1000,.5,1/2),main="alpha=.5, beta=2") > hist(rgamma(1000,.5,1/5),main="alpha=.5, beta=5") > hist(rgamma(1000,2,1/.5),main="alpha=2, beta=.5") > hist(rgamma(1000,2,1/2),main="alpha=2, beta=2") > hist(rgamma(1000,2,1/5),main="alpha=2, beta=5") > hist(rgamma(1000,5,1/.5),main="alpha=5, beta=.5") > hist(rgamma(1000,5,1/2),main="alpha=5, beta=2") > hist(rgamma(1000,5,1/5),main="alpha=5, beta=5") 1
Resulting plot: alpha=.5, beta=.5 alpha=.5, beta=2 alpha=.5, beta=5 0 300 0 300 700 0 400 0.0 1.0 2.0 rgamma(1000, 0.5, 1/0.5) 0 5 10 15 rgamma(1000, 0.5, 1/2) 0 10 20 30 40 rgamma(1000, 0.5, 1/5) alpha=2, beta=.5 alpha=2, beta=2 alpha=2, beta=5 0 150 0 150 0 150 0 1 2 3 4 5 rgamma(1000, 2, 1/0.5) 0 5 10 15 rgamma(1000, 2, 1/2) 0 20 40 60 rgamma(1000, 2, 1/5) alpha=5, beta=.5 alpha=5, beta=2 alpha=5, beta=5 0 100 0 100 0 100 0 2 4 6 rgamma(1000, 5, 1/0.5) 0 10 20 30 rgamma(1000, 5, 1/2) 0 20 40 60 80 rgamma(1000, 5, 1/5) The Gamma distribution is right-skewed. The degree of skewness decreases as α increases. As β increases, the values (the location) get larger. (b) Based on the answer to part (a), α is the shape parameter and β is the scale parameter. 2
(c) R code used: > mean(rgamma(10,.5,1/.5)) [1] 0.1784772 > mean(rgamma(10,.5,1/.5)) [1] 0.2272639 > mean(rgamma(10,.5,1/.5)) [1] 0.4029552 > mean(rgamma(1000,.5,1/.5)) [1] 0.2466551 > mean(rgamma(1000,.5,1/.5)) [1] 0.2562755 > mean(rgamma(1000,.5,1/.5)) [1] 0.2270936 > mean(rgamma(1000000,.5,1/.5)) [1] 0.2503406 > mean(rgamma(1000000,.5,1/.5)) [1] 0.2498957 > mean(rgamma(1000000,.5,1/.5)) [1] 0.2499907 The mean for the Gamma distribution from which we are generating samples is αβ = 0.25. The estimated means from the random samples are closer to 0.25 for larger sample sizes. This is an illustration of the Law of Large Numbers. (d) R code used: > par(mfrow=c(2,3)) > size10samples = matrix(rgamma(1000,.5,1/.5),nrow=10) > hist(apply(size10samples,2,mean),main="means of samples n=10") > size10samples = matrix(rgamma(1000,.5,1/.5),nrow=10) > hist(apply(size10samples,2,mean),main="means of samples n=10") > size10samples = matrix(rgamma(1000,.5,1/.5),nrow=10) > hist(apply(size10samples,2,mean),main="means of samples n=10") > size1000samples = matrix(rgamma(100000,.5,1/.5),nrow=1000) > hist(apply(size1000samples,2,mean),main="means n=1000") > size1000samples = matrix(rgamma(100000,.5,1/.5),nrow=1000) > hist(apply(size1000samples,2,mean),main="means n=1000") > size1000samples = matrix(rgamma(100000,.5,1/.5),nrow=1000) > hist(apply(size1000samples,2,mean),main="means n=1000") 3
Resulting plot: Means of samples n=10 Means of samples n=10 Means of samples n=10 0 5 10 15 20 0 5 10 15 20 0 5 10 15 0.1 0.3 0.5 apply(size10samples, 2, mean) 0.1 0.3 0.5 apply(size10samples, 2, mean) 0.0 0.2 0.4 apply(size10samples, 2, mean) Means n=1000 Means n=1000 Means n=1000 0 10 20 30 40 0 10 20 30 0 5 10 20 30 0.22 0.26 apply(size1000samples, 2, mean 0.22 0.25 0.28 apply(size1000samples, 2, mean 0.21 0.24 0.27 apply(size1000samples, 2, mean For samples of size 10, the distribution of the means is right skewed while for samples of size 1,000, the distribution of the means is closer to symmetric and bell-shaped. This is an illustration of the Central Limit Theorem. (e) When α = 1, Γ(α) = 0 e z dz = 1 so the Gamma density function simplifies to the exponential density function. Since the mean of the Gamma distribution is αβ, the mean of the exponential distribution is β. And since the variance of the Gamma distribution is αβ 2, the variance of the exponential distribution is β 2. 4
4. (a) R code used: > tcpdata = read.table("packetdata_dat.txt",head=t) > timestamp=tcpdata$timestamp[tcpdata$databytes!=0] > databytes=tcpdata$databytes[tcpdata$databytes!=0] > length(databytes) [1] 31656 The sample size excluding packets with 0 databytes is 31,656. (b) The R code > hist(databytes) produces Histogram of databytes 0 5000 10000 15000 0 500 1000 1500 databytes The distribution of the number of data bytes in the packets is bimodal with peaks at small values (0-100) and medium values (500-600) and with relatively few large values. (c) R code and output: > summary(databytes) Min. 1st Qu. Median Mean 3rd Qu. Max. 1.0 31.0 407.0 291.7 512.0 1460.0 > summary(databytes[databytes<300]) Min. 1st Qu. Median Mean 3rd Qu. Max. 1.00 5.00 27.00 57.61 51.00 299.00 > summary(databytes[databytes>=300]) Min. 1st Qu. Median Mean 3rd Qu. Max. 300.0 512.0 512.0 512.3 512.0 1460.0 5
i. The summary statistics for all values of data bytes is not useful because of the bimodal shape of the distribution; for example, it gives a mean between the 2 modes, where there is relatively little data. The summary statistics for the large and small packets are more useful as addressed further in part (ii). ii. For small packets, the distribution of packet size is right-skewed. This is evident because the mean is larger than the median. (It is very right-skewed as the mean is even larger than the third quartile!) This is also evident because the difference between the maximum and the median is much larger than the difference between the median and the minimum. For large packets, the distribution of packet size is dominated by the mode in the 500-600 range. This is evident in the summary statistics as the first quartile, median, and third quartile are all very close. This distribution is much more symmetric than the distribution of the size of small packets as the mean and median are very close. (d) R code to calculate inter-arrival times: > interarrivals=numeric(0) > for (i in 1:(length(timestamp)-1)) + interarrivals[i]=timestamp[i+1]-timestamp[i] i. R code and output: > hist(interarrivals) > boxplot(interarrivals) > title("boxplot of interarrivals") > plot.ecdf(interarrivals,do.points=f,verticals=t,main=null) > title("ecdf of interarrivals") > mean(interarrivals) [1] 0.003210678 > var(interarrivals) [1] 2.166012e-05 and resulting graphs: Histogram of interarrivals interarrivals 0.00 0.02 0.04 0.06 0 5000 10000 15000 20000 25000 0.00 0.02 0.04 0.06 Boxplot of interarrivals 0.00 0.02 0.04 0.06 0.08 0.0 0.2 0.4 0.6 0.8 1.0 x Fn(x) ECDF of interarrivals ii. As can be seen in the histogram, the distribution of inter-arrival times is severely right-skewed. This is evident in the boxplot by the extremely short left whisker and the many points that extend beyond the right whisker. The right-skewed shape is evident in the empirical distribution function because for small values of inter-arrival times it rises quickly, indicating that small values are more likely, but for large values it becomes very flat, indicating that large values are less likely and contribute less to the cumulative probability. 6
iii. This is most easily shown by marking the values on the plots. On the boxplot the minimum is the smallest and the maximum is the largest value shown; the 1st and 3rd quartiles are the lower and upper limits of the box, and the median is the line in the centre of the box, which is very close to the lower limit of the box. On the empirical cumulative distribution function, the minimum occurs where it first becomes non-zero (very close to 0.00) and the maximum occurs where the graph becomes 1 (which is difficult to tell with any accuracy since it rises so slowly at the end). The first quartile is the value on the horizontal axis corresponding to 0.25 on the vertical axis, the third quartile is the value on the horizontal axis corresponding to 0.75 on the vertical axis, and the median is the value on the horizontal axis corresponding to 0.5 on the vertical axis. iv. A. R code and output: > sum(interarrivals>1/60)/length(interarrivals) [1] 0.02192387 > 1-pexp(1/60,1/mean(interarrivals)) [1] 0.005566369 So 0.022 is the fraction of the observed inter-arrival times greater than one second, but the exponential distribution model predicts that that this will only be 0.0056 of the observations. B. R code to generate histogram of randomly generated values from an exponential distribution: > hist(rexp(length(interarrivals),1/mean(interarrivals)), + main="exponential sample",xlim=c(0,max(interarrivals)) and the resulting graph: Exponential sample 0 5000 10000 15000 0.00 0.02 0.04 0.06 rexp(length(interarrivals), 1/mean(interarrivals)) While both this histogram and the histogram of observed inter-arrival times from part (d) i. are right-skewed, the histogram of the exponentially distributed values drops off more quickly (with no observations greater than 0.035) than the the histogram of the observed inter-arrival times. 7
C. R code and output: > mean(interarrivals) [1] 0.003210678 > var(interarrivals) [1] 2.166012e-05 For an exponential distribution with β = 0.0032 (the mean of the inter-arrival times) the variance will be β 2 = 0.0000103. The inter-arrival times have the same mean (we generated the exponential values so that they would) but the variance is more than twice what would be expected from an exponential distribution. This is consistent with the larger values observed in the histogram of the observed inter-arrival times. 8