STA 248 H1S Winter 2008 Assignment 1 Solutions

Similar documents
Numerical Descriptions of Data

Lecture 1: Review and Exploratory Data Analysis (EDA)

Describing Data: One Quantitative Variable

NOTES TO CONSIDER BEFORE ATTEMPTING EX 2C BOX PLOTS

Week 1 Variables: Exploration, Familiarisation and Description. Descriptive Statistics.

STAT 113 Variability

Lecture 2 Describing Data

appstats5.notebook September 07, 2016 Chapter 5

Section 6-1 : Numerical Summaries

1 Describing Distributions with numbers

Today s plan: Section 4.1.4: Dispersion: Five-Number summary and Standard Deviation.

2 Exploring Univariate Data

Some estimates of the height of the podium

Math 2311 Bekki George Office Hours: MW 11am to 12:45pm in 639 PGH Online Thursdays 4-5:30pm And by appointment

Chapter 3. Numerical Descriptive Measures. Copyright 2016 Pearson Education, Ltd. Chapter 3, Slide 1

Chapter 3. Descriptive Measures. Copyright 2016, 2012, 2008 Pearson Education, Inc. Chapter 3, Slide 1

A LEVEL MATHEMATICS ANSWERS AND MARKSCHEMES SUMMARY STATISTICS AND DIAGRAMS. 1. a) 45 B1 [1] b) 7 th value 37 M1 A1 [2]

Putting Things Together Part 2

Descriptive Statistics

Handout 4 numerical descriptive measures part 2. Example 1. Variance and Standard Deviation for Grouped Data. mf N 535 = = 25

Graphical and Tabular Methods in Descriptive Statistics. Descriptive Statistics

Dot Plot: A graph for displaying a set of data. Each numerical value is represented by a dot placed above a horizontal number line.

CHAPTER 6. ' From the table the z value corresponding to this value Z = 1.96 or Z = 1.96 (d) P(Z >?) =

3.1 Measures of Central Tendency

Introduction to Computational Finance and Financial Econometrics Descriptive Statistics

CHAPTER TOPICS STATISTIK & PROBABILITAS. Copyright 2017 By. Ir. Arthur Daniel Limantara, MM, MT.

Overview/Outline. Moving beyond raw data. PSY 464 Advanced Experimental Design. Describing and Exploring Data The Normal Distribution

The Normal Distribution

Percentiles, STATA, Box Plots, Standardizing, and Other Transformations

DATA SUMMARIZATION AND VISUALIZATION

FINALS REVIEW BELL RINGER. Simplify the following expressions without using your calculator. 1) 6 2/3 + 1/2 2) 2 * 3(1/2 3/5) 3) 5/ /2 4

Normal Probability Distributions

Monte Carlo Simulation (Random Number Generation)

Some Characteristics of Data

Section3-2: Measures of Center

SOLUTIONS TO THE LAB 1 ASSIGNMENT

4. DESCRIPTIVE STATISTICS

Chapter 2: Descriptive Statistics. Mean (Arithmetic Mean): Found by adding the data values and dividing the total by the number of data.

Chapter 3 Descriptive Statistics: Numerical Measures Part A

The Not-So-Geeky World of Statistics

Copyright 2005 Pearson Education, Inc. Slide 6-1

Unit 2 Statistics of One Variable

Frequency Distribution and Summary Statistics

CHAPTER 2 Describing Data: Numerical

Math 227 Elementary Statistics. Bluman 5 th edition

Measures of Center. Mean. 1. Mean 2. Median 3. Mode 4. Midrange (rarely used) Measure of Center. Notation. Mean

Categorical. A general name for non-numerical data; the data is separated into categories of some kind.

Simple Descriptive Statistics

9/17/2015. Basic Statistics for the Healthcare Professional. Relax.it won t be that bad! Purpose of Statistic. Objectives

DATA HANDLING Five-Number Summary

Data that can be any numerical value are called continuous. These are usually things that are measured, such as height, length, time, speed, etc.

Description of Data I

STAT 157 HW1 Solutions

Lecture Week 4 Inspecting Data: Distributions

Mini-Lecture 3.1 Measures of Central Tendency

Empirical Rule (P148)

How Wealthy Are Europeans?

The Range, the Inter Quartile Range (or IQR), and the Standard Deviation (which we usually denote by a lower case s).

QQ PLOT Yunsi Wang, Tyler Steele, Eva Zhang Spring 2016

MidTerm 1) Find the following (round off to one decimal place):

Edexcel past paper questions

Moments and Measures of Skewness and Kurtosis

Standardized Data Percentiles, Quartiles and Box Plots Grouped Data Skewness and Kurtosis

Summary of Information from Recapitulation Report Submittals (DR-489 series, DR-493, Central Assessment, Agricultural Schedule):

Basic Procedure for Histograms

KING FAHD UNIVERSITY OF PETROLEUM & MINERALS DEPARTMENT OF MATHEMATICAL SCIENCES DHAHRAN, SAUDI ARABIA. Name: ID# Section

Lecture Slides. Elementary Statistics Tenth Edition. by Mario F. Triola. and the Triola Statistics Series. Slide 1

Chapter 4. The Normal Distribution

Wk 2 Hrs 1 (Tue, Jan 10) Wk 2 - Hr 2 and 3 (Thur, Jan 12)

2CORE. Summarising numerical data: the median, range, IQR and box plots

The Normal Distribution

IOP 201-Q (Industrial Psychological Research) Tutorial 5

Diploma in Financial Management with Public Finance

Summarising Data. Summarising Data. Examples of Types of Data. Types of Data

MEASURES OF CENTRAL TENDENCY & VARIABILITY + NORMAL DISTRIBUTION

Exploring Data and Graphics

Fundamentals of Statistics

MATHEMATICS APPLIED TO BIOLOGICAL SCIENCES MVE PA 07. LP07 DESCRIPTIVE STATISTICS - Calculating of statistical indicators (1)

Misleading Graphs. Examples Compare unlike quantities Truncate the y-axis Improper scaling Chart Junk Impossible to interpret

starting on 5/1/1953 up until 2/1/2017.

Skewness and the Mean, Median, and Mode *

MEASURES OF DISPERSION, RELATIVE STANDING AND SHAPE. Dr. Bijaya Bhusan Nanda,

Frequency Distribution Models 1- Probability Density Function (PDF)

Properties of Probability Models: Part Two. What they forgot to tell you about the Gammas

Chapter 7. Inferences about Population Variances

Data Distributions and Normality

22.2 Shape, Center, and Spread

Summary of Statistical Analysis Tools EDAD 5630

Edexcel past paper questions

Example - Let X be the number of boys in a 4 child family. Find the probability distribution table:

Stat 101 Exam 1 - Embers Important Formulas and Concepts 1

Lecture 6: Chapter 6

Appendix A. Selecting and Using Probability Distributions. In this appendix

Bangor University Transfer Abroad Undergraduate Programme Module Implementation Plan

Lesson 12: Describing Distributions: Shape, Center, and Spread

Statistics for Managers Using Microsoft Excel/SPSS Chapter 6 The Normal Distribution And Other Continuous Distributions

Terms & Characteristics

Data Analysis and Statistical Methods Statistics 651

Statistics I Chapter 2: Analysis of univariate data

STATISTICAL DISTRIBUTIONS AND THE CALCULATOR

Transcription:

1. (a) Measures of location: STA 248 H1S Winter 2008 Assignment 1 Solutions i. The mean, 100 1=1 x i/100, can be made arbitrarily large if one of the x i are made arbitrarily large since the sample size is fixed at 100. ii. Since the median is the value in the middle of the ordered data, it s value will be unaffected by the largest value, so making it arbitrarily large will not change the median. iii. To calculate the 10% trimmed mean we first remove the 10 largest and 10 smallest data values, so the one value that is made arbitrarily large will be disregarded and the 10% trimmed mean will not change. (b) Measures of spread: i. The standard deviation is ni=1 (x i x) 2 /(n 1) where x is the mean. Since the mean can be made arbitrarily large by changing one data value, but the rest of the x i remain the same, the standard deviation will also become large. ii. The interquartile range is the difference between the value such that 75% of the data are below it and the value such that 25% of the data are below it. Making one value arbitrarily large while keeping the rest the same will not affect it. 2. (a) Internet traffic is heaviest on Fridays and least on Saturdays and Sundays. Of the weekdays, it is lightest on Wednesdays. (b) The greatest spread occurs on Fridays and the least on the weekends (Saturday and Sunday). (c) The distributions all appear to be slighly right skewed (the median is closer to the left end of the distribution than the right with longer upper whiskers than lower whiskers), although there is little skew in the distributions on Saturday and Sunday. There our large outliers on Monday, Thursday, and Friday, which may be observations from a long right tail on those days. 3. (a) R code used: > par(mfrow=c(3,3)) > hist(rgamma(1000,.5,1/.5),main="alpha=.5, beta=.5") > hist(rgamma(1000,.5,1/2),main="alpha=.5, beta=2") > hist(rgamma(1000,.5,1/5),main="alpha=.5, beta=5") > hist(rgamma(1000,2,1/.5),main="alpha=2, beta=.5") > hist(rgamma(1000,2,1/2),main="alpha=2, beta=2") > hist(rgamma(1000,2,1/5),main="alpha=2, beta=5") > hist(rgamma(1000,5,1/.5),main="alpha=5, beta=.5") > hist(rgamma(1000,5,1/2),main="alpha=5, beta=2") > hist(rgamma(1000,5,1/5),main="alpha=5, beta=5") 1

Resulting plot: alpha=.5, beta=.5 alpha=.5, beta=2 alpha=.5, beta=5 0 300 0 300 700 0 400 0.0 1.0 2.0 rgamma(1000, 0.5, 1/0.5) 0 5 10 15 rgamma(1000, 0.5, 1/2) 0 10 20 30 40 rgamma(1000, 0.5, 1/5) alpha=2, beta=.5 alpha=2, beta=2 alpha=2, beta=5 0 150 0 150 0 150 0 1 2 3 4 5 rgamma(1000, 2, 1/0.5) 0 5 10 15 rgamma(1000, 2, 1/2) 0 20 40 60 rgamma(1000, 2, 1/5) alpha=5, beta=.5 alpha=5, beta=2 alpha=5, beta=5 0 100 0 100 0 100 0 2 4 6 rgamma(1000, 5, 1/0.5) 0 10 20 30 rgamma(1000, 5, 1/2) 0 20 40 60 80 rgamma(1000, 5, 1/5) The Gamma distribution is right-skewed. The degree of skewness decreases as α increases. As β increases, the values (the location) get larger. (b) Based on the answer to part (a), α is the shape parameter and β is the scale parameter. 2

(c) R code used: > mean(rgamma(10,.5,1/.5)) [1] 0.1784772 > mean(rgamma(10,.5,1/.5)) [1] 0.2272639 > mean(rgamma(10,.5,1/.5)) [1] 0.4029552 > mean(rgamma(1000,.5,1/.5)) [1] 0.2466551 > mean(rgamma(1000,.5,1/.5)) [1] 0.2562755 > mean(rgamma(1000,.5,1/.5)) [1] 0.2270936 > mean(rgamma(1000000,.5,1/.5)) [1] 0.2503406 > mean(rgamma(1000000,.5,1/.5)) [1] 0.2498957 > mean(rgamma(1000000,.5,1/.5)) [1] 0.2499907 The mean for the Gamma distribution from which we are generating samples is αβ = 0.25. The estimated means from the random samples are closer to 0.25 for larger sample sizes. This is an illustration of the Law of Large Numbers. (d) R code used: > par(mfrow=c(2,3)) > size10samples = matrix(rgamma(1000,.5,1/.5),nrow=10) > hist(apply(size10samples,2,mean),main="means of samples n=10") > size10samples = matrix(rgamma(1000,.5,1/.5),nrow=10) > hist(apply(size10samples,2,mean),main="means of samples n=10") > size10samples = matrix(rgamma(1000,.5,1/.5),nrow=10) > hist(apply(size10samples,2,mean),main="means of samples n=10") > size1000samples = matrix(rgamma(100000,.5,1/.5),nrow=1000) > hist(apply(size1000samples,2,mean),main="means n=1000") > size1000samples = matrix(rgamma(100000,.5,1/.5),nrow=1000) > hist(apply(size1000samples,2,mean),main="means n=1000") > size1000samples = matrix(rgamma(100000,.5,1/.5),nrow=1000) > hist(apply(size1000samples,2,mean),main="means n=1000") 3

Resulting plot: Means of samples n=10 Means of samples n=10 Means of samples n=10 0 5 10 15 20 0 5 10 15 20 0 5 10 15 0.1 0.3 0.5 apply(size10samples, 2, mean) 0.1 0.3 0.5 apply(size10samples, 2, mean) 0.0 0.2 0.4 apply(size10samples, 2, mean) Means n=1000 Means n=1000 Means n=1000 0 10 20 30 40 0 10 20 30 0 5 10 20 30 0.22 0.26 apply(size1000samples, 2, mean 0.22 0.25 0.28 apply(size1000samples, 2, mean 0.21 0.24 0.27 apply(size1000samples, 2, mean For samples of size 10, the distribution of the means is right skewed while for samples of size 1,000, the distribution of the means is closer to symmetric and bell-shaped. This is an illustration of the Central Limit Theorem. (e) When α = 1, Γ(α) = 0 e z dz = 1 so the Gamma density function simplifies to the exponential density function. Since the mean of the Gamma distribution is αβ, the mean of the exponential distribution is β. And since the variance of the Gamma distribution is αβ 2, the variance of the exponential distribution is β 2. 4

4. (a) R code used: > tcpdata = read.table("packetdata_dat.txt",head=t) > timestamp=tcpdata$timestamp[tcpdata$databytes!=0] > databytes=tcpdata$databytes[tcpdata$databytes!=0] > length(databytes) [1] 31656 The sample size excluding packets with 0 databytes is 31,656. (b) The R code > hist(databytes) produces Histogram of databytes 0 5000 10000 15000 0 500 1000 1500 databytes The distribution of the number of data bytes in the packets is bimodal with peaks at small values (0-100) and medium values (500-600) and with relatively few large values. (c) R code and output: > summary(databytes) Min. 1st Qu. Median Mean 3rd Qu. Max. 1.0 31.0 407.0 291.7 512.0 1460.0 > summary(databytes[databytes<300]) Min. 1st Qu. Median Mean 3rd Qu. Max. 1.00 5.00 27.00 57.61 51.00 299.00 > summary(databytes[databytes>=300]) Min. 1st Qu. Median Mean 3rd Qu. Max. 300.0 512.0 512.0 512.3 512.0 1460.0 5

i. The summary statistics for all values of data bytes is not useful because of the bimodal shape of the distribution; for example, it gives a mean between the 2 modes, where there is relatively little data. The summary statistics for the large and small packets are more useful as addressed further in part (ii). ii. For small packets, the distribution of packet size is right-skewed. This is evident because the mean is larger than the median. (It is very right-skewed as the mean is even larger than the third quartile!) This is also evident because the difference between the maximum and the median is much larger than the difference between the median and the minimum. For large packets, the distribution of packet size is dominated by the mode in the 500-600 range. This is evident in the summary statistics as the first quartile, median, and third quartile are all very close. This distribution is much more symmetric than the distribution of the size of small packets as the mean and median are very close. (d) R code to calculate inter-arrival times: > interarrivals=numeric(0) > for (i in 1:(length(timestamp)-1)) + interarrivals[i]=timestamp[i+1]-timestamp[i] i. R code and output: > hist(interarrivals) > boxplot(interarrivals) > title("boxplot of interarrivals") > plot.ecdf(interarrivals,do.points=f,verticals=t,main=null) > title("ecdf of interarrivals") > mean(interarrivals) [1] 0.003210678 > var(interarrivals) [1] 2.166012e-05 and resulting graphs: Histogram of interarrivals interarrivals 0.00 0.02 0.04 0.06 0 5000 10000 15000 20000 25000 0.00 0.02 0.04 0.06 Boxplot of interarrivals 0.00 0.02 0.04 0.06 0.08 0.0 0.2 0.4 0.6 0.8 1.0 x Fn(x) ECDF of interarrivals ii. As can be seen in the histogram, the distribution of inter-arrival times is severely right-skewed. This is evident in the boxplot by the extremely short left whisker and the many points that extend beyond the right whisker. The right-skewed shape is evident in the empirical distribution function because for small values of inter-arrival times it rises quickly, indicating that small values are more likely, but for large values it becomes very flat, indicating that large values are less likely and contribute less to the cumulative probability. 6

iii. This is most easily shown by marking the values on the plots. On the boxplot the minimum is the smallest and the maximum is the largest value shown; the 1st and 3rd quartiles are the lower and upper limits of the box, and the median is the line in the centre of the box, which is very close to the lower limit of the box. On the empirical cumulative distribution function, the minimum occurs where it first becomes non-zero (very close to 0.00) and the maximum occurs where the graph becomes 1 (which is difficult to tell with any accuracy since it rises so slowly at the end). The first quartile is the value on the horizontal axis corresponding to 0.25 on the vertical axis, the third quartile is the value on the horizontal axis corresponding to 0.75 on the vertical axis, and the median is the value on the horizontal axis corresponding to 0.5 on the vertical axis. iv. A. R code and output: > sum(interarrivals>1/60)/length(interarrivals) [1] 0.02192387 > 1-pexp(1/60,1/mean(interarrivals)) [1] 0.005566369 So 0.022 is the fraction of the observed inter-arrival times greater than one second, but the exponential distribution model predicts that that this will only be 0.0056 of the observations. B. R code to generate histogram of randomly generated values from an exponential distribution: > hist(rexp(length(interarrivals),1/mean(interarrivals)), + main="exponential sample",xlim=c(0,max(interarrivals)) and the resulting graph: Exponential sample 0 5000 10000 15000 0.00 0.02 0.04 0.06 rexp(length(interarrivals), 1/mean(interarrivals)) While both this histogram and the histogram of observed inter-arrival times from part (d) i. are right-skewed, the histogram of the exponentially distributed values drops off more quickly (with no observations greater than 0.035) than the the histogram of the observed inter-arrival times. 7

C. R code and output: > mean(interarrivals) [1] 0.003210678 > var(interarrivals) [1] 2.166012e-05 For an exponential distribution with β = 0.0032 (the mean of the inter-arrival times) the variance will be β 2 = 0.0000103. The inter-arrival times have the same mean (we generated the exponential values so that they would) but the variance is more than twice what would be expected from an exponential distribution. This is consistent with the larger values observed in the histogram of the observed inter-arrival times. 8