Lab 9 Distributions and the Central Limit Theorem Distributions: You will need to become familiar with at least 5 types of distributions in your Introductory Statistics study: the Normal distribution, N(m,s), where the mean = m and standard deviation = s the Binomial distribution B(n,p), where n is the number of trials and p is probability of success on each trial the Uniform distribution, U(a,b), where a is the minimum value and b is the maximum value the Exponential distribution exp(l), where the density function is le -lx the Poisson distribution pois(l), where pois(k events in interval) = l k e -l /k! We have already experienced the normal distribution, and your text book talks about the usefulness of the others. Normal Distribution: This distribution, whose density function is shown below, is symmetrical, bell-shaped, and is completely described with 2 parameters, the mean m and standard deviation s. It is a continuous distribution. # Distributions # normal N(12,3) curve(dnorm(x, mean=12, sd=3), xlim=c(2, 22), ylim=c(0,.15), ylab="density", main="n(12,3)") abline(v=12, lty=2) Binomial Distribution: The next distribution is the Binomial, where B(n, p) stands for the binomial which has n trials, and each trial has an independent probability of p to be a success. The binomial is a discrete distribution. -1-
# binomial B(15,.2) heights <- dbinom(0:15, size=15, prob=.2) plot(0:15, heights, type="h", main="spike plot of binom(x)", xlab="k", ylab="p.d.f.") points(0:15, heights, pch=20, cex=1) Uniform Distribution: The Uniform distribution, a continuous distribution, has the form U(a, b), where a is the minimum value and b is the maximum value of x. # uniform(3,12) curve(dunif(x, min=3, max=12), xlim=c(0,15), ylab="density", main="u(3,12)") Poisson Distribution: The Poisson distribution is a discrete distribution, with only one parameter, l. -2-
Below is pois(7). # poisson B(7) heights <- dpois(0:15, lambda=7) plot(0:15, heights, type="h", main="spike plot of pois(x)", xlab="k", ylab="p.d.f.") points(0:15, heights, pch=20, cex=1) The Poisson can sometimes look like the Normal, except it is discrete, whereas the Normal is continuous. Exponential Distribution: The Exponential distribution (decay function) is a continuous distribution, with only one parameter, l, the decay parameter. It is very right skewed. Below is exp(.35). # exponential curve(dexp(x, rate=.35), xlim=c(0,25), ylab="density", main="exponential exp(.35)") -3-
Notice on all of the discrete distributions that we had to use different R coding from the usual curve() command, instead using plot() and points(), to make spike plots of the values, distinguishing it from the continuous distributions. Homework [1]: I made 6 variables in lab9.csv (labeled sample1 through sample6) from the list below, using the rnorm(), runif(), rbinom(), rexp(), and rpois(). These various R commands sample 20 values from the respective distributions. Make histograms of these sample1-sample6 and match them up with their parent distributions in the table below. Use your detective skills and techniques used in previous labs/study to accomplish this task. Homework [2]: Make a vector of 200 values sampled from the pois(7) distribution, find the mean of the sample and compare with the theoretical mean expected. Homework [3]: Repeat [2] with a vector of 200 sampled from the exp(.35) distribution. Quantile-quantile plots: Before we superimposed density plots of normal curves on our histograms, to sort of compare them for normality of the histogram distribution. We have another way to compare distributions to their normal counterpart, the QQ plot. See picture below.
In the figure 3.4 we have the 5 th percentile values of our skewed distribution on the y axis and the 5 th percentiles of values of its corresponding normally distributed distribution on the x axis. The dots plotted are the respective 5 th percentiles (normal, y distribution). The more normally distributed the tested distribution (on the y axis) is to normal, the more the dots will line up in a straight line. In our figure above, we see a large bow in the dots, indicating that we do not have a normally distributed distribution on the y axis. The graphs below show, from left to right, QQ plots which result from short symmetric, average symmetric, and long symmetric distributions on the y axis. The QQ plots below are from y distributions which are, from left to right, short skew, regular skew, and long skew. Note that the short skew distribution is also discrete. Note also that the skew right distributions tend to bow down (concave), and skew left tend to bow up (convex). Below is a QQ plot using the shown R code for a 50 element sample from the N(10,3) distribution along the y axis. -5-
# qq plots vec1 <-rnorm(n=50, mean=10, sd=3) qqnorm(vec1) qqline(vec1) We also plotted the line which the points should follow if the distribution (y axis) being compared to its companion normal (x axis), using the qqline() command. Homework [4]: Take the 6 samples from lab9.csv, and run qqnorm() plots, with qqline() reference drawn. Comment on your results. Extra Information with this lab: [1] I recommend creating data and doing various editing of the data in the EXCEL spreadsheet environment, saving the data as a.csv (comma delimited) or.txt (space delimited) file. You can do editing in the Studio environment, where you type/execute edit(data1) in the Console window, assuming the name of your data is data1. A spreadsheet view of the data will appear, where you can edit the data, then return to the program with the edited file. To save the result for later use type save(data1, file= file1.rda ) in your R workspace. To retrieve it later from the workspace type load( file1.rda ). [2] On scatter plots you can make at least 25 basic kinds of dots, using the pch= command. See below for the types. -6-
[3] You can use text() to type text within the plot and mtext() to type text within the margins. For text, you can use either the (x,y) coordinates we have used in previous labs or you can use side=1 (bottom), side=2 (left), side=3 (top) or side=4 (right). For using mtext() add line=4,which would place the text 4 (or however many) lines to the bottom/top/right/left of the plot, depending what you used for the side= command. Use col= red or whatever color you want to color the text. Using cex=.7 (or whatever number, where default is cex=1) adjusts the type size. Using adj=1 justifies text far right, adj=0 justifies far left, and using a number between 0 and 1 justifies between the right and left this usually is used to print text next to (left of, right of, etc.) of points. [4] ggplot2 information: graphs are made in layers, using aesthetics aes(), geoms(), and various layers of other items. The reference web page shown below gives much more detailed analysis of these ggplot2 items. Next -7-
Below are some aesthetics used with geoms in ggplot2. Next next -8-
A generic example is shown below. mydata is the data set, variable are the x and y variables from the data set, and the name given in this case is mygraph next next, shows how to color the graphs by category gender. Now, we add another layer. -9-
Next next next, a generic histogram. Next -10-
In another lab we will continue with some actual examples of code used to make various graphs and give more features of the ggplot2 package. -11-