Statistics/BioSci 141, Fall 2006 Lab 2: Probability and Probability Distributions October 13, 2006 1 Using random samples to estimate a probability Suppose that you are stuck on the following problem: (SW 333) If two carriers of the gene for albinism marry, each of their children has probability of being albino If such a couple has six children, what is the probability that 1 4 (a) none will be albino? (b) at least one will be albino? One way to solve this problem would be to use the material of section 34 or section 38 but say that you didn t read these sections What do you do? One strategy is to estimate the probabilities by drawing random samples Let s say that 0 represents non-albino and 1 represents albino, and let s create a population vector that has 3 non-albino children for every 1 albino: > population <- c(0,0,0,1) [1] 0 0 0 1 0 0 [1] 0 0 0 0 0 1 [1] 0 1 1 0 0 1 For event (b), count the proportion of samples that have at least one albino child and divide by the number of samples you ve taken Obviously, the more samples you take, the more accurate your answer will be in this case, try at least 20 samples to get a rough idea Question: why do we need replace=true? This strategy is called the Monte Carlo method for estimating probabilities [To get an estimate that s close to the true probability, you need to do at least 500 repetitions, which can get tedious if you re doing it manually] 1
2 Random sampling: named distributions Now consider the following problem: SW Example 42 Eggshell Thickness In the commercial production of eggs, breakage is a major problem Consequently the thickness of the eggshell is an important variable We approximate the thickness of any given egg to be a normal random variable with mean µ = 38 mm and standard deviation σ = 03 mm What is the probability that an egg has thickness 35 mm or less? We can again estimate this probability using random sampling Let s draw many (say ten thousand) normal random samples with mean µ = 38 and standard deviation σ = 03: > simulations <- rnorm(n=10000, mean=38, sd=03) > events <- (simulations <= 35) > sum(events) [1] 1558 Out of our ten thousand random eggshells, 1558 were less than or equal to 35 mm thick, so we estimate that the probability is about 1558 You can draw random samples from many other distributions Some of them are: rbinom for binomial, rpois for Poisson, runif for uniform, and rexp for exponential Question: can you think of a better way to solve the albinism problem 333 using the function rbinom? Look up the help entry for rbinom and estimate the probability in 333 (a) using 10000 random draws as we did in the eggshell problem Make sure to use == for equality tests: the symbol = is for variable assignment 3 Densities Look again at the albinism problem Can we calculate exactly the probability of having zero albino children? The method detailed in section 38 is to use the binomial distribution with 6 draws and success probability 25 If N is the number of albino children, then we can use the function dbinom to calculate P (N = 0) > dbinom(0, size=6, prob=25) [1] 01779785 Is this close to the Monte Carlo estimate you made in part 1? The R functions for densities are all prefixed with d : examples are dbinom, dpois, dnorm, dunif, dexp 4 Cumulative distribution functions How about the eggshell problem how do we calculate the exact probability of finding an eggshell that s 35 mm or thinner? We use the cumulative distribution function (cdf) Let T be a random variable denoting the thickness of a particular egg s shell The pnorm function gives us P (T <= 35): 2
> pnorm (35, mean=38, sd=03) [1] 01586553 The R functions for cdfs are all prefixed with p (for probability): examples are pbinom, ppois, pnorm, punif, pexp Problem: we can model the count of earthworms in a one-square-meter plot of soil as a Poisson random variable with mean parameter λ = 130 earthworms What is the probability of finding 70 or fewer earthworms in a given plot of soil? How about 110 or more? 5 Plotting densities and cdfs In section??, we calculated the probability that an eggshell was 35 mm thick or less by using the pnorm function A more visual strategy is to plot the random variable s density function and infer probabilities from the graph 51 Method 1: random histogram We can draw random samples as we did in sections?? and??, and then draw the histogram of the random samples (here we treat the eggshells example): > eggshells <- rnorm(1000, mean=38, sd=03) > hist(eggshells) Exercise Marshall et al (1990) studied the opening and closing of the nicotinic receptor of frog muscle They modeled the number times that the receptor opens per millisecond as an exponential random variable with rate parameter rate = 118 openings/ms Draw 1000 random opening times according to this model and plot the histogram From this diagram, can you roughly estimate the probability that a receptor opens two or fewer times per millisecond? Between 2 and 4 times? 52 Method 2: plotting the exact density We can also graph the random variable s density function, which is an exact version of the histograms above The eggshell example: > range <- seq(from=25, to=51, by=001) > dens <- dnorm(x=range, mean=38, sd=03) > plot(x=range, y=dens, xlab= eggshell thickness, + main= Density function for eggshell thickness, + type= l ) Approximately what percentage of the area of the graph is to the left of 35? Exercise Can you draw the density function for the nicotinic receptor example? Question What does graphing the histogram or density give you that the probability calculation in section?? does not? What does the probability calculation give you that the density does not? 3
6 Probability calculations with the normal curve Exercise Graph the density function of the eggshell thickness again Say we want the probability that an eggshell is between 40 and 45 mm thick 1 How do you estimate this probability from the density function? 2 (Thought question you don t actually have to do it in R) How would you estimate this probability using random sampling? 3 How could you use pnorm to calculate this probability exactly? 4 (Thought question) If you don t have a computer available, you may have only a piece of paper that contains the quantiles of N(0, 1) How would you do this calculation using the standardized scale method described in section 43? Exercise Use the pnorm function to determine the area under the normal curve within one standard deviation of the mean 7 Reading files from disk We ve received many questions about reading files from disk, and from Excel Download the following file from the web onto your Desktop: wwwstanfordedu/~gtchang/lentilxls Open the file in Excel, then save the file to a comma-separated values (CSV) file so that R can read it (File Save-as, then select the CSV file type in the save as type dropdown box) Load it into R: lentil <- readtable( C:/Documents and Settings/stat141/Desktop/lentilcsv,header=T,sep=, ) 8 Assessing normality The lentils data that you just loaded represent the growth rate, in cm per day, for a sample of 47 lentil plants (SW exmaple 49 on p 138) Are these data normally distributed? Question: what s a good way to tell if the data are normally distributed? One way to test normality to see if the quantiles of the data match the corresponding quantiles of a normal random variable with the same mean and standard deviation The function qnorm returns quantiles of a normal random variable with given mean and standard deviation You can get quantiles for many other distributions: qbinom is for binomial, qpois is for Poisson, qunif is for uniform, and qexp is for exponential > mean(lentil$growth) [1] 0502766 > sd(lentil$growth) 4
[1] 04398602 > lentilquantiles <- quantile(lentil$growth, probs=seq(from=1,to=9,by=1)) > normalquantiles <- qnorm(seq(from=1, to=9, by=1), mean=502766, sd=4397602) > lentilquantiles 10% 20% 30% 40% 50% 60% 70% 80% 90% 0112 0130 0240 0350 0400 0460 0500 0670 1200 > normalquantiles [1] -006080937 013265448 027215553 039135403 050276600 061417797 073337647 087287752 10 The quantiles look kind of different, so the lentils data are probably not normal However, it s difficult to tell for sure just by looking at the numbers Let s visualize the quantiles by plotting the quantiles against each other: plot(normalquantiles,lentilquantiles) They deviate quite a bit from the line y = x, so the lentils data are definitely not normal This plot is a simplified Q-Q plot (SW section 44) R has a function for drawing nice Q-Q plots: use qqnorm(lentil), and notice the similarities and differences with the simplified plot that we graphed manually Question: How close is the binomial(n = 100, p = 5) distribution to normality? How about the exponential(rate = 5) distribution? How would you use a Q-Q plot to answer this question? 5