CALIFORNIA INSTITUTE OF TECHNOLOGY. Introduction to Probability and Statistics Winter Assignment 3

CALIFORNIA INSTITUTE OF TECHNOLOGY Ma 3 KC Border Introduction to Probability and Statistics Winter 2015 Assignment 3 Due Monday, January 25 by 4:00 p.m. at 253 Sloan Instructions: When asked for a probability or an expectation, give both a formula and an explanation for why you used that formula, and also give a numerical value when available. When asked to plot something, use informative labels (even if handwritten), so the TA knows what you are plotting, attach a copy of the plot, and, if appropriate, the commands that produced it. Exercise 1 (30 pts) Is it possible to have three random variables X, Y, and Z, where X and Y are stochastically independent, Y and Z are stochastically independent, and X and Z are stochastically independent; but the set {X, Y, Z} of random variables is not stochastically independent? Explain why your answer is correct. Exercise 2 (Problem 3.3.15 in Pitman) (20 pts) Let X and Y be independent random variables. Show that Var(X Y ) = Var(X + Y ). Exercise 3 (The Standard Normal Distribution) The Standard Normal Density is given by f(z) = 1 2π e z2 /2. The cumulative distribution function for the standard normal is denoted Φ. That is, Φ(t) = 1 2π t e z2 /2 dz. 1

KC Border Assignment 3 2 There is no closed form expression for this in terms of elementary functions, but there are some decent approximations. Most statistics books have tables of selected values of this cdf, but nowadays it is built in to languages such as R and Mathematica. In Mathematica (since version 8), to find Φ(t), you evaluate CDF[NormalDistribution[0, 1], t]. The values of the density at z is given by PDF[NormalDistribution[0, 1], z]. Mathematica also lets you find the probability mass function and cdf for a Binomial(n, p) variable with PDF[BinomialDistribution[n, p], k] and CDF[BinomialDistribution[n, p], k]. In R to find Φ(t), you evaluate pnorm(t), or more completely, pnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE) if you don t trust the defaults. The values of the density at z is given by dnorm(z). R also lets you find the probability mass function and cdf for a Binomial(n, p) variable with dbinom(k, n, p) and pbinom(t, n, p). To get the corresponding functions for a N(µ, σ 2 ) normal random variable with expectation µ and variance σ 2, replace the 0 by µ and the 1 by σ (not σ 2!) in the commands above. Note: You may want to look at Section 4.3 in Larsen Marx [1] and/or Sections 2.2 2.3 in Pitman [2]. 1. (20 pts for answering yes) Do you have some appropriate software installed? (E.g., R, Matlab, NumPy, Mathematica, Octave, or something else useful for statistical calculations and plotting. Excel does not count.) 2. (10 pts) Use the program of your choice to make a table of values of Φ(t) for t = 3, 2, 1, 0, 1, 1.96, 2, 2.58, 3, 4, 5, 6. 3. (5 pts) If Z is a standard normal random variable, what are Prob (Z 1.96) and Prob ( Z 2.58)? Exercise 4 (Cf. problem 3.3.26 in Pitman) (25 pts) Use Jensen s Inequality (Lecture 6) to show that for a random variable X with finite mean µ, std. dev. X E X µ, with equality if and only if X is degenerate. Exercise 5 (25 pts) There are n balls numbered 1,..., n and n bins numbered 1,..., n. The balls are put into the bins at random, one per bin. What is the expected number of balls put in the matching bin? Explain your reasoning. (Hint: Let E i be the event that ball i is in bin i. Use indicator functions.) Exercise 6 (Exploring some data) (40 pts) In lecture I suggested that examining the empirical distribution function was a good way to look at data. Let s compare it to using histograms.

KC Border Assignment 3 3 At the beginning of the term you flipped coins. This generated a long string of 0s and 1s. A segment of this string can be interpreted as a binary number, and by dividing this by the appropriate power of two, it can be interpreted as a number between 0 and 1. Moreover, if the coin tosses are independent and Heads and Tails are equally likely, then these numbers should be i.i.d. with an approximately uniform distribution. We are going to subject this to an eyeball test, which is one of the first things you should always do with data. I have taken the liberty of chopping the coin toss data from this year and three previous years into 3,464 strings of length 32, and converting them into numbers between 0 and 1. You can download these results from http://www.math.caltech.edu/ %7E2015-16/2term/ma003/Data/Random32.txt. Or you can do it yourself from the raw data at http://www.math.caltech.edu/%7e2015-16/2term/ma003/data/flips.txt Using the program/language of your choice do the following. (I give hints for R and Mathematica below.) 1. What is the expected value of a Uniform[0,1] random variable? What is its standard deviation? 2. What is the average of the numbers in your samples? What is the sample standard deviation of each sample? (The sample standard deviation is gotten by squaring the deviation of each sample value from the sample mean, summing them, dividing by (sample size 1), and then taking the square root. 3. Plot a histogram of these numbers, using the default. Then plot a histogram using bins of length 0.02. 4. Now plot a cumulative histogram or the empirical cumulative distribution function. (In Mathematica, this is just an option of the Histogram command, and in R use the ecdf command.) 5. Which method makes it easier to check by eye if the data appear to be uniform? If you don t have a preference, there is a lot to be said for learning the R statistical programming language. It is used widely on campus, and it looks like it will be around for a while. It is also free and runs on the major operating systems. You can get it at http://www.r-project.org. But if you are familiar with something else, go ahead. Even Excel can probably handle this assignment, but future ones may be trickier. Hint: Badly documented sample R code: Warning: I am not an R programer, and I am sure there are probably better ways to do things. Most of what I know I got by Googling various questions. Also typing?command will bring up help on the command command.

KC Border Assignment 3 4 First, use setwd("your_data_pathname") to change your working directory to the folder where the data file is. (Or be prepared to use a full path name.) You can use getwd() and list.files() check that you are in the right place. Read the data from the file into an array. Check the length, it should be 3464 for the file Random32. (# is a comment character.) a = as.matrix(read.table("random32.txt")) length(a) Now try a default histogram: hist(a) # the as.matrix is important! Now try a histogram with bins of size 0.02. Also instead of actual counts, use relative frequencies (density): bins=seq(0.0,1.0,by=0.02) hist(a, breaks=bins, freq=false) # freq=false uses relative frequencies?! c=ecdf(a) plot(c) Now let s examine the empirical cdf. How do you save these plots? Well on my Mac, I just click on the graphic s window and hit Save, and it saves the graphic as a pdf. But here is a better way. Say you want to save the plot above to a png file named Hist.png. Here you go: png("hist.png") # open the file for writing plot(c) # plot to the file dev.off() # close the file. This is crucial. To save to a pdf file use pdf("hist.pdf") for the first line. stdout.org/rcookbook/graphs/output%20to%20a%20file/. Hint: Undocumented sample Mathematica code: SetDirectory["Your path goes here"] a = Flatten[ Import["Random32", "Table"] ]; g = Histogram[a] Export["File name 1.pdf",g] g = Histogram[a, {0, 1, 0.02}] Export["File name 2.pdf",g] g = Histogram[a, {0, 1, 0.02}, "CDF"] Export["File name 3.pdf",g] I found this at http://wiki. Exercise 7 (10 pts) How much time did you spend on the previous exercises? Exercise 8 (Optional Exercise) (50 pts) There are n balls numbered 1,..., n and n bins numbered 1,..., n. The balls are put into the bins at random, one per bin. For each k = 0,..., n, what is the probability that exactly k balls are put in the matching bin? Explain your reasoning. (Hint: Let E i be the event that ball i is in bin i. Use indicator functions.)

KC Border Assignment 3 5 References [1] R. J. Larsen and M. L. Marx. 2012. An introduction to mathematical statistics and its applications, fifth ed. Boston: Prentice Hall. [2] J. Pitman. 1993. Probability. New York, Berlin, and Heidelberg: Springer.