STAT 825 Notes Random Number Generation What if R/Splus/SAS doesn t have a function to randomly generate data from a particular distribution? Although R, Splus, SAS and other packages can generate data from many different distributions, certainly most of the common ones, you may need to generate data from a distribution for which there is no existing function. The following tools will aid you in generating data from some of these other distributions. The methods described here can be thought of as direct methods. We will also discuss some indirect methods a bit later. Additional comment: If you find yourself writing a Fortran or C++ program one day, you may only be able to generate uniform random variates. Some of the techniques below will be necessary in generating variates from even the most common distributions. (Note that routines exist for many of these distributions. You just need to know where to look.) Transformations: As you may recall from Theory I and II, many common distributions are related to one another through transformations. For example, If X N(0, 1) and Y = X 2, then Y χ 2 (1). Let X N(0, 1) and Y χ 2 (r) where X and Y are independent, and let V = X/ Y/r. Then V t(r). Let X 1 and X 2 be independent Exp(θ) where θ is the mean, and let Y = X 1 X 2. Then Y Double Exponential(0, θ) where 0 is the value of the location parameter and θ is the scale parameter. If X Gamma(α, β) and Y = 1/X, then Y Inverted gamma(α, β). If X Exp(θ) and Y = X 1/γ, then Y Weibull(γ, θ). An important transformation is the probability integeral transformation: Let X be a random variable with a continuous cdf F X (x) and define the random variable Y as Y = F X (X). Then Y is uniformly distributed on (0, 1), Y Uniform(0, 1). The probability integral transformation implies the following: If Y is a continuous random variable with cdf F Y, then the random variable F 1 Y (U), where U Uniform(0, 1), has distribution F Y. Example: Recall that if Y Exp(θ), then the cdf of Y is F Y (y) = 1 exp( y/θ). The inverse of the F Y is F 1 Y (t) = θ log(1 t). 1
Therefore, if U Uniform(0, 1), then F 1 (U) = θ log(1 U) Exp(θ). Y What are the implications of the probability integral transformation? You can generate variates from any distribution provided that you can generate variates from a uniform distribution on (0, 1), the cdf of the target distribution can be written in closed form, and the cdf of the target distribution is invertible. Note: The probability integral transformation technique for generating random variates based on uniform random variates is called the inverse cdf method by some. Example continued: An algorithm for generating n variates from an Exp(θ) distribution is: 1. Generate u 1,...,u n from a Uniform(0, 1) distribution. 2. Let y i = θ log(1 u i ). In Splus/R, this can be carried out as follows. Suppose n = 1000 and θ = 3. > n<-1000 > theta<-3 > u<-runif(n,0,1) > y<-(-theta)*log(1-u) A histogram of y should show the characteristic shape of the exponential pdf. Also, note that if U Unif(0, 1), then 1 U Unif(0, 1). Therefore, Y = θ log(u) Exp(θ) also. Therefore, the last line above could be replaced by > y<-(-theta)*log(u) still randomly generating exponential variates. The relationship between the exponential distribution and other distributions allows the quick generation of many random variables. For example, if U j are iid Uniform(0, 1) random variables, what are the distributions of the following? X = b Y Z = = 2 a log(u j ) j=1 r log(u j ) j=1 a j=1 log(u j) a+b j=1 log(u j) Discrete Distributions: Many discrete distributions can also be generated using cdfs. For example, let Y be a 2
discrete random variable taking on values a 1 < a 2 < < a k with cdf F Y, and let U be a Uniform(0, 1) random variable. Then we can write P[F Y (a j ) < U F Y (a j+1 )] = F Y (a j+1 ) F Y (a j ) = P(Y = a j+1 ) Implementation of the above to generate discrete random variates is quite straightforward and can be summarized as follows. To generate, y 1,...,y n from a discrete distribution with cdf F Y (y), 1. Generate u 1,...,u n from Uniform(0,1). 2. If F Y (a j ) < u i F Y (a j+1 ), set y i = a j+1. We define a 0 = and F Y (a 0 ) = 0. Example: Suppose we want to generate a random sample of size 1000 from the following discrete distribution: Note that F Y is given by a 1 2 3 4 5 p Y (a) 1/15 2/15 3/15 4/15 5/15 a 1 2 3 4 5 F Y (a) 1/15 3/15 6/15 10/15 1 The following Splus/R code will do the job (discrete.prog). n<-1000 a<-1:5 p<-a/15 cdf<-c(0,cumsum(p)) u<-runif(n,0,1) ind<-matrix(0,nrow=5,ncol=n) for(i in 1:length(a)) ind[i,]<-ifelse((u>cdf[i])&(u<=cdf[i+1]),1,0) y<-as.vector(a%*%ind) Using table(y) should confirm that we have generated data from the above distribution. The above algorithm can be used to generate variates from a binomial distribution also. (How would you do this?) In addition, the algorithm will also work if the support of the discrete random variable is infinite, like Poisson and negative binomial. Although, theoretically, this could require a large number of evaluations, in practice this does not happen because there are simple and clever ways of speeding up the algorithm. For example, instead of checking each y i in the order 1, 2,..., it can be much faster to start checking y i s near the mean. (See Stochastic Simulation by B.D. Ripley, 1987, Section 3.3 and Exercise 5.55 for more information.) Other algorithms also exist for generating discrete random variates from 3
uniform random variates. (See Random Number Generation and Monte Carlo Methods by J.E. Gentle, 2003, Section 4.1.) Multivariate Distributions: The probability integeral transformation (or inverse cdf) method does not apply to a multivariate distribution. However, the probability integral transformation can be applied to the marginal and conditional univariate distributions to generate multivariate random variates. Suppose that the cdf of the multivariate random variable (random vecto) (X 1, X 2,...,X d ) is decomposed as F X1,X 2,...,X d (x 1, x 2,...,x d ) = F X1 (x 1 )F X2 X 1 (x 2 x 1 ) F Xd X 1,...,X d 1 (x d x 1, x 2,...,x d 1 ). Also, suppose that all of the marginal and conditional cdfs above are invertible. Then the probability integral transformation can be applied sequentially by using independent Uniform(0, 1) random variates, u 1,...,u d : x 1 x 2. x d = F 1 X 1 (u 1 ) = F 1 X 2 X 1 (u 2 ) = F 1 X d X 1,X 2,...,X d 1 (u d ) The modifications of the probability integral transformation approach for discrete random variables above can be applied here also if necessary. Example: Recall that R has a built-in function for generating random variates from a multinomial distribution but Splus does not. We can combine some of the techniques above to generate multinomial variates from uniform variates. To give this example some context, suppose that 15% of all college students are engineering majos and 20% are business majors. (The remaining 65% of college students have majors in other areas.) Suppose that we want to sample n = 100 college students and count the number, X 1, who major in engineering and the number, X 2, who major in business. Then (X 1, X 2 ) Multinomial(n = 100, p 1 = 0.15, p 2 = 0.20) (more specifically, a trinomial distribution). In general, if (X 1, X 2 ) Multinomial(n, p 1, p 2 ) with joint pmf p X1,X 2 (x 1, x 2 ), then and p X1,X 2 (x 1, x 2 ) = p X1 (x 1 )p X2 X 1 (x 2 x 1 ) X 1 Binomial(n, p 1 ) ( ) p 2 X 2 X 1 Binomial n x 1,. 1 p 1 Therefore, (X 1, X 2 ) can be produced by sampling from binomial distributions. 4
(Note: Since many packages have functions/procedures for generating binomial variates, we could use the above information to generate our multinomial random variate. However, to illustrate how the probability integral transformation can be used, we will continue one step further.) Now, we are faced with the issue of generating binomial random variates from uniform random variates. The early approach (under Discrete Distributions) could be used but I will illustrate another approach here. Consider U Uniform(0, 1) and 0 < p < 1. Let { 1 U <= p Y = 0 otherwise. Then Y Binomial(1, p) (or Y Bernoulli(p)). Also, note that if Y 1,...,Y n are iid Bernoulii(p), then n i=1 Y i Binomial(n, p). Putting this all together we can generate a binomial random variate from uniform random variates. Algorithm for generating (x 1, x 2 ) from a Trinomial(n, p 1, p 2 ) distribution: 1. Generate u 1,...,u n from a Uniform(0, 1) distribution. 2. x 1 = n i=1 I(u i p 1 ) generates x 1 from Binomial(n, p 1 ) distribution. 3. Generate u 1,...,u n x1 from a Uniform(0, 1) distribution. 4. x 2 = n x 1 i=1 I(u i p 2 /(1 p 1 )) gets x 2 x 1 Binomial(n x 1, p 2 /(1 p 1 )). 5. (x 1, x 2 ) is a trinomial variate. 6. Repeat previous steps m times for a random sample of size m. The following R function (stored in the file multinom.func) will generate a single variate from a trinomial distribution but implementing the above algorithm. Notice that the function requires you to specify n, p 1, and p 2. rtrinom.ex<-function(n,p1,p2){ # This function generates one random trinomial variate. # n = number of individuals sampled # p1 = probability of being in the first group # p2 = probability of being in the second group u1<-runif(n,0,1) x1<-sum(ifelse(u1<=p1,1,0)) u2<-runif(n-x1,0,1) x2<-sum(ifelse(u2<=p2/(1-p1),1,0)) c(x1,x2) } (We could also choose to return (x 1, x 2, n x 1 x 2 ). Executing this function for the above example, yields 5
> source("multinom.func") > rtrinom.ex(100,0.15,0.20) [1] 18 21 Therefore, in this sample of 100 college students, 18 were engineering majors and 21 were business majors. Other Methods: In many cases, the cdf of a random variable can not be written in closed form or the cdf is not invertible. In these cases, other options must be explored. These include other types of generation methods (other algorithms) and indirect methods. An example of the former is the Box-Muller Algorithm. Box-Muller Algorithm: Generate U 1 and U 2, two independent Uniform(0, 1) random variables. Then are iid Normal(0, 1) random variables. X 1 = 2 log(u 1 )cos(2πu 2 ) X 2 = 2 log(u 1 ) sin(2πu 2 ) Unfortunately, solutions such at the Box-Muller algorithm are not plentiful. Moreover, they take advantage of the specific structure of certain distributions and are, thus, less applicable as general strategies. For the most part, the generation of other continuous distributions is best accomplished through indirect methods. Indirect methods include, but are not limited to, the Accept/Reject Algorithm, the Ratio-of-Uniforms Method, the Metropolis-Hastings Algorithm, and Gibbs Sampling. Time permitting, we will discuss some of these at the end of the semester. Finite Mixture Distributions Consider a random variable X with pdf (or pmf) of the following form f(x) = w 1 f 1 (x) + w 2 f 2 (x) + + w k f k (x) where f 1 (x),...,f k (x) are pdfs (or pmfs); w j 0, j = 1,..., k; and k j=1 w j = 1. The random variable X is said to have a (finite) mixture distribution. Examples of finite mixture distributions include the following. X is said to have a contaminated normal distribution if f 1 (x) is the standard normal pdf and f 2 (x) is a normal pdf with variance greater than one. Typically, w 1 = 1 ε is close to one while w 2 = ε is close to zero; some common values are ε = 0.01, 0.05, 0.10. Contaminated normal distributions are useful if you need to generate data from a distribution that is pretty much normal but contains some outliers or has slightly thicker tails. Finite mixtures are often used when subpopulations are known to exist. As a simple example, suppose X represents heights of adults. The population of adults consists of both men and women. Heights of men might be modelled with one normal distribution while heights of women might be modelled with another normal distribution. 6
A zero-inflated Poisson distribution is another example. In this case, the bulk of the data follows a Poisson distribution but there are many extra zeroes. Note that the distributions need not be normal. In addition, the distributions need not be from the same family of distributions but they often are. How would generate n variates from a finite mixture distribution? Suppose that we can already generate variates from each of the component distributions. Also, suppose for simplicity that k = 2, and let w = w 1 and 1 w = w 2. Thus, we want to generate n random variates from f(x) = wf 1 (x) + (1 w)f 2 (x) where 0 < w < 1. One way of forming the above mixture distribution is to consider a conditional pdf/pmf of a similar form: f(x y) = yf 1 (x) + (1 y)f 2 (x) where y is a realization of a Bernoulli random variable with probability of success w, i.e. Y Bernoulli(w). Then the marginal pdf/pmf of X is f X (x) = = 1 f X,Y (x, y) y=0 1 f(x y)p(y = y) y=0 which is the desired mixture distribution. = f(x 0)P(Y = 0) + f(x 1)P(Y = 1) = wf 1 (x) + (1 w)f 2 (x) Algorithm for generating data from a mixture distribution: 1. Generate n variates from f 1 (x): x (1) 1,...,x (1) n. 2. Generate n variates from f 2 (x): x (2) 1,...,x (2) n. 3. Generate n variates from Bernoulli(w): y 1,...,y n. 4. Let z i = y i x (1) i + (1 y i ) x (2) i, i = 1,...,n. Then z 1,..., z n is a sample from f(x) = wf 1 (x) + (1 w)f 2 (x). Example: Suppose we want to generate n random variates from a contaminated normal distribution with f 1 (x) = φ(x) and f 2 (x) = (1/σ)φ(x/σ) where φ represents the standard normal pdf. (Therefore, the first component is Normal(0, 1) and the second component is Normal(0, σ 2 )). The following R function will do the trick: 7
contam.norm<-function(n,eps,sigma){ # This function generates n random variates from a contaminated # normal distributions with contaminations proportion eps and # contamination variance sigma^2. x1<-rnorm(n,0,1) x2<-rnorm(n,0,sigma) y<-rbinom(n,1,1-eps) x<-y*x1+(1-y)*x2 x } Execute the function for n = 1000, eps = 0.1 and sigma = 3: > x<-contam.norm(1000,0.1,3) A histogram of x will be bell-shaped with longer tails than a standard normal distribution. Alternatively, the following R function could be used: contam.norm2<-function(n,eps,sigma){ # This function generates n random variates from a contaminated # normal distributions with contaminations proportion eps and # contamination variance sigma^2. x<-rnorm(n,0,1) y<-rbinom(n,1,1-eps) x<-y*x+(1-y)*sigma*x x } How does this differ from the first? Note: When we run simulations, we will typically need to take several samples of the same size from the same distribution. We could do this in a for loop but there are more efficient ways. Can you think of any? Suppose you will need 10 samples of size 5 from a standard normal distribution. In simulation studies, these 10 samples should be independent. Therefore, one can simply generate 10 5 = 50 standard normal variates and fill in a matrix so each sample is one column of the matrix. > set.seed(10) > y<-matrix(rnorm(50),nrow=5,ncol=10) > y [,1] [,2] [,3] [,4] [,5] [,6] [1,] 0.01874617 0.3897943 1.1017795 0.08934727-0.5963106-0.3736616 [2,] -0.18425254-1.2080762 0.7557815-0.95494386-2.1852868-0.6875554 [3,] -1.37133055-0.3636760-0.2382336-0.19515038-0.6748659-0.8721588 [4,] -0.59916772-1.6266727 0.9874447 0.92552126-2.1190612-0.1017610 [5,] 0.29454513-0.2564784 0.7413901 0.48297852-1.2651980-0.2537805 8
[,7] [,8] [,9] [,10] [1,] -1.85374045-1.4355144 1.0865514-0.02881534 [2,] -0.07794607 0.3620872-0.7625449 0.23252515 [3,] 0.96856634-1.7590868-0.8286625-0.30120868 [4,] 0.18492596-0.3245440 0.8344739-0.67761458 [5,] -1.37994358-0.6515630-0.9676520 0.65522764 The replicate() function is useful for repeating a procedure many times. > set.seed(10) > z<-replicate(10,rnorm(5)) > z [,1] [,2] [,3] [,4] [,5] [,6] [1,] 0.01874617 0.3897943 1.1017795 0.08934727-0.5963106-0.3736616 [2,] -0.18425254-1.2080762 0.7557815-0.95494386-2.1852868-0.6875554 [3,] -1.37133055-0.3636760-0.2382336-0.19515038-0.6748659-0.8721588 [4,] -0.59916772-1.6266727 0.9874447 0.92552126-2.1190612-0.1017610 [5,] 0.29454513-0.2564784 0.7413901 0.48297852-1.2651980-0.2537805 [,7] [,8] [,9] [,10] [1,] -1.85374045-1.4355144 1.0865514-0.02881534 [2,] -0.07794607 0.3620872-0.7625449 0.23252515 [3,] 0.96856634-1.7590868-0.8286625-0.30120868 [4,] 0.18492596-0.3245440 0.8344739-0.67761458 [5,] -1.37994358-0.6515630-0.9676520 0.65522764 9