STAT 825 Notes Random Number Generation

Similar documents
Exam 2 Spring 2015 Statistics for Applications 4/9/2015

IEOR E4703: Monte-Carlo Simulation

Posterior Inference. , where should we start? Consider the following computational procedure: 1. draw samples. 2. convert. 3. compute properties

What was in the last lecture?

UQ, STAT2201, 2017, Lectures 3 and 4 Unit 3 Probability Distributions.

Stochastic Simulation

Non-informative Priors Multiparameter Models

Lecture Notes 6. Assume F belongs to a family of distributions, (e.g. F is Normal), indexed by some parameter θ.

continuous rv Note for a legitimate pdf, we have f (x) 0 and f (x)dx = 1. For a continuous rv, P(X = c) = c f (x)dx = 0, hence

Machine Learning for Quantitative Finance

4-2 Probability Distributions and Probability Density Functions. Figure 4-2 Probability determined from the area under f(x).

Chapter 7: Point Estimation and Sampling Distributions

Lean Six Sigma: Training/Certification Books and Resources

Business Statistics 41000: Probability 4

ELEMENTS OF MONTE CARLO SIMULATION

Generating Random Variables and Stochastic Processes

Introduction to Algorithmic Trading Strategies Lecture 8

MATH 3200 Exam 3 Dr. Syring

Probability Theory. Probability and Statistics for Data Science CSE594 - Spring 2016

Generating Random Numbers

Write legibly. Unreadable answers are worthless.

Corso di Identificazione dei Modelli e Analisi dei Dati

Business Statistics 41000: Probability 3

Unit 5: Sampling Distributions of Statistics

Unit 5: Sampling Distributions of Statistics

Describing Uncertain Variables

6. Continous Distributions

Chapter 5. Sampling Distributions

A useful modeling tricks.

Probability and Statistics

STAT Mathematical Statistics

Random Variables Handout. Xavier Vilà

Chapter 3 Discrete Random Variables and Probability Distributions

CPSC 540: Machine Learning

Metropolis-Hastings algorithm

ECON 214 Elements of Statistics for Economists

4 Random Variables and Distributions

Chapter 3 Discrete Random Variables and Probability Distributions

Department of Quantitative Methods & Information Systems. Business Statistics. Chapter 6 Normal Probability Distribution QMIS 120. Dr.

CS145: Probability & Computing

ECON 214 Elements of Statistics for Economists 2016/2017

GENERATION OF APPROXIMATE GAMMA SAMPLES BY PARTIAL REJECTION

STOR Lecture 15. Jointly distributed Random Variables - III

10. Monte Carlo Methods

STATS 200: Introduction to Statistical Inference. Lecture 4: Asymptotics and simulation

Conjugate Models. Patrick Lam

MEASURING PORTFOLIO RISKS USING CONDITIONAL COPULA-AR-GARCH MODEL

Chapter 4: Commonly Used Distributions. Statistics for Engineers and Scientists Fourth Edition William Navidi

Stat 213: Intro to Statistics 9 Central Limit Theorem

Probability. An intro for calculus students P= Figure 1: A normal integral

CPSC 540: Machine Learning

Section The Sampling Distribution of a Sample Mean

Stochastic Components of Models

Monte Carlo Methods. Matt Davison May University of Verona Italy

Biostatistics and Design of Experiments Prof. Mukesh Doble Department of Biotechnology Indian Institute of Technology, Madras

Chapter 5: Statistical Inference (in General)

Homework: Due Wed, Feb 20 th. Chapter 8, # 60a + 62a (count together as 1), 74, 82

A Probabilistic Approach to Determining the Number of Widgets to Build in a Yield-Constrained Process

FREDRIK BAJERS VEJ 7 G 9220 AALBORG ØST Tlf.: URL: Fax: Monte Carlo methods

Bernoulli and Binomial Distributions

Monte Carlo Simulation and Resampling

Value at Risk Ch.12. PAK Study Manual

2.1 Mathematical Basis: Risk-Neutral Pricing

Chapter 8 Estimation

Practical example of an Economic Scenario Generator

Financial Risk Forecasting Chapter 7 Simulation methods for VaR for options and bonds

Point Estimation. Stat 4570/5570 Material from Devore s book (Ed 8), and Cengage

Data Simulator. Chapter 920. Introduction

LECTURE CHAPTER 3 DESCRETE RANDOM VARIABLE

Chapter 4 Continuous Random Variables and Probability Distributions

Expectations. Definition Let X be a discrete rv with set of possible values D and pmf p(x). The expected value or mean value of X, denoted by E(X ) or

BIOINFORMATICS MSc PROBABILITY AND STATISTICS SPLUS SHEET 1

Probability Theory and Simulation Methods. April 9th, Lecture 20: Special distributions

STAT Chapter 5: Continuous Distributions. Probability distributions are used a bit differently for continuous r.v. s than for discrete r.v. s.

Homework: Due Wed, Nov 3 rd Chapter 8, # 48a, 55c and 56 (count as 1), 67a

IEOR E4602: Quantitative Risk Management

Chapter 7. Sampling Distributions and the Central Limit Theorem

EVA Tutorial #1 BLOCK MAXIMA APPROACH IN HYDROLOGIC/CLIMATE APPLICATIONS. Rick Katz

Chapter 8: Sampling distributions of estimators Sections

6 Central Limit Theorem. (Chs 6.4, 6.5)

4-1. Chapter 4. Commonly Used Distributions by The McGraw-Hill Companies, Inc. All rights reserved.

The Bernoulli distribution

ME3620. Theory of Engineering Experimentation. Spring Chapter III. Random Variables and Probability Distributions.

Probability Distributions: Discrete

CHAPTERS 5 & 6: CONTINUOUS RANDOM VARIABLES

MTH6154 Financial Mathematics I Stochastic Interest Rates

Chapter 7. Sampling Distributions and the Central Limit Theorem

Chapter 7 Sampling Distributions and Point Estimation of Parameters

4.3 Normal distribution

Point Estimation. Some General Concepts of Point Estimation. Example. Estimator quality

Two hours UNIVERSITY OF MANCHESTER. 23 May :00 16:00. Answer ALL SIX questions The total number of marks in the paper is 90.

Objective Bayesian Analysis for Heteroscedastic Regression

CS 237: Probability in Computing

STA258H5. Al Nosedal and Alison Weir. Winter Al Nosedal and Alison Weir STA258H5 Winter / 41

A First Course in Probability

,,, be any other strategy for selling items. It yields no more revenue than, based on the

Regression and Simulation

Statistics for Managers Using Microsoft Excel 7 th Edition

Statistical Computing (36-350)

STAT 111 Recitation 4

Transcription:

STAT 825 Notes Random Number Generation What if R/Splus/SAS doesn t have a function to randomly generate data from a particular distribution? Although R, Splus, SAS and other packages can generate data from many different distributions, certainly most of the common ones, you may need to generate data from a distribution for which there is no existing function. The following tools will aid you in generating data from some of these other distributions. The methods described here can be thought of as direct methods. We will also discuss some indirect methods a bit later. Additional comment: If you find yourself writing a Fortran or C++ program one day, you may only be able to generate uniform random variates. Some of the techniques below will be necessary in generating variates from even the most common distributions. (Note that routines exist for many of these distributions. You just need to know where to look.) Transformations: As you may recall from Theory I and II, many common distributions are related to one another through transformations. For example, If X N(0, 1) and Y = X 2, then Y χ 2 (1). Let X N(0, 1) and Y χ 2 (r) where X and Y are independent, and let V = X/ Y/r. Then V t(r). Let X 1 and X 2 be independent Exp(θ) where θ is the mean, and let Y = X 1 X 2. Then Y Double Exponential(0, θ) where 0 is the value of the location parameter and θ is the scale parameter. If X Gamma(α, β) and Y = 1/X, then Y Inverted gamma(α, β). If X Exp(θ) and Y = X 1/γ, then Y Weibull(γ, θ). An important transformation is the probability integeral transformation: Let X be a random variable with a continuous cdf F X (x) and define the random variable Y as Y = F X (X). Then Y is uniformly distributed on (0, 1), Y Uniform(0, 1). The probability integral transformation implies the following: If Y is a continuous random variable with cdf F Y, then the random variable F 1 Y (U), where U Uniform(0, 1), has distribution F Y. Example: Recall that if Y Exp(θ), then the cdf of Y is F Y (y) = 1 exp( y/θ). The inverse of the F Y is F 1 Y (t) = θ log(1 t). 1

Therefore, if U Uniform(0, 1), then F 1 (U) = θ log(1 U) Exp(θ). Y What are the implications of the probability integral transformation? You can generate variates from any distribution provided that you can generate variates from a uniform distribution on (0, 1), the cdf of the target distribution can be written in closed form, and the cdf of the target distribution is invertible. Note: The probability integral transformation technique for generating random variates based on uniform random variates is called the inverse cdf method by some. Example continued: An algorithm for generating n variates from an Exp(θ) distribution is: 1. Generate u 1,...,u n from a Uniform(0, 1) distribution. 2. Let y i = θ log(1 u i ). In Splus/R, this can be carried out as follows. Suppose n = 1000 and θ = 3. > n<-1000 > theta<-3 > u<-runif(n,0,1) > y<-(-theta)*log(1-u) A histogram of y should show the characteristic shape of the exponential pdf. Also, note that if U Unif(0, 1), then 1 U Unif(0, 1). Therefore, Y = θ log(u) Exp(θ) also. Therefore, the last line above could be replaced by > y<-(-theta)*log(u) still randomly generating exponential variates. The relationship between the exponential distribution and other distributions allows the quick generation of many random variables. For example, if U j are iid Uniform(0, 1) random variables, what are the distributions of the following? X = b Y Z = = 2 a log(u j ) j=1 r log(u j ) j=1 a j=1 log(u j) a+b j=1 log(u j) Discrete Distributions: Many discrete distributions can also be generated using cdfs. For example, let Y be a 2

discrete random variable taking on values a 1 < a 2 < < a k with cdf F Y, and let U be a Uniform(0, 1) random variable. Then we can write P[F Y (a j ) < U F Y (a j+1 )] = F Y (a j+1 ) F Y (a j ) = P(Y = a j+1 ) Implementation of the above to generate discrete random variates is quite straightforward and can be summarized as follows. To generate, y 1,...,y n from a discrete distribution with cdf F Y (y), 1. Generate u 1,...,u n from Uniform(0,1). 2. If F Y (a j ) < u i F Y (a j+1 ), set y i = a j+1. We define a 0 = and F Y (a 0 ) = 0. Example: Suppose we want to generate a random sample of size 1000 from the following discrete distribution: Note that F Y is given by a 1 2 3 4 5 p Y (a) 1/15 2/15 3/15 4/15 5/15 a 1 2 3 4 5 F Y (a) 1/15 3/15 6/15 10/15 1 The following Splus/R code will do the job (discrete.prog). n<-1000 a<-1:5 p<-a/15 cdf<-c(0,cumsum(p)) u<-runif(n,0,1) ind<-matrix(0,nrow=5,ncol=n) for(i in 1:length(a)) ind[i,]<-ifelse((u>cdf[i])&(u<=cdf[i+1]),1,0) y<-as.vector(a%*%ind) Using table(y) should confirm that we have generated data from the above distribution. The above algorithm can be used to generate variates from a binomial distribution also. (How would you do this?) In addition, the algorithm will also work if the support of the discrete random variable is infinite, like Poisson and negative binomial. Although, theoretically, this could require a large number of evaluations, in practice this does not happen because there are simple and clever ways of speeding up the algorithm. For example, instead of checking each y i in the order 1, 2,..., it can be much faster to start checking y i s near the mean. (See Stochastic Simulation by B.D. Ripley, 1987, Section 3.3 and Exercise 5.55 for more information.) Other algorithms also exist for generating discrete random variates from 3

uniform random variates. (See Random Number Generation and Monte Carlo Methods by J.E. Gentle, 2003, Section 4.1.) Multivariate Distributions: The probability integeral transformation (or inverse cdf) method does not apply to a multivariate distribution. However, the probability integral transformation can be applied to the marginal and conditional univariate distributions to generate multivariate random variates. Suppose that the cdf of the multivariate random variable (random vecto) (X 1, X 2,...,X d ) is decomposed as F X1,X 2,...,X d (x 1, x 2,...,x d ) = F X1 (x 1 )F X2 X 1 (x 2 x 1 ) F Xd X 1,...,X d 1 (x d x 1, x 2,...,x d 1 ). Also, suppose that all of the marginal and conditional cdfs above are invertible. Then the probability integral transformation can be applied sequentially by using independent Uniform(0, 1) random variates, u 1,...,u d : x 1 x 2. x d = F 1 X 1 (u 1 ) = F 1 X 2 X 1 (u 2 ) = F 1 X d X 1,X 2,...,X d 1 (u d ) The modifications of the probability integral transformation approach for discrete random variables above can be applied here also if necessary. Example: Recall that R has a built-in function for generating random variates from a multinomial distribution but Splus does not. We can combine some of the techniques above to generate multinomial variates from uniform variates. To give this example some context, suppose that 15% of all college students are engineering majos and 20% are business majors. (The remaining 65% of college students have majors in other areas.) Suppose that we want to sample n = 100 college students and count the number, X 1, who major in engineering and the number, X 2, who major in business. Then (X 1, X 2 ) Multinomial(n = 100, p 1 = 0.15, p 2 = 0.20) (more specifically, a trinomial distribution). In general, if (X 1, X 2 ) Multinomial(n, p 1, p 2 ) with joint pmf p X1,X 2 (x 1, x 2 ), then and p X1,X 2 (x 1, x 2 ) = p X1 (x 1 )p X2 X 1 (x 2 x 1 ) X 1 Binomial(n, p 1 ) ( ) p 2 X 2 X 1 Binomial n x 1,. 1 p 1 Therefore, (X 1, X 2 ) can be produced by sampling from binomial distributions. 4

(Note: Since many packages have functions/procedures for generating binomial variates, we could use the above information to generate our multinomial random variate. However, to illustrate how the probability integral transformation can be used, we will continue one step further.) Now, we are faced with the issue of generating binomial random variates from uniform random variates. The early approach (under Discrete Distributions) could be used but I will illustrate another approach here. Consider U Uniform(0, 1) and 0 < p < 1. Let { 1 U <= p Y = 0 otherwise. Then Y Binomial(1, p) (or Y Bernoulli(p)). Also, note that if Y 1,...,Y n are iid Bernoulii(p), then n i=1 Y i Binomial(n, p). Putting this all together we can generate a binomial random variate from uniform random variates. Algorithm for generating (x 1, x 2 ) from a Trinomial(n, p 1, p 2 ) distribution: 1. Generate u 1,...,u n from a Uniform(0, 1) distribution. 2. x 1 = n i=1 I(u i p 1 ) generates x 1 from Binomial(n, p 1 ) distribution. 3. Generate u 1,...,u n x1 from a Uniform(0, 1) distribution. 4. x 2 = n x 1 i=1 I(u i p 2 /(1 p 1 )) gets x 2 x 1 Binomial(n x 1, p 2 /(1 p 1 )). 5. (x 1, x 2 ) is a trinomial variate. 6. Repeat previous steps m times for a random sample of size m. The following R function (stored in the file multinom.func) will generate a single variate from a trinomial distribution but implementing the above algorithm. Notice that the function requires you to specify n, p 1, and p 2. rtrinom.ex<-function(n,p1,p2){ # This function generates one random trinomial variate. # n = number of individuals sampled # p1 = probability of being in the first group # p2 = probability of being in the second group u1<-runif(n,0,1) x1<-sum(ifelse(u1<=p1,1,0)) u2<-runif(n-x1,0,1) x2<-sum(ifelse(u2<=p2/(1-p1),1,0)) c(x1,x2) } (We could also choose to return (x 1, x 2, n x 1 x 2 ). Executing this function for the above example, yields 5

> source("multinom.func") > rtrinom.ex(100,0.15,0.20) [1] 18 21 Therefore, in this sample of 100 college students, 18 were engineering majors and 21 were business majors. Other Methods: In many cases, the cdf of a random variable can not be written in closed form or the cdf is not invertible. In these cases, other options must be explored. These include other types of generation methods (other algorithms) and indirect methods. An example of the former is the Box-Muller Algorithm. Box-Muller Algorithm: Generate U 1 and U 2, two independent Uniform(0, 1) random variables. Then are iid Normal(0, 1) random variables. X 1 = 2 log(u 1 )cos(2πu 2 ) X 2 = 2 log(u 1 ) sin(2πu 2 ) Unfortunately, solutions such at the Box-Muller algorithm are not plentiful. Moreover, they take advantage of the specific structure of certain distributions and are, thus, less applicable as general strategies. For the most part, the generation of other continuous distributions is best accomplished through indirect methods. Indirect methods include, but are not limited to, the Accept/Reject Algorithm, the Ratio-of-Uniforms Method, the Metropolis-Hastings Algorithm, and Gibbs Sampling. Time permitting, we will discuss some of these at the end of the semester. Finite Mixture Distributions Consider a random variable X with pdf (or pmf) of the following form f(x) = w 1 f 1 (x) + w 2 f 2 (x) + + w k f k (x) where f 1 (x),...,f k (x) are pdfs (or pmfs); w j 0, j = 1,..., k; and k j=1 w j = 1. The random variable X is said to have a (finite) mixture distribution. Examples of finite mixture distributions include the following. X is said to have a contaminated normal distribution if f 1 (x) is the standard normal pdf and f 2 (x) is a normal pdf with variance greater than one. Typically, w 1 = 1 ε is close to one while w 2 = ε is close to zero; some common values are ε = 0.01, 0.05, 0.10. Contaminated normal distributions are useful if you need to generate data from a distribution that is pretty much normal but contains some outliers or has slightly thicker tails. Finite mixtures are often used when subpopulations are known to exist. As a simple example, suppose X represents heights of adults. The population of adults consists of both men and women. Heights of men might be modelled with one normal distribution while heights of women might be modelled with another normal distribution. 6

A zero-inflated Poisson distribution is another example. In this case, the bulk of the data follows a Poisson distribution but there are many extra zeroes. Note that the distributions need not be normal. In addition, the distributions need not be from the same family of distributions but they often are. How would generate n variates from a finite mixture distribution? Suppose that we can already generate variates from each of the component distributions. Also, suppose for simplicity that k = 2, and let w = w 1 and 1 w = w 2. Thus, we want to generate n random variates from f(x) = wf 1 (x) + (1 w)f 2 (x) where 0 < w < 1. One way of forming the above mixture distribution is to consider a conditional pdf/pmf of a similar form: f(x y) = yf 1 (x) + (1 y)f 2 (x) where y is a realization of a Bernoulli random variable with probability of success w, i.e. Y Bernoulli(w). Then the marginal pdf/pmf of X is f X (x) = = 1 f X,Y (x, y) y=0 1 f(x y)p(y = y) y=0 which is the desired mixture distribution. = f(x 0)P(Y = 0) + f(x 1)P(Y = 1) = wf 1 (x) + (1 w)f 2 (x) Algorithm for generating data from a mixture distribution: 1. Generate n variates from f 1 (x): x (1) 1,...,x (1) n. 2. Generate n variates from f 2 (x): x (2) 1,...,x (2) n. 3. Generate n variates from Bernoulli(w): y 1,...,y n. 4. Let z i = y i x (1) i + (1 y i ) x (2) i, i = 1,...,n. Then z 1,..., z n is a sample from f(x) = wf 1 (x) + (1 w)f 2 (x). Example: Suppose we want to generate n random variates from a contaminated normal distribution with f 1 (x) = φ(x) and f 2 (x) = (1/σ)φ(x/σ) where φ represents the standard normal pdf. (Therefore, the first component is Normal(0, 1) and the second component is Normal(0, σ 2 )). The following R function will do the trick: 7

contam.norm<-function(n,eps,sigma){ # This function generates n random variates from a contaminated # normal distributions with contaminations proportion eps and # contamination variance sigma^2. x1<-rnorm(n,0,1) x2<-rnorm(n,0,sigma) y<-rbinom(n,1,1-eps) x<-y*x1+(1-y)*x2 x } Execute the function for n = 1000, eps = 0.1 and sigma = 3: > x<-contam.norm(1000,0.1,3) A histogram of x will be bell-shaped with longer tails than a standard normal distribution. Alternatively, the following R function could be used: contam.norm2<-function(n,eps,sigma){ # This function generates n random variates from a contaminated # normal distributions with contaminations proportion eps and # contamination variance sigma^2. x<-rnorm(n,0,1) y<-rbinom(n,1,1-eps) x<-y*x+(1-y)*sigma*x x } How does this differ from the first? Note: When we run simulations, we will typically need to take several samples of the same size from the same distribution. We could do this in a for loop but there are more efficient ways. Can you think of any? Suppose you will need 10 samples of size 5 from a standard normal distribution. In simulation studies, these 10 samples should be independent. Therefore, one can simply generate 10 5 = 50 standard normal variates and fill in a matrix so each sample is one column of the matrix. > set.seed(10) > y<-matrix(rnorm(50),nrow=5,ncol=10) > y [,1] [,2] [,3] [,4] [,5] [,6] [1,] 0.01874617 0.3897943 1.1017795 0.08934727-0.5963106-0.3736616 [2,] -0.18425254-1.2080762 0.7557815-0.95494386-2.1852868-0.6875554 [3,] -1.37133055-0.3636760-0.2382336-0.19515038-0.6748659-0.8721588 [4,] -0.59916772-1.6266727 0.9874447 0.92552126-2.1190612-0.1017610 [5,] 0.29454513-0.2564784 0.7413901 0.48297852-1.2651980-0.2537805 8

[,7] [,8] [,9] [,10] [1,] -1.85374045-1.4355144 1.0865514-0.02881534 [2,] -0.07794607 0.3620872-0.7625449 0.23252515 [3,] 0.96856634-1.7590868-0.8286625-0.30120868 [4,] 0.18492596-0.3245440 0.8344739-0.67761458 [5,] -1.37994358-0.6515630-0.9676520 0.65522764 The replicate() function is useful for repeating a procedure many times. > set.seed(10) > z<-replicate(10,rnorm(5)) > z [,1] [,2] [,3] [,4] [,5] [,6] [1,] 0.01874617 0.3897943 1.1017795 0.08934727-0.5963106-0.3736616 [2,] -0.18425254-1.2080762 0.7557815-0.95494386-2.1852868-0.6875554 [3,] -1.37133055-0.3636760-0.2382336-0.19515038-0.6748659-0.8721588 [4,] -0.59916772-1.6266727 0.9874447 0.92552126-2.1190612-0.1017610 [5,] 0.29454513-0.2564784 0.7413901 0.48297852-1.2651980-0.2537805 [,7] [,8] [,9] [,10] [1,] -1.85374045-1.4355144 1.0865514-0.02881534 [2,] -0.07794607 0.3620872-0.7625449 0.23252515 [3,] 0.96856634-1.7590868-0.8286625-0.30120868 [4,] 0.18492596-0.3245440 0.8344739-0.67761458 [5,] -1.37994358-0.6515630-0.9676520 0.65522764 9