Using R to Create Synthetic Discrete Response Regression Models

Arizona State University From the SelectedWorks of Joseph M Hilbe July 3, 2011 Using R to Create Synthetic Discrete Response Regression Models Joseph Hilbe, Arizona State University Available at: https://works.bepress.com/joseph_hilbe/3/

Using R to Create Synthetic Discrete Response Regression Models Joseph M. Hilbe Arizona State University, and Jet Propulsion Laboratory, California Institute of Technology 3 July, 2011 (revised) Hilbe@asu.edu 2009, Joseph M Hilbe, all rights reserved. Do not distribute without prior written permission of author. Appropriate citation is requested if used in preparation of an article or book. The use of synthetic data and synthetic models has played an important role in evaluating model fit and bias. Synthetic models and tests are now found in many of the major texts dealing with statistical models. R is particularly adept in allowing the user to create synthetic data, and provides the opportunity of having most every aspect of the data and model be engineered as needed by the statistician. In my texts, Negative Binomial Regression (2007, Cambridge University Press) and Logistic Regression Models (2009, Chapman & Hall/CRC) I present the reader with a number of synthetic models, displaying how missing predictors, needed interactions, incorrectly specified links, and so forth can affect both model parameter estimates and dispersion statistics. The examples in these two books demonstrate how to determine apparent from real overdispersion in count and grouped binomial models. In this article I demonstrate how to construct synthetic data sets for various popular discrete response regression models. The same methods may be used to create data specific to a wide variety of alternative models. Specifically, I provide code for creating synthetic data sets for given types of binomial, Poisson, negative binomial, ordered slopes and proportional odds, and multinomial models. All code is based on the R pseudo-random number generators, runif(), rpois(), rgamma(), and rbinom(). The pnorm() function used for probit models. I also provide code for developing a synthetic NB-C, or canonical negative binomial model, even though there is not as yet an R command for its estimation. I intend on developing this command in a forthcoming article. The NB-C code is based on the rnbinom() function. Note that model predictors are all based on random uniform data. The results would be the same if we used normal variates, or even binary and factor variates. NB1 and NB2 negative binomial models are based on a Poisson-gamma mixture; the code clearly shows how this occurs. The logic of the coding can be used to develop a wide range of other types of synthetic models. For each set of model code it is easy for users to amend the default values, selecting their own number of model observations, seed values, number of predictors and their values, and for negative binomial models, a specific value for the ancillary parameter. Categorical response models are coded such that the number of predictors and their values may be selected, as well as cut point values for ordered response models, and the number of levels for multinomial models. Poisson, negative binomial and binomial models may also employ an offset.

I divide this article into three sections. First, I shall discuss creation of synthetic count response models specifically, Poisson, NB2, NB1, and NB-C. Second, I develop code for binomial models, which include both Bernoulli or binary and binomial or grouped logit and probit models. A third section provides an overview for creating synthetic proportional slopes models, including the proportional odds model and the ordered probit, as well code for constructing a synthetic multinomial logit model. Statisticians should find it relatively easy to adjust the provided default code to construct synthetic data for a wide variety of alternative discrete response models. Synthetic continuous response models are also based on the same logic as the code developed here. It should be noted that each run of these synthetic models will differ one from the other unless a seed is specified. In addition, it is possible to employ a Monte Carlo simulation algorithm to each of these models in order to determine the mean parameter and intercept values when many simulation runs have been effected. I have performed Monte Carlo simulations on each model discussed in this article, with the result that the user specified values are nearly identical to the resultant mean values. Some algorithms appear to yield more variability in parameter estimates than other algorithms, but upon a thousand or more Monte Carlo runs the mean values are stable. Each coding algorithm may be used as a batch file, either submitted as a whole file, or capable of being submitted a line at a time. As previously indicated, the user may specify various aspects of the algorithm to affect their requirements. Each algorithm begins with the characters, syn, followed by the type of model and any extra capabilities. For example, code to create the basic synthetic Poisson model is called, syn.poisson.r. The algorithm for adding an offset is called, syn.poissono.r. For proportional odds and multinomial models, the number of cuts (ordered models) or levels (multinomial) is indicated at the end of the name, e.g. syn.ologit4.r. Specific directions for running the algorithms: 1: open File -> Open script 2: Select file name from directory where files kept 3: Using left mouse button, select all lines of file (or lines desired) 4: Click once on right mouse button, Run line of selection 5: Left click on mouse button. 6: Synthetic model results displayed in R console. 1: SYNTHETIC COUNT MODELS I shall first create a simple Poisson model using the R function rpois(). This is basic or paradigm count model. The synthetic model is given two pseudo-random uniform predictors, x1 and x2. Random normal predictors are also commonly used when developing synthetic models. The values assigned to the two predictors, and intercept, are respectively x1 = 0.75, x2 = -1.25, and 2. Note that the final two lines of each algorithm are the regression model and summary output. The fourth line below, yielding xb, is the linear predictor, which is given to the inverse

link function in the following line. The inverse link function is used in generalized linear models to calculate the fitted or predicted value. The fit is then employed in the rpois() function to produce the synthetic Poisson response variate. SYNTHETIC POISSON MODEL ========================================================================== # syn.poisson.r // Joseph Hilbe 10Apr2009 xb <-2 +.75*x1-1.25*x2 # linear predictor exb <-exp(xb) # inverse link to predicted fit py <-rpois(50000, exb) # generates random Poisson variates poi <-glm(py ~ x1 + x2, family=poisson) summary(jhpoi) =========================================================================== glm(formula = py ~ x1 + x2, family = poisson) Deviance Residuals: Min 1Q Median 3Q Max -4.34669-0.75621-0.05794 0.61327 4.07484 Coefficients: (Intercept) 2.008057 0.004694 427.8 <2e-16 *** x1 0.747337 0.006292 118.8 <2e-16 *** x2-1.263599 0.006439-196.2 <2e-16 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 106437 on 49999 degrees of freedom Residual deviance: 52133 on 49997 degrees of freedom AIC: 227207 Number of Fisher Scoring iterations: 4 The resultant parameter estimates closely approximate the user defined values. A seed was not specified for the example model. If we wish to generate the same random numbers, and the exact same model, a seed may be given after the initial line. The function is set.seed(#); e.g,, set.seed(32478). Unfortunately the Pearson statistic is not displayed with the glm() function output, but rather the deviance is used as the basis of the dispersion. Although this is the traditional manner used for calculating the dispersion statistic, it is n ot correct. See Hilbe (2007, 2009). Poisson models are commonly parameterized as rate models. As such they employ an offset, which reflects the area or time over which the count response is generated. Since the natural log is the canonical link of the Poisson model, the offset must be logged prior to entry into the estimating algorithm.

A synthetic offset may be randomly generated, or may be specified by the user. For this example I will create an area offset having increasing values of 100 for each 10,000 observations in the 50,000 observation data set. The first 10,000 observations have an offset value of 100, the second 10,000 a value of 200, and so on until the final 10,000 observations, which have a value of 500. This value is calculated on the fourth line of program code. Note that the log-offset is added to the linear predictor in line 6. POISSON WITH RATE PARAMETERIZATION =========================================================================== # syn.poissono.r // Joseph Hilbe 10Apr2009 off <- rep(1:5, each=10000, times=1)*100 # creates offset as defined loff <- log(off) # log the offset xb<-2 +.75*x1-1.25*x2 + loff # linear predictor exb <-exp(xb) # inverse link py <-rpois(50000, exb) # creates random Poisson variates poir <-glm(py ~ x1 + x2 + offset(loff), family=poisson) summary(poir) ============================================================================ glm(formula = py ~ x1 + x2 + offset(loff), family = poisson) Deviance Residuals: Min 1Q Median 3Q Max -4.378369-0.682454-0.001613 0.679505 4.444123 Coefficients: (Intercept) 2.0002055 0.0002715 7367 <2e-16 *** x1 0.7501224 0.0003612 2077 <2e-16 *** x2-1.2507453 0.0003705-3375 <2e-16 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 16323076 on 49999 degrees of freedom Residual deviance: 50368 on 49997 degrees of freedom AIC: 507939 Number of Fisher Scoring iterations: 3 A display of the offset table is shown as: > table(off) off 100 200 300 400 500 10000 10000 10000 10000 10000

Note that the resultant parameter estimates are nearly identical to the specified values. We do expect that the parameter estimates of the model with an offset, or rate parameterized model, will closely approximate those of the standard model. However, we also expect that the standard errors will differ, Moreover, if the rate model were modeled without declaring an offset, we would notice a greatly inflated intercept value. A Poisson model having a Pearson-based dispersion greater than 1.0 indicates possible overdispersion. Typically the deviance-based dispersion is similar to the Pearson, but not always. In any case, in R, two types of generalized linear model are generally employed when there is evidence of overdispersion in a Poisson model quasipoisson and negative binomial. The negative binomial used in both the glm.nb() and glm() functions is commonly referred to as the quadratic negative binomial, or NB2 model. It is parameterized with a variance function of μ + αμ 2. The other common parameterization of negative binomial is termed the linear negative binomial, or NB1. The NB1 model is parameterized as μ + αμ, or as μ(1+α). The quasipoisson model is simply the Poisson model with standard errors scaled by the Pearson dispersion. Since the standard errors are not based directly on the Hessian matrix derived from the log-likelihood function, the term quasi-likelihood is used as the type of model estimated. Quasipoisson is the name given to Poisson models with scaled standard errors. The NB2 parameterization of the negative binomial can be generated as a Poisson-gamma mixture model, with a gamma scale parameter of 1. We use this method to create a synthetic NB2 variate, and a corresponding synthetic NB2 model. The negative binomial random number generator in R, rnbinom(), is not appropriate for dealing with NB2 data. The rnbinom() function is, however, appropriate for generating NB-C data, as we shall discuss later. In fact, both NB2 and NB1 synthetic models are generated as a mixture of Poisson and gamma distributions. The algorithm below is used to create a synthetic NB2 model. Note how the mixture of distributions occur. The same parameters are used here as for the above Poisson models. We assign a value of.5 to the NB2 heterogeneity or overdispersion parameter --- also termed an ancillary parameter. The glm.nb() function, which is part of the MASS library, is used model the synthetic data. NEGATIVE BINOMIAL (NB2) SYNTHETIC MODEL ======================================================================= # syn.nb2.r // Joseph Hilbe 10Apr2009 xb <- 2 +.75*x1-1.25*x2 # linear predictor / parameter values a <-.5 # assign value to ancillary parameter ia <- 1/.5 # invert alpha exb <- exp(xb) # Poisson predicted value xg <- rgamma(50000, a, a, ia) # generate gamma variates given alpha xbg <-exb*xg # mix Poisson and gamma variates nby <- rpois(50000, xbg) # generate NB2 variates jhnb2 <-glm.nb(nby ~ x1 + x2) # model NB2 summary(jhnb2) =======================================================================

glm.nb(formula = nby ~ x1 + x2, init.theta = 0.503321424020909, link = log) Deviance Residuals: Min 1Q Median 3Q Max -1.8687-1.4161-0.4982 0.2087 4.3393 Coefficients: (Intercept) 2.00385 0.01729 115.92 <2e-16 *** x1 0.76716 0.02287 33.54 <2e-16 *** x2-1.27986 0.02292-55.83 <2e-16 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 (Dispersion parameter for Negative Binomial(0.5033) family taken to be 1) Null deviance: 59153 on 49999 degrees of freedom Residual deviance: 54957 on 49997 degrees of freedom AIC: 274887 Number of Fisher Scoring iterations: 1 Theta: 0.50332 Std. Err.: 0.00374 2 x log-likelihood: -274879.37000 The values of the parameter estimates and ancillary parameter closely approximate the user specified values. The estimated value of alpha is.50332. Another run will produce different values, centered about the specified values. There is typically more variability in simulated NB variates than for Poisson due to the interaction of the ancillary parameter. The glm() command may also be used to estimate NB2 parameter estimates. However, the value of the ancillary parameter must be entered into the function as a constant. It is not estimated as with the glm.nb() function. The new line and model results are displayed below. NB2 MODEL WITH THETA (ALPHA) ENTERED AS A CONSTANT =================================================================== mynb2c <- glm(nby ~ x1 + x2, family=negative.binomial(theta=.5)) summary(mynb2c) =================================================================== glm(formula = nby ~ x1 + x2, family = negative.binomial(theta = 0.5)) Deviance Residuals: Min 1Q Median 3Q Max -1.8642-1.4135-0.4967 0.2081 4.3255 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 2.00386 0.01725 116.19 <2e-16 ***

x1 0.76715 0.02282 33.62 <2e-16 *** x2-1.27986 0.02287-55.96 <2e-16 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 (Dispersion parameter for Negative Binomial(0.5) family taken to be 0.989393) Null deviance: 58862 on 49999 degrees of freedom Residual deviance: 54691 on 49997 degrees of freedom AIC: 274886 Number of Fisher Scoring iterations: 6 We may employ an offset into the NB2 algorithm in the same manner as we did for the Poisson. Since the mean of the Poisson and NB2 are both exp(xb), we may use the same method. The synthetic NB2 data and model with offset is in the syn.nb2o.r file. The linear negative binomial model, NB1, is also based on a Poisson-gamma mixture distribution. As discussed earlier, the NB1 heterogeneity or ancillary parameter is typically referred to as δ, not α. Converting the NB2 algorithm to NB1 entails defining idelta as the inverse of the value of delta, the desired value of the model ancillary parameter, and multiplying the result by the fitted value, exb. idelta and 1/idelta are the terms given to the rgamma() function. All else is the same as the NB2 algorithm. The resultant synthetic data can be modeled using the ml.nb1 function in the COUNT package, which is located on CRAN. The COUNT package was first published in September 2010 as a source for data, functions and scripts associated with Hilbe (2011). The code below is stored in the syn.nb1.r file NEGATIVE BINOMIAL (NB1) SYNTHETIC MODEL =================================================================== library(count) xb <- 2 +.75*x1-1.25*x2 # linear predictor; parameter values delta <-.5 # value assigned to delta exb <-exp(xb) idelta <- (1/delta)*exb # product of theta and mu xg <-rgamma(50000, idelta, idelta, 1/idelta) xbg <- exb*xg nb1y <- rpois(50000, xbg) nb1data <- data.frame(nb1y, x1, x2) nb1 <- ml.nb1(nb1y ~ x1 + x2, data=nb1data) nb1 =================================================================== Estimate SE Z LCL UCL (Intercept) 2.0049903 0.005733333 349.70763 1.9937530 2.0162276 x1 0.7367309 0.007626869 96.59677 0.7217823 0.7516796 x2-1.2401928 0.007792276-159.15668-1.2554657-1.2249199 alpha 0.5015621 0.009831187 51.01745 0.4822930 0.5208312 The canonical negative binomial (NB-C), however, must be constructed in an entirely different manner from NB2, NB1, or from Poisson. NB-C is based directly on the negative binomial PDF.

R s rnbinom() function may be used to construct NB-C data, but not NB1 or NB2. Other options such as offsets are entered into the NB-C algorithm in the same manner as then were for Poisson, NB2 and NB1. The ml.nbc function in the COUNT package can be used to estimate NB- C models. The NB-C inverse link differs from that of Poisson, NB1, and NB2 models. It may be expressed as: 1/((exp(-xb)-1)*α). Unlike the other parameterizations, α is part of the link and inverse link functions. The probability is given as 1/(1+αμ). See Hilbe (2011) for details. The algorithm for creating NB-C data is given below, and is stored in syn.nbc.r. Values of p outside the range 0 to 1 are dropped from estimation, although this is usually quite rare. The resultant model still estimates the specified values for the parameter estimates; only the number of observations differ from that assigned by the user. SYNTHETIC CANONICAL NEGATIVE BINOMIAL (NB-C) MODEL rnbinom(n, size, prob, mu) where size is the ancillary parameter and prob is the probability, p, and mu = mu. Only p or mu may be put into the function, not both. ========================================================================== # syn.nbc.r // Joseph Hilbe 10Apr2009; revised 2/2011, adding ml.nbc. a <- 1.15 // value of alpha: 1.15 xb <- 1.25*x1 +.1*x2-1.5 // x1=1.25; x2=0.1; _cons= -1.5 mu <- 1/((exp(-xb)-1)*a) p <- 1/(1+a*mu) r <- 1/a nbcy <- rnbinom(50000, size=r, prob = p) nbcdata <- data.frame(nbcy, x1, x2) nbc <- ml.nbc(nbcy ~ x1 + x2, data=nbcdata) nbc ========================================================================== Estimate SE Z LCL UCL (Intercept) -1.50669922 0.01767738-85.23319-1.54134688-1.4720516 x1 1.26157472 0.01576360 80.03086 1.23067806 1.2924714 x2 0.08397626 0.00903996 9.28945 0.06625794 0.1016946 alpha 1.12340448 0.01725717 65.09786 1.08958044 1.1572285 The specified and estimated values are nearly identical. Note that the log-likelihood function used for NB-C estimation is given as (see Hilbe, 2011). LL NB-C = Σ {y(xb) + (1/α)ln(1-exp(xb)) + lnγ(y+1/α) - lnγ(y+1) lnγ(1/α) } I should also mention that exact regression models are based on the canonical form of the model, eg binomial logit and Poisson log models. If an exact negative binomial regression model is to be created, it must be based on the NB-C parameterization, not NB2.

The zero-inflated Poisson model is a mixture model where 0 and 1 counts overlap between a count model component and a binary model component. We will give an example of such a mixture model with Poisson for the count component and binary logistic regression as the binary component. See Hilbe (2011) for details on the model and interpretation. Note that the Poisson component is constructed first, with the data stored as a data frame. The zeroinfl function in the pscl package is used to estimate the model data. The values of the two components are established as: POISSON : intercept=2; x1=0.75; x2= -1.25 LOGISTIC : intercept=0.2; x1=0.9; x2= 0.1 ZERO-INFLATED POISSON ========================================================================= # Zero-Inflated Poisson syn.zip.r J Hilbe 6Jun2011 library(pscl) nobs <- 50000 x1 <- runif(nobs) x2 <- runif(nobs) # POISSON xb <- 2 +.75*x1-1.25*x2 exb <- exp(xb) poy <- rpois(nobs, exb) pdata <- data.frame(poy, x1, x2) pi <- 1/(1+exp(-(.9*x1 +.1*x2 +.2))) pdata$bern <- runif(nobs)>pi zy <- pdata$bern * poy zip <- zeroinfl(zy ~ x1 + x2 x1 + x2, dist="poisson", data=pdata) summary(zip) ========================================================================= zeroinfl(formula = zy ~ x1 + x2 x1 + x2, data = pdata, dist = "poisson") Pearson residuals: Min 1Q Median 3Q Max -0.7908-0.6328-0.5657 0.6396 5.5661 Count model coefficients (poisson with log link): (Intercept) 2.000417 0.007976 250.80 <2e-16 *** x1 0.745379 0.011221 66.43 <2e-16 *** x2-1.250595 0.011780-106.16 <2e-16 *** Zero-inflation model coefficients (binomial with logit link): (Intercept) 0.25101 0.02504 10.024 <2e-16 *** x1 0.83844 0.03378 24.820 <2e-16 *** x2 0.09824 0.03375 2.911 0.0036 ** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Number of iterations in BFGS optimization: 11 Log-likelihood: -6.756e+04 on 6 Df

The zero-inflated negative binomial is built on the same logic as the ZIP model just discussed. The same interepts and coefficients are given as were for the ZIP model. ZERO-INFLATED NEGATIVE BINOMIAL ================================================================== # ZERO INFLATED NEGATIVE BINOMIAL syn.zinb.r # J Hilbe 6Jun2011 library(pscl) nobs <- 50000 x1 <- runif(nobs) x2 <- runif(nobs) xb <- 2 +.75*x1-1.25*x2 a <-.5 ia <- 1/.5 exb <- exp(xb) xg <- rgamma(nobs, a, a, ia) xbg <-exb*xg nby <- rpois(nobs, xbg) nbdata <- data.frame(nby, x1, x2) pi <- 1/(1+exp(-(.9*x1 +.1*x2 +.2))) nbdata$bern <- runif(nobs)>pi zy <- nbdata$bern * nby zinb <- zeroinfl(zy ~ x1 + x2 x1 + x2, dist="negbin", data=nbdata) summary(zinb) =================================================================== zeroinfl(formula = zy ~ x1 + x2 x1 + x2, data = nbdata, dist = "negbin") Pearson residuals: Min 1Q Median 3Q Max -0.4096-0.3593-0.3218-0.2855 20.6937 Count model coefficients (negbin with log link): (Intercept) 1.98477 0.03392 58.51 <2e-16 *** x1 0.79218 0.04433 17.87 <2e-16 *** x2-1.22246 0.04458-27.43 <2e-16 *** Log(theta) -0.67919 0.03281-20.70 <2e-16 *** Zero-inflation model coefficients (binomial with logit link): (Intercept) 0.20921 0.03864 5.415 6.15e-08 *** x1 0.96134 0.04525 21.245 < 2e-16 *** x2 0.04372 0.04436 0.986 0.324 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Theta = 0.507 Number of iterations in BFGS optimization: 25 Log-likelihood: -6.283e+04 on 7 Df

We established that the value of the negative binomial ancillary parameter is to be.5. Remember, that theta is defined in R s glm and glm.nb functions as 1/alpha. In the zeroifl function, however, the authors have parameterized it as is standard in other software applications, and as parameterized in the COUNT package ml.nb2, ml.nb1, and ml.nbc functions. Next we look at the hurdle models, namely the Poisson-logit and negative binomial-logit hurdle models. For this class of model, the count component is restructured as a zero-truncated model, and the binary component specifies 1 as any response value above zero. Zero counts are therefore, found only in the binary component. This is unlike the zero-inflated models which mix the numbers. POISSON-LOGIT HURDLE MODEL ================================================================== # syn.logitpoisson.hurdle.r # R - SYNTHETIC LOGIT-POISSON HURDLE SCRIPT # J. Hilbe 6 June 2011 library(pscl) nobs <- 50000 # Generate predictors, design matrix x1 <- runif(nobs) x2 <- runif(nobs) xb <- 2 +.75*x1-1.25*x2 # Construct Poisson responses exb <- exp(xb) poy <- rpois(nobs, exb) pdata <- data.frame(poy, x1, x2) # Construct filter pi <- 1/(1+exp(-(.9*x1 +.1*x2 +.2))) pdata$bern <- runif(nobs) > pi # Remove all response zeros. pdata <- subset(pdata, poy > 0) # Add structural zeros pdata$poy[pdata$bern] <- 0 # Model Synthetic Logit-Poisson Hurdle data hlpoi <- hurdle(poy ~ x1 + x2, dist = "poisson", zero.dist = "binomial", link = "logit", data = pdata) summary(hlpoi) ========================================================= hurdle(formula = poy ~ x1 + x2, data = pdata, dist = "poisson", zero.dist = "binomial", link = "logit")

Pearson residuals: Min 1Q Median 3Q Max -1.55022-1.00822 0.06045 0.74654 5.47058 Count model coefficients (truncated poisson with log link): (Intercept) 1.995620 0.006003 332.43 <2e-16 *** x1 0.761713 0.007895 96.49 <2e-16 *** x2-1.254314 0.007987-157.05 <2e-16 *** Zero hurdle model coefficients (binomial with logit link): (Intercept) 0.20448 0.02486 8.226 <2e-16 *** x1 0.90717 0.03366 26.954 <2e-16 *** x2 0.06832 0.03338 2.047 0.0407 * --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Number of iterations in BFGS optimization: 9 Log-likelihood: -1.053e+05 on 6 Df The values of the component coefficients and intercepts are quite close. We next construct a negative binomial-logit, or a NB2-logit, hurdle model. It is based on the same logic as the ZINB model, but instea of mixing the distributions, it separates them as we did in the Poisson-logit hurdle model. NEGATIVE BINOMIAL-LOGIT HURDLE MODEL =========================================================================== # syn.lnb2.hurdle.r : J Hilbe 6Jun2011 library(pscl) nobs <- 50000 x1 <- runif(nobs) x2 <- runif(nobs) xb <- 2 +.75*x1-1.25*x2 a <-.5 ia <- 1/.5 exb <- exp(xb) xg <- rgamma(nobs, a, a, ia) xbg <-exb*xg nby <- rpois(nobs, xbg) nbdata <- data.frame(nby, x1, x2) pi <- 1/(1+exp(-(.9*x1 +.1*x2 +.2))) nbdata$bern <- runif(nobs)>pi nbdata <- subset(nbdata, nby > 0) nbdata$nby[nbdata$bern] <- 0 hlnb2 <- hurdle(nby ~ x1 + x2, dist="negbin", zero.dist= "binomial", link="logit", data=nbdata) summary(hlnb2) ============================================================================

hurdle(formula = nby ~ x1 + x2, data = nbdata, dist = "negbin", zero.dist = "binomial", link = "logit") Pearson residuals: Min 1Q Median 3Q Max -0.7351-0.6151-0.4055 0.2136 12.2085 Count model coefficients (truncated negbin with log link): (Intercept) 2.01068 0.02513 80.00 <2e-16 *** x1 0.78575 0.03087 25.45 <2e-16 *** x2-1.24280 0.03052-40.72 <2e-16 *** Log(theta) -0.65281 0.02230-29.27 <2e-16 *** Zero hurdle model coefficients (binomial with logit link): (Intercept) 0.19189 0.02909 6.597 4.21e-11 *** x1 0.89671 0.03936 22.784 < 2e-16 *** x2 0.12046 0.03931 3.065 0.00218 ** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Theta: count = 0.5206 Number of iterations in BFGS optimization: 17 Log-likelihood: -9.529e+04 on 7 Df The closeness of the specified and estimated coefficients is not as good as with the Poisson-logit hurdle, but this is largely due to the fact that we are only observing one of a host of possible synthetic models. Using a Monte Carlo method to repeat and summarize would show much closer values. For out final synthetic count model we shall look at the Poisson-Poisson finite mixture model. The code comes from Hilbe (2011), which has a good discussion of the model. POISSON-POISSON FINITE MIXTURE MODEL ============================================================================= # syn.ppfm.r Synthetic Poisson-Poisson Finite Mixture model # Table 13.2: Hilbe, Negative Binomial Regression, 2 ed, Cambridge Univ Press library(flexmix) nobs <- 50000 x1 <- runif(nobs) x2 <- runif(nobs) xb1 <- 1 + 0.25*x1-0.75*x2 xb2 <- 2 + 0.75*x1-1.25*x2 exb1 <- exp(xb2) exb2 <- exp(xb1) py1 <- rpois(nobs, exb2) py2 <- rpois(nobs, exb1) poixpoi <- py2 poixpoi <- ifelse(runif(nobs) >.9, py1, poixpoi) pxp <- flexmix(poixpoi ~ x1 + x2, k=2, model=flxmrglm(family="poisson")) summary(pxp) parameters(pxp, component=1, model=1) parameters(pxp, component=2, model=1) ===========================================================================

flexmix(formula = poixpoi ~ x1 + x2, k = 2, model = FLXMRglm(family = "poisson")) prior size post>0 ratio Comp.1 0.889 47236 50000 0.9447 # proportion comp 1 -.889 Comp.2 0.111 2764 46308 0.0597 # proportion comp 2 -.111 'log Lik.' -117663.1 (df=7) AIC: 235340.2 BIC: 235401.9 > parameters(pxp, component=1, model=1) Comp.1 coef.(intercept) 2.0042292 coef.x1 0.7498374 coef.x2-1.2490310 > parameters(pxp, component=2, model=1) Comp.2 coef.(intercept) 1.0561926 coef.x1 0.3204433 coef.x2-0.8295100 Note that the proportions and coefficients are given in the reverse order than specified in the synthetic model. The first component is defined as having 10% of the sample, whereas the second component has 90%. This is coded on the second poixpoi line. I have indicated the proportions in the output above. It is not at all difficult to modify the above code to employ Poisson-NB2 mixtures, NB2-NB2 mixtures, or any other combination we have been discussing. 2: SYNTHETIC BINOMIAL MODELS Synthetic binomial models are constructed in the same manner as synthetic Poisson data and models. The key lines are those that generate pseudo-random variates, a line creating the linear predictor with user defined parameters, a line employing the inverse link function to generate μ, the fitted probability of 1, and a line using μ to generate pseudo-random variates appropriate to the distribution. A Bernoulli distribution consists entirely of binary values, 1/0. y is binary and is considered here to be the response variable which is explained by the values of x1 and x2. Data such as this are typically modeled using a logistic regression. A probit, loglog, or complementary loglog model may also be used to model the data. y x1 x2 1: 1 1 1 2: 0 1 1 3: 1 0 1 4: 1 1 0 5: 1 0 1 6: 0 0 1

The above data may be grouped by covariate patterns. The covariates here are, of course, x1 and x2. With y now the number of successes, i.e. a count of 1 s, and m the number of observations having the same covariate pattern, the above data may be grouped as: y m x1 x2 1: 1 2 1 1 2: 2 3 0 1 3: 1 1 1 0 The distribution of y/m is binomial. y is a count of observations having a value of y=1 for a specific covariate pattern, and m is the number of observations having the same covariate pattern. One can see that the Bernoulli distribution is a subset of the binomial, i.e. a binomial distribution where m=1. In actuality, a logistic regression models the top data as if there were no m, regardless of the number of separate covariate patterns. Grouped logistic, or binomiallogit, regression assumes appropriate values of y and m. In R, grouped data such as the above may be modeled as a logistic regression using the glm() function, however, the binomial numerator and denominator terms must be bound. I shall demonstrate how this is accomplished. We shall fist address the binary, or Bernoulli, logistic model. This parameterization of the binomial is usually referred to as logistic regression. We specify the following parameter values: x1=.75, x2=-1.25, and _c=2 SYNTHETIC BERNOULLI-LOGIT DATA ================================================================= # syn.logit.r Joseph Hilbe 10Apr2009 xb <- 2 +.75*x1-1.25*x2 exb <- 1/(1+exp(-xb)) by <- rbinom(50000, size = 1, prob =exp) lry <- glm(by ~ x1 + x2, family=binomial(link="logit")) summary(lry) ================================================================== glm(formula = by ~ x1 + x2, family = binomial(link = "logit")) Deviance Residuals: Min 1Q Median 3Q Max -2.3491 0.4370 0.5318 0.6308 0.8783 Coefficients: (Intercept) 1.98808 0.03404 58.40 <2e-16 *** x1 0.73256 0.04337 16.89 <2e-16 *** x2-1.23873 0.04394-28.19 <2e-16 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1

(Dispersion parameter for binomial family taken to be 1) Null deviance: 43517 on 49999 degrees of freedom Residual deviance: 42406 on 49997 degrees of freedom AIC: 42412 Number of Fisher Scoring iterations: 4 A synthetic probit model may be constructed by changing the inverse link function from 1/(1+exp(-xb)) to pnorm(xb). Complementary log-log and log-log models may be developed inserting their own inverse links. Of course, the link must be declared appropriately when modeling with the glm() function. BERNOULLI PROBIT REGRESSION =========================================================== # syn.probit.r // Joseph Hilbe 10Apr2009 xb <- 2 +.75*x1 1.25*x2 exb <- pnorm(xb) by <- rbinom(50000, size = 1, prob =exb) pry <- glm(by ~ x1 + x2, family=binomial(link="probit")) summary(pry) =========================================================== glm(formula = by ~ x1 + x2, family = binomial(link = "probit")) Deviance Residuals: Min 1Q Median 3Q Max -3.2766 0.1836 0.2706 0.3807 0.7095 Coefficients: (Intercept) 1.96540 0.02817 69.76 <2e-16 *** x1 0.76039 0.03485 21.82 <2e-16 *** x2-1.20583 0.03659-32.96 <2e-16 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 20864 on 49999 degrees of freedom Residual deviance: 19166 on 49997 degrees of freedom AIC: 19172 Number of Fisher Scoring iterations: 6 The code for constructing a synthetic binomial logit model is a bit more complex. A binomial denominator must be defined and appropriately inserted into the algorithm. It must also be

decided if the binomial denominator is a fixed value, or is itself random within certain defined constraints. In the algorithm below I have specified the same values for the binomial denominator as previously employed for the Poisson offset. Therefore, I am using fixed values for the denominator. Pseudo-random values for the denominator may be created in a variety of ways. Below is code to generate a pseudo-random denominator that is divided into ten groups of 5000 observations. > library(ggplot2) > re <- 100*runif(50000) # or runif() if observations already defined. > d <- cut_number(d, n=10) BINOMIAL OR GROUPED LOGISTIC REGRESSION =========================================================================== # syn.bin_logit.r // Joseph Hilbe 10Apr2009 # predictor & n of observations # 2 nd predictor d <- rep(1:5, each=10000, times=1)*100 # denominator xb <- 2 +.75*x1-1.25*x2 # predictor values exb <- 1/(1+exp(-xb)) by <- rbinom(50000, size = d, p = exb) dby = d - by gby <- glm(cbind(by,dby) ~ x1 + x2, family=binomial(link="logit")) summary(gby) =========================================================================== glm(formula = cbind(by, dby) ~ x1 + x2, family = binomial(link = "logit")) Deviance Residuals: Min 1Q Median 3Q Max -3.8789-0.6541 0.0245 0.6986 4.1580 Coefficients: (Intercept) 1.999247 0.001976 1011.7 <2e-16 *** x1 0.750260 0.002513 298.5 <2e-16 *** x2-1.248831 0.002553-489.1 <2e-16 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 385895 on 49999 degrees of freedom Residual deviance: 50343 on 49997 degrees of freedom AIC: 315288 Number of Fisher Scoring iterations: 4 The specified parameters and estimates are extremely close in values. Probit, loglog, and complementary loglog binomial models may easily be constructed by changing the inverse link function.

3: SYNTHETIC CATEGORICAL RESPONSE MODELS I have previously discussed in detail the creation of synthetic ordered logit, or proportional odds, data in Hilbe (2009), and refer to that source for a more thorough examination of the subject. Multinomial logit data is also examined in the same source. Because of the complexity of the model, the generated data is a bit more variable than with synthetic logit, Poisson, or negative binomial models. However, Monte Carlo simulation (not shown) proves that the mean values closely approximate the user supplied parameters and cut points. I display code for generating synthetic ordered logit data below. The model is also commonly referred to as the proportional odds model. PROPORTIONAL ODDS MODEL With four levels having cut points at 2, 3, and 4, and predictors x1=.75 and x2=-1.25, we have 4 LEVELS ======================================================================== # syn.ologit4.r // Joseph Hilbe 10Apr2009 err <- runif(50000) y <-.75*x1-1.25*x2 + log(err/(1-err)) # predictor values ys <- rep(1, length(y)) #start with all level 1 y<= 2) ys <- ifelse(y>2 & y<=3, 2, ys) ys <- ifelse(y>3 & y<=4, 3, ys) ys <- ifelse(y>4, 4, ys) ologit <- polr(factor(ys) ~ x1 + x2, method="logistic") summary(ologit) ======================================================================== Re-fitting to get Hessian polr(formula = factor(ys) ~ x1 + x2, method = "logistic") Coefficients: Value Std. Error t value x1 0.7152066 0.05243098 13.64092 x2-1.3236267 0.05378750-24.60844 Intercepts: Value Std. Error t value 1 2 1.9631 0.0388 50.6324 2 3 2.9656 0.0424 69.8647 3 4 3.8963 0.0504 77.3550 Residual Deviance: 41177.99 AIC: 41187.99

To demonstrate the changes that need to be made when adding another level, I show a five level proportional odds model. The five levels have cuts at.8, 1.6, 2.4, and 3.2. The same two parameter estimates are assigned. 5 LEVELS ============================================================= # syn.ologit5.r // Joseph Hilbe 10Apr2009 err <- runif(50000) y <-.75*x1-1.25*x2 + log(err/(1-err)) ys <- rep(1, length(y)) #start with all level 1 y<=.8) ys <- ifelse(y>.8 & y<=1.6, 2, ys) ys <- ifelse(y>1.6 & y<=2.4, 3, ys) ys <- ifelse(y>2.4 & y<=3.2, 4, ys) ys <- ifelse(y>3.2, 5, ys) ologit <- polr(factor(ys) ~ x1 + x2, method="logistic") summary(ologit) =============================================================== Re-fitting to get Hessian polr(formula = factor(ys) ~ x1 + x2, method = "logistic") Coefficients: Value Std. Error t value x1 0.6984393 0.03510575 19.89530 x2-1.2877099 0.03570239-36.06789 Intercepts: Value Std. Error t value 1 2 0.7447 0.0261 28.5525 2 3 1.5511 0.0271 57.2335 3 4 2.3625 0.0296 79.8757 4 5 3.1529 0.0345 91.4819 Residual Deviance: 89935.65 AIC: 89947.65 Finally we turn to the synthetic multinomial logit model. The construction is easy to follow, and easily expandable. I shall first show the code for developing a synthetic multinomial model with two predictors and three levels. The nnet library must be called to access the multinom() model, which is a standard method for modeling multinomial models. Two sets of predictors are assigned by the user. The default values are given in the code. The first level is assigned the role of reference. Level 2 predictors: x1 =.4, x2 = -.5, intercept = 1 Level 3 predictors: x1 = -.3, x2 =.25, intercept = 2

SYNTHETIC MULTINOMIAL LOGIT DATA AND MODEL ==================================================================== # syn.multinom3.r // Joseph Hilbe 10Apr2009 library(nnet) denom <- 1+exp(.4*x1.5*x2 + 1) + exp(-.3*x1 +.25*x2 + 2) p1 <- 1/denom p2 <- exp(.4*x1.5*x2 + 1)/denom p3 <- exp(-.3*x1 +.25*x2 + 2)/denom u <- runif(50000) y <- rep(1, length(u)) #start with all level 1 u<= p1) p12 <- p1 + p2 y <- ifelse(u>p1 & u<= p12, 2, y) y <- ifelse(u>p12, 3, y) mlogit <- multinom( y ~ x1 + x2) summary(mlogit) =============================================================== # weights: 12 (6 variable) initial value 54930.614433 iter 10 value 41252.560111 final value 41250.742397 converged > summary(mlogit) multinom(formula = y ~ x1 + x2) Coefficients: (Intercept) x1 x2 2 1.011825 0.4170857-0.4733716 3 2.004441-0.2956572 0.2803842 Std. Errors: (Intercept) x1 x2 2 0.04617221 0.06057211 0.06067404 3 0.04202969 0.05510793 0.05519603 Residual Deviance: 82501.48 AIC: 82513.48 Expanding the model to three predictors and four levels can be done employing the same logic as for the smaller model. We add a fourth level of predictors, recalling that level 1 is the reference level. The predictor values are now: Level 2 predictors: x1 =.4, x2 = -.5, x3 = -.2, intercept = 1 Level 3 predictors: x1 = -.3, x2 =.25, x3 = -.3, intercept = 2 Level 4 predictors: x1 = -.25, x2 =.1, x3 =.15, intercept = 2.5

================================================================================ # syn.multinom4.r // Joseph Hilbe 10Apr2009 library(nnet) x3 <- runif(50000) denom = 1+exp(.4*x1 -.5*x2 -.2*x3 +1 ) + exp(-.3*x1+.25*x2 -.3*x3 +2) + exp(-.25*x1 +.1*x2 +.15*x3 +2.5) p1 <- 1/denom p2 <- exp(.4*x1.5*x2 + 1)/denom p3 <- exp(-.3*x1 +.25*x2 + 2)/denom p4 <- exp(-.25*x1+.1*x2 +.15*x3 + 2.5)/denom u <- runif(50000) y <- rep(1, length(u)) #start with all level 1 u<= p1) p12 <- p1 + p2 y <- ifelse(u>p1 & u<= p12, 2, y) p13 <- p1 + p2 + p3 y <- ifelse (u>p12 & u<= p13, 3, y) y <- ifelse(u>p13, 4, y) mlogit <- multinom( y ~ x1 + x2 + x3) summary(mlogit) ========================================================================= multinom(formula = y ~ x1 + x2 + x3) Coefficients: (Intercept) x1 x2 x3 2 0.9864786 0.3763757-0.3706099-0.05273498 3 1.9362067-0.2790776 0.3627176 0.06091044 4 2.4010267-0.2585436 0.2875483 0.02457017 Std. Errors: (Intercept) x1 x2 x3 2 0.07771902 0.08598288 0.08637727 0.08559650 3 0.07062911 0.07802746 0.07835065 0.07781311 4 0.06913382 0.07642240 0.07674092 0.07621202 Residual Deviance: 110009.1 AIC: 110033.1 SUMMARY REMARKS Synthetic data may be used with substantial efficacy for the evaluation of statistical models. In this article I have presented algorithm code that can be used to create a number of different types of synthetic discrete response models. The default code may be amended to employ different predictor values, as well as various numbers of predictor. The algorithms may also be extended to the generation of other types of synthetic models. I advocate using synthetic models of this sort to better understand the models we apply to real data. R is particularly capable of creating synthetic data, and is an excellent simulation environment. With computers gaining in memory and speed, it is possible to construct far

more complex synthetic data than the ones we have discussed in this article. The ones we describe may be considered as the foundation for a wide variety of other models, including fixed, random, and mixed effects models and GEE models among others. I trust that this effort will encourage others to construct extensions of the models given here, making them available to the general statistical community. Two recent sources that exploit simulation and synthetic data in the analysis of complex statistical models include Gelman and Hill (2007) with respect to the understanding of random and mixed effects models and Hardin and Hilbe (2003) for the analysis of GEE methodology. References: Gelman, A. and J. Hill (2007), Data Analysis Using Regression and Multilevel/Hierarchical Models, Cambridge: Cambridge University Press. Hardin, J.W. and J.M. Hilbe (2003), Generalized Estimating Equations, Boca Raton, FL: Chapman & Hall/CRC. Hardin, J.W. and J.M. Hilbe (2007), Generalized Linear Models and Extensions, second edition, College Station, TX: Stata Press. Hilbe, J.M. (2011), Negative Binomial Regression, second edition Cambridge: Cambridge University Press Hilbe, J.M. (2009), Logistic Regression Models, Boca Raton, FL: Chapman & Hall/CRC