Using R to Create Synthetic Discrete Response Regression Models

Size: px
Start display at page:

Download "Using R to Create Synthetic Discrete Response Regression Models"

Transcription

1 Arizona State University From the SelectedWorks of Joseph M Hilbe July 3, 2011 Using R to Create Synthetic Discrete Response Regression Models Joseph Hilbe, Arizona State University Available at:

2 Using R to Create Synthetic Discrete Response Regression Models Joseph M. Hilbe Arizona State University, and Jet Propulsion Laboratory, California Institute of Technology 3 July, 2011 (revised) Hilbe@asu.edu 2009, Joseph M Hilbe, all rights reserved. Do not distribute without prior written permission of author. Appropriate citation is requested if used in preparation of an article or book. The use of synthetic data and synthetic models has played an important role in evaluating model fit and bias. Synthetic models and tests are now found in many of the major texts dealing with statistical models. R is particularly adept in allowing the user to create synthetic data, and provides the opportunity of having most every aspect of the data and model be engineered as needed by the statistician. In my texts, Negative Binomial Regression (2007, Cambridge University Press) and Logistic Regression Models (2009, Chapman & Hall/CRC) I present the reader with a number of synthetic models, displaying how missing predictors, needed interactions, incorrectly specified links, and so forth can affect both model parameter estimates and dispersion statistics. The examples in these two books demonstrate how to determine apparent from real overdispersion in count and grouped binomial models. In this article I demonstrate how to construct synthetic data sets for various popular discrete response regression models. The same methods may be used to create data specific to a wide variety of alternative models. Specifically, I provide code for creating synthetic data sets for given types of binomial, Poisson, negative binomial, ordered slopes and proportional odds, and multinomial models. All code is based on the R pseudo-random number generators, runif(), rpois(), rgamma(), and rbinom(). The pnorm() function used for probit models. I also provide code for developing a synthetic NB-C, or canonical negative binomial model, even though there is not as yet an R command for its estimation. I intend on developing this command in a forthcoming article. The NB-C code is based on the rnbinom() function. Note that model predictors are all based on random uniform data. The results would be the same if we used normal variates, or even binary and factor variates. NB1 and NB2 negative binomial models are based on a Poisson-gamma mixture; the code clearly shows how this occurs. The logic of the coding can be used to develop a wide range of other types of synthetic models. For each set of model code it is easy for users to amend the default values, selecting their own number of model observations, seed values, number of predictors and their values, and for negative binomial models, a specific value for the ancillary parameter. Categorical response models are coded such that the number of predictors and their values may be selected, as well as cut point values for ordered response models, and the number of levels for multinomial models. Poisson, negative binomial and binomial models may also employ an offset.

3 I divide this article into three sections. First, I shall discuss creation of synthetic count response models specifically, Poisson, NB2, NB1, and NB-C. Second, I develop code for binomial models, which include both Bernoulli or binary and binomial or grouped logit and probit models. A third section provides an overview for creating synthetic proportional slopes models, including the proportional odds model and the ordered probit, as well code for constructing a synthetic multinomial logit model. Statisticians should find it relatively easy to adjust the provided default code to construct synthetic data for a wide variety of alternative discrete response models. Synthetic continuous response models are also based on the same logic as the code developed here. It should be noted that each run of these synthetic models will differ one from the other unless a seed is specified. In addition, it is possible to employ a Monte Carlo simulation algorithm to each of these models in order to determine the mean parameter and intercept values when many simulation runs have been effected. I have performed Monte Carlo simulations on each model discussed in this article, with the result that the user specified values are nearly identical to the resultant mean values. Some algorithms appear to yield more variability in parameter estimates than other algorithms, but upon a thousand or more Monte Carlo runs the mean values are stable. Each coding algorithm may be used as a batch file, either submitted as a whole file, or capable of being submitted a line at a time. As previously indicated, the user may specify various aspects of the algorithm to affect their requirements. Each algorithm begins with the characters, syn, followed by the type of model and any extra capabilities. For example, code to create the basic synthetic Poisson model is called, syn.poisson.r. The algorithm for adding an offset is called, syn.poissono.r. For proportional odds and multinomial models, the number of cuts (ordered models) or levels (multinomial) is indicated at the end of the name, e.g. syn.ologit4.r. Specific directions for running the algorithms: 1: open File -> Open script 2: Select file name from directory where files kept 3: Using left mouse button, select all lines of file (or lines desired) 4: Click once on right mouse button, Run line of selection 5: Left click on mouse button. 6: Synthetic model results displayed in R console. 1: SYNTHETIC COUNT MODELS I shall first create a simple Poisson model using the R function rpois(). This is basic or paradigm count model. The synthetic model is given two pseudo-random uniform predictors, x1 and x2. Random normal predictors are also commonly used when developing synthetic models. The values assigned to the two predictors, and intercept, are respectively x1 = 0.75, x2 = -1.25, and 2. Note that the final two lines of each algorithm are the regression model and summary output. The fourth line below, yielding xb, is the linear predictor, which is given to the inverse

4 link function in the following line. The inverse link function is used in generalized linear models to calculate the fitted or predicted value. The fit is then employed in the rpois() function to produce the synthetic Poisson response variate. SYNTHETIC POISSON MODEL ========================================================================== # syn.poisson.r // Joseph Hilbe 10Apr2009 xb < *x1-1.25*x2 # linear predictor exb <-exp(xb) # inverse link to predicted fit py <-rpois(50000, exb) # generates random Poisson variates poi <-glm(py ~ x1 + x2, family=poisson) summary(jhpoi) =========================================================================== glm(formula = py ~ x1 + x2, family = poisson) Deviance Residuals: Min 1Q Median 3Q Max Coefficients: (Intercept) <2e-16 *** x <2e-16 *** x <2e-16 *** --- Signif. codes: 0 *** ** 0.01 * (Dispersion parameter for poisson family taken to be 1) Null deviance: on degrees of freedom Residual deviance: on degrees of freedom AIC: Number of Fisher Scoring iterations: 4 The resultant parameter estimates closely approximate the user defined values. A seed was not specified for the example model. If we wish to generate the same random numbers, and the exact same model, a seed may be given after the initial line. The function is set.seed(#); e.g,, set.seed(32478). Unfortunately the Pearson statistic is not displayed with the glm() function output, but rather the deviance is used as the basis of the dispersion. Although this is the traditional manner used for calculating the dispersion statistic, it is n ot correct. See Hilbe (2007, 2009). Poisson models are commonly parameterized as rate models. As such they employ an offset, which reflects the area or time over which the count response is generated. Since the natural log is the canonical link of the Poisson model, the offset must be logged prior to entry into the estimating algorithm.

5 A synthetic offset may be randomly generated, or may be specified by the user. For this example I will create an area offset having increasing values of 100 for each 10,000 observations in the 50,000 observation data set. The first 10,000 observations have an offset value of 100, the second 10,000 a value of 200, and so on until the final 10,000 observations, which have a value of 500. This value is calculated on the fourth line of program code. Note that the log-offset is added to the linear predictor in line 6. POISSON WITH RATE PARAMETERIZATION =========================================================================== # syn.poissono.r // Joseph Hilbe 10Apr2009 off <- rep(1:5, each=10000, times=1)*100 # creates offset as defined loff <- log(off) # log the offset xb< *x1-1.25*x2 + loff # linear predictor exb <-exp(xb) # inverse link py <-rpois(50000, exb) # creates random Poisson variates poir <-glm(py ~ x1 + x2 + offset(loff), family=poisson) summary(poir) ============================================================================ glm(formula = py ~ x1 + x2 + offset(loff), family = poisson) Deviance Residuals: Min 1Q Median 3Q Max Coefficients: (Intercept) <2e-16 *** x <2e-16 *** x <2e-16 *** --- Signif. codes: 0 *** ** 0.01 * (Dispersion parameter for poisson family taken to be 1) Null deviance: on degrees of freedom Residual deviance: on degrees of freedom AIC: Number of Fisher Scoring iterations: 3 A display of the offset table is shown as: > table(off) off

6 Note that the resultant parameter estimates are nearly identical to the specified values. We do expect that the parameter estimates of the model with an offset, or rate parameterized model, will closely approximate those of the standard model. However, we also expect that the standard errors will differ, Moreover, if the rate model were modeled without declaring an offset, we would notice a greatly inflated intercept value. A Poisson model having a Pearson-based dispersion greater than 1.0 indicates possible overdispersion. Typically the deviance-based dispersion is similar to the Pearson, but not always. In any case, in R, two types of generalized linear model are generally employed when there is evidence of overdispersion in a Poisson model quasipoisson and negative binomial. The negative binomial used in both the glm.nb() and glm() functions is commonly referred to as the quadratic negative binomial, or NB2 model. It is parameterized with a variance function of μ + αμ 2. The other common parameterization of negative binomial is termed the linear negative binomial, or NB1. The NB1 model is parameterized as μ + αμ, or as μ(1+α). The quasipoisson model is simply the Poisson model with standard errors scaled by the Pearson dispersion. Since the standard errors are not based directly on the Hessian matrix derived from the log-likelihood function, the term quasi-likelihood is used as the type of model estimated. Quasipoisson is the name given to Poisson models with scaled standard errors. The NB2 parameterization of the negative binomial can be generated as a Poisson-gamma mixture model, with a gamma scale parameter of 1. We use this method to create a synthetic NB2 variate, and a corresponding synthetic NB2 model. The negative binomial random number generator in R, rnbinom(), is not appropriate for dealing with NB2 data. The rnbinom() function is, however, appropriate for generating NB-C data, as we shall discuss later. In fact, both NB2 and NB1 synthetic models are generated as a mixture of Poisson and gamma distributions. The algorithm below is used to create a synthetic NB2 model. Note how the mixture of distributions occur. The same parameters are used here as for the above Poisson models. We assign a value of.5 to the NB2 heterogeneity or overdispersion parameter --- also termed an ancillary parameter. The glm.nb() function, which is part of the MASS library, is used model the synthetic data. NEGATIVE BINOMIAL (NB2) SYNTHETIC MODEL ======================================================================= # syn.nb2.r // Joseph Hilbe 10Apr2009 xb < *x1-1.25*x2 # linear predictor / parameter values a <-.5 # assign value to ancillary parameter ia <- 1/.5 # invert alpha exb <- exp(xb) # Poisson predicted value xg <- rgamma(50000, a, a, ia) # generate gamma variates given alpha xbg <-exb*xg # mix Poisson and gamma variates nby <- rpois(50000, xbg) # generate NB2 variates jhnb2 <-glm.nb(nby ~ x1 + x2) # model NB2 summary(jhnb2) =======================================================================

7 glm.nb(formula = nby ~ x1 + x2, init.theta = , link = log) Deviance Residuals: Min 1Q Median 3Q Max Coefficients: (Intercept) <2e-16 *** x <2e-16 *** x <2e-16 *** --- Signif. codes: 0 *** ** 0.01 * (Dispersion parameter for Negative Binomial(0.5033) family taken to be 1) Null deviance: on degrees of freedom Residual deviance: on degrees of freedom AIC: Number of Fisher Scoring iterations: 1 Theta: Std. Err.: x log-likelihood: The values of the parameter estimates and ancillary parameter closely approximate the user specified values. The estimated value of alpha is Another run will produce different values, centered about the specified values. There is typically more variability in simulated NB variates than for Poisson due to the interaction of the ancillary parameter. The glm() command may also be used to estimate NB2 parameter estimates. However, the value of the ancillary parameter must be entered into the function as a constant. It is not estimated as with the glm.nb() function. The new line and model results are displayed below. NB2 MODEL WITH THETA (ALPHA) ENTERED AS A CONSTANT =================================================================== mynb2c <- glm(nby ~ x1 + x2, family=negative.binomial(theta=.5)) summary(mynb2c) =================================================================== glm(formula = nby ~ x1 + x2, family = negative.binomial(theta = 0.5)) Deviance Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) <2e-16 ***

8 x <2e-16 *** x <2e-16 *** --- Signif. codes: 0 *** ** 0.01 * (Dispersion parameter for Negative Binomial(0.5) family taken to be ) Null deviance: on degrees of freedom Residual deviance: on degrees of freedom AIC: Number of Fisher Scoring iterations: 6 We may employ an offset into the NB2 algorithm in the same manner as we did for the Poisson. Since the mean of the Poisson and NB2 are both exp(xb), we may use the same method. The synthetic NB2 data and model with offset is in the syn.nb2o.r file. The linear negative binomial model, NB1, is also based on a Poisson-gamma mixture distribution. As discussed earlier, the NB1 heterogeneity or ancillary parameter is typically referred to as δ, not α. Converting the NB2 algorithm to NB1 entails defining idelta as the inverse of the value of delta, the desired value of the model ancillary parameter, and multiplying the result by the fitted value, exb. idelta and 1/idelta are the terms given to the rgamma() function. All else is the same as the NB2 algorithm. The resultant synthetic data can be modeled using the ml.nb1 function in the COUNT package, which is located on CRAN. The COUNT package was first published in September 2010 as a source for data, functions and scripts associated with Hilbe (2011). The code below is stored in the syn.nb1.r file NEGATIVE BINOMIAL (NB1) SYNTHETIC MODEL =================================================================== library(count) xb < *x1-1.25*x2 # linear predictor; parameter values delta <-.5 # value assigned to delta exb <-exp(xb) idelta <- (1/delta)*exb # product of theta and mu xg <-rgamma(50000, idelta, idelta, 1/idelta) xbg <- exb*xg nb1y <- rpois(50000, xbg) nb1data <- data.frame(nb1y, x1, x2) nb1 <- ml.nb1(nb1y ~ x1 + x2, data=nb1data) nb1 =================================================================== Estimate SE Z LCL UCL (Intercept) x x alpha The canonical negative binomial (NB-C), however, must be constructed in an entirely different manner from NB2, NB1, or from Poisson. NB-C is based directly on the negative binomial PDF.

9 R s rnbinom() function may be used to construct NB-C data, but not NB1 or NB2. Other options such as offsets are entered into the NB-C algorithm in the same manner as then were for Poisson, NB2 and NB1. The ml.nbc function in the COUNT package can be used to estimate NB- C models. The NB-C inverse link differs from that of Poisson, NB1, and NB2 models. It may be expressed as: 1/((exp(-xb)-1)*α). Unlike the other parameterizations, α is part of the link and inverse link functions. The probability is given as 1/(1+αμ). See Hilbe (2011) for details. The algorithm for creating NB-C data is given below, and is stored in syn.nbc.r. Values of p outside the range 0 to 1 are dropped from estimation, although this is usually quite rare. The resultant model still estimates the specified values for the parameter estimates; only the number of observations differ from that assigned by the user. SYNTHETIC CANONICAL NEGATIVE BINOMIAL (NB-C) MODEL rnbinom(n, size, prob, mu) where size is the ancillary parameter and prob is the probability, p, and mu = mu. Only p or mu may be put into the function, not both. ========================================================================== # syn.nbc.r // Joseph Hilbe 10Apr2009; revised 2/2011, adding ml.nbc. a < // value of alpha: 1.15 xb <- 1.25*x1 +.1*x2-1.5 // x1=1.25; x2=0.1; _cons= -1.5 mu <- 1/((exp(-xb)-1)*a) p <- 1/(1+a*mu) r <- 1/a nbcy <- rnbinom(50000, size=r, prob = p) nbcdata <- data.frame(nbcy, x1, x2) nbc <- ml.nbc(nbcy ~ x1 + x2, data=nbcdata) nbc ========================================================================== Estimate SE Z LCL UCL (Intercept) x x alpha The specified and estimated values are nearly identical. Note that the log-likelihood function used for NB-C estimation is given as (see Hilbe, 2011). LL NB-C = Σ {y(xb) + (1/α)ln(1-exp(xb)) + lnγ(y+1/α) - lnγ(y+1) lnγ(1/α) } I should also mention that exact regression models are based on the canonical form of the model, eg binomial logit and Poisson log models. If an exact negative binomial regression model is to be created, it must be based on the NB-C parameterization, not NB2.

10 The zero-inflated Poisson model is a mixture model where 0 and 1 counts overlap between a count model component and a binary model component. We will give an example of such a mixture model with Poisson for the count component and binary logistic regression as the binary component. See Hilbe (2011) for details on the model and interpretation. Note that the Poisson component is constructed first, with the data stored as a data frame. The zeroinfl function in the pscl package is used to estimate the model data. The values of the two components are established as: POISSON : intercept=2; x1=0.75; x2= LOGISTIC : intercept=0.2; x1=0.9; x2= 0.1 ZERO-INFLATED POISSON ========================================================================= # Zero-Inflated Poisson syn.zip.r J Hilbe 6Jun2011 library(pscl) nobs < x1 <- runif(nobs) x2 <- runif(nobs) # POISSON xb < *x1-1.25*x2 exb <- exp(xb) poy <- rpois(nobs, exb) pdata <- data.frame(poy, x1, x2) pi <- 1/(1+exp(-(.9*x1 +.1*x2 +.2))) pdata$bern <- runif(nobs)>pi zy <- pdata$bern * poy zip <- zeroinfl(zy ~ x1 + x2 x1 + x2, dist="poisson", data=pdata) summary(zip) ========================================================================= zeroinfl(formula = zy ~ x1 + x2 x1 + x2, data = pdata, dist = "poisson") Pearson residuals: Min 1Q Median 3Q Max Count model coefficients (poisson with log link): (Intercept) <2e-16 *** x <2e-16 *** x <2e-16 *** Zero-inflation model coefficients (binomial with logit link): (Intercept) <2e-16 *** x <2e-16 *** x ** --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Number of iterations in BFGS optimization: 11 Log-likelihood: e+04 on 6 Df

11 The zero-inflated negative binomial is built on the same logic as the ZIP model just discussed. The same interepts and coefficients are given as were for the ZIP model. ZERO-INFLATED NEGATIVE BINOMIAL ================================================================== # ZERO INFLATED NEGATIVE BINOMIAL syn.zinb.r # J Hilbe 6Jun2011 library(pscl) nobs < x1 <- runif(nobs) x2 <- runif(nobs) xb < *x1-1.25*x2 a <-.5 ia <- 1/.5 exb <- exp(xb) xg <- rgamma(nobs, a, a, ia) xbg <-exb*xg nby <- rpois(nobs, xbg) nbdata <- data.frame(nby, x1, x2) pi <- 1/(1+exp(-(.9*x1 +.1*x2 +.2))) nbdata$bern <- runif(nobs)>pi zy <- nbdata$bern * nby zinb <- zeroinfl(zy ~ x1 + x2 x1 + x2, dist="negbin", data=nbdata) summary(zinb) =================================================================== zeroinfl(formula = zy ~ x1 + x2 x1 + x2, data = nbdata, dist = "negbin") Pearson residuals: Min 1Q Median 3Q Max Count model coefficients (negbin with log link): (Intercept) <2e-16 *** x <2e-16 *** x <2e-16 *** Log(theta) <2e-16 *** Zero-inflation model coefficients (binomial with logit link): (Intercept) e-08 *** x < 2e-16 *** x Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Theta = Number of iterations in BFGS optimization: 25 Log-likelihood: e+04 on 7 Df

12 We established that the value of the negative binomial ancillary parameter is to be.5. Remember, that theta is defined in R s glm and glm.nb functions as 1/alpha. In the zeroifl function, however, the authors have parameterized it as is standard in other software applications, and as parameterized in the COUNT package ml.nb2, ml.nb1, and ml.nbc functions. Next we look at the hurdle models, namely the Poisson-logit and negative binomial-logit hurdle models. For this class of model, the count component is restructured as a zero-truncated model, and the binary component specifies 1 as any response value above zero. Zero counts are therefore, found only in the binary component. This is unlike the zero-inflated models which mix the numbers. POISSON-LOGIT HURDLE MODEL ================================================================== # syn.logitpoisson.hurdle.r # R - SYNTHETIC LOGIT-POISSON HURDLE SCRIPT # J. Hilbe 6 June 2011 library(pscl) nobs < # Generate predictors, design matrix x1 <- runif(nobs) x2 <- runif(nobs) xb < *x1-1.25*x2 # Construct Poisson responses exb <- exp(xb) poy <- rpois(nobs, exb) pdata <- data.frame(poy, x1, x2) # Construct filter pi <- 1/(1+exp(-(.9*x1 +.1*x2 +.2))) pdata$bern <- runif(nobs) > pi # Remove all response zeros. pdata <- subset(pdata, poy > 0) # Add structural zeros pdata$poy[pdata$bern] <- 0 # Model Synthetic Logit-Poisson Hurdle data hlpoi <- hurdle(poy ~ x1 + x2, dist = "poisson", zero.dist = "binomial", link = "logit", data = pdata) summary(hlpoi) ========================================================= hurdle(formula = poy ~ x1 + x2, data = pdata, dist = "poisson", zero.dist = "binomial", link = "logit")

13 Pearson residuals: Min 1Q Median 3Q Max Count model coefficients (truncated poisson with log link): (Intercept) <2e-16 *** x <2e-16 *** x <2e-16 *** Zero hurdle model coefficients (binomial with logit link): (Intercept) <2e-16 *** x <2e-16 *** x * --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Number of iterations in BFGS optimization: 9 Log-likelihood: e+05 on 6 Df The values of the component coefficients and intercepts are quite close. We next construct a negative binomial-logit, or a NB2-logit, hurdle model. It is based on the same logic as the ZINB model, but instea of mixing the distributions, it separates them as we did in the Poisson-logit hurdle model. NEGATIVE BINOMIAL-LOGIT HURDLE MODEL =========================================================================== # syn.lnb2.hurdle.r : J Hilbe 6Jun2011 library(pscl) nobs < x1 <- runif(nobs) x2 <- runif(nobs) xb < *x1-1.25*x2 a <-.5 ia <- 1/.5 exb <- exp(xb) xg <- rgamma(nobs, a, a, ia) xbg <-exb*xg nby <- rpois(nobs, xbg) nbdata <- data.frame(nby, x1, x2) pi <- 1/(1+exp(-(.9*x1 +.1*x2 +.2))) nbdata$bern <- runif(nobs)>pi nbdata <- subset(nbdata, nby > 0) nbdata$nby[nbdata$bern] <- 0 hlnb2 <- hurdle(nby ~ x1 + x2, dist="negbin", zero.dist= "binomial", link="logit", data=nbdata) summary(hlnb2) ============================================================================

14 hurdle(formula = nby ~ x1 + x2, data = nbdata, dist = "negbin", zero.dist = "binomial", link = "logit") Pearson residuals: Min 1Q Median 3Q Max Count model coefficients (truncated negbin with log link): (Intercept) <2e-16 *** x <2e-16 *** x <2e-16 *** Log(theta) <2e-16 *** Zero hurdle model coefficients (binomial with logit link): (Intercept) e-11 *** x < 2e-16 *** x ** --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Theta: count = Number of iterations in BFGS optimization: 17 Log-likelihood: e+04 on 7 Df The closeness of the specified and estimated coefficients is not as good as with the Poisson-logit hurdle, but this is largely due to the fact that we are only observing one of a host of possible synthetic models. Using a Monte Carlo method to repeat and summarize would show much closer values. For out final synthetic count model we shall look at the Poisson-Poisson finite mixture model. The code comes from Hilbe (2011), which has a good discussion of the model. POISSON-POISSON FINITE MIXTURE MODEL ============================================================================= # syn.ppfm.r Synthetic Poisson-Poisson Finite Mixture model # Table 13.2: Hilbe, Negative Binomial Regression, 2 ed, Cambridge Univ Press library(flexmix) nobs < x1 <- runif(nobs) x2 <- runif(nobs) xb1 < *x1-0.75*x2 xb2 < *x1-1.25*x2 exb1 <- exp(xb2) exb2 <- exp(xb1) py1 <- rpois(nobs, exb2) py2 <- rpois(nobs, exb1) poixpoi <- py2 poixpoi <- ifelse(runif(nobs) >.9, py1, poixpoi) pxp <- flexmix(poixpoi ~ x1 + x2, k=2, model=flxmrglm(family="poisson")) summary(pxp) parameters(pxp, component=1, model=1) parameters(pxp, component=2, model=1) ===========================================================================

15 flexmix(formula = poixpoi ~ x1 + x2, k = 2, model = FLXMRglm(family = "poisson")) prior size post>0 ratio Comp # proportion comp Comp # proportion comp 'log Lik.' (df=7) AIC: BIC: > parameters(pxp, component=1, model=1) Comp.1 coef.(intercept) coef.x coef.x > parameters(pxp, component=2, model=1) Comp.2 coef.(intercept) coef.x coef.x Note that the proportions and coefficients are given in the reverse order than specified in the synthetic model. The first component is defined as having 10% of the sample, whereas the second component has 90%. This is coded on the second poixpoi line. I have indicated the proportions in the output above. It is not at all difficult to modify the above code to employ Poisson-NB2 mixtures, NB2-NB2 mixtures, or any other combination we have been discussing. 2: SYNTHETIC BINOMIAL MODELS Synthetic binomial models are constructed in the same manner as synthetic Poisson data and models. The key lines are those that generate pseudo-random variates, a line creating the linear predictor with user defined parameters, a line employing the inverse link function to generate μ, the fitted probability of 1, and a line using μ to generate pseudo-random variates appropriate to the distribution. A Bernoulli distribution consists entirely of binary values, 1/0. y is binary and is considered here to be the response variable which is explained by the values of x1 and x2. Data such as this are typically modeled using a logistic regression. A probit, loglog, or complementary loglog model may also be used to model the data. y x1 x2 1: : : : : : 0 0 1

16 The above data may be grouped by covariate patterns. The covariates here are, of course, x1 and x2. With y now the number of successes, i.e. a count of 1 s, and m the number of observations having the same covariate pattern, the above data may be grouped as: y m x1 x2 1: : : The distribution of y/m is binomial. y is a count of observations having a value of y=1 for a specific covariate pattern, and m is the number of observations having the same covariate pattern. One can see that the Bernoulli distribution is a subset of the binomial, i.e. a binomial distribution where m=1. In actuality, a logistic regression models the top data as if there were no m, regardless of the number of separate covariate patterns. Grouped logistic, or binomiallogit, regression assumes appropriate values of y and m. In R, grouped data such as the above may be modeled as a logistic regression using the glm() function, however, the binomial numerator and denominator terms must be bound. I shall demonstrate how this is accomplished. We shall fist address the binary, or Bernoulli, logistic model. This parameterization of the binomial is usually referred to as logistic regression. We specify the following parameter values: x1=.75, x2=-1.25, and _c=2 SYNTHETIC BERNOULLI-LOGIT DATA ================================================================= # syn.logit.r Joseph Hilbe 10Apr2009 xb < *x1-1.25*x2 exb <- 1/(1+exp(-xb)) by <- rbinom(50000, size = 1, prob =exp) lry <- glm(by ~ x1 + x2, family=binomial(link="logit")) summary(lry) ================================================================== glm(formula = by ~ x1 + x2, family = binomial(link = "logit")) Deviance Residuals: Min 1Q Median 3Q Max Coefficients: (Intercept) <2e-16 *** x <2e-16 *** x <2e-16 *** --- Signif. codes: 0 *** ** 0.01 *

17 (Dispersion parameter for binomial family taken to be 1) Null deviance: on degrees of freedom Residual deviance: on degrees of freedom AIC: Number of Fisher Scoring iterations: 4 A synthetic probit model may be constructed by changing the inverse link function from 1/(1+exp(-xb)) to pnorm(xb). Complementary log-log and log-log models may be developed inserting their own inverse links. Of course, the link must be declared appropriately when modeling with the glm() function. BERNOULLI PROBIT REGRESSION =========================================================== # syn.probit.r // Joseph Hilbe 10Apr2009 xb < *x1 1.25*x2 exb <- pnorm(xb) by <- rbinom(50000, size = 1, prob =exb) pry <- glm(by ~ x1 + x2, family=binomial(link="probit")) summary(pry) =========================================================== glm(formula = by ~ x1 + x2, family = binomial(link = "probit")) Deviance Residuals: Min 1Q Median 3Q Max Coefficients: (Intercept) <2e-16 *** x <2e-16 *** x <2e-16 *** --- Signif. codes: 0 *** ** 0.01 * (Dispersion parameter for binomial family taken to be 1) Null deviance: on degrees of freedom Residual deviance: on degrees of freedom AIC: Number of Fisher Scoring iterations: 6 The code for constructing a synthetic binomial logit model is a bit more complex. A binomial denominator must be defined and appropriately inserted into the algorithm. It must also be

18 decided if the binomial denominator is a fixed value, or is itself random within certain defined constraints. In the algorithm below I have specified the same values for the binomial denominator as previously employed for the Poisson offset. Therefore, I am using fixed values for the denominator. Pseudo-random values for the denominator may be created in a variety of ways. Below is code to generate a pseudo-random denominator that is divided into ten groups of 5000 observations. > library(ggplot2) > re <- 100*runif(50000) # or runif() if observations already defined. > d <- cut_number(d, n=10) BINOMIAL OR GROUPED LOGISTIC REGRESSION =========================================================================== # syn.bin_logit.r // Joseph Hilbe 10Apr2009 # predictor & n of observations # 2 nd predictor d <- rep(1:5, each=10000, times=1)*100 # denominator xb < *x1-1.25*x2 # predictor values exb <- 1/(1+exp(-xb)) by <- rbinom(50000, size = d, p = exb) dby = d - by gby <- glm(cbind(by,dby) ~ x1 + x2, family=binomial(link="logit")) summary(gby) =========================================================================== glm(formula = cbind(by, dby) ~ x1 + x2, family = binomial(link = "logit")) Deviance Residuals: Min 1Q Median 3Q Max Coefficients: (Intercept) <2e-16 *** x <2e-16 *** x <2e-16 *** --- Signif. codes: 0 *** ** 0.01 * (Dispersion parameter for binomial family taken to be 1) Null deviance: on degrees of freedom Residual deviance: on degrees of freedom AIC: Number of Fisher Scoring iterations: 4 The specified parameters and estimates are extremely close in values. Probit, loglog, and complementary loglog binomial models may easily be constructed by changing the inverse link function.

19 3: SYNTHETIC CATEGORICAL RESPONSE MODELS I have previously discussed in detail the creation of synthetic ordered logit, or proportional odds, data in Hilbe (2009), and refer to that source for a more thorough examination of the subject. Multinomial logit data is also examined in the same source. Because of the complexity of the model, the generated data is a bit more variable than with synthetic logit, Poisson, or negative binomial models. However, Monte Carlo simulation (not shown) proves that the mean values closely approximate the user supplied parameters and cut points. I display code for generating synthetic ordered logit data below. The model is also commonly referred to as the proportional odds model. PROPORTIONAL ODDS MODEL With four levels having cut points at 2, 3, and 4, and predictors x1=.75 and x2=-1.25, we have 4 LEVELS ======================================================================== # syn.ologit4.r // Joseph Hilbe 10Apr2009 err <- runif(50000) y <-.75*x1-1.25*x2 + log(err/(1-err)) # predictor values ys <- rep(1, length(y)) #start with all level 1 y<= 2) ys <- ifelse(y>2 & y<=3, 2, ys) ys <- ifelse(y>3 & y<=4, 3, ys) ys <- ifelse(y>4, 4, ys) ologit <- polr(factor(ys) ~ x1 + x2, method="logistic") summary(ologit) ======================================================================== Re-fitting to get Hessian polr(formula = factor(ys) ~ x1 + x2, method = "logistic") Coefficients: Value Std. Error t value x x Intercepts: Value Std. Error t value Residual Deviance: AIC:

20 To demonstrate the changes that need to be made when adding another level, I show a five level proportional odds model. The five levels have cuts at.8, 1.6, 2.4, and 3.2. The same two parameter estimates are assigned. 5 LEVELS ============================================================= # syn.ologit5.r // Joseph Hilbe 10Apr2009 err <- runif(50000) y <-.75*x1-1.25*x2 + log(err/(1-err)) ys <- rep(1, length(y)) #start with all level 1 y<=.8) ys <- ifelse(y>.8 & y<=1.6, 2, ys) ys <- ifelse(y>1.6 & y<=2.4, 3, ys) ys <- ifelse(y>2.4 & y<=3.2, 4, ys) ys <- ifelse(y>3.2, 5, ys) ologit <- polr(factor(ys) ~ x1 + x2, method="logistic") summary(ologit) =============================================================== Re-fitting to get Hessian polr(formula = factor(ys) ~ x1 + x2, method = "logistic") Coefficients: Value Std. Error t value x x Intercepts: Value Std. Error t value Residual Deviance: AIC: Finally we turn to the synthetic multinomial logit model. The construction is easy to follow, and easily expandable. I shall first show the code for developing a synthetic multinomial model with two predictors and three levels. The nnet library must be called to access the multinom() model, which is a standard method for modeling multinomial models. Two sets of predictors are assigned by the user. The default values are given in the code. The first level is assigned the role of reference. Level 2 predictors: x1 =.4, x2 = -.5, intercept = 1 Level 3 predictors: x1 = -.3, x2 =.25, intercept = 2

21 SYNTHETIC MULTINOMIAL LOGIT DATA AND MODEL ==================================================================== # syn.multinom3.r // Joseph Hilbe 10Apr2009 library(nnet) denom <- 1+exp(.4*x1.5*x2 + 1) + exp(-.3*x1 +.25*x2 + 2) p1 <- 1/denom p2 <- exp(.4*x1.5*x2 + 1)/denom p3 <- exp(-.3*x1 +.25*x2 + 2)/denom u <- runif(50000) y <- rep(1, length(u)) #start with all level 1 u<= p1) p12 <- p1 + p2 y <- ifelse(u>p1 & u<= p12, 2, y) y <- ifelse(u>p12, 3, y) mlogit <- multinom( y ~ x1 + x2) summary(mlogit) =============================================================== # weights: 12 (6 variable) initial value iter 10 value final value converged > summary(mlogit) multinom(formula = y ~ x1 + x2) Coefficients: (Intercept) x1 x Std. Errors: (Intercept) x1 x Residual Deviance: AIC: Expanding the model to three predictors and four levels can be done employing the same logic as for the smaller model. We add a fourth level of predictors, recalling that level 1 is the reference level. The predictor values are now: Level 2 predictors: x1 =.4, x2 = -.5, x3 = -.2, intercept = 1 Level 3 predictors: x1 = -.3, x2 =.25, x3 = -.3, intercept = 2 Level 4 predictors: x1 = -.25, x2 =.1, x3 =.15, intercept = 2.5

22 ================================================================================ # syn.multinom4.r // Joseph Hilbe 10Apr2009 library(nnet) x3 <- runif(50000) denom = 1+exp(.4*x1 -.5*x2 -.2*x3 +1 ) + exp(-.3*x1+.25*x2 -.3*x3 +2) + exp(-.25*x1 +.1*x2 +.15*x3 +2.5) p1 <- 1/denom p2 <- exp(.4*x1.5*x2 + 1)/denom p3 <- exp(-.3*x1 +.25*x2 + 2)/denom p4 <- exp(-.25*x1+.1*x2 +.15*x )/denom u <- runif(50000) y <- rep(1, length(u)) #start with all level 1 u<= p1) p12 <- p1 + p2 y <- ifelse(u>p1 & u<= p12, 2, y) p13 <- p1 + p2 + p3 y <- ifelse (u>p12 & u<= p13, 3, y) y <- ifelse(u>p13, 4, y) mlogit <- multinom( y ~ x1 + x2 + x3) summary(mlogit) ========================================================================= multinom(formula = y ~ x1 + x2 + x3) Coefficients: (Intercept) x1 x2 x Std. Errors: (Intercept) x1 x2 x Residual Deviance: AIC: SUMMARY REMARKS Synthetic data may be used with substantial efficacy for the evaluation of statistical models. In this article I have presented algorithm code that can be used to create a number of different types of synthetic discrete response models. The default code may be amended to employ different predictor values, as well as various numbers of predictor. The algorithms may also be extended to the generation of other types of synthetic models. I advocate using synthetic models of this sort to better understand the models we apply to real data. R is particularly capable of creating synthetic data, and is an excellent simulation environment. With computers gaining in memory and speed, it is possible to construct far

23 more complex synthetic data than the ones we have discussed in this article. The ones we describe may be considered as the foundation for a wide variety of other models, including fixed, random, and mixed effects models and GEE models among others. I trust that this effort will encourage others to construct extensions of the models given here, making them available to the general statistical community. Two recent sources that exploit simulation and synthetic data in the analysis of complex statistical models include Gelman and Hill (2007) with respect to the understanding of random and mixed effects models and Hardin and Hilbe (2003) for the analysis of GEE methodology. References: Gelman, A. and J. Hill (2007), Data Analysis Using Regression and Multilevel/Hierarchical Models, Cambridge: Cambridge University Press. Hardin, J.W. and J.M. Hilbe (2003), Generalized Estimating Equations, Boca Raton, FL: Chapman & Hall/CRC. Hardin, J.W. and J.M. Hilbe (2007), Generalized Linear Models and Extensions, second edition, College Station, TX: Stata Press. Hilbe, J.M. (2011), Negative Binomial Regression, second edition Cambridge: Cambridge University Press Hilbe, J.M. (2009), Logistic Regression Models, Boca Raton, FL: Chapman & Hall/CRC

Creation of Synthetic Discrete Response Regression Models

Creation of Synthetic Discrete Response Regression Models Arizona State University From the SelectedWorks of Joseph M Hilbe 2010 Creation of Synthetic Discrete Response Regression Models Joseph Hilbe, Arizona State University Available at: https://works.bepress.com/joseph_hilbe/2/

More information

Creating synthetic discrete-response regression models

Creating synthetic discrete-response regression models The Stata Journal (2010) 10, Number 1, pp. 104 124 Creating synthetic discrete-response regression models Joseph M. Hilbe Arizona State University and Jet Propulsion Laboratory, CalTech Hilbe@asu.edu Abstract.

More information

The Stata Journal. Editor Nicholas J. Cox Department of Geography Durham University South Road Durham City DH1 3LE UK

The Stata Journal. Editor Nicholas J. Cox Department of Geography Durham University South Road Durham City DH1 3LE UK The Stata Journal Editor H. Joseph Newton Department of Statistics Texas A&M University College Station, Texas 77843 979-845-8817; fax 979-845-6077 jnewton@stata-journal.com Associate Editors Christopher

More information

Multiple Regression and Logistic Regression II. Dajiang 525 Apr

Multiple Regression and Logistic Regression II. Dajiang 525 Apr Multiple Regression and Logistic Regression II Dajiang Liu @PHS 525 Apr-19-2016 Materials from Last Time Multiple regression model: Include multiple predictors in the model = + + + + How to interpret the

More information

Negative Binomial Model for Count Data Log-linear Models for Contingency Tables - Introduction

Negative Binomial Model for Count Data Log-linear Models for Contingency Tables - Introduction Negative Binomial Model for Count Data Log-linear Models for Contingency Tables - Introduction Statistics 149 Spring 2006 Copyright 2006 by Mark E. Irwin Negative Binomial Family Example: Absenteeism from

More information

Ordinal Multinomial Logistic Regression. Thom M. Suhy Southern Methodist University May14th, 2013

Ordinal Multinomial Logistic Regression. Thom M. Suhy Southern Methodist University May14th, 2013 Ordinal Multinomial Logistic Thom M. Suhy Southern Methodist University May14th, 2013 GLM Generalized Linear Model (GLM) Framework for statistical analysis (Gelman and Hill, 2007, p. 135) Linear Continuous

More information

############################ ### toxo.r ### ############################

############################ ### toxo.r ### ############################ ############################ ### toxo.r ### ############################ toxo < read.table(file="n:\\courses\\stat8620\\fall 08\\toxo.dat",header=T) #toxo < read.table(file="c:\\documents and Settings\\dhall\\My

More information

Intro to GLM Day 2: GLM and Maximum Likelihood

Intro to GLM Day 2: GLM and Maximum Likelihood Intro to GLM Day 2: GLM and Maximum Likelihood Federico Vegetti Central European University ECPR Summer School in Methods and Techniques 1 / 32 Generalized Linear Modeling 3 steps of GLM 1. Specify the

More information

Introduction to General and Generalized Linear Models

Introduction to General and Generalized Linear Models Introduction to General and Generalized Linear Models Generalized Linear Models - IIIb Henrik Madsen March 18, 2012 Henrik Madsen () Chapman & Hall March 18, 2012 1 / 32 Examples Overdispersion and Offset!

More information

Bayesian Multinomial Model for Ordinal Data

Bayesian Multinomial Model for Ordinal Data Bayesian Multinomial Model for Ordinal Data Overview This example illustrates how to fit a Bayesian multinomial model by using the built-in mutinomial density function (MULTINOM) in the MCMC procedure

More information

Generalized Linear Models

Generalized Linear Models Generalized Linear Models Scott Creel Wednesday, September 10, 2014 This exercise extends the prior material on using the lm() function to fit an OLS regression and test hypotheses about effects on a parameter.

More information

boxcox() returns the values of α and their loglikelihoods,

boxcox() returns the values of α and their loglikelihoods, Solutions to Selected Computer Lab Problems and Exercises in Chapter 11 of Statistics and Data Analysis for Financial Engineering, 2nd ed. by David Ruppert and David S. Matteson c 2016 David Ruppert and

More information

Getting Started in Logit and Ordered Logit Regression (ver. 3.1 beta)

Getting Started in Logit and Ordered Logit Regression (ver. 3.1 beta) Getting Started in Logit and Ordered Logit Regression (ver. 3. beta Oscar Torres-Reyna Data Consultant otorres@princeton.edu http://dss.princeton.edu/training/ Logit model Use logit models whenever your

More information

Logistic Regression. Logistic Regression Theory

Logistic Regression. Logistic Regression Theory Logistic Regression Dr. J. Kyle Roberts Southern Methodist University Simmons School of Education and Human Development Department of Teaching and Learning Logistic Regression The linear probability model.

More information

Supplementary material for the paper Identifiability and bias reduction in the skew-probit model for a binary response

Supplementary material for the paper Identifiability and bias reduction in the skew-probit model for a binary response Supplementary material for the paper Identifiability and bias reduction in the skew-probit model for a binary response DongHyuk Lee and Samiran Sinha Department of Statistics, Texas A&M University, College

More information

List of figures. I General information 1

List of figures. I General information 1 List of figures Preface xix xxi I General information 1 1 Introduction 7 1.1 What is this book about?........................ 7 1.2 Which models are considered?...................... 8 1.3 Whom is this

More information

Logit Models for Binary Data

Logit Models for Binary Data Chapter 3 Logit Models for Binary Data We now turn our attention to regression models for dichotomous data, including logistic regression and probit analysis These models are appropriate when the response

More information

Logistic Regression with R: Example One

Logistic Regression with R: Example One Logistic Regression with R: Example One math = read.table("http://www.utstat.toronto.edu/~brunner/appliedf12/data/mathcat.data") math[1:5,] hsgpa hsengl hscalc course passed outcome 1 78.0 80 Yes Mainstrm

More information

MCMC Package Example

MCMC Package Example MCMC Package Example Charles J. Geyer April 4, 2005 This is an example of using the mcmc package in R. The problem comes from a take-home question on a (take-home) PhD qualifying exam (School of Statistics,

More information

WC-5 Just How Credible Is That Employer? Exploring GLMs and Multilevel Modeling for NCCI s Excess Loss Factor Methodology

WC-5 Just How Credible Is That Employer? Exploring GLMs and Multilevel Modeling for NCCI s Excess Loss Factor Methodology Antitrust Notice The Casualty Actuarial Society is committed to adhering strictly to the letter and spirit of the antitrust laws. Seminars conducted under the auspices of the CAS are designed solely to

More information

WesVar uses repeated replication variance estimation methods exclusively and as a result does not offer the Taylor Series Linearization approach.

WesVar uses repeated replication variance estimation methods exclusively and as a result does not offer the Taylor Series Linearization approach. CHAPTER 9 ANALYSIS EXAMPLES REPLICATION WesVar 4.3 GENERAL NOTES ABOUT ANALYSIS EXAMPLES REPLICATION These examples are intended to provide guidance on how to use the commands/procedures for analysis of

More information

> budworm$samplogit < log((budworm$y+0.5)/(budworm$m budworm$y+0.5))

> budworm$samplogit < log((budworm$y+0.5)/(budworm$m budworm$y+0.5)) budworm < read.table(file="n:\\courses\\stat8620\\fall 08\\budworm.dat",header=T) #budworm < read.table(file="c:\\documents and Settings\\dhall\\My Documents\\Dan's Work Stuff\\courses\\STAT8620\\Fall

More information

CHAPTER 8 EXAMPLES: MIXTURE MODELING WITH LONGITUDINAL DATA

CHAPTER 8 EXAMPLES: MIXTURE MODELING WITH LONGITUDINAL DATA Examples: Mixture Modeling With Longitudinal Data CHAPTER 8 EXAMPLES: MIXTURE MODELING WITH LONGITUDINAL DATA Mixture modeling refers to modeling with categorical latent variables that represent subpopulations

More information

Maximum Likelihood Estimation Richard Williams, University of Notre Dame, https://www3.nd.edu/~rwilliam/ Last revised January 13, 2018

Maximum Likelihood Estimation Richard Williams, University of Notre Dame, https://www3.nd.edu/~rwilliam/ Last revised January 13, 2018 Maximum Likelihood Estimation Richard Williams, University of otre Dame, https://www3.nd.edu/~rwilliam/ Last revised January 3, 208 [This handout draws very heavily from Regression Models for Categorical

More information

Stat 401XV Exam 3 Spring 2017

Stat 401XV Exam 3 Spring 2017 Stat 40XV Exam Spring 07 I have neither given nor received unauthorized assistance on this exam. Name Signed Date Name Printed ATTENTION! Incorrect numerical answers unaccompanied by supporting reasoning

More information

Getting Started in Logit and Ordered Logit Regression (ver. 3.1 beta)

Getting Started in Logit and Ordered Logit Regression (ver. 3.1 beta) Getting Started in Logit and Ordered Logit Regression (ver. 3. beta Oscar Torres-Reyna Data Consultant otorres@princeton.edu http://dss.princeton.edu/training/ Logit model Use logit models whenever your

More information

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN Modeling Counts & ZIP: Extended Example Carolyn J. Anderson Department of Educational Psychology I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN Modeling Counts Slide 1 of 36 Outline Outline

More information

Categorical Outcomes. Statistical Modelling in Stata: Categorical Outcomes. R by C Table: Example. Nominal Outcomes. Mark Lunt.

Categorical Outcomes. Statistical Modelling in Stata: Categorical Outcomes. R by C Table: Example. Nominal Outcomes. Mark Lunt. Categorical Outcomes Statistical Modelling in Stata: Categorical Outcomes Mark Lunt Arthritis Research UK Epidemiology Unit University of Manchester Nominal Ordinal 28/11/2017 R by C Table: Example Categorical,

More information

Maximum Likelihood Estimation Richard Williams, University of Notre Dame, https://www3.nd.edu/~rwilliam/ Last revised January 10, 2017

Maximum Likelihood Estimation Richard Williams, University of Notre Dame, https://www3.nd.edu/~rwilliam/ Last revised January 10, 2017 Maximum Likelihood Estimation Richard Williams, University of otre Dame, https://www3.nd.edu/~rwilliam/ Last revised January 0, 207 [This handout draws very heavily from Regression Models for Categorical

More information

Lecture 21: Logit Models for Multinomial Responses Continued

Lecture 21: Logit Models for Multinomial Responses Continued Lecture 21: Logit Models for Multinomial Responses Continued Dipankar Bandyopadhyay, Ph.D. BMTRY 711: Analysis of Categorical Data Spring 2011 Division of Biostatistics and Epidemiology Medical University

More information

Addiction - Multinomial Model

Addiction - Multinomial Model Addiction - Multinomial Model February 8, 2012 First the addiction data are loaded and attached. > library(catdata) > data(addiction) > attach(addiction) For the multinomial logit model the function multinom

More information

Maximum Likelihood Estimation

Maximum Likelihood Estimation Maximum Likelihood Estimation EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #6 EPSY 905: Maximum Likelihood In This Lecture The basics of maximum likelihood estimation Ø The engine that

More information

book 2014/5/6 15:21 page 261 #285

book 2014/5/6 15:21 page 261 #285 book 2014/5/6 15:21 page 261 #285 Chapter 10 Simulation Simulations provide a powerful way to answer questions and explore properties of statistical estimators and procedures. In this chapter, we will

More information

STA 4504/5503 Sample questions for exam True-False questions.

STA 4504/5503 Sample questions for exam True-False questions. STA 4504/5503 Sample questions for exam 2 1. True-False questions. (a) For General Social Survey data on Y = political ideology (categories liberal, moderate, conservative), X 1 = gender (1 = female, 0

More information

Local Maxima in the Estimation of the ZINB and Sample Selection models

Local Maxima in the Estimation of the ZINB and Sample Selection models 1 Local Maxima in the Estimation of the ZINB and Sample Selection models J.M.C. Santos Silva School of Economics, University of Surrey 23rd London Stata Users Group Meeting 7 September 2017 2 1. Introduction

More information

Step 1: Load the appropriate R package. Step 2: Fit a separate mixed model for each independence claim in the basis set.

Step 1: Load the appropriate R package. Step 2: Fit a separate mixed model for each independence claim in the basis set. Step 1: Load the appropriate R package. You will need two libraries: nlme and lme4. Step 2: Fit a separate mixed model for each independence claim in the basis set. For instance, in Table 2 the first basis

More information

Duration Models: Parametric Models

Duration Models: Parametric Models Duration Models: Parametric Models Brad 1 1 Department of Political Science University of California, Davis January 28, 2011 Parametric Models Some Motivation for Parametrics Consider the hazard rate:

More information

Applied Econometrics with. Microeconometrics

Applied Econometrics with. Microeconometrics Applied Econometrics with Chapter 5 Microeconometrics Christian Kleiber, Achim Zeileis 2008 2017 Applied Econometrics with R 5 Microeconometrics 0 / 72 Microeconometrics Overview Christian Kleiber, Achim

More information

CREDIT RISK MODELING IN R. Logistic regression: introduction

CREDIT RISK MODELING IN R. Logistic regression: introduction CREDIT RISK MODELING IN R Logistic regression: introduction Final data structure > str(training_set) 'data.frame': 19394 obs. of 8 variables: $ loan_status : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1

More information

Estimation Parameters and Modelling Zero Inflated Negative Binomial

Estimation Parameters and Modelling Zero Inflated Negative Binomial CAUCHY JURNAL MATEMATIKA MURNI DAN APLIKASI Volume 4(3) (2016), Pages 115-119 Estimation Parameters and Modelling Zero Inflated Negative Binomial Cindy Cahyaning Astuti 1, Angga Dwi Mulyanto 2 1 Muhammadiyah

More information

Subject CS1 Actuarial Statistics 1 Core Principles. Syllabus. for the 2019 exams. 1 June 2018

Subject CS1 Actuarial Statistics 1 Core Principles. Syllabus. for the 2019 exams. 1 June 2018 ` Subject CS1 Actuarial Statistics 1 Core Principles Syllabus for the 2019 exams 1 June 2018 Copyright in this Core Reading is the property of the Institute and Faculty of Actuaries who are the sole distributors.

More information

Model fit assessment via marginal model plots

Model fit assessment via marginal model plots The Stata Journal (2010) 10, Number 2, pp. 215 225 Model fit assessment via marginal model plots Charles Lindsey Texas A & M University Department of Statistics College Station, TX lindseyc@stat.tamu.edu

More information

STAT 825 Notes Random Number Generation

STAT 825 Notes Random Number Generation STAT 825 Notes Random Number Generation What if R/Splus/SAS doesn t have a function to randomly generate data from a particular distribution? Although R, Splus, SAS and other packages can generate data

More information

11. Logistic modeling of proportions

11. Logistic modeling of proportions 11. Logistic modeling of proportions Retrieve the data File on main menu Open worksheet C:\talks\strirling\employ.ws = Note Postcode is neighbourhood in Glasgow Cell is element of the table for each postcode

More information

Analysis of Microdata

Analysis of Microdata Rainer Winkelmann Stefan Boes Analysis of Microdata Second Edition 4u Springer 1 Introduction 1 1.1 What Are Microdata? 1 1.2 Types of Microdata 4 1.2.1 Qualitative Data 4 1.2.2 Quantitative Data 6 1.3

More information

Module 9: Single-level and Multilevel Models for Ordinal Responses. Stata Practical 1

Module 9: Single-level and Multilevel Models for Ordinal Responses. Stata Practical 1 Module 9: Single-level and Multilevel Models for Ordinal Responses Pre-requisites Modules 5, 6 and 7 Stata Practical 1 George Leckie, Tim Morris & Fiona Steele Centre for Multilevel Modelling If you find

More information

sociology SO5032 Quantitative Research Methods Brendan Halpin, Sociology, University of Limerick Spring 2018 SO5032 Quantitative Research Methods

sociology SO5032 Quantitative Research Methods Brendan Halpin, Sociology, University of Limerick Spring 2018 SO5032 Quantitative Research Methods 1 SO5032 Quantitative Research Methods Brendan Halpin, Sociology, University of Limerick Spring 2018 Lecture 10: Multinomial regression baseline category extension of binary What if we have multiple possible

More information

Log-linear Modeling Under Generalized Inverse Sampling Scheme

Log-linear Modeling Under Generalized Inverse Sampling Scheme Log-linear Modeling Under Generalized Inverse Sampling Scheme Soumi Lahiri (1) and Sunil Dhar (2) (1) Department of Mathematical Sciences New Jersey Institute of Technology University Heights, Newark,

More information

Rescaling results of nonlinear probability models to compare regression coefficients or variance components across hierarchically nested models

Rescaling results of nonlinear probability models to compare regression coefficients or variance components across hierarchically nested models Rescaling results of nonlinear probability models to compare regression coefficients or variance components across hierarchically nested models Dirk Enzmann & Ulrich Kohler University of Hamburg, dirk.enzmann@uni-hamburg.de

More information

Bradley-Terry Models. Stat 557 Heike Hofmann

Bradley-Terry Models. Stat 557 Heike Hofmann Bradley-Terry Models Stat 557 Heike Hofmann Outline Definition: Bradley-Terry Fitting the model Extension: Order Effects Extension: Ordinal & Nominal Response Repeated Measures Bradley-Terry Model (1952)

More information

ORDERED MULTINOMIAL LOGISTIC REGRESSION ANALYSIS. Pooja Shivraj Southern Methodist University

ORDERED MULTINOMIAL LOGISTIC REGRESSION ANALYSIS. Pooja Shivraj Southern Methodist University ORDERED MULTINOMIAL LOGISTIC REGRESSION ANALYSIS Pooja Shivraj Southern Methodist University KINDS OF REGRESSION ANALYSES Linear Regression Logistic Regression Dichotomous dependent variable (yes/no, died/

More information

Credit Risk Modelling

Credit Risk Modelling Credit Risk Modelling Tiziano Bellini Università di Bologna December 13, 2013 Tiziano Bellini (Università di Bologna) Credit Risk Modelling December 13, 2013 1 / 55 Outline Framework Credit Risk Modelling

More information

Two-Sample Z-Tests Assuming Equal Variance

Two-Sample Z-Tests Assuming Equal Variance Chapter 426 Two-Sample Z-Tests Assuming Equal Variance Introduction This procedure provides sample size and power calculations for one- or two-sided two-sample z-tests when the variances of the two groups

More information

CHAPTER 12 EXAMPLES: MONTE CARLO SIMULATION STUDIES

CHAPTER 12 EXAMPLES: MONTE CARLO SIMULATION STUDIES Examples: Monte Carlo Simulation Studies CHAPTER 12 EXAMPLES: MONTE CARLO SIMULATION STUDIES Monte Carlo simulation studies are often used for methodological investigations of the performance of statistical

More information

Superiority by a Margin Tests for the Ratio of Two Proportions

Superiority by a Margin Tests for the Ratio of Two Proportions Chapter 06 Superiority by a Margin Tests for the Ratio of Two Proportions Introduction This module computes power and sample size for hypothesis tests for superiority of the ratio of two independent proportions.

More information

tm / / / / / / / / / / / / Statistics/Data Analysis User: Klick Project: Limited Dependent Variables{space -6}

tm / / / / / / / / / / / / Statistics/Data Analysis User: Klick Project: Limited Dependent Variables{space -6} PS 4 Monday August 16 01:00:42 2010 Page 1 tm / / / / / / / / / / / / Statistics/Data Analysis User: Klick Project: Limited Dependent Variables{space -6} log: C:\web\PS4log.smcl log type: smcl opened on:

More information

Econometric Methods for Valuation Analysis

Econometric Methods for Valuation Analysis Econometric Methods for Valuation Analysis Margarita Genius Dept of Economics M. Genius (Univ. of Crete) Econometric Methods for Valuation Analysis Cagliari, 2017 1 / 25 Outline We will consider econometric

More information

Chapter 8 Exercises 1. Data Analysis & Graphics Using R Solutions to Exercises (May 1, 2010)

Chapter 8 Exercises 1. Data Analysis & Graphics Using R Solutions to Exercises (May 1, 2010) Chapter 8 Exercises 1 Data Analysis & Graphics Using R Solutions to Exercises (May 1, 2010) Preliminaries > library(daag) Exercise 1 The following table shows numbers of occasions when inhibition (i.e.,

More information

BIOINFORMATICS MSc PROBABILITY AND STATISTICS SPLUS SHEET 1

BIOINFORMATICS MSc PROBABILITY AND STATISTICS SPLUS SHEET 1 BIOINFORMATICS MSc PROBABILITY AND STATISTICS SPLUS SHEET 1 A data set containing a segment of human chromosome 13 containing the BRCA2 breast cancer gene; it was obtained from the National Center for

More information

Projects for Bayesian Computation with R

Projects for Bayesian Computation with R Projects for Bayesian Computation with R Laura Vana & Kurt Hornik Winter Semeter 2018/2019 1 S&P Rating Data On the homepage of this course you can find a time series for Standard & Poors default data

More information

Economics Multinomial Choice Models

Economics Multinomial Choice Models Economics 217 - Multinomial Choice Models So far, most extensions of the linear model have centered on either a binary choice between two options (work or don t work) or censoring options. Many questions

More information

Some Characteristics of Data

Some Characteristics of Data Some Characteristics of Data Not all data is the same, and depending on some characteristics of a particular dataset, there are some limitations as to what can and cannot be done with that data. Some key

More information

STATISTICAL METHODS FOR CATEGORICAL DATA ANALYSIS

STATISTICAL METHODS FOR CATEGORICAL DATA ANALYSIS STATISTICAL METHODS FOR CATEGORICAL DATA ANALYSIS Daniel A. Powers Department of Sociology University of Texas at Austin YuXie Department of Sociology University of Michigan ACADEMIC PRESS An Imprint of

More information

Estimating log models: to transform or not to transform?

Estimating log models: to transform or not to transform? Journal of Health Economics 20 (2001) 461 494 Estimating log models: to transform or not to transform? Willard G. Manning a,, John Mullahy b a Department of Health Studies, Biological Sciences Division,

More information

MODEL SELECTION CRITERIA IN R:

MODEL SELECTION CRITERIA IN R: 1. R 2 statistics We may use MODEL SELECTION CRITERIA IN R R 2 = SS R SS T = 1 SS Res SS T or R 2 Adj = 1 SS Res/(n p) SS T /(n 1) = 1 ( ) n 1 (1 R 2 ). n p where p is the total number of parameters. R

More information

Introduction to the Maximum Likelihood Estimation Technique. September 24, 2015

Introduction to the Maximum Likelihood Estimation Technique. September 24, 2015 Introduction to the Maximum Likelihood Estimation Technique September 24, 2015 So far our Dependent Variable is Continuous That is, our outcome variable Y is assumed to follow a normal distribution having

More information

proc genmod; model malform/total = alcohol / dist=bin link=identity obstats; title 'Table 2.7'; title2 'Identity Link';

proc genmod; model malform/total = alcohol / dist=bin link=identity obstats; title 'Table 2.7'; title2 'Identity Link'; BIOS 6244 Analysis of Categorical Data Assignment 5 s 1. Consider Exercise 4.4, p. 98. (i) Write the SAS code, including the DATA step, to fit the linear probability model and the logit model to the data

More information

Multinomial Logit Models - Overview Richard Williams, University of Notre Dame, https://www3.nd.edu/~rwilliam/ Last revised February 13, 2017

Multinomial Logit Models - Overview Richard Williams, University of Notre Dame, https://www3.nd.edu/~rwilliam/ Last revised February 13, 2017 Multinomial Logit Models - Overview Richard Williams, University of Notre Dame, https://www3.nd.edu/~rwilliam/ Last revised February 13, 2017 This is adapted heavily from Menard s Applied Logistic Regression

More information

A Comparison of Univariate Probit and Logit. Models Using Simulation

A Comparison of Univariate Probit and Logit. Models Using Simulation Applied Mathematical Sciences, Vol. 12, 2018, no. 4, 185-204 HIKARI Ltd, www.m-hikari.com https://doi.org/10.12988/ams.2018.818 A Comparison of Univariate Probit and Logit Models Using Simulation Abeer

More information

Loss Simulation Model Testing and Enhancement

Loss Simulation Model Testing and Enhancement Loss Simulation Model Testing and Enhancement Casualty Loss Reserve Seminar By Kailan Shang Sept. 2011 Agenda Research Overview Model Testing Real Data Model Enhancement Further Development Enterprise

More information

Generalized Multilevel Regression Example for a Binary Outcome

Generalized Multilevel Regression Example for a Binary Outcome Psy 510/610 Multilevel Regression, Spring 2017 1 HLM Generalized Multilevel Regression Example for a Binary Outcome Specifications for this Bernoulli HLM2 run Problem Title: no title The data source for

More information

Non-Inferiority Tests for the Ratio of Two Proportions

Non-Inferiority Tests for the Ratio of Two Proportions Chapter Non-Inferiority Tests for the Ratio of Two Proportions Introduction This module provides power analysis and sample size calculation for non-inferiority tests of the ratio in twosample designs in

More information

Estimation Procedure for Parametric Survival Distribution Without Covariates

Estimation Procedure for Parametric Survival Distribution Without Covariates Estimation Procedure for Parametric Survival Distribution Without Covariates The maximum likelihood estimates of the parameters of commonly used survival distribution can be found by SAS. The following

More information

SAS/STAT 15.1 User s Guide The FMM Procedure

SAS/STAT 15.1 User s Guide The FMM Procedure SAS/STAT 15.1 User s Guide The FMM Procedure This document is an individual chapter from SAS/STAT 15.1 User s Guide. The correct bibliographic citation for this manual is as follows: SAS Institute Inc.

More information

Non-Inferiority Tests for the Odds Ratio of Two Proportions

Non-Inferiority Tests for the Odds Ratio of Two Proportions Chapter Non-Inferiority Tests for the Odds Ratio of Two Proportions Introduction This module provides power analysis and sample size calculation for non-inferiority tests of the odds ratio in twosample

More information

The data definition file provided by the authors is reproduced below: Obs: 1500 home sales in Stockton, CA from Oct 1, 1996 to Nov 30, 1998

The data definition file provided by the authors is reproduced below: Obs: 1500 home sales in Stockton, CA from Oct 1, 1996 to Nov 30, 1998 Economics 312 Sample Project Report Jeffrey Parker Introduction This project is based on Exercise 2.12 on page 81 of the Hill, Griffiths, and Lim text. It examines how the sale price of houses in Stockton,

More information

Gov 2001: Section 5. I. A Normal Example II. Uncertainty. Gov Spring 2010

Gov 2001: Section 5. I. A Normal Example II. Uncertainty. Gov Spring 2010 Gov 2001: Section 5 I. A Normal Example II. Uncertainty Gov 2001 Spring 2010 A roadmap We started by introducing the concept of likelihood in the simplest univariate context one observation, one variable.

More information

Calculating the Probabilities of Member Engagement

Calculating the Probabilities of Member Engagement Calculating the Probabilities of Member Engagement by Larry J. Seibert, Ph.D. Binary logistic regression is a regression technique that is used to calculate the probability of an outcome when there are

More information

Quantitative Techniques Term 2

Quantitative Techniques Term 2 Quantitative Techniques Term 2 Laboratory 7 2 March 2006 Overview The objective of this lab is to: Estimate a cost function for a panel of firms; Calculate returns to scale; Introduce the command cluster

More information

Model 0: We start with a linear regression model: log Y t = β 0 + β 1 (t 1980) + ε, with ε N(0,

Model 0: We start with a linear regression model: log Y t = β 0 + β 1 (t 1980) + ε, with ε N(0, Stat 534: Fall 2017. Introduction to the BUGS language and rjags Installation: download and install JAGS. You will find the executables on Sourceforge. You must have JAGS installed prior to installing

More information

Catherine De Vries, Spyros Kosmidis & Andreas Murr

Catherine De Vries, Spyros Kosmidis & Andreas Murr APPLIED STATISTICS FOR POLITICAL SCIENTISTS WEEK 8: DEPENDENT CATEGORICAL VARIABLES II Catherine De Vries, Spyros Kosmidis & Andreas Murr Topic: Logistic regression. Predicted probabilities. STATA commands

More information

West Coast Stata Users Group Meeting, October 25, 2007

West Coast Stata Users Group Meeting, October 25, 2007 Estimating Heterogeneous Choice Models with Stata Richard Williams, Notre Dame Sociology, rwilliam@nd.edu oglm support page: http://www.nd.edu/~rwilliam/oglm/index.html West Coast Stata Users Group Meeting,

More information

Non-Inferiority Tests for the Difference Between Two Proportions

Non-Inferiority Tests for the Difference Between Two Proportions Chapter 0 Non-Inferiority Tests for the Difference Between Two Proportions Introduction This module provides power analysis and sample size calculation for non-inferiority tests of the difference in twosample

More information

Using New SAS 9.4 Features for Cumulative Logit Models with Partial Proportional Odds Paul J. Hilliard, Educational Testing Service (ETS)

Using New SAS 9.4 Features for Cumulative Logit Models with Partial Proportional Odds Paul J. Hilliard, Educational Testing Service (ETS) Using New SAS 9.4 Features for Cumulative Logit Models with Partial Proportional Odds Using New SAS 9.4 Features for Cumulative Logit Models with Partial Proportional Odds INTRODUCTION Multicategory Logit

More information

Longitudinal Logistic Regression: Breastfeeding of Nepalese Children

Longitudinal Logistic Regression: Breastfeeding of Nepalese Children Longitudinal Logistic Regression: Breastfeeding of Nepalese Children Scientific Question Determine whether the breastfeeding of Nepalese children varies with child age and/or sex of child. Data: Nepal

More information

AIC = Log likelihood = BIC =

AIC = Log likelihood = BIC = - log: /mnt/ide1/home/sschulh1/apc/apc_examplelog log type: text opened on: 21 Jul 2006, 18:08:20 *replicate table 5 and cols 7-9 of table 3 in Yang, Fu and Land (2004) *Stata can maximize GLM objective

More information

Lecture 10: Alternatives to OLS with limited dependent variables, part 1. PEA vs APE Logit/Probit

Lecture 10: Alternatives to OLS with limited dependent variables, part 1. PEA vs APE Logit/Probit Lecture 10: Alternatives to OLS with limited dependent variables, part 1 PEA vs APE Logit/Probit PEA vs APE PEA: partial effect at the average The effect of some x on y for a hypothetical case with sample

More information

Lapse Modeling for the Post-Level Period

Lapse Modeling for the Post-Level Period Lapse Modeling for the Post-Level Period A Practical Application of Predictive Modeling JANUARY 2015 SPONSORED BY Committee on Finance Research PREPARED BY Richard Xu, FSA, Ph.D. Dihui Lai, Ph.D. Minyu

More information

COMPLEMENTARITY ANALYSIS IN MULTINOMIAL

COMPLEMENTARITY ANALYSIS IN MULTINOMIAL 1 / 25 COMPLEMENTARITY ANALYSIS IN MULTINOMIAL MODELS: THE GENTZKOW COMMAND Yunrong Li & Ricardo Mora SWUFE & UC3M Madrid, Oct 2017 2 / 25 Outline 1 Getzkow (2007) 2 Case Study: social vs. internet interactions

More information

XLSTAT TIP SHEET FOR BUSINESS STATISTICS CENGAGE LEARNING

XLSTAT TIP SHEET FOR BUSINESS STATISTICS CENGAGE LEARNING XLSTAT TIP SHEET FOR BUSINESS STATISTICS CENGAGE LEARNING INTRODUCTION XLSTAT makes accessible to anyone a powerful, complete and user-friendly data analysis and statistical solution. Accessibility to

More information

9. Logit and Probit Models For Dichotomous Data

9. Logit and Probit Models For Dichotomous Data Sociology 740 John Fox Lecture Notes 9. Logit and Probit Models For Dichotomous Data Copyright 2014 by John Fox Logit and Probit Models for Dichotomous Responses 1 1. Goals: I To show how models similar

More information

MCMC Package Example (Version 0.5-1)

MCMC Package Example (Version 0.5-1) MCMC Package Example (Version 0.5-1) Charles J. Geyer September 16, 2005 1 The Problem This is an example of using the mcmc package in R. The problem comes from a take-home question on a (take-home) PhD

More information

Module 4 Bivariate Regressions

Module 4 Bivariate Regressions AGRODEP Stata Training April 2013 Module 4 Bivariate Regressions Manuel Barron 1 and Pia Basurto 2 1 University of California, Berkeley, Department of Agricultural and Resource Economics 2 University of

More information

Posterior Inference. , where should we start? Consider the following computational procedure: 1. draw samples. 2. convert. 3. compute properties

Posterior Inference. , where should we start? Consider the following computational procedure: 1. draw samples. 2. convert. 3. compute properties Posterior Inference Example. Consider a binomial model where we have a posterior distribution for the probability term, θ. Suppose we want to make inferences about the log-odds γ = log ( θ 1 θ), where

More information

Multiple regression - a brief introduction

Multiple regression - a brief introduction Multiple regression - a brief introduction Multiple regression is an extension to regular (simple) regression. Instead of one X, we now have several. Suppose, for example, that you are trying to predict

More information

Comparing Odds Ratios and Marginal Effects from Logistic Regression and Linear Probability Models

Comparing Odds Ratios and Marginal Effects from Logistic Regression and Linear Probability Models Western Kentucky University From the SelectedWorks of Matt Bogard Spring March 11, 2016 Comparing Odds Ratios and Marginal Effects from Logistic Regression and Linear Probability Models Matt Bogard Available

More information

An Empirical Study on Default Factors for US Sub-prime Residential Loans

An Empirical Study on Default Factors for US Sub-prime Residential Loans An Empirical Study on Default Factors for US Sub-prime Residential Loans Kai-Jiun Chang, Ph.D. Candidate, National Taiwan University, Taiwan ABSTRACT This research aims to identify the loan characteristics

More information

SAS/STAT 14.1 User s Guide. The HPFMM Procedure

SAS/STAT 14.1 User s Guide. The HPFMM Procedure SAS/STAT 14.1 User s Guide The HPFMM Procedure This document is an individual chapter from SAS/STAT 14.1 User s Guide. The correct bibliographic citation for this manual is as follows: SAS Institute Inc.

More information

Description Quick start Menu Syntax Options Remarks and examples Stored results Methods and formulas Acknowledgment References Also see

Description Quick start Menu Syntax Options Remarks and examples Stored results Methods and formulas Acknowledgment References Also see Title stata.com tssmooth shwinters Holt Winters seasonal smoothing Description Quick start Menu Syntax Options Remarks and examples Stored results Methods and formulas Acknowledgment References Also see

More information

Statistics 175 Applied Statistics Generalized Linear Models Jianqing Fan

Statistics 175 Applied Statistics Generalized Linear Models Jianqing Fan Statistics 175 Applied Statistics Generalized Linear Models Jianqing Fan Example 1 (Kyhposis data): (The data set kyphosis consists of measurements on 81 children following corrective spinal surgery. Variable

More information