Using R to Create Synthetic Discrete Response Regression Models
|
|
- Spencer Lang
- 5 years ago
- Views:
Transcription
1 Arizona State University From the SelectedWorks of Joseph M Hilbe July 3, 2011 Using R to Create Synthetic Discrete Response Regression Models Joseph Hilbe, Arizona State University Available at:
2 Using R to Create Synthetic Discrete Response Regression Models Joseph M. Hilbe Arizona State University, and Jet Propulsion Laboratory, California Institute of Technology 3 July, 2011 (revised) Hilbe@asu.edu 2009, Joseph M Hilbe, all rights reserved. Do not distribute without prior written permission of author. Appropriate citation is requested if used in preparation of an article or book. The use of synthetic data and synthetic models has played an important role in evaluating model fit and bias. Synthetic models and tests are now found in many of the major texts dealing with statistical models. R is particularly adept in allowing the user to create synthetic data, and provides the opportunity of having most every aspect of the data and model be engineered as needed by the statistician. In my texts, Negative Binomial Regression (2007, Cambridge University Press) and Logistic Regression Models (2009, Chapman & Hall/CRC) I present the reader with a number of synthetic models, displaying how missing predictors, needed interactions, incorrectly specified links, and so forth can affect both model parameter estimates and dispersion statistics. The examples in these two books demonstrate how to determine apparent from real overdispersion in count and grouped binomial models. In this article I demonstrate how to construct synthetic data sets for various popular discrete response regression models. The same methods may be used to create data specific to a wide variety of alternative models. Specifically, I provide code for creating synthetic data sets for given types of binomial, Poisson, negative binomial, ordered slopes and proportional odds, and multinomial models. All code is based on the R pseudo-random number generators, runif(), rpois(), rgamma(), and rbinom(). The pnorm() function used for probit models. I also provide code for developing a synthetic NB-C, or canonical negative binomial model, even though there is not as yet an R command for its estimation. I intend on developing this command in a forthcoming article. The NB-C code is based on the rnbinom() function. Note that model predictors are all based on random uniform data. The results would be the same if we used normal variates, or even binary and factor variates. NB1 and NB2 negative binomial models are based on a Poisson-gamma mixture; the code clearly shows how this occurs. The logic of the coding can be used to develop a wide range of other types of synthetic models. For each set of model code it is easy for users to amend the default values, selecting their own number of model observations, seed values, number of predictors and their values, and for negative binomial models, a specific value for the ancillary parameter. Categorical response models are coded such that the number of predictors and their values may be selected, as well as cut point values for ordered response models, and the number of levels for multinomial models. Poisson, negative binomial and binomial models may also employ an offset.
3 I divide this article into three sections. First, I shall discuss creation of synthetic count response models specifically, Poisson, NB2, NB1, and NB-C. Second, I develop code for binomial models, which include both Bernoulli or binary and binomial or grouped logit and probit models. A third section provides an overview for creating synthetic proportional slopes models, including the proportional odds model and the ordered probit, as well code for constructing a synthetic multinomial logit model. Statisticians should find it relatively easy to adjust the provided default code to construct synthetic data for a wide variety of alternative discrete response models. Synthetic continuous response models are also based on the same logic as the code developed here. It should be noted that each run of these synthetic models will differ one from the other unless a seed is specified. In addition, it is possible to employ a Monte Carlo simulation algorithm to each of these models in order to determine the mean parameter and intercept values when many simulation runs have been effected. I have performed Monte Carlo simulations on each model discussed in this article, with the result that the user specified values are nearly identical to the resultant mean values. Some algorithms appear to yield more variability in parameter estimates than other algorithms, but upon a thousand or more Monte Carlo runs the mean values are stable. Each coding algorithm may be used as a batch file, either submitted as a whole file, or capable of being submitted a line at a time. As previously indicated, the user may specify various aspects of the algorithm to affect their requirements. Each algorithm begins with the characters, syn, followed by the type of model and any extra capabilities. For example, code to create the basic synthetic Poisson model is called, syn.poisson.r. The algorithm for adding an offset is called, syn.poissono.r. For proportional odds and multinomial models, the number of cuts (ordered models) or levels (multinomial) is indicated at the end of the name, e.g. syn.ologit4.r. Specific directions for running the algorithms: 1: open File -> Open script 2: Select file name from directory where files kept 3: Using left mouse button, select all lines of file (or lines desired) 4: Click once on right mouse button, Run line of selection 5: Left click on mouse button. 6: Synthetic model results displayed in R console. 1: SYNTHETIC COUNT MODELS I shall first create a simple Poisson model using the R function rpois(). This is basic or paradigm count model. The synthetic model is given two pseudo-random uniform predictors, x1 and x2. Random normal predictors are also commonly used when developing synthetic models. The values assigned to the two predictors, and intercept, are respectively x1 = 0.75, x2 = -1.25, and 2. Note that the final two lines of each algorithm are the regression model and summary output. The fourth line below, yielding xb, is the linear predictor, which is given to the inverse
4 link function in the following line. The inverse link function is used in generalized linear models to calculate the fitted or predicted value. The fit is then employed in the rpois() function to produce the synthetic Poisson response variate. SYNTHETIC POISSON MODEL ========================================================================== # syn.poisson.r // Joseph Hilbe 10Apr2009 xb < *x1-1.25*x2 # linear predictor exb <-exp(xb) # inverse link to predicted fit py <-rpois(50000, exb) # generates random Poisson variates poi <-glm(py ~ x1 + x2, family=poisson) summary(jhpoi) =========================================================================== glm(formula = py ~ x1 + x2, family = poisson) Deviance Residuals: Min 1Q Median 3Q Max Coefficients: (Intercept) <2e-16 *** x <2e-16 *** x <2e-16 *** --- Signif. codes: 0 *** ** 0.01 * (Dispersion parameter for poisson family taken to be 1) Null deviance: on degrees of freedom Residual deviance: on degrees of freedom AIC: Number of Fisher Scoring iterations: 4 The resultant parameter estimates closely approximate the user defined values. A seed was not specified for the example model. If we wish to generate the same random numbers, and the exact same model, a seed may be given after the initial line. The function is set.seed(#); e.g,, set.seed(32478). Unfortunately the Pearson statistic is not displayed with the glm() function output, but rather the deviance is used as the basis of the dispersion. Although this is the traditional manner used for calculating the dispersion statistic, it is n ot correct. See Hilbe (2007, 2009). Poisson models are commonly parameterized as rate models. As such they employ an offset, which reflects the area or time over which the count response is generated. Since the natural log is the canonical link of the Poisson model, the offset must be logged prior to entry into the estimating algorithm.
5 A synthetic offset may be randomly generated, or may be specified by the user. For this example I will create an area offset having increasing values of 100 for each 10,000 observations in the 50,000 observation data set. The first 10,000 observations have an offset value of 100, the second 10,000 a value of 200, and so on until the final 10,000 observations, which have a value of 500. This value is calculated on the fourth line of program code. Note that the log-offset is added to the linear predictor in line 6. POISSON WITH RATE PARAMETERIZATION =========================================================================== # syn.poissono.r // Joseph Hilbe 10Apr2009 off <- rep(1:5, each=10000, times=1)*100 # creates offset as defined loff <- log(off) # log the offset xb< *x1-1.25*x2 + loff # linear predictor exb <-exp(xb) # inverse link py <-rpois(50000, exb) # creates random Poisson variates poir <-glm(py ~ x1 + x2 + offset(loff), family=poisson) summary(poir) ============================================================================ glm(formula = py ~ x1 + x2 + offset(loff), family = poisson) Deviance Residuals: Min 1Q Median 3Q Max Coefficients: (Intercept) <2e-16 *** x <2e-16 *** x <2e-16 *** --- Signif. codes: 0 *** ** 0.01 * (Dispersion parameter for poisson family taken to be 1) Null deviance: on degrees of freedom Residual deviance: on degrees of freedom AIC: Number of Fisher Scoring iterations: 3 A display of the offset table is shown as: > table(off) off
6 Note that the resultant parameter estimates are nearly identical to the specified values. We do expect that the parameter estimates of the model with an offset, or rate parameterized model, will closely approximate those of the standard model. However, we also expect that the standard errors will differ, Moreover, if the rate model were modeled without declaring an offset, we would notice a greatly inflated intercept value. A Poisson model having a Pearson-based dispersion greater than 1.0 indicates possible overdispersion. Typically the deviance-based dispersion is similar to the Pearson, but not always. In any case, in R, two types of generalized linear model are generally employed when there is evidence of overdispersion in a Poisson model quasipoisson and negative binomial. The negative binomial used in both the glm.nb() and glm() functions is commonly referred to as the quadratic negative binomial, or NB2 model. It is parameterized with a variance function of μ + αμ 2. The other common parameterization of negative binomial is termed the linear negative binomial, or NB1. The NB1 model is parameterized as μ + αμ, or as μ(1+α). The quasipoisson model is simply the Poisson model with standard errors scaled by the Pearson dispersion. Since the standard errors are not based directly on the Hessian matrix derived from the log-likelihood function, the term quasi-likelihood is used as the type of model estimated. Quasipoisson is the name given to Poisson models with scaled standard errors. The NB2 parameterization of the negative binomial can be generated as a Poisson-gamma mixture model, with a gamma scale parameter of 1. We use this method to create a synthetic NB2 variate, and a corresponding synthetic NB2 model. The negative binomial random number generator in R, rnbinom(), is not appropriate for dealing with NB2 data. The rnbinom() function is, however, appropriate for generating NB-C data, as we shall discuss later. In fact, both NB2 and NB1 synthetic models are generated as a mixture of Poisson and gamma distributions. The algorithm below is used to create a synthetic NB2 model. Note how the mixture of distributions occur. The same parameters are used here as for the above Poisson models. We assign a value of.5 to the NB2 heterogeneity or overdispersion parameter --- also termed an ancillary parameter. The glm.nb() function, which is part of the MASS library, is used model the synthetic data. NEGATIVE BINOMIAL (NB2) SYNTHETIC MODEL ======================================================================= # syn.nb2.r // Joseph Hilbe 10Apr2009 xb < *x1-1.25*x2 # linear predictor / parameter values a <-.5 # assign value to ancillary parameter ia <- 1/.5 # invert alpha exb <- exp(xb) # Poisson predicted value xg <- rgamma(50000, a, a, ia) # generate gamma variates given alpha xbg <-exb*xg # mix Poisson and gamma variates nby <- rpois(50000, xbg) # generate NB2 variates jhnb2 <-glm.nb(nby ~ x1 + x2) # model NB2 summary(jhnb2) =======================================================================
7 glm.nb(formula = nby ~ x1 + x2, init.theta = , link = log) Deviance Residuals: Min 1Q Median 3Q Max Coefficients: (Intercept) <2e-16 *** x <2e-16 *** x <2e-16 *** --- Signif. codes: 0 *** ** 0.01 * (Dispersion parameter for Negative Binomial(0.5033) family taken to be 1) Null deviance: on degrees of freedom Residual deviance: on degrees of freedom AIC: Number of Fisher Scoring iterations: 1 Theta: Std. Err.: x log-likelihood: The values of the parameter estimates and ancillary parameter closely approximate the user specified values. The estimated value of alpha is Another run will produce different values, centered about the specified values. There is typically more variability in simulated NB variates than for Poisson due to the interaction of the ancillary parameter. The glm() command may also be used to estimate NB2 parameter estimates. However, the value of the ancillary parameter must be entered into the function as a constant. It is not estimated as with the glm.nb() function. The new line and model results are displayed below. NB2 MODEL WITH THETA (ALPHA) ENTERED AS A CONSTANT =================================================================== mynb2c <- glm(nby ~ x1 + x2, family=negative.binomial(theta=.5)) summary(mynb2c) =================================================================== glm(formula = nby ~ x1 + x2, family = negative.binomial(theta = 0.5)) Deviance Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) <2e-16 ***
8 x <2e-16 *** x <2e-16 *** --- Signif. codes: 0 *** ** 0.01 * (Dispersion parameter for Negative Binomial(0.5) family taken to be ) Null deviance: on degrees of freedom Residual deviance: on degrees of freedom AIC: Number of Fisher Scoring iterations: 6 We may employ an offset into the NB2 algorithm in the same manner as we did for the Poisson. Since the mean of the Poisson and NB2 are both exp(xb), we may use the same method. The synthetic NB2 data and model with offset is in the syn.nb2o.r file. The linear negative binomial model, NB1, is also based on a Poisson-gamma mixture distribution. As discussed earlier, the NB1 heterogeneity or ancillary parameter is typically referred to as δ, not α. Converting the NB2 algorithm to NB1 entails defining idelta as the inverse of the value of delta, the desired value of the model ancillary parameter, and multiplying the result by the fitted value, exb. idelta and 1/idelta are the terms given to the rgamma() function. All else is the same as the NB2 algorithm. The resultant synthetic data can be modeled using the ml.nb1 function in the COUNT package, which is located on CRAN. The COUNT package was first published in September 2010 as a source for data, functions and scripts associated with Hilbe (2011). The code below is stored in the syn.nb1.r file NEGATIVE BINOMIAL (NB1) SYNTHETIC MODEL =================================================================== library(count) xb < *x1-1.25*x2 # linear predictor; parameter values delta <-.5 # value assigned to delta exb <-exp(xb) idelta <- (1/delta)*exb # product of theta and mu xg <-rgamma(50000, idelta, idelta, 1/idelta) xbg <- exb*xg nb1y <- rpois(50000, xbg) nb1data <- data.frame(nb1y, x1, x2) nb1 <- ml.nb1(nb1y ~ x1 + x2, data=nb1data) nb1 =================================================================== Estimate SE Z LCL UCL (Intercept) x x alpha The canonical negative binomial (NB-C), however, must be constructed in an entirely different manner from NB2, NB1, or from Poisson. NB-C is based directly on the negative binomial PDF.
9 R s rnbinom() function may be used to construct NB-C data, but not NB1 or NB2. Other options such as offsets are entered into the NB-C algorithm in the same manner as then were for Poisson, NB2 and NB1. The ml.nbc function in the COUNT package can be used to estimate NB- C models. The NB-C inverse link differs from that of Poisson, NB1, and NB2 models. It may be expressed as: 1/((exp(-xb)-1)*α). Unlike the other parameterizations, α is part of the link and inverse link functions. The probability is given as 1/(1+αμ). See Hilbe (2011) for details. The algorithm for creating NB-C data is given below, and is stored in syn.nbc.r. Values of p outside the range 0 to 1 are dropped from estimation, although this is usually quite rare. The resultant model still estimates the specified values for the parameter estimates; only the number of observations differ from that assigned by the user. SYNTHETIC CANONICAL NEGATIVE BINOMIAL (NB-C) MODEL rnbinom(n, size, prob, mu) where size is the ancillary parameter and prob is the probability, p, and mu = mu. Only p or mu may be put into the function, not both. ========================================================================== # syn.nbc.r // Joseph Hilbe 10Apr2009; revised 2/2011, adding ml.nbc. a < // value of alpha: 1.15 xb <- 1.25*x1 +.1*x2-1.5 // x1=1.25; x2=0.1; _cons= -1.5 mu <- 1/((exp(-xb)-1)*a) p <- 1/(1+a*mu) r <- 1/a nbcy <- rnbinom(50000, size=r, prob = p) nbcdata <- data.frame(nbcy, x1, x2) nbc <- ml.nbc(nbcy ~ x1 + x2, data=nbcdata) nbc ========================================================================== Estimate SE Z LCL UCL (Intercept) x x alpha The specified and estimated values are nearly identical. Note that the log-likelihood function used for NB-C estimation is given as (see Hilbe, 2011). LL NB-C = Σ {y(xb) + (1/α)ln(1-exp(xb)) + lnγ(y+1/α) - lnγ(y+1) lnγ(1/α) } I should also mention that exact regression models are based on the canonical form of the model, eg binomial logit and Poisson log models. If an exact negative binomial regression model is to be created, it must be based on the NB-C parameterization, not NB2.
10 The zero-inflated Poisson model is a mixture model where 0 and 1 counts overlap between a count model component and a binary model component. We will give an example of such a mixture model with Poisson for the count component and binary logistic regression as the binary component. See Hilbe (2011) for details on the model and interpretation. Note that the Poisson component is constructed first, with the data stored as a data frame. The zeroinfl function in the pscl package is used to estimate the model data. The values of the two components are established as: POISSON : intercept=2; x1=0.75; x2= LOGISTIC : intercept=0.2; x1=0.9; x2= 0.1 ZERO-INFLATED POISSON ========================================================================= # Zero-Inflated Poisson syn.zip.r J Hilbe 6Jun2011 library(pscl) nobs < x1 <- runif(nobs) x2 <- runif(nobs) # POISSON xb < *x1-1.25*x2 exb <- exp(xb) poy <- rpois(nobs, exb) pdata <- data.frame(poy, x1, x2) pi <- 1/(1+exp(-(.9*x1 +.1*x2 +.2))) pdata$bern <- runif(nobs)>pi zy <- pdata$bern * poy zip <- zeroinfl(zy ~ x1 + x2 x1 + x2, dist="poisson", data=pdata) summary(zip) ========================================================================= zeroinfl(formula = zy ~ x1 + x2 x1 + x2, data = pdata, dist = "poisson") Pearson residuals: Min 1Q Median 3Q Max Count model coefficients (poisson with log link): (Intercept) <2e-16 *** x <2e-16 *** x <2e-16 *** Zero-inflation model coefficients (binomial with logit link): (Intercept) <2e-16 *** x <2e-16 *** x ** --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Number of iterations in BFGS optimization: 11 Log-likelihood: e+04 on 6 Df
11 The zero-inflated negative binomial is built on the same logic as the ZIP model just discussed. The same interepts and coefficients are given as were for the ZIP model. ZERO-INFLATED NEGATIVE BINOMIAL ================================================================== # ZERO INFLATED NEGATIVE BINOMIAL syn.zinb.r # J Hilbe 6Jun2011 library(pscl) nobs < x1 <- runif(nobs) x2 <- runif(nobs) xb < *x1-1.25*x2 a <-.5 ia <- 1/.5 exb <- exp(xb) xg <- rgamma(nobs, a, a, ia) xbg <-exb*xg nby <- rpois(nobs, xbg) nbdata <- data.frame(nby, x1, x2) pi <- 1/(1+exp(-(.9*x1 +.1*x2 +.2))) nbdata$bern <- runif(nobs)>pi zy <- nbdata$bern * nby zinb <- zeroinfl(zy ~ x1 + x2 x1 + x2, dist="negbin", data=nbdata) summary(zinb) =================================================================== zeroinfl(formula = zy ~ x1 + x2 x1 + x2, data = nbdata, dist = "negbin") Pearson residuals: Min 1Q Median 3Q Max Count model coefficients (negbin with log link): (Intercept) <2e-16 *** x <2e-16 *** x <2e-16 *** Log(theta) <2e-16 *** Zero-inflation model coefficients (binomial with logit link): (Intercept) e-08 *** x < 2e-16 *** x Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Theta = Number of iterations in BFGS optimization: 25 Log-likelihood: e+04 on 7 Df
12 We established that the value of the negative binomial ancillary parameter is to be.5. Remember, that theta is defined in R s glm and glm.nb functions as 1/alpha. In the zeroifl function, however, the authors have parameterized it as is standard in other software applications, and as parameterized in the COUNT package ml.nb2, ml.nb1, and ml.nbc functions. Next we look at the hurdle models, namely the Poisson-logit and negative binomial-logit hurdle models. For this class of model, the count component is restructured as a zero-truncated model, and the binary component specifies 1 as any response value above zero. Zero counts are therefore, found only in the binary component. This is unlike the zero-inflated models which mix the numbers. POISSON-LOGIT HURDLE MODEL ================================================================== # syn.logitpoisson.hurdle.r # R - SYNTHETIC LOGIT-POISSON HURDLE SCRIPT # J. Hilbe 6 June 2011 library(pscl) nobs < # Generate predictors, design matrix x1 <- runif(nobs) x2 <- runif(nobs) xb < *x1-1.25*x2 # Construct Poisson responses exb <- exp(xb) poy <- rpois(nobs, exb) pdata <- data.frame(poy, x1, x2) # Construct filter pi <- 1/(1+exp(-(.9*x1 +.1*x2 +.2))) pdata$bern <- runif(nobs) > pi # Remove all response zeros. pdata <- subset(pdata, poy > 0) # Add structural zeros pdata$poy[pdata$bern] <- 0 # Model Synthetic Logit-Poisson Hurdle data hlpoi <- hurdle(poy ~ x1 + x2, dist = "poisson", zero.dist = "binomial", link = "logit", data = pdata) summary(hlpoi) ========================================================= hurdle(formula = poy ~ x1 + x2, data = pdata, dist = "poisson", zero.dist = "binomial", link = "logit")
13 Pearson residuals: Min 1Q Median 3Q Max Count model coefficients (truncated poisson with log link): (Intercept) <2e-16 *** x <2e-16 *** x <2e-16 *** Zero hurdle model coefficients (binomial with logit link): (Intercept) <2e-16 *** x <2e-16 *** x * --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Number of iterations in BFGS optimization: 9 Log-likelihood: e+05 on 6 Df The values of the component coefficients and intercepts are quite close. We next construct a negative binomial-logit, or a NB2-logit, hurdle model. It is based on the same logic as the ZINB model, but instea of mixing the distributions, it separates them as we did in the Poisson-logit hurdle model. NEGATIVE BINOMIAL-LOGIT HURDLE MODEL =========================================================================== # syn.lnb2.hurdle.r : J Hilbe 6Jun2011 library(pscl) nobs < x1 <- runif(nobs) x2 <- runif(nobs) xb < *x1-1.25*x2 a <-.5 ia <- 1/.5 exb <- exp(xb) xg <- rgamma(nobs, a, a, ia) xbg <-exb*xg nby <- rpois(nobs, xbg) nbdata <- data.frame(nby, x1, x2) pi <- 1/(1+exp(-(.9*x1 +.1*x2 +.2))) nbdata$bern <- runif(nobs)>pi nbdata <- subset(nbdata, nby > 0) nbdata$nby[nbdata$bern] <- 0 hlnb2 <- hurdle(nby ~ x1 + x2, dist="negbin", zero.dist= "binomial", link="logit", data=nbdata) summary(hlnb2) ============================================================================
14 hurdle(formula = nby ~ x1 + x2, data = nbdata, dist = "negbin", zero.dist = "binomial", link = "logit") Pearson residuals: Min 1Q Median 3Q Max Count model coefficients (truncated negbin with log link): (Intercept) <2e-16 *** x <2e-16 *** x <2e-16 *** Log(theta) <2e-16 *** Zero hurdle model coefficients (binomial with logit link): (Intercept) e-11 *** x < 2e-16 *** x ** --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Theta: count = Number of iterations in BFGS optimization: 17 Log-likelihood: e+04 on 7 Df The closeness of the specified and estimated coefficients is not as good as with the Poisson-logit hurdle, but this is largely due to the fact that we are only observing one of a host of possible synthetic models. Using a Monte Carlo method to repeat and summarize would show much closer values. For out final synthetic count model we shall look at the Poisson-Poisson finite mixture model. The code comes from Hilbe (2011), which has a good discussion of the model. POISSON-POISSON FINITE MIXTURE MODEL ============================================================================= # syn.ppfm.r Synthetic Poisson-Poisson Finite Mixture model # Table 13.2: Hilbe, Negative Binomial Regression, 2 ed, Cambridge Univ Press library(flexmix) nobs < x1 <- runif(nobs) x2 <- runif(nobs) xb1 < *x1-0.75*x2 xb2 < *x1-1.25*x2 exb1 <- exp(xb2) exb2 <- exp(xb1) py1 <- rpois(nobs, exb2) py2 <- rpois(nobs, exb1) poixpoi <- py2 poixpoi <- ifelse(runif(nobs) >.9, py1, poixpoi) pxp <- flexmix(poixpoi ~ x1 + x2, k=2, model=flxmrglm(family="poisson")) summary(pxp) parameters(pxp, component=1, model=1) parameters(pxp, component=2, model=1) ===========================================================================
15 flexmix(formula = poixpoi ~ x1 + x2, k = 2, model = FLXMRglm(family = "poisson")) prior size post>0 ratio Comp # proportion comp Comp # proportion comp 'log Lik.' (df=7) AIC: BIC: > parameters(pxp, component=1, model=1) Comp.1 coef.(intercept) coef.x coef.x > parameters(pxp, component=2, model=1) Comp.2 coef.(intercept) coef.x coef.x Note that the proportions and coefficients are given in the reverse order than specified in the synthetic model. The first component is defined as having 10% of the sample, whereas the second component has 90%. This is coded on the second poixpoi line. I have indicated the proportions in the output above. It is not at all difficult to modify the above code to employ Poisson-NB2 mixtures, NB2-NB2 mixtures, or any other combination we have been discussing. 2: SYNTHETIC BINOMIAL MODELS Synthetic binomial models are constructed in the same manner as synthetic Poisson data and models. The key lines are those that generate pseudo-random variates, a line creating the linear predictor with user defined parameters, a line employing the inverse link function to generate μ, the fitted probability of 1, and a line using μ to generate pseudo-random variates appropriate to the distribution. A Bernoulli distribution consists entirely of binary values, 1/0. y is binary and is considered here to be the response variable which is explained by the values of x1 and x2. Data such as this are typically modeled using a logistic regression. A probit, loglog, or complementary loglog model may also be used to model the data. y x1 x2 1: : : : : : 0 0 1
16 The above data may be grouped by covariate patterns. The covariates here are, of course, x1 and x2. With y now the number of successes, i.e. a count of 1 s, and m the number of observations having the same covariate pattern, the above data may be grouped as: y m x1 x2 1: : : The distribution of y/m is binomial. y is a count of observations having a value of y=1 for a specific covariate pattern, and m is the number of observations having the same covariate pattern. One can see that the Bernoulli distribution is a subset of the binomial, i.e. a binomial distribution where m=1. In actuality, a logistic regression models the top data as if there were no m, regardless of the number of separate covariate patterns. Grouped logistic, or binomiallogit, regression assumes appropriate values of y and m. In R, grouped data such as the above may be modeled as a logistic regression using the glm() function, however, the binomial numerator and denominator terms must be bound. I shall demonstrate how this is accomplished. We shall fist address the binary, or Bernoulli, logistic model. This parameterization of the binomial is usually referred to as logistic regression. We specify the following parameter values: x1=.75, x2=-1.25, and _c=2 SYNTHETIC BERNOULLI-LOGIT DATA ================================================================= # syn.logit.r Joseph Hilbe 10Apr2009 xb < *x1-1.25*x2 exb <- 1/(1+exp(-xb)) by <- rbinom(50000, size = 1, prob =exp) lry <- glm(by ~ x1 + x2, family=binomial(link="logit")) summary(lry) ================================================================== glm(formula = by ~ x1 + x2, family = binomial(link = "logit")) Deviance Residuals: Min 1Q Median 3Q Max Coefficients: (Intercept) <2e-16 *** x <2e-16 *** x <2e-16 *** --- Signif. codes: 0 *** ** 0.01 *
17 (Dispersion parameter for binomial family taken to be 1) Null deviance: on degrees of freedom Residual deviance: on degrees of freedom AIC: Number of Fisher Scoring iterations: 4 A synthetic probit model may be constructed by changing the inverse link function from 1/(1+exp(-xb)) to pnorm(xb). Complementary log-log and log-log models may be developed inserting their own inverse links. Of course, the link must be declared appropriately when modeling with the glm() function. BERNOULLI PROBIT REGRESSION =========================================================== # syn.probit.r // Joseph Hilbe 10Apr2009 xb < *x1 1.25*x2 exb <- pnorm(xb) by <- rbinom(50000, size = 1, prob =exb) pry <- glm(by ~ x1 + x2, family=binomial(link="probit")) summary(pry) =========================================================== glm(formula = by ~ x1 + x2, family = binomial(link = "probit")) Deviance Residuals: Min 1Q Median 3Q Max Coefficients: (Intercept) <2e-16 *** x <2e-16 *** x <2e-16 *** --- Signif. codes: 0 *** ** 0.01 * (Dispersion parameter for binomial family taken to be 1) Null deviance: on degrees of freedom Residual deviance: on degrees of freedom AIC: Number of Fisher Scoring iterations: 6 The code for constructing a synthetic binomial logit model is a bit more complex. A binomial denominator must be defined and appropriately inserted into the algorithm. It must also be
18 decided if the binomial denominator is a fixed value, or is itself random within certain defined constraints. In the algorithm below I have specified the same values for the binomial denominator as previously employed for the Poisson offset. Therefore, I am using fixed values for the denominator. Pseudo-random values for the denominator may be created in a variety of ways. Below is code to generate a pseudo-random denominator that is divided into ten groups of 5000 observations. > library(ggplot2) > re <- 100*runif(50000) # or runif() if observations already defined. > d <- cut_number(d, n=10) BINOMIAL OR GROUPED LOGISTIC REGRESSION =========================================================================== # syn.bin_logit.r // Joseph Hilbe 10Apr2009 # predictor & n of observations # 2 nd predictor d <- rep(1:5, each=10000, times=1)*100 # denominator xb < *x1-1.25*x2 # predictor values exb <- 1/(1+exp(-xb)) by <- rbinom(50000, size = d, p = exb) dby = d - by gby <- glm(cbind(by,dby) ~ x1 + x2, family=binomial(link="logit")) summary(gby) =========================================================================== glm(formula = cbind(by, dby) ~ x1 + x2, family = binomial(link = "logit")) Deviance Residuals: Min 1Q Median 3Q Max Coefficients: (Intercept) <2e-16 *** x <2e-16 *** x <2e-16 *** --- Signif. codes: 0 *** ** 0.01 * (Dispersion parameter for binomial family taken to be 1) Null deviance: on degrees of freedom Residual deviance: on degrees of freedom AIC: Number of Fisher Scoring iterations: 4 The specified parameters and estimates are extremely close in values. Probit, loglog, and complementary loglog binomial models may easily be constructed by changing the inverse link function.
19 3: SYNTHETIC CATEGORICAL RESPONSE MODELS I have previously discussed in detail the creation of synthetic ordered logit, or proportional odds, data in Hilbe (2009), and refer to that source for a more thorough examination of the subject. Multinomial logit data is also examined in the same source. Because of the complexity of the model, the generated data is a bit more variable than with synthetic logit, Poisson, or negative binomial models. However, Monte Carlo simulation (not shown) proves that the mean values closely approximate the user supplied parameters and cut points. I display code for generating synthetic ordered logit data below. The model is also commonly referred to as the proportional odds model. PROPORTIONAL ODDS MODEL With four levels having cut points at 2, 3, and 4, and predictors x1=.75 and x2=-1.25, we have 4 LEVELS ======================================================================== # syn.ologit4.r // Joseph Hilbe 10Apr2009 err <- runif(50000) y <-.75*x1-1.25*x2 + log(err/(1-err)) # predictor values ys <- rep(1, length(y)) #start with all level 1 y<= 2) ys <- ifelse(y>2 & y<=3, 2, ys) ys <- ifelse(y>3 & y<=4, 3, ys) ys <- ifelse(y>4, 4, ys) ologit <- polr(factor(ys) ~ x1 + x2, method="logistic") summary(ologit) ======================================================================== Re-fitting to get Hessian polr(formula = factor(ys) ~ x1 + x2, method = "logistic") Coefficients: Value Std. Error t value x x Intercepts: Value Std. Error t value Residual Deviance: AIC:
20 To demonstrate the changes that need to be made when adding another level, I show a five level proportional odds model. The five levels have cuts at.8, 1.6, 2.4, and 3.2. The same two parameter estimates are assigned. 5 LEVELS ============================================================= # syn.ologit5.r // Joseph Hilbe 10Apr2009 err <- runif(50000) y <-.75*x1-1.25*x2 + log(err/(1-err)) ys <- rep(1, length(y)) #start with all level 1 y<=.8) ys <- ifelse(y>.8 & y<=1.6, 2, ys) ys <- ifelse(y>1.6 & y<=2.4, 3, ys) ys <- ifelse(y>2.4 & y<=3.2, 4, ys) ys <- ifelse(y>3.2, 5, ys) ologit <- polr(factor(ys) ~ x1 + x2, method="logistic") summary(ologit) =============================================================== Re-fitting to get Hessian polr(formula = factor(ys) ~ x1 + x2, method = "logistic") Coefficients: Value Std. Error t value x x Intercepts: Value Std. Error t value Residual Deviance: AIC: Finally we turn to the synthetic multinomial logit model. The construction is easy to follow, and easily expandable. I shall first show the code for developing a synthetic multinomial model with two predictors and three levels. The nnet library must be called to access the multinom() model, which is a standard method for modeling multinomial models. Two sets of predictors are assigned by the user. The default values are given in the code. The first level is assigned the role of reference. Level 2 predictors: x1 =.4, x2 = -.5, intercept = 1 Level 3 predictors: x1 = -.3, x2 =.25, intercept = 2
21 SYNTHETIC MULTINOMIAL LOGIT DATA AND MODEL ==================================================================== # syn.multinom3.r // Joseph Hilbe 10Apr2009 library(nnet) denom <- 1+exp(.4*x1.5*x2 + 1) + exp(-.3*x1 +.25*x2 + 2) p1 <- 1/denom p2 <- exp(.4*x1.5*x2 + 1)/denom p3 <- exp(-.3*x1 +.25*x2 + 2)/denom u <- runif(50000) y <- rep(1, length(u)) #start with all level 1 u<= p1) p12 <- p1 + p2 y <- ifelse(u>p1 & u<= p12, 2, y) y <- ifelse(u>p12, 3, y) mlogit <- multinom( y ~ x1 + x2) summary(mlogit) =============================================================== # weights: 12 (6 variable) initial value iter 10 value final value converged > summary(mlogit) multinom(formula = y ~ x1 + x2) Coefficients: (Intercept) x1 x Std. Errors: (Intercept) x1 x Residual Deviance: AIC: Expanding the model to three predictors and four levels can be done employing the same logic as for the smaller model. We add a fourth level of predictors, recalling that level 1 is the reference level. The predictor values are now: Level 2 predictors: x1 =.4, x2 = -.5, x3 = -.2, intercept = 1 Level 3 predictors: x1 = -.3, x2 =.25, x3 = -.3, intercept = 2 Level 4 predictors: x1 = -.25, x2 =.1, x3 =.15, intercept = 2.5
22 ================================================================================ # syn.multinom4.r // Joseph Hilbe 10Apr2009 library(nnet) x3 <- runif(50000) denom = 1+exp(.4*x1 -.5*x2 -.2*x3 +1 ) + exp(-.3*x1+.25*x2 -.3*x3 +2) + exp(-.25*x1 +.1*x2 +.15*x3 +2.5) p1 <- 1/denom p2 <- exp(.4*x1.5*x2 + 1)/denom p3 <- exp(-.3*x1 +.25*x2 + 2)/denom p4 <- exp(-.25*x1+.1*x2 +.15*x )/denom u <- runif(50000) y <- rep(1, length(u)) #start with all level 1 u<= p1) p12 <- p1 + p2 y <- ifelse(u>p1 & u<= p12, 2, y) p13 <- p1 + p2 + p3 y <- ifelse (u>p12 & u<= p13, 3, y) y <- ifelse(u>p13, 4, y) mlogit <- multinom( y ~ x1 + x2 + x3) summary(mlogit) ========================================================================= multinom(formula = y ~ x1 + x2 + x3) Coefficients: (Intercept) x1 x2 x Std. Errors: (Intercept) x1 x2 x Residual Deviance: AIC: SUMMARY REMARKS Synthetic data may be used with substantial efficacy for the evaluation of statistical models. In this article I have presented algorithm code that can be used to create a number of different types of synthetic discrete response models. The default code may be amended to employ different predictor values, as well as various numbers of predictor. The algorithms may also be extended to the generation of other types of synthetic models. I advocate using synthetic models of this sort to better understand the models we apply to real data. R is particularly capable of creating synthetic data, and is an excellent simulation environment. With computers gaining in memory and speed, it is possible to construct far
23 more complex synthetic data than the ones we have discussed in this article. The ones we describe may be considered as the foundation for a wide variety of other models, including fixed, random, and mixed effects models and GEE models among others. I trust that this effort will encourage others to construct extensions of the models given here, making them available to the general statistical community. Two recent sources that exploit simulation and synthetic data in the analysis of complex statistical models include Gelman and Hill (2007) with respect to the understanding of random and mixed effects models and Hardin and Hilbe (2003) for the analysis of GEE methodology. References: Gelman, A. and J. Hill (2007), Data Analysis Using Regression and Multilevel/Hierarchical Models, Cambridge: Cambridge University Press. Hardin, J.W. and J.M. Hilbe (2003), Generalized Estimating Equations, Boca Raton, FL: Chapman & Hall/CRC. Hardin, J.W. and J.M. Hilbe (2007), Generalized Linear Models and Extensions, second edition, College Station, TX: Stata Press. Hilbe, J.M. (2011), Negative Binomial Regression, second edition Cambridge: Cambridge University Press Hilbe, J.M. (2009), Logistic Regression Models, Boca Raton, FL: Chapman & Hall/CRC
Creation of Synthetic Discrete Response Regression Models
Arizona State University From the SelectedWorks of Joseph M Hilbe 2010 Creation of Synthetic Discrete Response Regression Models Joseph Hilbe, Arizona State University Available at: https://works.bepress.com/joseph_hilbe/2/
More informationCreating synthetic discrete-response regression models
The Stata Journal (2010) 10, Number 1, pp. 104 124 Creating synthetic discrete-response regression models Joseph M. Hilbe Arizona State University and Jet Propulsion Laboratory, CalTech Hilbe@asu.edu Abstract.
More informationThe Stata Journal. Editor Nicholas J. Cox Department of Geography Durham University South Road Durham City DH1 3LE UK
The Stata Journal Editor H. Joseph Newton Department of Statistics Texas A&M University College Station, Texas 77843 979-845-8817; fax 979-845-6077 jnewton@stata-journal.com Associate Editors Christopher
More informationMultiple Regression and Logistic Regression II. Dajiang 525 Apr
Multiple Regression and Logistic Regression II Dajiang Liu @PHS 525 Apr-19-2016 Materials from Last Time Multiple regression model: Include multiple predictors in the model = + + + + How to interpret the
More informationNegative Binomial Model for Count Data Log-linear Models for Contingency Tables - Introduction
Negative Binomial Model for Count Data Log-linear Models for Contingency Tables - Introduction Statistics 149 Spring 2006 Copyright 2006 by Mark E. Irwin Negative Binomial Family Example: Absenteeism from
More informationOrdinal Multinomial Logistic Regression. Thom M. Suhy Southern Methodist University May14th, 2013
Ordinal Multinomial Logistic Thom M. Suhy Southern Methodist University May14th, 2013 GLM Generalized Linear Model (GLM) Framework for statistical analysis (Gelman and Hill, 2007, p. 135) Linear Continuous
More information############################ ### toxo.r ### ############################
############################ ### toxo.r ### ############################ toxo < read.table(file="n:\\courses\\stat8620\\fall 08\\toxo.dat",header=T) #toxo < read.table(file="c:\\documents and Settings\\dhall\\My
More informationIntro to GLM Day 2: GLM and Maximum Likelihood
Intro to GLM Day 2: GLM and Maximum Likelihood Federico Vegetti Central European University ECPR Summer School in Methods and Techniques 1 / 32 Generalized Linear Modeling 3 steps of GLM 1. Specify the
More informationIntroduction to General and Generalized Linear Models
Introduction to General and Generalized Linear Models Generalized Linear Models - IIIb Henrik Madsen March 18, 2012 Henrik Madsen () Chapman & Hall March 18, 2012 1 / 32 Examples Overdispersion and Offset!
More informationBayesian Multinomial Model for Ordinal Data
Bayesian Multinomial Model for Ordinal Data Overview This example illustrates how to fit a Bayesian multinomial model by using the built-in mutinomial density function (MULTINOM) in the MCMC procedure
More informationGeneralized Linear Models
Generalized Linear Models Scott Creel Wednesday, September 10, 2014 This exercise extends the prior material on using the lm() function to fit an OLS regression and test hypotheses about effects on a parameter.
More informationboxcox() returns the values of α and their loglikelihoods,
Solutions to Selected Computer Lab Problems and Exercises in Chapter 11 of Statistics and Data Analysis for Financial Engineering, 2nd ed. by David Ruppert and David S. Matteson c 2016 David Ruppert and
More informationGetting Started in Logit and Ordered Logit Regression (ver. 3.1 beta)
Getting Started in Logit and Ordered Logit Regression (ver. 3. beta Oscar Torres-Reyna Data Consultant otorres@princeton.edu http://dss.princeton.edu/training/ Logit model Use logit models whenever your
More informationLogistic Regression. Logistic Regression Theory
Logistic Regression Dr. J. Kyle Roberts Southern Methodist University Simmons School of Education and Human Development Department of Teaching and Learning Logistic Regression The linear probability model.
More informationSupplementary material for the paper Identifiability and bias reduction in the skew-probit model for a binary response
Supplementary material for the paper Identifiability and bias reduction in the skew-probit model for a binary response DongHyuk Lee and Samiran Sinha Department of Statistics, Texas A&M University, College
More informationList of figures. I General information 1
List of figures Preface xix xxi I General information 1 1 Introduction 7 1.1 What is this book about?........................ 7 1.2 Which models are considered?...................... 8 1.3 Whom is this
More informationLogit Models for Binary Data
Chapter 3 Logit Models for Binary Data We now turn our attention to regression models for dichotomous data, including logistic regression and probit analysis These models are appropriate when the response
More informationLogistic Regression with R: Example One
Logistic Regression with R: Example One math = read.table("http://www.utstat.toronto.edu/~brunner/appliedf12/data/mathcat.data") math[1:5,] hsgpa hsengl hscalc course passed outcome 1 78.0 80 Yes Mainstrm
More informationMCMC Package Example
MCMC Package Example Charles J. Geyer April 4, 2005 This is an example of using the mcmc package in R. The problem comes from a take-home question on a (take-home) PhD qualifying exam (School of Statistics,
More informationWC-5 Just How Credible Is That Employer? Exploring GLMs and Multilevel Modeling for NCCI s Excess Loss Factor Methodology
Antitrust Notice The Casualty Actuarial Society is committed to adhering strictly to the letter and spirit of the antitrust laws. Seminars conducted under the auspices of the CAS are designed solely to
More informationWesVar uses repeated replication variance estimation methods exclusively and as a result does not offer the Taylor Series Linearization approach.
CHAPTER 9 ANALYSIS EXAMPLES REPLICATION WesVar 4.3 GENERAL NOTES ABOUT ANALYSIS EXAMPLES REPLICATION These examples are intended to provide guidance on how to use the commands/procedures for analysis of
More information> budworm$samplogit < log((budworm$y+0.5)/(budworm$m budworm$y+0.5))
budworm < read.table(file="n:\\courses\\stat8620\\fall 08\\budworm.dat",header=T) #budworm < read.table(file="c:\\documents and Settings\\dhall\\My Documents\\Dan's Work Stuff\\courses\\STAT8620\\Fall
More informationCHAPTER 8 EXAMPLES: MIXTURE MODELING WITH LONGITUDINAL DATA
Examples: Mixture Modeling With Longitudinal Data CHAPTER 8 EXAMPLES: MIXTURE MODELING WITH LONGITUDINAL DATA Mixture modeling refers to modeling with categorical latent variables that represent subpopulations
More informationMaximum Likelihood Estimation Richard Williams, University of Notre Dame, https://www3.nd.edu/~rwilliam/ Last revised January 13, 2018
Maximum Likelihood Estimation Richard Williams, University of otre Dame, https://www3.nd.edu/~rwilliam/ Last revised January 3, 208 [This handout draws very heavily from Regression Models for Categorical
More informationStat 401XV Exam 3 Spring 2017
Stat 40XV Exam Spring 07 I have neither given nor received unauthorized assistance on this exam. Name Signed Date Name Printed ATTENTION! Incorrect numerical answers unaccompanied by supporting reasoning
More informationGetting Started in Logit and Ordered Logit Regression (ver. 3.1 beta)
Getting Started in Logit and Ordered Logit Regression (ver. 3. beta Oscar Torres-Reyna Data Consultant otorres@princeton.edu http://dss.princeton.edu/training/ Logit model Use logit models whenever your
More informationI L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN
Modeling Counts & ZIP: Extended Example Carolyn J. Anderson Department of Educational Psychology I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN Modeling Counts Slide 1 of 36 Outline Outline
More informationCategorical Outcomes. Statistical Modelling in Stata: Categorical Outcomes. R by C Table: Example. Nominal Outcomes. Mark Lunt.
Categorical Outcomes Statistical Modelling in Stata: Categorical Outcomes Mark Lunt Arthritis Research UK Epidemiology Unit University of Manchester Nominal Ordinal 28/11/2017 R by C Table: Example Categorical,
More informationMaximum Likelihood Estimation Richard Williams, University of Notre Dame, https://www3.nd.edu/~rwilliam/ Last revised January 10, 2017
Maximum Likelihood Estimation Richard Williams, University of otre Dame, https://www3.nd.edu/~rwilliam/ Last revised January 0, 207 [This handout draws very heavily from Regression Models for Categorical
More informationLecture 21: Logit Models for Multinomial Responses Continued
Lecture 21: Logit Models for Multinomial Responses Continued Dipankar Bandyopadhyay, Ph.D. BMTRY 711: Analysis of Categorical Data Spring 2011 Division of Biostatistics and Epidemiology Medical University
More informationAddiction - Multinomial Model
Addiction - Multinomial Model February 8, 2012 First the addiction data are loaded and attached. > library(catdata) > data(addiction) > attach(addiction) For the multinomial logit model the function multinom
More informationMaximum Likelihood Estimation
Maximum Likelihood Estimation EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #6 EPSY 905: Maximum Likelihood In This Lecture The basics of maximum likelihood estimation Ø The engine that
More informationbook 2014/5/6 15:21 page 261 #285
book 2014/5/6 15:21 page 261 #285 Chapter 10 Simulation Simulations provide a powerful way to answer questions and explore properties of statistical estimators and procedures. In this chapter, we will
More informationSTA 4504/5503 Sample questions for exam True-False questions.
STA 4504/5503 Sample questions for exam 2 1. True-False questions. (a) For General Social Survey data on Y = political ideology (categories liberal, moderate, conservative), X 1 = gender (1 = female, 0
More informationLocal Maxima in the Estimation of the ZINB and Sample Selection models
1 Local Maxima in the Estimation of the ZINB and Sample Selection models J.M.C. Santos Silva School of Economics, University of Surrey 23rd London Stata Users Group Meeting 7 September 2017 2 1. Introduction
More informationStep 1: Load the appropriate R package. Step 2: Fit a separate mixed model for each independence claim in the basis set.
Step 1: Load the appropriate R package. You will need two libraries: nlme and lme4. Step 2: Fit a separate mixed model for each independence claim in the basis set. For instance, in Table 2 the first basis
More informationDuration Models: Parametric Models
Duration Models: Parametric Models Brad 1 1 Department of Political Science University of California, Davis January 28, 2011 Parametric Models Some Motivation for Parametrics Consider the hazard rate:
More informationApplied Econometrics with. Microeconometrics
Applied Econometrics with Chapter 5 Microeconometrics Christian Kleiber, Achim Zeileis 2008 2017 Applied Econometrics with R 5 Microeconometrics 0 / 72 Microeconometrics Overview Christian Kleiber, Achim
More informationCREDIT RISK MODELING IN R. Logistic regression: introduction
CREDIT RISK MODELING IN R Logistic regression: introduction Final data structure > str(training_set) 'data.frame': 19394 obs. of 8 variables: $ loan_status : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1
More informationEstimation Parameters and Modelling Zero Inflated Negative Binomial
CAUCHY JURNAL MATEMATIKA MURNI DAN APLIKASI Volume 4(3) (2016), Pages 115-119 Estimation Parameters and Modelling Zero Inflated Negative Binomial Cindy Cahyaning Astuti 1, Angga Dwi Mulyanto 2 1 Muhammadiyah
More informationSubject CS1 Actuarial Statistics 1 Core Principles. Syllabus. for the 2019 exams. 1 June 2018
` Subject CS1 Actuarial Statistics 1 Core Principles Syllabus for the 2019 exams 1 June 2018 Copyright in this Core Reading is the property of the Institute and Faculty of Actuaries who are the sole distributors.
More informationModel fit assessment via marginal model plots
The Stata Journal (2010) 10, Number 2, pp. 215 225 Model fit assessment via marginal model plots Charles Lindsey Texas A & M University Department of Statistics College Station, TX lindseyc@stat.tamu.edu
More informationSTAT 825 Notes Random Number Generation
STAT 825 Notes Random Number Generation What if R/Splus/SAS doesn t have a function to randomly generate data from a particular distribution? Although R, Splus, SAS and other packages can generate data
More information11. Logistic modeling of proportions
11. Logistic modeling of proportions Retrieve the data File on main menu Open worksheet C:\talks\strirling\employ.ws = Note Postcode is neighbourhood in Glasgow Cell is element of the table for each postcode
More informationAnalysis of Microdata
Rainer Winkelmann Stefan Boes Analysis of Microdata Second Edition 4u Springer 1 Introduction 1 1.1 What Are Microdata? 1 1.2 Types of Microdata 4 1.2.1 Qualitative Data 4 1.2.2 Quantitative Data 6 1.3
More informationModule 9: Single-level and Multilevel Models for Ordinal Responses. Stata Practical 1
Module 9: Single-level and Multilevel Models for Ordinal Responses Pre-requisites Modules 5, 6 and 7 Stata Practical 1 George Leckie, Tim Morris & Fiona Steele Centre for Multilevel Modelling If you find
More informationsociology SO5032 Quantitative Research Methods Brendan Halpin, Sociology, University of Limerick Spring 2018 SO5032 Quantitative Research Methods
1 SO5032 Quantitative Research Methods Brendan Halpin, Sociology, University of Limerick Spring 2018 Lecture 10: Multinomial regression baseline category extension of binary What if we have multiple possible
More informationLog-linear Modeling Under Generalized Inverse Sampling Scheme
Log-linear Modeling Under Generalized Inverse Sampling Scheme Soumi Lahiri (1) and Sunil Dhar (2) (1) Department of Mathematical Sciences New Jersey Institute of Technology University Heights, Newark,
More informationRescaling results of nonlinear probability models to compare regression coefficients or variance components across hierarchically nested models
Rescaling results of nonlinear probability models to compare regression coefficients or variance components across hierarchically nested models Dirk Enzmann & Ulrich Kohler University of Hamburg, dirk.enzmann@uni-hamburg.de
More informationBradley-Terry Models. Stat 557 Heike Hofmann
Bradley-Terry Models Stat 557 Heike Hofmann Outline Definition: Bradley-Terry Fitting the model Extension: Order Effects Extension: Ordinal & Nominal Response Repeated Measures Bradley-Terry Model (1952)
More informationORDERED MULTINOMIAL LOGISTIC REGRESSION ANALYSIS. Pooja Shivraj Southern Methodist University
ORDERED MULTINOMIAL LOGISTIC REGRESSION ANALYSIS Pooja Shivraj Southern Methodist University KINDS OF REGRESSION ANALYSES Linear Regression Logistic Regression Dichotomous dependent variable (yes/no, died/
More informationCredit Risk Modelling
Credit Risk Modelling Tiziano Bellini Università di Bologna December 13, 2013 Tiziano Bellini (Università di Bologna) Credit Risk Modelling December 13, 2013 1 / 55 Outline Framework Credit Risk Modelling
More informationTwo-Sample Z-Tests Assuming Equal Variance
Chapter 426 Two-Sample Z-Tests Assuming Equal Variance Introduction This procedure provides sample size and power calculations for one- or two-sided two-sample z-tests when the variances of the two groups
More informationCHAPTER 12 EXAMPLES: MONTE CARLO SIMULATION STUDIES
Examples: Monte Carlo Simulation Studies CHAPTER 12 EXAMPLES: MONTE CARLO SIMULATION STUDIES Monte Carlo simulation studies are often used for methodological investigations of the performance of statistical
More informationSuperiority by a Margin Tests for the Ratio of Two Proportions
Chapter 06 Superiority by a Margin Tests for the Ratio of Two Proportions Introduction This module computes power and sample size for hypothesis tests for superiority of the ratio of two independent proportions.
More informationtm / / / / / / / / / / / / Statistics/Data Analysis User: Klick Project: Limited Dependent Variables{space -6}
PS 4 Monday August 16 01:00:42 2010 Page 1 tm / / / / / / / / / / / / Statistics/Data Analysis User: Klick Project: Limited Dependent Variables{space -6} log: C:\web\PS4log.smcl log type: smcl opened on:
More informationEconometric Methods for Valuation Analysis
Econometric Methods for Valuation Analysis Margarita Genius Dept of Economics M. Genius (Univ. of Crete) Econometric Methods for Valuation Analysis Cagliari, 2017 1 / 25 Outline We will consider econometric
More informationChapter 8 Exercises 1. Data Analysis & Graphics Using R Solutions to Exercises (May 1, 2010)
Chapter 8 Exercises 1 Data Analysis & Graphics Using R Solutions to Exercises (May 1, 2010) Preliminaries > library(daag) Exercise 1 The following table shows numbers of occasions when inhibition (i.e.,
More informationBIOINFORMATICS MSc PROBABILITY AND STATISTICS SPLUS SHEET 1
BIOINFORMATICS MSc PROBABILITY AND STATISTICS SPLUS SHEET 1 A data set containing a segment of human chromosome 13 containing the BRCA2 breast cancer gene; it was obtained from the National Center for
More informationProjects for Bayesian Computation with R
Projects for Bayesian Computation with R Laura Vana & Kurt Hornik Winter Semeter 2018/2019 1 S&P Rating Data On the homepage of this course you can find a time series for Standard & Poors default data
More informationEconomics Multinomial Choice Models
Economics 217 - Multinomial Choice Models So far, most extensions of the linear model have centered on either a binary choice between two options (work or don t work) or censoring options. Many questions
More informationSome Characteristics of Data
Some Characteristics of Data Not all data is the same, and depending on some characteristics of a particular dataset, there are some limitations as to what can and cannot be done with that data. Some key
More informationSTATISTICAL METHODS FOR CATEGORICAL DATA ANALYSIS
STATISTICAL METHODS FOR CATEGORICAL DATA ANALYSIS Daniel A. Powers Department of Sociology University of Texas at Austin YuXie Department of Sociology University of Michigan ACADEMIC PRESS An Imprint of
More informationEstimating log models: to transform or not to transform?
Journal of Health Economics 20 (2001) 461 494 Estimating log models: to transform or not to transform? Willard G. Manning a,, John Mullahy b a Department of Health Studies, Biological Sciences Division,
More informationMODEL SELECTION CRITERIA IN R:
1. R 2 statistics We may use MODEL SELECTION CRITERIA IN R R 2 = SS R SS T = 1 SS Res SS T or R 2 Adj = 1 SS Res/(n p) SS T /(n 1) = 1 ( ) n 1 (1 R 2 ). n p where p is the total number of parameters. R
More informationIntroduction to the Maximum Likelihood Estimation Technique. September 24, 2015
Introduction to the Maximum Likelihood Estimation Technique September 24, 2015 So far our Dependent Variable is Continuous That is, our outcome variable Y is assumed to follow a normal distribution having
More informationproc genmod; model malform/total = alcohol / dist=bin link=identity obstats; title 'Table 2.7'; title2 'Identity Link';
BIOS 6244 Analysis of Categorical Data Assignment 5 s 1. Consider Exercise 4.4, p. 98. (i) Write the SAS code, including the DATA step, to fit the linear probability model and the logit model to the data
More informationMultinomial Logit Models - Overview Richard Williams, University of Notre Dame, https://www3.nd.edu/~rwilliam/ Last revised February 13, 2017
Multinomial Logit Models - Overview Richard Williams, University of Notre Dame, https://www3.nd.edu/~rwilliam/ Last revised February 13, 2017 This is adapted heavily from Menard s Applied Logistic Regression
More informationA Comparison of Univariate Probit and Logit. Models Using Simulation
Applied Mathematical Sciences, Vol. 12, 2018, no. 4, 185-204 HIKARI Ltd, www.m-hikari.com https://doi.org/10.12988/ams.2018.818 A Comparison of Univariate Probit and Logit Models Using Simulation Abeer
More informationLoss Simulation Model Testing and Enhancement
Loss Simulation Model Testing and Enhancement Casualty Loss Reserve Seminar By Kailan Shang Sept. 2011 Agenda Research Overview Model Testing Real Data Model Enhancement Further Development Enterprise
More informationGeneralized Multilevel Regression Example for a Binary Outcome
Psy 510/610 Multilevel Regression, Spring 2017 1 HLM Generalized Multilevel Regression Example for a Binary Outcome Specifications for this Bernoulli HLM2 run Problem Title: no title The data source for
More informationNon-Inferiority Tests for the Ratio of Two Proportions
Chapter Non-Inferiority Tests for the Ratio of Two Proportions Introduction This module provides power analysis and sample size calculation for non-inferiority tests of the ratio in twosample designs in
More informationEstimation Procedure for Parametric Survival Distribution Without Covariates
Estimation Procedure for Parametric Survival Distribution Without Covariates The maximum likelihood estimates of the parameters of commonly used survival distribution can be found by SAS. The following
More informationSAS/STAT 15.1 User s Guide The FMM Procedure
SAS/STAT 15.1 User s Guide The FMM Procedure This document is an individual chapter from SAS/STAT 15.1 User s Guide. The correct bibliographic citation for this manual is as follows: SAS Institute Inc.
More informationNon-Inferiority Tests for the Odds Ratio of Two Proportions
Chapter Non-Inferiority Tests for the Odds Ratio of Two Proportions Introduction This module provides power analysis and sample size calculation for non-inferiority tests of the odds ratio in twosample
More informationThe data definition file provided by the authors is reproduced below: Obs: 1500 home sales in Stockton, CA from Oct 1, 1996 to Nov 30, 1998
Economics 312 Sample Project Report Jeffrey Parker Introduction This project is based on Exercise 2.12 on page 81 of the Hill, Griffiths, and Lim text. It examines how the sale price of houses in Stockton,
More informationGov 2001: Section 5. I. A Normal Example II. Uncertainty. Gov Spring 2010
Gov 2001: Section 5 I. A Normal Example II. Uncertainty Gov 2001 Spring 2010 A roadmap We started by introducing the concept of likelihood in the simplest univariate context one observation, one variable.
More informationCalculating the Probabilities of Member Engagement
Calculating the Probabilities of Member Engagement by Larry J. Seibert, Ph.D. Binary logistic regression is a regression technique that is used to calculate the probability of an outcome when there are
More informationQuantitative Techniques Term 2
Quantitative Techniques Term 2 Laboratory 7 2 March 2006 Overview The objective of this lab is to: Estimate a cost function for a panel of firms; Calculate returns to scale; Introduce the command cluster
More informationModel 0: We start with a linear regression model: log Y t = β 0 + β 1 (t 1980) + ε, with ε N(0,
Stat 534: Fall 2017. Introduction to the BUGS language and rjags Installation: download and install JAGS. You will find the executables on Sourceforge. You must have JAGS installed prior to installing
More informationCatherine De Vries, Spyros Kosmidis & Andreas Murr
APPLIED STATISTICS FOR POLITICAL SCIENTISTS WEEK 8: DEPENDENT CATEGORICAL VARIABLES II Catherine De Vries, Spyros Kosmidis & Andreas Murr Topic: Logistic regression. Predicted probabilities. STATA commands
More informationWest Coast Stata Users Group Meeting, October 25, 2007
Estimating Heterogeneous Choice Models with Stata Richard Williams, Notre Dame Sociology, rwilliam@nd.edu oglm support page: http://www.nd.edu/~rwilliam/oglm/index.html West Coast Stata Users Group Meeting,
More informationNon-Inferiority Tests for the Difference Between Two Proportions
Chapter 0 Non-Inferiority Tests for the Difference Between Two Proportions Introduction This module provides power analysis and sample size calculation for non-inferiority tests of the difference in twosample
More informationUsing New SAS 9.4 Features for Cumulative Logit Models with Partial Proportional Odds Paul J. Hilliard, Educational Testing Service (ETS)
Using New SAS 9.4 Features for Cumulative Logit Models with Partial Proportional Odds Using New SAS 9.4 Features for Cumulative Logit Models with Partial Proportional Odds INTRODUCTION Multicategory Logit
More informationLongitudinal Logistic Regression: Breastfeeding of Nepalese Children
Longitudinal Logistic Regression: Breastfeeding of Nepalese Children Scientific Question Determine whether the breastfeeding of Nepalese children varies with child age and/or sex of child. Data: Nepal
More informationAIC = Log likelihood = BIC =
- log: /mnt/ide1/home/sschulh1/apc/apc_examplelog log type: text opened on: 21 Jul 2006, 18:08:20 *replicate table 5 and cols 7-9 of table 3 in Yang, Fu and Land (2004) *Stata can maximize GLM objective
More informationLecture 10: Alternatives to OLS with limited dependent variables, part 1. PEA vs APE Logit/Probit
Lecture 10: Alternatives to OLS with limited dependent variables, part 1 PEA vs APE Logit/Probit PEA vs APE PEA: partial effect at the average The effect of some x on y for a hypothetical case with sample
More informationLapse Modeling for the Post-Level Period
Lapse Modeling for the Post-Level Period A Practical Application of Predictive Modeling JANUARY 2015 SPONSORED BY Committee on Finance Research PREPARED BY Richard Xu, FSA, Ph.D. Dihui Lai, Ph.D. Minyu
More informationCOMPLEMENTARITY ANALYSIS IN MULTINOMIAL
1 / 25 COMPLEMENTARITY ANALYSIS IN MULTINOMIAL MODELS: THE GENTZKOW COMMAND Yunrong Li & Ricardo Mora SWUFE & UC3M Madrid, Oct 2017 2 / 25 Outline 1 Getzkow (2007) 2 Case Study: social vs. internet interactions
More informationXLSTAT TIP SHEET FOR BUSINESS STATISTICS CENGAGE LEARNING
XLSTAT TIP SHEET FOR BUSINESS STATISTICS CENGAGE LEARNING INTRODUCTION XLSTAT makes accessible to anyone a powerful, complete and user-friendly data analysis and statistical solution. Accessibility to
More information9. Logit and Probit Models For Dichotomous Data
Sociology 740 John Fox Lecture Notes 9. Logit and Probit Models For Dichotomous Data Copyright 2014 by John Fox Logit and Probit Models for Dichotomous Responses 1 1. Goals: I To show how models similar
More informationMCMC Package Example (Version 0.5-1)
MCMC Package Example (Version 0.5-1) Charles J. Geyer September 16, 2005 1 The Problem This is an example of using the mcmc package in R. The problem comes from a take-home question on a (take-home) PhD
More informationModule 4 Bivariate Regressions
AGRODEP Stata Training April 2013 Module 4 Bivariate Regressions Manuel Barron 1 and Pia Basurto 2 1 University of California, Berkeley, Department of Agricultural and Resource Economics 2 University of
More informationPosterior Inference. , where should we start? Consider the following computational procedure: 1. draw samples. 2. convert. 3. compute properties
Posterior Inference Example. Consider a binomial model where we have a posterior distribution for the probability term, θ. Suppose we want to make inferences about the log-odds γ = log ( θ 1 θ), where
More informationMultiple regression - a brief introduction
Multiple regression - a brief introduction Multiple regression is an extension to regular (simple) regression. Instead of one X, we now have several. Suppose, for example, that you are trying to predict
More informationComparing Odds Ratios and Marginal Effects from Logistic Regression and Linear Probability Models
Western Kentucky University From the SelectedWorks of Matt Bogard Spring March 11, 2016 Comparing Odds Ratios and Marginal Effects from Logistic Regression and Linear Probability Models Matt Bogard Available
More informationAn Empirical Study on Default Factors for US Sub-prime Residential Loans
An Empirical Study on Default Factors for US Sub-prime Residential Loans Kai-Jiun Chang, Ph.D. Candidate, National Taiwan University, Taiwan ABSTRACT This research aims to identify the loan characteristics
More informationSAS/STAT 14.1 User s Guide. The HPFMM Procedure
SAS/STAT 14.1 User s Guide The HPFMM Procedure This document is an individual chapter from SAS/STAT 14.1 User s Guide. The correct bibliographic citation for this manual is as follows: SAS Institute Inc.
More informationDescription Quick start Menu Syntax Options Remarks and examples Stored results Methods and formulas Acknowledgment References Also see
Title stata.com tssmooth shwinters Holt Winters seasonal smoothing Description Quick start Menu Syntax Options Remarks and examples Stored results Methods and formulas Acknowledgment References Also see
More informationStatistics 175 Applied Statistics Generalized Linear Models Jianqing Fan
Statistics 175 Applied Statistics Generalized Linear Models Jianqing Fan Example 1 (Kyhposis data): (The data set kyphosis consists of measurements on 81 children following corrective spinal surgery. Variable
More information