Estimation Parameters and Modelling Zero Inflated Negative Binomial

Similar documents
Duangporn Jearkpaporn, Connie M. Borror Douglas C. Montgomery and George C. Runger Arizona State University Tempe, AZ

Stochastic Claims Reserving _ Methods in Insurance

**BEGINNING OF EXAMINATION** A random sample of five observations from a population is:

Log-linear Modeling Under Generalized Inverse Sampling Scheme

1. You are given the following information about a stationary AR(2) model:

Negative Binomial Regression By Joseph M. Hilbe READ ONLINE

ESTIMATION OF MODIFIED MEASURE OF SKEWNESS. Elsayed Ali Habib *

Local Maxima in the Estimation of the ZINB and Sample Selection models

A potentially useful approach to model nonlinearities in time series is to assume different behavior (structural break) in different subsamples

STA 4504/5503 Sample questions for exam True-False questions.

Dividend Policy and Stock Price to the Company Value in Pharmaceutical Company s Sub Sector Listed in Indonesia Stock Exchange

Australian Journal of Basic and Applied Sciences. Conditional Maximum Likelihood Estimation For Survival Function Using Cox Model

Modeling of Claim Counts with k fold Cross-validation

International Journal of Scientific Engineering and Science Volume 2, Issue 9, pp , ISSN (Online):

Application of Bayesian Network to stock price prediction

arxiv: v1 [q-fin.rm] 13 Dec 2016

INDIAN INSTITUTE OF SCIENCE STOCHASTIC HYDROLOGY. Lecture -5 Course Instructor : Prof. P. P. MUJUMDAR Department of Civil Engg., IISc.

On the Distribution and Its Properties of the Sum of a Normal and a Doubly Truncated Normal

Keywords Akiake Information criterion, Automobile, Bonus-Malus, Exponential family, Linear regression, Residuals, Scaled deviance. I.

A Convenient Way of Generating Normal Random Variables Using Generalized Exponential Distribution

LOSS SEVERITY DISTRIBUTION ESTIMATION OF OPERATIONAL RISK USING GAUSSIAN MIXTURE MODEL FOR LOSS DISTRIBUTION APPROACH

A NOTE ON FULL CREDIBILITY FOR ESTIMATING CLAIM FREQUENCY

Bivariate Birnbaum-Saunders Distribution

SOCIETY OF ACTUARIES EXAM STAM SHORT-TERM ACTUARIAL MATHEMATICS EXAM STAM SAMPLE QUESTIONS

Estimation Procedure for Parametric Survival Distribution Without Covariates

Two hours. To be supplied by the Examinations Office: Mathematical Formula Tables and Statistical Tables THE UNIVERSITY OF MANCHESTER

Predictive Regressions: A Present-Value Approach (van Binsbe. (van Binsbergen and Koijen, 2009)

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

Statistics 6 th Edition

Modeling. joint work with Jed Frees, U of Wisconsin - Madison. Travelers PASG (Predictive Analytics Study Group) Seminar Tuesday, 12 April 2016

Negative Binomial Model for Count Data Log-linear Models for Contingency Tables - Introduction

On modelling of electricity spot price

From Double Chain Ladder To Double GLM

An Empirical Analysis on the Management Strategy of the Growth in Dividend Payout Signal Transmission Based on Event Study Methodology

Statistical estimation

THE OPTIMAL ASSET ALLOCATION PROBLEMFOR AN INVESTOR THROUGH UTILITY MAXIMIZATION

Forecasting with Inadequate Data. The Piggyback Model. The Problem. The Solution. Iain Currie Heriot-Watt University. Universidad Carlos III de Madrid

[D7] PROBABILITY DISTRIBUTION OF OUTSTANDING LIABILITY FROM INDIVIDUAL PAYMENTS DATA Contributed by T S Wright

EX-POST VERIFICATION OF PREDICTION MODELS OF WAGE DISTRIBUTIONS

SAMPLE STANDARD DEVIATION(s) CHART UNDER THE ASSUMPTION OF MODERATENESS AND ITS PERFORMANCE ANALYSIS

Confidence interval for the 100p-th percentile for measurement error distributions

Exam STAM Practice Exam #1

Statistical Models of Word Frequency and Other Count Data

Logit Models for Binary Data

Market Risk Analysis Volume I

Risk Evaluation on Construction Projects Using Fuzzy Logic and Binomial Probit Regression

ANALYSIS OF FACTORS AFFECTING DECISION TO PROVIDE MICRO CREDITS AT DANAMON SAVINGS AND LOAN SURABAYA CLUSTER

A CLASS OF PRODUCT-TYPE EXPONENTIAL ESTIMATORS OF THE POPULATION MEAN IN SIMPLE RANDOM SAMPLING SCHEME

Commonly Used Distributions

Dual response surface methodology: Applicable always?

THE OPTIMAL HEDGE RATIO FOR UNCERTAIN MULTI-FOREIGN CURRENCY CASH FLOW

Chapter 5 Univariate time-series analysis. () Chapter 5 Univariate time-series analysis 1 / 29

Assessment on Credit Risk of Real Estate Based on Logistic Regression Model

EFFICIENT ESTIMATORS FOR THE POPULATION MEAN

MEASURING PORTFOLIO RISKS USING CONDITIONAL COPULA-AR-GARCH MODEL

JAM 15, 3 Received, February 2017 Revised, May 2017 July 2017 Accepted, August 2017

ME3620. Theory of Engineering Experimentation. Spring Chapter III. Random Variables and Probability Distributions.

Equity, Vacancy, and Time to Sale in Real Estate.

GENERATION OF APPROXIMATE GAMMA SAMPLES BY PARTIAL REJECTION

A NEW POINT ESTIMATOR FOR THE MEDIAN OF GAMMA DISTRIBUTION

Meigi F. Willem, D.P.E. Saerang, F. Tumewu, Prediction of Stock

Session 5. Predictive Modeling in Life Insurance

Double Chain Ladder and Bornhutter-Ferguson

4. GLIM for data with constant coefficient of variation

Back to estimators...

Chapter 4: Commonly Used Distributions. Statistics for Engineers and Scientists Fourth Edition William Navidi

Statistics for Managers Using Microsoft Excel 7 th Edition

Abstract. 1. Introduction

RESEARCH ARTICLE. The Penalized Biclustering Model And Related Algorithms Supplemental Online Material

Discrete Choice Model for Public Transport Development in Kuala Lumpur

An Empirical Examination of Traditional Equity Valuation Models: The case of the Athens Stock Exchange

Financial Risk Management

Institute of Actuaries of India Subject CT6 Statistical Methods

A RIDGE REGRESSION ESTIMATION APPROACH WHEN MULTICOLLINEARITY IS PRESENT

Maximum Likelihood Estimates for Alpha and Beta With Zero SAIDI Days

INSTITUTE AND FACULTY OF ACTUARIES. Curriculum 2019 SPECIMEN EXAMINATION

Proxies. Glenn Meyers, FCAS, MAAA, Ph.D. Chief Actuary, ISO Innovative Analytics Presented at the ASTIN Colloquium June 4, 2009

SELECTION OF VARIABLES INFLUENCING IRAQI BANKS DEPOSITS BY USING NEW BAYESIAN LASSO QUANTILE REGRESSION

Conjugate Models. Patrick Lam

OVER- AND UNDER-DISPERSED CRASH DATA: COMPARING THE CONWAY-MAXWELL-POISSON AND DOUBLE-POISSON DISTRIBUTIONS. A Thesis YAOTIAN ZOU

Hierarchical Generalized Linear Models. Measurement Incorporated Hierarchical Linear Models Workshop

Chapter 5: Statistical Inference (in General)

Contents Utility theory and insurance The individual risk model Collective risk models

THE IMPACT OF BANKING RISKS ON THE CAPITAL OF COMMERCIAL BANKS IN LIBYA

Gamma Distribution Fitting

High-Frequency Data Analysis and Market Microstructure [Tsay (2005), chapter 5]

An Improved Saddlepoint Approximation Based on the Negative Binomial Distribution for the General Birth Process

Equity correlations implied by index options: estimation and model uncertainty analysis

Chapter 3 Statistical Quality Control, 7th Edition by Douglas C. Montgomery. Copyright (c) 2013 John Wiley & Sons, Inc.

Probability Distributions II

Superiority by a Margin Tests for the Ratio of Two Proportions

A New Test for Correlation on Bivariate Nonnormal Distributions

6. Genetics examples: Hardy-Weinberg Equilibrium

Chapter 3 Common Families of Distributions. Definition 3.4.1: A family of pmfs or pdfs is called exponential family if it can be expressed as

Technical Note: An Improved Range Chart for Normal and Long-Tailed Symmetrical Distributions

Econometric Methods for Valuation Analysis

Model 0: We start with a linear regression model: log Y t = β 0 + β 1 (t 1980) + ε, with ε N(0,

Pakistan Export Earnings -Analysis

Multinomial Logit Models for Variable Response Categories Ordered

Estimating a Dynamic Oligopolistic Game with Serially Correlated Unobserved Production Costs. SS223B-Empirical IO

Transcription:

CAUCHY JURNAL MATEMATIKA MURNI DAN APLIKASI Volume 4(3) (2016), Pages 115-119 Estimation Parameters and Modelling Zero Inflated Negative Binomial Cindy Cahyaning Astuti 1, Angga Dwi Mulyanto 2 1 Muhammadiyah University of Sidoarjo, Sidoarjo, Indonesia 2 Alpha Research, Malang, Indonesia Email: cindy.cahyaning@umsida.ac.id, angga.dwi.m@gmail.com ABSTRACT Regression model between predictor variables and the Poisson distributed response variable is called Poisson Regression Model. Since, Poisson Regression requires an equality between mean and variance, it is not appropriate to apply this model on overdispersion. Poisson regression can be used to analyze count data but it has not been able to solve problem of excess zero value on the response variable. An alternative model which is more suitable for overdispersion data and can solve the problem of excess zero value on the response variable is Zero Inflated Negative Binomial (ZINB). In this research, ZINB is applied on the case of Tetanus Neonatorum in East Java. The aim of this research is to examine the likelihood function and to form an algorithm to estimate the parameter of ZINB and also applying ZINB model in the case of Tetanus Neonatorum in East Java. Maximum Likelihood Estimation (MLE) method is used to estimate the parameter on ZINB and the likelihood function is maximized using Expectation Maximization (EM) algorithm. Test results of ZINB regression model showed that the predictor variable have a partial significant effect at negative binomial model is the percentage of pregnant women visits and the percentage of maternal health personnel assisted, while the predictor variables that have a partial significant effect at zero inflation model is the percentage of neonatus visits. Keywords: Overdispersion, Tetanus Neonatorum, Zero Inflation, Zero Inflated Negative Binomial (ZINB) INTRODUCTION Regression analysis is used to determine relationship between one or several response variable (Y) with one or several predictor variables (X). In the classical linear model assumptions are response variables follow a normal distribution, but in fact often found the response variable did not follow the normal distribution. To overcome this there is development in the classical linear model, namely the Generalized Linear Model (GLM) [1]. GLM assuming the response variable follows the exponential family distribution, which has a more general characteristic. In some research, there are often data with response variable that follows a Poisson distribution, regression analysis is used to this kind of data is the Poisson regression analysis. Poisson regression model is commonly used to analyze the data count (data count). Poisson regression there is an assumption on Y ~ Poisson (μ). A key assumption in the Poisson regression analysis is the variance should be equal to the average, the condition is called equidispersion. On the type of count data often encountered zero value is more than 50 percent on the response variable (zero inflation) [2]. Data proportion that has exaggeration zero value can lead to the accuracy of inference. Poisson regression can be used to analyze the data count but still cannot resolve the problem of excessive zero value. In modelling count data if there ar many zero observations on response variable it can be overcome by using Zero inflated Poisson regression (ZIP) model [3]. However, if there are many Submitted: August, 20 2016 Reviewed: October, 31 2016 Accepted: November, 29 2016 DOI: http://dx.doi.org/10.18860/ca.v4i3.3656

zero observations and occurs overdispersion then Zero inflated Poisson regression (ZIP) inappropriately used. Overdispersion can be defined as a condition in which the Poisson distribution variance is greater than average. If in modelling count data (data count) there are many zero observations on response variable (zero inflation) and occurs overdispersion then the regression model can be used is Zero Inflated Generalized Poisson [2]. In progress there are other alternatives to modelling many zero observations and occurs overdispersion besides using Zero Inflated Generalized Poisson (ZIGP), the regression model is Zero Inflated Negative Binomial (ZINB). Zero Inflated Negative Binomial (ZINB) model is formed of Poisson Gamma mixture distribution [4]. Zero Inflated Negative Binomial (ZINB) can be used as an alternative to modelling many zero observations and occurs overdispersion because this model does not require the variance should be equal with average, in addition Zero inflated Negative Binomial (ZINB) model also has a dispersion parameter that useful to describe the variation of the data, which is commonly denoted by κ (kappa). The purpose of this research is examine the likelihood form, estimation parameters of Zero inflated Negative Binomial (ZINB) model and modelling Zero inflated Negative Binomial (ZINB) on Neonatorum Tetanus cases. METHODS In this research used secondary data sourced from East Java Health Profile 2012 [5]. Unit of observation in this research was 38 districts/cities in East Java province which covers 29 districts and 9 Cities. The response variable (Y) used in this research is number of cases of Tetanus Neonatorum in each district/city in East Java province, while the predictor variable (X) is used as much as 4 variables. Operational definition of each variable response and predictor variables will be described as a. The response variable (Y): Number of cases of Tetanus Neonatorum b. Predictor variable (X) 1. The percentage of pregnant mothers visit K4 (X 1) 2. Percentage of immunization Tetanus Toxoid (TT) in pregnant women (X 2) 3. Percentage of maternal mothers assisted by health workers (X 3) 4. The percentage of neonates visits (X 4) The method of analysis in this study is. a. Knowing the probability function of Zero inflated Negative Binomial (ZINB) model. b. Determining the likelihood function of Zero inflated Negative Binomial (ZINB) model based on probability function that are already known. c. Develop algorithms for estimation parameter process based on the likelihood function that is already known. Parameter estimation of Zero inflated Negative Binomial (ZINB) model. was performed using MLE method and solved using EM algorithm. d. Modelling Zero inflated Negative Binomial (ZINB) model on Neonatorum Tetanus cases in East Java Province e. Significance test of parameters model carried out simultan and partial test. Statistical tests are used for simultan test is the test statistic G and to partial test used test statistics t. RESULTS AND DISCUSSION Estimation parameter Zero inflated Negative Binomial (ZINB) was conducted using Maximum Likelihood Estimation (MLE) and to maximize the function is used EM (Expectation Maximization) algorithm. Probability Function of Zero inflated Negative Binomial (ZINB) model can be defined as : P(Y i = y i ) = π i + (1 π i ) ( 1 κ ), for yi = 0 1+κμ i (1 π i ) (y 1 i +1 κ ) { ( 1 κ )y i! ( 1 κ κμ ) ( i ) 1+κμ i 1+κμ i 1 y i, for yi > 0 Cindy Cahyaning Astuti 116

EM algorithm consists of two stage, expectation and maximization stage. Expectation stage is expectation calculation of ln likelihood the function, the next stage maximization is calculation to look for estimation parameter which maximizes the likelihood function. Probability function of ZINB model consist of two conditions, y i = 0 and y i > 0. Response variable is also composed of two conditions, namely zero state and negative binomial state. To describe in detail the condition y i, then it will be redefined variables y i with latent variable Zi. 1, if y Z i = { i from zero state 0, if y i > 0from negative binomial state Zero inflated Negative Binomial regression (ZINB) model can be defined as two models that are : Model for negative binomial μ i p lnμ i = β 0 + β 0x ij, i = 1,2,, n andj = 1,2,.., p j=1 Model for zero inflation π i p logitπ i = γ 0 + γ 0x ij, i = 1,2,, n andj = 1,2,.., p j=1 EM algorithm is alternative methods to maximize likelihood function on the data containing latent variables defining new variables such as variable Zi. EM algorithm consists of two stage: the expectation stage and maximization stage. Expectation stage is calculation of the ln likelihood function, the next stage is maximization calculation stage to look for parameter estimation which maximizes the likelihood function ln results from stage earlier expectations. Estimation parameter and parameter test of Zero inflated Negative Binomial (ZINB) on Neonatorum Tetanus cases in East Java Province using SAS software, the result can see at table 1. Table 1. Estimation Parameter and Parameter Test of ZINB Parameter Estimation SE t value (Pr > t ) β 0-5,847 3,602-1,623 0,105 β 1-0,145 0,055-2,644 0,008 * β 2-0,006 0,010-0,599 0,549 β 3 0,233 0,101 2,295 0,022 * β 4-0,023 0,067-0,339 0,735 γ 0 11,325 13,409 0,845 0,398 γ 1 0,223 0,169 1,316 0,188 γ 2-0,296 0,179-1,653 0,098 γ 3 0,835 0,503 1,660 0,096 γ 4-1,078 0,539-2,000 0,045 * Test Statistic G = 581,24 Results of simultan parameter test based on test statistic G. G test is 581.24. Value of G test 2 is greater than (0,05;8) 15, 507. This shows that simultaneously the predictor variables X 1, X 2, X 3 and X 4 have significant effect on response variable. While results of partial parameter test based on test statistical t. According to Table 1, there are two predictor variables in negative binomial state model and one predictor variables in zero inflation state model that has t value greater than or equal to t (α / 2; 37 = 2.00) and has p-value less than α (0.05). This indicates that the predictor variables were partial significant effect in negative binomial state model are pregnant mothers visit K4 (X 1) and maternal mothers assisted by health workers (X 3), while the predictor Cindy Cahyaning Astuti 117

variables were partial significant effect in zero inflation state model is the percentage of neonates visits (X 4). So that Zero inflated Negative Binomial (ZINB) model can be defined as : a. Negative binomial state model for μ μ = exp( 5,847 0,145 X 1 0,006 X 2 + 0,233 X 3 0,023 X 4 ) b. Zero inflation state model for π π = exp (11,325 + 0,223 X 1 0,296 X 2 + 0,835 X 3 1,078 X 4 ) 1 + exp (11,325 + 0,223 X 1 0,296 X 2 + 0,835 X 3 1,078 X 4 ) All coefficient parameter which aren t significant still is exist in Negative binomial state Zero inflation state model because it is intended to determine the contribution of each predictor variable on the response variable can be defined as : Zero inflation model for ˆ 1. Each additional 1 percent of pregnant mothers visit K4 (X1) it will increase the chances of the number of Tetanus Neonatorum by exp (0.223) = 1.249 times the number of cases of Tetanus Neonatorum original, if the other variables constant value. 2. Each additional 1 percent immunization Tetanus Toxoid (TT) in pregnant women (X2) will decrease the chances of the number of Tetanus Neonatorum by exp (0.296) = 1.344 times the number of cases of Tetanus Neonatorum original, if the other variables constant value. 3. Each additional 1 percent of maternal mothers assisted by health workers (X3) then it will increase the chances of the number of Tetanus Neonatorum by exp (0.835) = 2.305 times the number of cases of Tetanus Neonatorum original, if the other variables constant value. 4. Each additional 1 percent of neonates visits (X4) will decrease the chances of the number of Tetanus Neonatorum by exp (1.078) = 2.939 times the number of cases of Tetanus Neonatorum original, if the other variables constant value. Let s discussion, based on the negative binomial state model and zero inflation state model, there are signs of regression coefficient as opposed to the theory are percentage of maternal mothers assisted by health workers (X3) to model negative binomial state model and the percentage of pregnant mothers visit K4 (X1) and the percentage of maternal mothers assisted by health workers (X3) for zero inflation state model. The existence of the regression coefficient has a sign contrary to the theory of probability caused by the effect of the multikolinieritas. Moreover sign contrary to the theory also caused by the shape of the data pattern of the predictor variables that have a positive correlation with the response variable. In a subsequent study if there are multikolinieritas the predictor variables can be addressed using Principal Component Analysis (PCA). CONCLUSION Based on the results, estimation parameter of Zero inflated Negative Binomial (ZINB) model was conducted using Maximum Likelihood Estimation (MLE) and to maximize the likelihood function used the EM (Expectation Maximization) algorithm. For parameter test predictor variable that has significant effect on the number of cases of Tetanus Neonatorum are are pregnant mothers visit K4 (X 1) and maternal mothers assisted by health workers (X 3) for the negative binomial state models, while zero inflation state model predictor variable that has significant effect on the number of cases of Tetanus Neonatorum include the percentage of neonates visit (X 4). Cindy Cahyaning Astuti 118

REFERENCES [1] A. Agresti, Categorical Data Analysis, New York: John Wiley and Sons, Inc., 2002. [2] F. Famoye dan K. P. Singh, Zero Inflated Poisson Regression Model with an Applications Domestic Violence to Accident Data, Journal of Data Science, pp. 117-130, 2006. [3] D. Lambert, Zero Inflated Poisson Regression, With an Application to Defect in Manufacturing, Technometric, vol. 34, no. 1, 1992. [4] J. M. Hilbe, Negative Binomial Regression, New York: Cambridge University Press, 2011. [5] Dinas Kesehatan Provinsi Jawa Timur, Profil Kesehatan Provinsi Jawa Timur Tahun 2012, Surabaya: Dinas Kesehatan Provinsi Jawa Timur, 2013. Cindy Cahyaning Astuti 119