Modeling of Claim Counts with k fold Cross-validation

Size: px

Start display at page:

Download "Modeling of Claim Counts with k fold Cross-validation"

Sydney Bruce
5 years ago
Views:

1 Modeling of Claim Counts with k fold Cross-validation Alicja Wolny Dominiak 1 Abstract In the ratemaking process the ranking, which takes into account the number of claims generated by a policy in a given period of insurance, may be helpful For example, such a ranking allows to classify the newly concluded insurance policy to the appropriate tariffication group For this purpose, in this paper we analyze models applicable to the modelling of counter variables In the first part of the paper we present the classical Poisson regression and a modified regression model for data, where there is a large number of zeros in the values of the counter variable, which is a common situation in the insurance data In the second part we expand the classical Poisson regression by adding the random effect The goal is to avoid an unrealistic assumption that in every class all insurance policies are characterized by the same expected number of claims In the last part of the paper we propose to use k fold cross validation to identify the factors which influence the number of insurance claims the most Then, setting the parameters of the Poisson distribution, we create the ranking of polices using estimated parameters of the model, which give the smallest cross validation mean squared error In the paper we use a real world data set taken from literature For all computations we used a free software environment R Keywords: automobile insurance, ratemaking, risk classification, GLM, HGLM, ZIP models 1 Introduction Every person, when applying for an insurance policy, is assigned to a class, that is homogeneous in terms of the tariffication system One of the criteria used for assigning an individual to a certain class is the number of claims Thus it is insurance companies very important task to model the number of claims in a given insurance portfolio In the paper we propose a simple procedure for creating a ranking of insurance policies and also for classifying them due to the number of claims It allows a preliminary classification of a new policy to a group with an adequate premium level The very common choice of a method for modelling the number of claims is a regression model with the use of Poisson distribution (Poisson regression), which is a special case of a Generalized Linear Model (GLM) However the insurance portfolios have a very specific characteristic, ie for many policies there are no claims observed in the insurance history for a given period It means that data contains lots of zeros and, as a consequence, the Poisson regression may not give satisfactory results Therefore when creating the ranking, the GLM model and ZIP model (zero inflated Poisson) and the model with a random effect were considered The ranking creation procedure used a k fold cross-validation and furthermore the ranking was discretized due to a parameter λ We build many different models and then we use a 10-fold cross-validation in order to recognize which rating variables have an impact on the presence of zeros in the policies portfolios The data for the illustrative example has been taken from the literature [6] All the computations were conducted in R the free software environment The procedure for building a model with random effect and a cross validation technique have been written in R language 1 University of Economics in Katowice, Faculty of Economics, Department of Statistical and Mathematical Methods in Economics, ul Bogucicka 14, Katowice, alicjawolny-dominiak@uekatowicepl 1

2 2 Modelling the number of claims in the automobile insurance with the use of cross validation procedure The linear regression models are used for creating a raking of insurance policies due to the number of claims Generalized Linear Models (GLM) In these models we assume that the number of claims is a dependent variable Y that follows Poisson distribution and it depends on certain system of predictors [1]: P(Y i = y i ) = e λi λ yi i, i = 1,,n, (1) y i! where Y i is the number of claims for the ith insured person, y 1,,y n are independent and have equal variances, and the average number of claims is equal to the variance The λ i parameter is the expected number of claims and it depends on predictors X j,j = 1,,k, that describe the insured individual or vehicle, eg sex, age, engine capacity The logarithm is used as a link function: Thus we have that the expected value equals: logλ i = k β ji X ji (2) j=1 λ i = µ i = e k j=1 βjixji (3) We can see that for every linear combination of predictors the expected value of the number of claims is always positive The λ i parameter is adjusted with the use of d i exposition to risk factor for the ith policy This factor shows what part of the analysed period of time was covered by a given policy: When creating the ranking we used min λ i as a criterion λ i = d i e k j=1 βjixji (4) The independence assumption in the above model may not be fulfilled In that case the solution is to use a mixed model and introducing a random effect ν Hierarchical Generalized Linear Models (HGLM) In case of automobile insurance data the region or the vehicle model can be treated as a random effect ν Hierarchical generalized linear model (HGLM) with variable y u following the Poisson distribution, has a form: µ = E(y u) = e Xβ+ν, Var(y u) = φv(µ), ν = logu, where β = [β 1,,β I ], u = [u 1,,u K ] and X is the model matrix The distribution of the random effect may belong to the exponential dispersion family of distributions, eg the gamma distribution with parameter α: E(u) = ψ, (6) Var(u) = αv(ψ) The structural parameters of a model have a following interpretation: parameter β i, i = 1,,I, measures the influence of the ith predictor on the number of claims (which is equal for every category), parameter u k, k = 1,,K, measures the risk level for every category (which is different for every category) (5)

3 Zero inflated generalized linear model Another model used for modelling the number of claims is ZIP model, where counting response variable has many zero values This is exactly the case when modelling the number of counts Analysing different risk portfolios it can be noticed that for many policies there is no claim observed and if the claims occur their number is one, two or three and very rarely more In the ZIP model the independent variables Y i take zero values Y i 0 with the probability i or values from Poisson distribution Y i Pois(λ i ) with probability 1 i It can be written in a form [5]: { i +(1 i)e λi, if y i = 0 P(Y i = y i ) = (1 i) e λ iλ y i i y i!, if y i > 0 i = 1,,n (7) Thus in the ZIP model we have two parameters: λ i and i Both parameters, as in case of Poisson regression, are linked with predictor variables with the following link functions: ( ) i t log = γ ji Z ji, 1 i j=1 (8) k logλ i = β ji X ji, j=1 wherez 1,,Z l arethedependent variablesforthe firstequationandx 1,,X k forthesecondone The expected value and variance of the number of claims for the ith policy in the ZIP model are, respectively: E(Y i ) = λ i (1 i), D 2 (Y i ) = (1 i)(λ i iλ 2 i ) (9) Similarly to Poisson regression case, in the ZIP model we assume that the average number of claims equals the variance The solution to a problem when excessive dispersion occurs is the use of negative bimodal distribution [4] In order to unify the process of comparing presented models, the choice of the model for the number of claims and the choice of the combination of predictor variables that is responsible for generating zeros in the policies have been supported by statistical learning methods In general in these methods we assume we are given a training data set D = {(x i,y i ),i = 1,,N}, where x i,y i R Moreover we assume that data is iid (independent and identically distributed) and it has been taken from the population with a multidimensional distribution defined by an unknown density function: p(x, y) = p(x)p(y x) (10) The task is to search a given set of functions H = {f(x, ) : Ω}, where is a model parameters vector, andto find the best element Using the model f(x, ) H, which isalwaysa simplified equivalent of the analysed phenomenon, we accept some errors that are just the consequence of taking theoretical values instead of real values for response variable These errors (for a given observation) are measured by so called loss functions L(y, f(x, )) In the concept of statistical learning, the risk functional is considered It measures the overall loss, ie the sum of errors for all possible observations One of the methods of estimating the value of risk functional is the cross validation method (CV) [2] In this paper we use 10 fold cross-validation algorithm, ie: a) randomly divide the training set into k = 10 approximately equally sized parts (n the size of a training set, m l the size of the lth subset, l = 1,,10), b) build 10 times a model using 9 of 10 parts (n m l observations), treating excluded observations as validation set, (y ˆµl ) 2 c) calculate 10 times the value of the mean squared error MSE l = using the validation set, d) estimate the cross validation error: cv = 10 l=1 The model with the smallest cv value is selected m l n MSE l m l

4 3 Procedure of creating ranking of property insurance policies and classification of these policies The procedure of building a ranking of policies using linear models presented in the previous part of the paper may be formulated in a few steps: STEP 1: Estimating λ parameter for every policy in the portfolio using three different models: generalized linear model, hierarchical generalized linear model and zero inflated generalized linear model STEP 2: Applying cross validation procedure to every model from Step 1 STEP 3: Choosing the model with the smallest cv error STEP 4: Creating the ranking of insurance policies for every combination of predictor variables X i, using as a criterion min λ STEP 5: Discretizing the ranking due to the values of parameters λ and thus obtaining insurance risk classification which allow to classify a new policy to a group with an adequate premium level Based on the estimated parameter λ for a chosen model, we create ranking and conduct discretization in order to obtain different classes of insurance risk Discretization means dividing the ordered set of values of a given continuous variable onto finite number of disjoint intervals Labels can be assigned to these intervals, eg: high insurance risk level, neutral to risk etc The problem is how to determine the cut points These cut points should separate the object from different risk classes in a best possible way There are two main approaches in discretization: agglomerative and divisive The first one starts with every single empirical value of the continuous variable belonging to a different interval and then neighbouring intervals are merged iteratively until the maximum value of homogeneity of subsets measure is reached The second approach starts with one big interval covering all empirical values of the continuous variable that is iteratively divided, using previously determined cut points 4 Empirical example In order to illustrate the process of creating the ranking and discretizing it, the necessary procedures were implemented in R environment The automobile insurance data set including information about the number of claims has been used for computations The following variables form the data set and have been considered in the model: 1 Driverage age of the insured person (driver) 2 Region: classes from 1 to 7 3 MCclass: classes from 1 to 7 These classes were created based on the EV coefficient defined as: EV = where 75 kg is the average weight of a driver 4 Vehege age of the vehicle engine capacity in kw 100 vehicle weight in kg+75, 5 Numclaims number of claims (the sum within the class) Dependences between predictor variables influencing the number of claims are presented in Figure 1 in Appendix A

5 Procedure for creating the ranking STEP 1: We model the number of claims with the use of three types of models presented above Model 1 GLM for the variable Numclaims assuming Poisson distribution R Code data(dataset) glmformula=numclaims Driverage+Region+MCclass+Vehage glmmodel1=glm(glmformula, family=poisson(link="log"), data=dataset) summary(glmmodel1) β i Standard error e β i Intercept DriverageA 0 1 DriverageB DriverageC DriverageD DriverageE DriverageF DriverageG RegionA 0 1 RegionB RegionC RegionD RegionE RegionF RegionG MCclassA 0 1 MCclassB MCclassC MCclassD MCclassE MCclassF MCclassG VehageA 0 1 VehageB VehageC VehageD Table 1: Parameters for Model 1 The following combination was chosen as a reference categories: DriverageA, RegionA, MCclassA, VehageA Model 2 HGLM of a type POISSON GAMMA for the variable Numclaims assuming Poisson distribution and treating variable Region as a random effect with gamma distribution R Code data(dataset) modelpoissongamma=function(x=x, Z=Z, Y=Y, datasetletters= datasetletters, glmformula=numclaims Driverage+Region+MCclass+Vehage)

6 β i Standard error e β i Intercept DriverageA 0 1 DriverageB DriverageC DriverageD DriverageE DriverageF DriverageG MCclassA 0 1 MCclassB MCclassC MCclassD MCclassE MCclassF MCclassG VehageA 0 1 VehageB VehageC VehageD Table 2: Parameters for Model 2 fixed effects β i Standard error e β i RegionA RegionB RegionC RegionD RegionE RegionF RegionG Table 3: Parameters for Model 2 random effect Region

7 Model 3 Model ZIP taking into account a large number of zero values for variable Numclaims R Code data(dataset) ZIPmodel3=zeroinfl(formula=Numclaims Driverage+Region+ MCclass+Vehage 1, data=dataset) summary(zipmodel3) Function zeroinfl is from the library {pscl} β i Standard error e β i Intercept DriverageA 0 1 DriverageB DriverageC DriverageD DriverageE DriverageF DriverageG RegionA 0 1 RegionB RegionC RegionD RegionE RegionF RegionG MCclassA 0 1 MCclassB MCclassC MCclassD MCclassE MCclassF MCclassG VehageA 0 1 VehageB VehageC VehageD Table 4: Parameters for Model 3 The probability that variable Numclaims takes zero value equals 82% STEP 2: Ten fold cross validation procedure was applied to every model from Step 1, obtaining corresponding cv errors Model 1 MSE on one of 10 validation parts in CV method: MSE on one of 10 validation parts in CV method: MSE on one of 10 validation parts in CV method: MSE on one of 10 validation parts in CV method: MSE on one of 10 validation parts in CV method: MSE on one of 10 validation parts in CV method:

8 MSE on one of 10 validation parts in CV method: MSE on one of 10 validation parts in CV method: MSE on one of 10 validation parts in CV method: MSE on one of 10 validation parts in CV method: MSECV for the model of the form: Numclaims Driverage + Region + MCclass + Vehage MSECV for the model equals: ************************************************************************* Model 2 MSE on one of 10 validation parts in CV method: MSE on one of 10 validation parts in CV method: MSE on one of 10 validation parts in CV method: MSE on one of 10 validation parts in CV method: MSE on one of 10 validation parts in CV method: MSE on one of 10 validation parts in CV method: MSE on one of 10 validation parts in CV method: MSE on one of 10 validation parts in CV method: 1977 MSE on one of 10 validation parts in CV method: MSE on one of 10 validation parts in CV method: 1679 MSECV for the model equals: ************************************************************************* Model 3 MSE on one of 10 validation parts in CV method: MSE on one of 10 validation parts in CV method: MSE on one of 10 validation parts in CV method: MSE on one of 10 validation parts in CV method: MSE on one of 10 validation parts in CV method: MSE on one of 10 validation parts in CV method: MSE on one of 10 validation parts in CV method: MSE on one of 10 validation parts in CV method: MSE on one of 10 validation parts in CV method: MSE on one of 10 validation parts in CV method: MSECV for the model of the form: Numclaims Driverage + Region + MCclass + Vehage 1 MSECV for the model equals: ************************************************************************* STEP 3: The smallest value of MSE cv was obtained for the Model 3, ie for the zero-inflated generalized linear model Thus this model was used further in the ranking creation steps STEP 4/STEP 5: The number of combinations of different empirical values of predictor variables X i equals 1372 After discretization every combination was assigned a label representing a risk class: from 10 the lowest risk of claim to occur, to 1 the highest risk of claim to occur The first five combinations of categories in the ranking were presented in Table 5 for illustration

9 Driverage Region MCclass Vehage λ Risk class DriverageG RegionG MCclassG VehageD DriverageG RegionE MCclassG VehageD DriverageG RegionG MCclassG VehageC DriverageG RegionG MCclassD VehageD DriverageG RegionG MCclassA VehageD Table 5: Part of the ranking and classification based on Model 3 5 Summary The procedure for recognizing risk classes in the insurance policies portfolios proposed in the paper allows to differentiate policies with no claims observed in the insurance history The minimum value of λ criterion used in classification causes that the risk classes and associated premiums are fairer for individuals applying for an insurance policy Essentially the main disadvantage of ZIP model, that turned out to be the best in terms of cv error criterion, is that within every risk class the policies have equal expected number of claims, which is an unrealistic assumption The solution to this issue may be using the mixed Poisson model and introducing a random effect that would differentiate policies (ZIP regression with random effect) However estimating that type of model is computationally very demanding what discourages from using in real world applications References [1] Denuit, M, and Marechal, X, and Pitrebois, S, and Walhin, J: Actuarial Modelling of Claims Counts John Wiley & Sons Ltd, 2007 [2] Gatnar, E: Ensemble Approach in Classification and Regression (in Polish) Wydawnictwo Naukowe PWN, Warszawa, 2008 [3] Gatnar, E, and Walesiak, M: Qualitative and Symbolic Data Analysis with R (in Polish) Wydawnictwo C H Beck, Warszawa, 2011 [4] Kopczewska, K, and Kopczewski, T, and Wjcik, P: Quantitative Methods in R Economic and Financial Applications (in Polish) Cedetupl Wydawnictwa Fachowe, Warszawa, 2009 [5] Lambert, D: Zero-Inflated Poisson Regression, with an Application to Defects in Manufacturing, Technometrics, 18, (1992) [6] Ohlsson, E, and Johansson, B: Non Life Insurance Pricing with Generalized Linear Models Springer Verlag, Berlin, 2010 Appendix A R Code library(vcd) dataset=readcsv2(file="e:/datasetcsv") attach(dataset) tabkontyng=table(driverage,region,mcclass,vehage) diag=list(varoffset=11, rot=0, fill=greycolors, gpvartext=gpar(fontsize=10), gpleveltext=gpar(fontsize=10)) pairs(tabkontyng,diagpanelargs=diag, highlighting=2) detach(dataset)

10 Driver_age ABCDEFG Region ABCDEFG MC_class ABCDEFG Veh_age A B C D Figure 1: Dependences between predictor variables influencing the number of claims

arxiv: v1 [q-fin.rm] 13 Dec 2016

arxiv: v1 [q-fin.rm] 13 Dec 2016 arxiv:1612.04126v1 [q-fin.rm] 13 Dec 2016 The hierarchical generalized linear model and the bootstrap estimator of the error of prediction of loss reserves in a non-life insurance company Alicja Wolny-Dominiak