Flexible modeling of frequency-severity data

Size: px

Start display at page:

Download "Flexible modeling of frequency-severity data"

Vincent Russell
6 years ago
Views:

1 Faculty of Science Flexible modeling of frequency-severity data Olivier Vermassen Master dissertation submitted to obtain the degree of Master of Statistical Data Analysis Promotor: Prof. Dr. Christophe Ley Co-promotor: Dr. Robin van Oirbeek Statistics Academic year

2 The author and the promoter give permission to consult this master dissertation and to copy it or parts of it for personal use. Each other use falls under the restrictions of the copyright, in particular concerning the obligation to mention explicitly the source when using results of this master dissertation. Olivier Vermassen August 27, 2017

3 Foreword There are two main risk drivers in general insurance, namely claim frequency and claim severity. Many insurance models make restrictive distribution assumptions on latter random variables and even assume them to be independent. As the title of our master thesis suggests, we will explore flexible models for frequency and severity data and relax the commonly made assumptions. A case study is presented with non-confidential data from an unknown private insurer. A special thanks goes to my co-promotor Dr. Robin van Oirbeek for introducing me to this topic and his valuable remarks. In addition, I would like to thank Professor Dr. Christophe Ley for his guidance and enthusiasm throughout this project.

5 Table of Contents 1 Introduction 1 2 Methods The classical tarification strategy Univariate modeling of random variables Modeling dependence with copula models A more flexible pricing strategy Model validation Case study Exploratory Data Analysis Claim Frequency Analysis Claim Severity Analysis Dependence analysis Conclusion 37 References 37

7 Abstract Claim frequency and claim severity are the main two risk drivers in general insurance. Many insurance models make restrictive distribution assumptions on latter random variables and even assume them to be independent. The premium charged to a customer is given by the multiplication of the expected claim frequency and severity. This master thesis will explore flexible models for frequency and severity data and relax the commonly made assumptions. A motor insurance case study is provided to illustrate latter strategy. Insurance data is characterized by semi-continuous data. In fact, most customers did not incur a loss, hence resulting in a mixture of a degenerate distribution at 0 and a positive-continuous part. We start by fitting a binomial Generalized Additive Model to the claim probability component, i.e. modeling the probability a customer will incur a loss. The positive-continuous part is split into two components, namely claim severity and conditional claim frequency. The conditional claim frequency models the actual number of claims a customer will have, i.e. 1, 2, 3,... claims. We continue by fitting Generalized Additive Models for Location, Scale and Shape to latter random variables. Special care is taken for large losses by using Extreme Value Theory. Finally, dependence between the claim severity and conditional claim frequency is modeled by means of a Gaussian copula model. We find that our approach provides a better fit to the data, but does not improve predictive performance over the classical approach. KEY WORDS: General insurance pricing; Generalized Additive Models for Location, Scale and Shape (GAMLSS); Extreme Value Theory (EVT); Dependence modeling

9 1 Introduction Risk is a condition in which there is a possibility of an adverse deviation from a desired outcome that is expected or hoped for. [34] Possibility refers to the fact that we are dealing with probabilities and adverse deviation implies that there is a potential loss. For instance, drivers have a positive probability of being involved in a car crash and if the risk is realized, a potential cost is associated with it. How can members of society deal with such risks? The most well-known solution is insurance (see Figure 1.1). In principle, insurance is about exchanging uncertainty for certainty. The customer pays a fixed premium and obtains benefits in return, conditional on the occurrence of a pre-specified event that induces a financial loss. Financial losses insurers are confronted with are spread among a large community that contains the whole customer base of the insurer. In sum, an individual can transfer his risk to a group and potential losses are shared among the whole group. The latter principles are known as risk transfer and risk sharing, respectively. Risk sharing is related to the Law of Large Numbers, as by... expanding the number of risks in a pool, the average loss (or gain) in an expanding pool of risks eventually becomes certain (or predictable) [23, p. 7]. Certainty Premium New customer Insurer Community Benefits Uncertainty Figure 1.1: Principle of insurance. The insurer faces two potential problems with the construction of a community. Anti-selection implies that customers have more information about the risk than the insurer and is thus a form of information asymmetry. The concept of moral hazard is about customers behaving differently just because they have taken out insurance. How can the insurer deal with the latter problems? The most feasible solution for solving anti-selection is searching for rating variables that could explain the risk of a customer. This is also known as segmentation and implies that two distinct customers will be offered two distinct prices. Insurance data is in general reliable because fraud

10 2 would lead to a loss of coverage. Furthermore, the data available is usually only correlated to the risk and not causal. For instance, data on swiftness of reflexes cannot be obtained. Moral hazard can be solved by adding a time dimension to the modeling strategy. Customers that behaved better than expected last year, will receive a discount in the current year and vice versa. Non-life insurance is characterized by an inverted product cycle, which implies that premiums are received before any costs have to be paid. The premium in non-life insurance generally corresponds to an insurance contract that provides one-year coverage. Randomness comes into play as the insurer does not know how many claims a customer will have in the following year, nor how large the claims will be. The former is known as claim frequency and the latter is known as claim severity. A tariff is constructed with the multiplication of claim frequency and claim severity. The statisical challenge lies in the estimation of both expected frequency and severity from historic data. Why does insurance work? It is based on the Weak Law of Large Numbers (WLLN). For independent and identically distributed random variables X 1,..., X n, n N with E[X i ] = µ <, we have for all ɛ > 0: P [ 1 n ] n X i µ > ɛ i=1 n 0 In insurance, n will correspond to the number of policies in a portfolio and X i is the total cost related to policy i. It is usual to assume policies are independent. The WLLN tells that we can use the expected value as a premium, as long as the portfolio is large enough. In practice, the expected value of the claim frequency and severity are modelled by means of a Poisson and gamma distribution, respectively. Furthermore, both random variables are assumed to be independent. The concept of this thesis is flexible modelling and aims at relaxing the rather restrictive assumptions that are made in practice. For that reason, the main subject of this thesis is twofold: 1. investigate what other distributions/methods are appropriate for modelling the Frequency and Severity component; 2. investigate how dependence can be modelled between the Frequency and Severity component. Having both allows constructing a bivariate density. We will determine with a case study to what extent these rather restrictive assumptions lead to over- or underestimation of the risk.

11 2 Methods 2.1 The classical tarification strategy A key principle in non-life ratemaking is cost-based pricing. The costs related to customer i in a future year are random and can be decomposed in two components: 1. Claim frequency N i ; 2. Claim severity Y i,j where j refers to the j th payment. Every customer has a random aggregate loss L i = N i j=1 Y i,j, where L i 0 if N i = 0. The expected value of latter random sum equals [35, p. 25]: E[L i ] = E = E = E [ E [ Ni j=1 [ Ni [ Ni ]] Y i,j N i, j=1 ] E [Y i,j N i ], ] E [Y i,j ], j=1 = E[N i E[Y i,1 ]], = E[N i ] E[Y i,1 ]. under the assumption that Y i,j is independent from N i and Y i,j j are independent and identically distributed. E[L i ] will be charged to customer i as a premium, and relies upon the WLLN as explained in Section 1. Note that we ignore expenses, taxes and profit margins that would be added on top of the premium in practice. In the remainder of the master thesis, we will focus on the average claim severity Y i L i N i. The latter set-up is known as frequency-severity modeling, since there is one component modeling the event of a claim, whilst the other component focuses on the claim amount, conditional on there being at least one claim. A covariate vector x i will be used to segment the tariff and charge customer i a premium in line with his risk. Practitioners almost always assume that

12 4 N i Poi(λ i E i ) and Y i Ga(α i, γ i ), where Poi and Ga denote the Poisson and gamma distribution, respectively. Although insurance coverage is provided for one year, data records are broken down by year for accounting reasons. Hence, the logarithm of the exposure E i, i.e. the proportion of the year insured, is added as an offset. The expected value of random variables N i and Y i is modeled by means of a linear structure and logarithmic link function: λ i = e β x i +log E i, α i γ i = e η x i, with β the parameter vector for claim frequency and η the parameter vector for claim severity. Note that the latter pricing strategy allows a different covariate vector for claim frequency and claim severity. The premium P i is given by the multiplicative tariff under the assumption of independence: λ i αi γ i, and will be charged to the i th customer. As explained in Section 1, the expected value of one customer is not sufficient to cover claims suffered by one person, yet the premium income of the whole customer base should be able to cover all claims suffered due to the Weak Law of Large Numbers. The remainder of Chapter 2 will focus on more flexible modeling strategies including Extreme Value Theory (EVT), Generalized Additive Models for Location, Scale and Shape and dependence modeling via copulas.

13 5 2.2 Univariate modeling of random variables Generalized Linear Model Generalized Linear Models (GLM) are omnipresent in actuarial science, as they allow to understand the effects covariates have on a response variable Y i (see for instance [28] and [19]). A GLM allows strictly monotone transformations of the mean to depend on covariates in a linear manner. Even more, GLMs have been developed for members of the exponential family: ( ) yi θ i b(θ i ) f(y i ; θ i, φ) = exp + c(y i, φ) a(φ) where θ i is the location parameter, φ the dispersion paramater and a( ), b( ) and c( ) known functions. GLMs relax the normality assumption of the standard linear regression model. Wellknown distributions that are part of the exponential family are the Poisson, gamma, normal and binomial distributions. More distributions can be obtained by considering a transformation of the response variable Y i. For instance, a logarithmic transformation of the response variable allows working with a lognormal distribution. The real-valued functions a( ), b( ) and c( ) determine the distribution and the moments can be defined as: E[Y i ] = b (θ i ), Var(Y i ) = b (θ i )a(φ). A GLM consists of three components. Firstly, the random component Y i has a density belonging to the exponential family. Secondly, the systematic component η i = β x i corresponds to the linear predictor. The covariate vector x i of dimension p is the covariate information for observation i and β is the parameter vector (β 0, β 1,..., β p ). Finally, there is a link function g( ) that connects the expected value E[Y i ] = µ i to the linear predictor: g(µ i ) = η i., Section will introduce a flexible alternative for the GLM. Often of interest in actuarial modeling is a Poisson regression in order to model claim counts (see Section 2.1). A Poisson random variable N i Poi(λ i ) has probability mass function: P(N i = k i ) = λk i i e λ i k i! = e k i log λ i λ i log(k i!), and clearly belongs to the exponential family with a(φ) = 1, b(θ i ) = λ i and c(k i, φ) = log(k i!). We have that λ i > 0, hence a function g : R + R is needed in order to avoid

14 6 generating negative predictions. g is often chosen to be the logarithmic link function for Poisson regression models. Parameter estimates can be found by using maximum likelihood estimation, i.e. maximization of the log-likelihood. The log-likelihood for a member of the exponential family is: L(β, φ, y) = n [ yi θ i b(θ i ) i=1 a(φ) ] + c(y i, φ) and will be maximized over β, resulting in parameter estimates ˆβ. Using that b (θ i ) = µ i θ i = (b ) 1 (µ i ), the log-likelihood can be parameterized in terms of the mean µ i = g 1 (β x i ): L(β, φ, y i ) = n [ yi (b ) 1 (µ i ) b((b ) 1 (µ i )) i=1 a(φ), ] + c(y i, φ). The deviance is defined as: [ 2 L( ˆβ, ] φ, y) L(y, φ, y), with L(y, φ, y) the log-likelihood of the saturated model where ˆµ i = y i. The deviance essentially measures the closeness of the observed and predicted values. The smaller the deviance, the better the fit. For a Poisson regression model, the Deviance can be shown to equal: 2 n i=1 ( y i log ( yi ˆµ i ) ) (y i ˆµ i ) and can be used for hypothesis testing or out-of-sample validation., Whenever confronted with competing models, an information criterium can be defined that penalizes the likelihood: 2 L(β, φ, y) + c (degrees of freedom) The Akaike Information Criterium (AIC) corresponds to c = 2 and the Bayesian Information Criterion to ln n, where n denotes the sample size.

15 Generalized Additive Model of Location, Scale and Shape An extension of the GLM is the Generalized Additive Model (GAM) and allows for a smooth effect of the parameters. Generalized Additive Model for Location, Scale and Shape (GAMLSS) is even more flexible and has both the GLM and GAM as sub-components. GAMLSS was initially developed by Stasinopoulos & Rigby [33]. It has not received a great deal of attention, but has for instance been used in Denuit et al. [21] and Antonio et al. [1] to model insurance data. A major feature of GAMLSS is that it allows to relax the distribution assumption of the error distribution since it supports the choice among a wide array of flexible parametric distributions. Even more, the distributional family encompasses all parametric distributions that are part of the exponential family. Some examples are provided in Tables 2.1 and 2.2 for discrete and continuous distributions, respectively. A GAMLSS allows modeling up to four parameters, where modelization implies that the parameter values can depend upon covariate effects. Following the notation of Stasinopoulos & Rigby [33], a parametric linear GAMLSS assumes that the response Y i has a density f ( Y i θ i), with θ i = (θ 1i, θ 2i, θ 3i, θ 4i ). Note that latter vector θ i can be of a lower dimension than four, depending on the distribution assumption made for the error distribution. By convention, the first four parameters are the location µ i, scale σ i, skewness ν i and kurtosis τ i respectively. For instance, the one parameter exponential distribution will only have the first parameter with θ 1i = µ i, while the Box-Cox t distribution will have four parameters with θ 1i = µ i, θ 2i = σ i, θ 3i = ν i and θ 4i = τ i. Note that the location parameter in the Box-Cox t distribution corresponds to the median. Every model parameter can depend on covariates. The systematic component of the GAMLSS is: g k (θ ki ) = β kx i for k = 1, 2, 3, 4 ; with g k ( ) a monotonic link function that connects the systematic component β kx i with the parameter θ ki, β k the parameter vector and x i the covariate information for observation i. GAMLSS can be further extended to include additive terms, non-linear parametric terms etc. for which we refer the reader to the original paper [33, p. 2-4]. For instance, smoothing techniques can be used, which move us to semi-parametric GAMLSS [33, p. 3]: r g k (θ ki ) = β kx i + f jk (x jk ) for k = 1, 2, 3, 4 ; j=1 with r the number of covariates modeled by means of a smooth function f such as a cubic

16 8 spline, fractional polynomial or other. Linear parametric GAMLSS can be fit by using maximum likelihood. Often of interest is a truncated distribution that is cut off at deterministic points δ L, δ R for δ L < δ R. This can be fit by f(y i θ maximum likelihood, but the contribution for each observation y i will be i ) P(δ L <Y i <δ R θ i ) instead of f ( y i θ i). In other words, it corresponds to a conditional density that has all probability mass between the truncation points. In contrast, models that include non-linear terms or nonparametric smoothing functions are estimated by a penalized likelihood function [33, p. 4]. Distribution Number of parameters Poisson 1 Negative Binomial 2 Zero inflated Poisson 2 Zero adjusted Poisson 2 Poisson-inverse Gaussian 2 Sichel 3 Table 2.1: Some discrete distributions belonging to the GAMLSS family. Distribution Number of parameters Exponential 1 Gamma 2 Inverse Gaussian 2 Lognormal 2 Weibull 2 Generalized-Gamma 3 Generalized Inverse Gaussian 3 Box-Cox t 4 Generalized Beta type 2 4 Table 2.2: Some continuous distributions belonging to the GAMLSS family with support on R +.

17 Extreme value theory Insurance data is often characterized by right-skewed and heavy-tailed data. Classical statistical analysis focuses on the expected value of a random variable and is dominated by the normal distribution, a symmetric and light-tailed distribution. For instance, for a sum S n = X X n, the Central Limit Theorem (CLT) is concerned with finding constants a n > 0 and b n such that: Y n := S n b n a n n Y, and a normal distribution typically appears. In contrast, Extreme Value Theory (EVT) focuses on the limiting distribution of sample maxima X n,n = max(x 1,..., X n ). The Fisher-Tippett theorem, the EVT equivalent of the CLT, dictates that for suitable choices of a n > 0 and b n, the limiting distribution of normalized sample maxima: is the extreme value distribution: ( Xn,n b n P a n ) n x G(x), G ξ (x) = e (1+ξx) 1/ξ, with the Extreme Value Index ξ R and 1+ξx > 0. The sign of latter parameter will determine the domain of attraction of the distribution G. The Fréchet-Pareto domain corresponds to ξ > 0 and implies that the distribution is heavy tailed, which is often observed in insurance data as stated before. According to the Pickands-Balkema-de Haan theorem, data follows a Generalized Pareto Distribution (GPD) from a high enough threshold u on (see [20] and [4]). The GPD has the following cumulative distribution function: 1 ( ) 1 + ξx 1 ξ if ξ 0 σ G(x; ξ, σ) =, 1 e x σ if ξ = 0 with scale parameter σ > 0, shape parameter ξ (= Extreme Value Index) and conditions x 0 if ξ 0 and 0 x σ when ξ < 0. A location parameter µ can be added as well: ξ G(x µ; ξ, σ). Latter parameters can be found by using maximum likelihood. EVT is yet to determine the right approach to estimate the threshold u. McNeil [24, p. 132] does not discuss any method to find the threshold by a statistical procedure, but already hints that there is a bias-variance trade-off. If the threshold is too high, only few data points remain to fit the model, leading to high variance. In contrast, if the threshold is too low, the Pickands-

18 10 Balkema-de Haan theorem might not hold and bias will occur as a result. Which methods can be used to find the threshold u? Scarrot et al. [30] gives an overview of both the traditional and novel threshold estimation techniques. The traditional approaches are often 1 graphical, inspired by EVT, require investigating the data and allow for the inclusion of expert opinion. In contrast, the novel approaches focus on automation and uncertainty quantification. For instance, there exist likelihood-based approaches as well as clustering approaches with mixture models. Still, the novel approaches are not necessarily consistent with the theorems put forward by EVT. Which techniques are being used in the applied insurance literature? Antonio et al. [1], Reynkens et al. [29] and Ganegoda et al. [16] rely on an extreme value analysis with the related graphical tools. The first choose the threshold where the Hill estimator, an estimator of ξ, enters a stable region. The others study the mean excess plot in order to decide upon the optimal threshold u. In fact, the mean excess plot is linearly increasing for Pareto-type behavior. Similarly, this master thesis will perform a graphical EVT analysis and decide upon the threshold before the fitting procedure. The most well-known estimator for ξ is the Hill estimator H k,n : 1 k k log X n j+1,n log X n k,n, j=1 with k the threshold, n the sample size and X i,n the i th largest observation. A threshold stability plot can be created by plotting the Hill estimator as a function of the threshold. Yet another graphical approach is the mean excess plot. For a random variable X, the mean excess function is: and can be estimated as follows: e(t) = E[X t X > t], ê n (t) = n i=1 x ii (t, ) (x i ) n i=1 I (t, )(x i ) t, where I (t, ) (x i ) = 1 if x i > t and 0 otherwise. The mean excess plot of the exponential distribution is constant and therefore, interpretation in terms of tails is either heavier than exponential or lighter than exponential. Pareto-type behavior corresponds to a linearly increasing mean excess plot [8]. 1 Some researchers have proposed automated threshold selection techniques consistent with EVT. For instance, Pickands [20, p. 120] has proposed to choose as threshold u the value that minimizes the absolute value between the empirical distribution and the fitted GPD for values greater than u. Another example is Beirlant et al. [5], who proposes as threshold the value that minimizes the asymptotic mean squared error (AMSE) of an estimator for the extreme-value index. The AMSE can be decomposed in both the asymptotic squared bias and asymptotic variance.

19 Splicing for claim severity We motivated in Section the necessity to isolate large losses and to model them separate from the attritional losses, whereas attritional refers to small to medium sized losses. This section will explain the technique of splicing 2 or the body-tail approach to join the models for attritional and large losses into a single probabilistic model (see [22]). Latter technique was also used by Pigeon et al. [27] with a log-normal and Pareto distribution for the attritional and large losses, respectively. Ganegoda et al. [16] perform a splicing analysis as well in the context of operational loss modelling, but allow the distribution for attritional losses to come from the GAMLSS family. Similary, Antonio et al. [1], in the context of reserving, use a GAMLSS distribution for the attritional losses and a GPD for the large losses. In contrast, Bakar et al. [3] model loss data using spliced models and consider a Burr distribution for the tail. It was found that spliced models provide a significantly better fit. However, a caveat of their research is that the distribution for attritional losses is restricted to the Weibull distribution. Also, the threshold was considered as a model parameter, whilst imposing continuity and differentiability restrictions at the threshold. In fact, the latter is not consistent with EVT. Several other researchers have used K component mixture distributions. Miljkovic & Grun [25] have found such a mixture distribution to provide a better fit than the composite models used in Bakar et al. [3]. Reynkens et al. [29] use an even more flexible approach as the number of components K in the mixture distribution is considered to be random. In this section, we continue with the notation introduced in Section 2.1, namely L i N i Y i, a random variable that denotes the average claim severity of customer i in a particular year. The severity distribution f Yi (y) will be split up in two separate 3 density functions f 1 and f 2 by the splicing technique [22]: q 1 f 1 (y), y (0, u] f Yi (y) = q 2 f 2 (y), y (u, ) where both q 1 and q 2 0 and satisfy q 1 + q 2 = 1. Also, f 1 (y) and f 2 (y) are well-defined densities on the interval [0, u) and [u, ) respectively. In practice, one will often use truncated versions of existing probability distributions. The weights q 1 and q 2 ensure that the density integrates to 1: 0 u f Yi (y)dy = q 1 f 1 (y)dy + q 2 f 2 (y)dy = 1. 0 We refer to the first component as the body and the second component as the tail. The body 2 Spliced models are sometimes referred to as composite models. 3 We do not impose continuity nor differentiability at the splicing point as this would reduce the number of parameters and thus the flexibility. u,

20 12 component models the data up to the extreme value threshold u. This data can be modelled by a right truncated GAMLSS (see Section for more information about GAMLSS and truncation). The tail component will be modeled by means of a Generalized Pareto Distribution, as motivated in Section The upper limit was chosen to be +, but could be a policy limit or some maximal probable loss as well.

21 Modeling dependence with copula models General introduction to Copulas This section will give an introduction to dependence modeling with copulas. Since the interest lies in jointly modeling two random variables, we limit ourselves to the bivariate case. Consider continuous random variables X 1 and X 2 that are described separately by the cumulative distribution functions F X1 and F X2, respectively. X = (X 1, X 2 ) is a bivariate random vector that describes the joint behavior of latter random variables. The bivariate cumulative distribution function F X of X is defined as: F X (x) = P (X 1 x 1, X 2 x 2 ), where x = (x 1, x 2 ) R 2. In general, there exist three extreme dependence structures, namely independence, comonotonicity and counter monotonicity. Independence between X 1 and X 2 implies that the realized value for X 1 is not influenced by the realization of X 2 and vice versa. The random variables X 1 and X 2 are independent if and only if F X (x) satisfies: F X (x) = F X1 (x 1 ) F X2 (x 2 ), (x 1, x 2 ) R 2. A comonotonic dependence structure implies that all risk factors are driven by the same source of randomness. For a comonotonic random vector, we have that 4 : (X 1, X 2 ) d = ( F 1 X 1 (U), F 1 X 2 (U) ), where U d = Unif(0, 1). The joint cumulative distribution function under comonotonic dependence can be derived by making use of the quantile transform theorem: We obtain that: X i d = F 1 X i (U). F X (x) = P (X 1 x 1, X 2 x 2 ) = P ( ) F 1 X 1 (U) x 1, F 1 X 2 (U) x 2 = P (U F X1 (x 1 ), U F X2 (x 2 )) = min{f X1 (x 1 ), F X2 (x 2 )}, (x 1, x 2 ) R 2. 4 d = denotes equality in distribution.

22 14 See for instance Dhaene & Goovaerts [9], Dhaene et al. [11] and Dhaene et al. [12] for more information about comonotonicity. Counter monotonicity implies that all random variables are driven by the same source of randomness, but move in the opposite direction. The random variables X 1 and X 2 are counter monotonic if and only if: (X 1, X 2 ) d = ( F 1 X 1 (U), F 1 X 2 (1 U) ). The joint cumulative distribution function can again be derived by making use of the quantile transform theorem: F X (x) = P (X 1 x 1, X 2 x 2 ) = P ( ) F 1 X 1 (U) x 1, F 1 X 2 (1 U) x 2 = P (U F X1 (x 1 ), 1 U F X2 (x 2 )) = max{f X1 (x 1 ) + F X2 (x 2 ) 1, 0}, (x 1, x 2 ) R 2. Note that it is not straightforward to generalize counter monotonicity in a dimension larger than 2. A Fréchet space R 2 (F X1, F X2 ) that consists of all bivariate random vectors Y with fixed univariate distribution functions F X1, F X2 is used to study dependence structures: R 2 (F X1, F X2 ) = {Y F Y1 F X1, F Y2 F X2 }. Any joint distribution F Y with Y R 2 (F X1, F X2 ) can describe the random vector X. Note that the Fréchet upper and lower bound are given by the distribution functions with a comonotonic and countermonotonic dependence structure respectively. A copula is a function that allows to determine the dependence structure independently from the univariate distribution functions. The latter is in our interest as insurers usually have a model available for both claim frequency and claim severity. According to Sklar s theorem [32], there exists a unique copula C : [0, 1] [0, 1] [0, 1] so that: F X1, F X2 continuous distribution functions; F X (x 1, x 2 ) = C (F X1 (x 1 ), F X2 (x 2 )) with F X R 2 (F X1, F X2 ); (x 1, x 2 ) R 2. A copula satisfies several properties:

23 15 1. lim ui 0 C(u 1, u 2 ) = 0 for i = 1, 2; 2. lim u1 1 C(u 1, u 2 ) = u 2 and lim u2 1 C(u 1, u 2 ) = u 1 ; 3. C is supermodular so that the inequality C(v 1, v 2 ) C(u 1, v 2 ) C(v 1, u 2 )+C(u 1, u 2 ) 0 is valid for any u 1 v 1 and u 2 v 2. The first property implies that the joint probability equals 0 if one of the events has a zero probability of occurring. In contrast, the joint probability reduces to one event s probability of occurring if the other event will happen for sure. Finally, supermodular is a property that implies the joint probability should not decrease if the probabilities of both events are non-decreasing. We refer the reader to [15], [26] and [10] for more information about copulas Copula families The most well-known dependence structure is linear dependence and corresponds to the Gaussian copula with correlation parameter ρ. Note that the univariate distributions do not need to follow a Gaussian distribution. The strength of dependence increases with ρ as ρ = 1 corresponds to counter-monotonicity, ρ = 0 to independence and ρ = 1 to comonotonicity. A generalization of the Gaussian copula is the Student copula with an additional degrees of freedom parameter ν. Note that ρ = 0 only corresponds to independence as ν, which is equivalent to the Gaussian copula. Both the Gaussian and Student copula are elliptical copulas. An Archimedean copula with a strict generator φ is a copula of the form: C(u 1, u 2 ) = φ 1 (φ(u 1 ) + φ(u 2 )), and φ satisfies: φ(0) = + ; φ(1) = 0; φ is a continuous, strictly decreasing and convex function mapping [0, 1] onto [0, ). The Frank, Clayton and Gumbel copula satisfy these conditions and are thus part of the Archimedean copula family (see for instance [15] for an overview). Next to parametric copula families, there are non-parametric and empirical estimators as well. The empirical copula is simply the multivariate empirical distribution function of the transformed random variables F X1 (x 1 ) and F X2 (x 2 ).

24 Applications in Insurance This section will discuss the few research papers available that have relaxed the independence assumption between claim frequency and severity. Czado et al. [18] have performed a study on spatial (Bayesian) modelling for both frequency and severity data. Dependence was induced by using the number of claims as a covariate in the severity model. Garrido et al. [17] followed a similar strategy to relax the independence assumption. Although it was found to improve the results, it cannot be used in practice as the number of claims is unknown for new customers. In the fact, the latter is the most important source of randomness in pricing insurance contracts. Czado et al. [6] link a Poisson and gamma regression with a Gaussian copula. It was found that a copula approach did not outperform the independence model for their dataset. Czado et al. [7] generalize this approach to the class of Archimedean copulas. It was found that latter copulas are more appropriate than the Gaussian one. A drawback of their study is the use of a gamma regression, which, in the case of heavy-tailed data, could lead to misspecification of the percentiles. In addition, no out-of-sample estimation was performed, which is increasingly important as insurers try to predict claims in future years to come. The research of Shi et al. [31] is closest to the work that this master thesis will present. They work in a hurdle framework where the hurdle component is the probability to have at least one claim, and the conditional component is the number of claims and their respective sizes. They do not limit the univariate distributions to the GLM family, and model claim counts with a truncated negative binomial regression and the claim sizes are assumed to follow a generalized gamma distribution. The dependence assumption is relaxed by either the copula approach or using the number of claims as a covariate in the severity component. The copula approach was found to outperform the independence model on a hold-out sample.

25 A more flexible pricing strategy We have seen that the classical approach of the premium for customer i consists of the multiplication of the claim frequency and severity component, i.e. P i = E[Y i x i ] E[N i x i ]. We now want to relax the independence assumption between claim frequency and claim severity. The main problem is that lots of customers did not experience a claim, whilst claim severity is only defined if there is at least one accident. Hence, this motivates working in a conditional framework as explained in the paragraph below (see also Shi et al. [31] for a similar approach). Let I i be an indicator random variable that takes the value 1 if there is at least one claim (also called claim incidence) and 0 otherwise. N i is the claim frequency and has as support 5 S Ni = {0, 1, 2, 3,... }. Y i Y i I i = 1 with support S Yi = (0, ) is the claim severity and is conditional on I i = 1 by convention. We decompose N i by conditioning on I i : P(N i = k x i ) = P(N i = k I i = 1, x i ) P(I i = 1 x i )+P(N i = k I i = 0, x i ) P(I i = 0 x i ). Hence, we have that: P(I i = 0 x i ) if k = 0 ; P(N i = k x i ) = P(N i = k I i = 1, x i ) P(I i = 1 x i ) if k 1. The latter result shows that we can decompose claim frequency into two components, namely the probability to have a claim and the conditional claim frequency. Note that S Ni I i =1 = {1, 2, 3,... }, which implies it is truncated at 0. It is now feasible to define a dependence structure between Y i and N i I i = 1. The pricing strategy corresponding to the flexible approach is as follows: P i = E [N i Y i x i ] = E [N i Y i I i = 1, x i ] P(I i = 1 x i ) ; since E [N i Y i I i = 0, x i ] = 0. The advantages of the model over the traditional approach are twofold. Firstly, it is possible to consider a different set of covariates for claim incidence and claim frequency greater than or equal to 1. Next, it is allowed that Y i and N i are dependent so that: E [Y i N i x i ] E [Y i x i ] E [N i x i ]. The claim probability P(I i = 1 x i ) p(x i ) can be modeled by means of a binomial GAM. According to Baetschmann [2], the complementary log-log link can be used to deal with varying 5 The support S X of random variable X with probability density function f is the set of values for which f(x) > 0.

26 18 time exposure E i : ν(x i ) = 1 e eβ xi + r j=1 f j (x j )+log E i. Note that, in contrast to the Poisson case, the exposure is proportional to the risk in a non-linear fashion. Furthermore, it is assumed that: Y i I i = 1, x i ( ) G µ f i, σf i, νf i, τ f i ; N i I i = 1, x i H (µ s i, σ s i, ν s i, τ s i ). Note that f refers to frequency and s to severity. Distribution functions G and H are distributions existing within the GAMLSS framework and every parameter can potentially be modelled as a function of the covariate vector x i. Even more, Y i I i = 1, x i can be modeled with a truncated GAMLSS model for the attritional losses and a GPD for the tail as outlined in Section In addition, one can model a probability q(x i ) to have either an attritional or large loss. Finally, a copula is used to link the conditional claim frequency and claim size component: C ( F Ni I i =1,x i (k i ), F Yi I i =1,x i (y i ) ). A comprehensive summary of the model structure is provided in Figure 2.1 for customer i with covariate information x i. Frequency 1 N i I i = 1, x i Claim incidence Body p (x i ) Severity q (x i ) Customer i Y i I i = 1, x i 1 p(x i ) 1 q (x i ) No claim incidence Tail Figure 2.1: A more flexible pricing strategy that relaxes the classical assumption of independence. The contribution of this approach over the existing literature is threefold. Firstly, we investigate whether more predictive accuracy can be obtained by using GAMLSS in a hurdle framework. The latter could either be due to a more appropriate distribution assumption or that the risks for claim incidence and the conditional claim frequency are different. Secondly, large losses are

27 19 explicitly accounted for by using EVT and the splicing technique. Finally, this master thesis does not restrict dependence models to parametric copulas. 2.5 Model validation Insurance data is difficult to validate due to its semi-continuous nature. In fact, most customers did not incur a loss, hence resulting in a mixture of a degenerate distribution at 0 and a positivecontinuous part. As a result, any frequency-severity model is difficult to validate with standard performance metrics such as the Root Mean Square Error. Frees et al. [14] introduce the Lorentz curve, a well-known concept in economics, to insurance modeling. Observations y i are ordered by the riskiness measured by a relativity R. A straightforward choice would be the predicted premium as it indicates how risky customers are compared to each other. A plot can be generated: ( ˆFP (s), ˆF L (s)) with ˆF P (s) = ˆF L (s) = n i=1 P I(R i s) n i=1 P n i=1 y i I(R i s) n i=1 y i where I(R i s) = 1 if R i s and 0 otherwise. The x axis is the population of interest P and the y axis the distribution of incurred losses L. An example of the Lorentz Curve is displayed in Figure 2.2. The population of interest is the exposure of policyholders insurance policy. The observations y i are ordered by the predictions. As a result, the more concave the Lorentz curve, the better risks are classified. For instance, the point (20%, 35%), represented by the red lines, implies that 20% of the risks the insurer was exposed to, correspond to 35% of the incurred losses. Lorentz Curve, Incurred Losses Population Figure 2.2: Example: Ordered Lorentz Curve and Gini coefficient. The dotted line represents the 45 degree line, and the solid black line the Lorentz curve.

28 20 The related Gini coefficient is twice the area between the ordered Lorentz curve and the 45 degree line. This metric has for instance been used by Frees et al. [13] and Shi et al. [31] to validate pricing models.

29 3 Case study 3.1 Exploratory Data Analysis We have available a non-confidential dataset of 159, 911 records from an unknown private insurer. The available variables are listed in Table 3.1 and we visualize the distributions of the claim count Clm_Count, exposure TLength and claim severity AvPayment = TotPayment / Clm_Count in Figure 3.1. Most policyholders did not file a claim (145, 719 or 91.1%), those who do file usually have 1 claim only (12, 874 or 8.05%) and the remaining ones file two or more claims (1, 354 or 0.85%). The overall claim frequency Clm_Count TLength equals 15.47%. Variable Type Description Clm_Count Integer Number of claims filed by the policyholder. TLength Numeric The fraction of the year during which the policyholder was exposed to the risk. TotPayment Numeric The total amount claimed by the policyholder (in EUR). AgeInsured Integer Age of the policyholder in years. SexInsured Categorical Gender of the policyholder: Male (M) or Female (F). PrivateCar Binary 1 if it is a car for private use, 0 otherwise. Experience Integer Experience of the policyholder in years. VAge Continuous Age of the vehicle in years. Cover_C Binary 1 if the policyholder has this particular cover, 0 otherwise. NCD_0 Binary 1 if the no-claims history equals 0, 0 otherwise. Table 3.1: Overview of the variables. Relative frequency Relative frequency Density Clm_Count TLength AvPayment Figure 3.1: Distribution of the claim count Clm_Count, exposure TLength and kernel density estimate of the Average Payment AvPayment = TotPayment/Clm_Count.

30 22 A large proportion of policyholders have an exposure equal to 1. The remaining peaks are due to the underwriting process of the insurer, which happen at the end of the month. For instance, if a policyholder aged 26 subscribes an insurance contract 15 th of August 2016, the contract will end 31 st of July 2017 and potentially renewed 1 st of August The example presented in Table 3.2 shows how the data is structured and how variables are aged. In fact, variables are usually aged at the renewal date and observations broken down by accounting year. Observation Accounting year Renewal year Exposure Age /08/ /12/ /01/ /07/ /08/ /12/ /01/ /07/ Table 3.2: Example data structuring The claim severity, of which the density is visualized in Figure 3.1, is on average 4,438 EUR, whilst the median equals 2,575 EUR, indicating the right skewness of the data. It is clear that most policyholders file small claims, as only 3.16% of claims exceed 20,000 EUR. Finally, Figure 3.2 illustrates how the distribution of AvPayment evolves according to the number of claims filed. There seems to be an increasing trend, which might indicate some dependence is present between claim frequency and claim severity Count 1 density Loss Figure 3.2: Kernel Density Estimate of the severity distribution by Claim Count Figure 3.3 illustrates how the risk factors are distributed (right axis) and how they are related to the response variable (left axis). The response variables are the claim probability, conditional

31 23 claim frequency and claim severity for the left, middle and right panel, respectively. Claim Probability AgeInsured Exposure Claim Frequency AgeInsured Exposure Claim Severity AgeInsured Exposure Claim Probability SexInsured Exposure Claim Frequency SexInsured Exposure Claim Severity SexInsured Exposure F M F M F M Claim Probability Experience Exposure Claim Frequency Experience Exposure Claim Severity Experience Exposure Claim Probability VAge Exposure Claim Frequency VAge Exposure Claim Severity VAge Exposure Claim Probability PrivateCar Exposure Claim Frequency PrivateCar Exposure Claim Severity PrivateCar Exposure Claim Probability NCD_ Exposure Claim Frequency NCD_ Exposure Claim Severity NCD_ Exposure Claim Probability Cover_C Exposure Claim Frequency Cover_C Exposure Claim Severity Cover_C Exposure Figure 3.3: Exploratory Data Analysis : The crosses indicate the observed average of the response per level (left axis) and the bars the exposure per level (right axis). The response variables are the claim probability, conditional claim frequency and claim severity for the left, middle and right panel, respectively.

32 Claim Frequency Analysis This section will focus on modeling the claim frequency Clm_Count per unit of exposure TLength and provide a comparative analysis between the classical and hurdle approach. The classical approach will model Clm_Count directly with a Poisson error distribution and by using the logarithm of the exposure as offset. In contrast, the hurdle approach decomposes Clm_Count in the following two components: 1. Claim probability I Clm_Count 1, taking the value of 1 if there is at least one claim, and 0 otherwise; 2. Conditional claim frequency Clm_Count Clm_Count 1. The claim probability is modeled with a binomial GAM 1 with complementary log-log function, whilst using the logarithm of the exposure as offset. Similar to the classical approach, the conditional claim frequency component is modeled with a Poisson distribution and the logarithm of the exposure as offset. We use the following rule to model the conditional claim frequency, rather than using a truncated Poisson regression model: E [Clm_Count Clm_Count 1] = 1 + E [Clm_Count 1 Clm_Count 1]. For all three models, categorical variables are coded as dummy variables and the penalty for the smooth effects is estimated with generalized cross-validation. The optimal model structure is determined by doing an exhaustive search over all possible combinations 2, and by using 5 fold cross validation. On one hand, the AIC is calculated on the training folds, while on the other hand, the deviance between the observed and predicted values is calculated on the test folds. More weight is given to out-of-sample performance, as our aim is predicting the behavior of new clients. The following model structure, corresponding to the classical approach, attained the lowest AIC on all training folds: log E[Clm_Count] = β 0 + β 1 SexInsured + β 2 PrivateCar + β 3 Cover_C + β 4 NCD_0 + f 1 (AgeInsured) + f 2 (Experience) + f 3 (VAge) + log(tlength). In addition, latter structure also minimized the average Poisson deviance over the test folds. The exact same results were found for the claim probability component with the Binomial deviance. 1 The gam function from the mgcv package was used in R. 2 The model could be further enriched by searching for interaction effects. Yet, this would over-complicate the analysis and not add any value given the techniques we aim to demonstrate.

33 25 The model structure is as follows: log ( log (1 E[I Clm_Count 1 ])) = α 0 + α 1 SexInsured + α 2 PrivateCar + α 3 Cover_C + α 4 NCD_0 + g 1 (AgeInsured) + g 2 (Experience) + g 3 (VAge) + log(tlength). The optimal model structure for the conditional claim frequency on the training data included the variables SexInsured, Cover_C, NCD_0 and smooth effects for AgeInsured and VAge. In fact, latter structure attained the lowest AIC on 3 out of 5 training folds. Yet, the same model without a smooth effect for VAge attained the lowest average deviance on the test folds. Since more weight is given to out-of-sample performance, we continue with the following model structure: log E[Clm_Count 1 Clm_Count 1] = γ 0 + γ 1 SexInsured + γ 2 Cover_C + γ 3 NCD_0 + h 1 (AgeInsured) + log(tlength). Claim frequency Claim probability Conditional claim frequency f(ageinsured) f(ageinsured) f(ageinsured) AgeInsured AgeInsured AgeInsured Claim frequency Claim probability f(experience) f(experience) Experience Experience Claim frequency Claim probability f(vage) f(vage) VAge Figure 3.4: Fitted smooth effects for AgeInsured, Experience and VAge with 95% confidence intervals for the claim frequency, claim probability and conditional claim frequency. VAge

34 26 Figure 3.4 displays the fitted smooth functions with 95% confidence intervals. ˆf1, ˆf 2 & ˆf 3 correspond to the claim frequency model, ĝ 1, ĝ 2 & ĝ 3 to the claim probability model and ĥ1 to the conditional claim frequency model. It appears there are no tremendous differences in the fitted smooth effects of the three models. The fitted smooth effect of AgeInsured in any of the three models shows that young drivers are riskier. Latter effect stabilizes around the age of 40, after which the so-called accident hump occurs, i.e. parents with children that start driving and file claims under their parents name. The effect for the old ages is less clear due to scarceness of data, as the widening confidence intervals illustrate. Next, the smooth effect for Experience is monotonically decreasing and indicates that more experienced drivers have fewer claims on average. Finally, the effect of VAge is less intuitive due to the two peaks. It is known that older vehicles are less safe, but the latter does not explain the decrease from year 7 on. Claim frequency Model term Estimate Std. Error t statistic p value (Intercept) < 0.01 SexInsured - Male < 0.01 PrivateCar < 0.01 NCD_ < 0.01 Cover_C < 0.01 Claim probability Model term Estimate Std. Error t statistic p value (Intercept) < 0.01 SexInsured - Male < 0.01 PrivateCar < 0.01 NCD_ < 0.01 Cover_C < 0.01 Conditional claim frequency Model term Estimate Std. Error t statistic p value (Intercept) < 0.01 SexInsured - Male NCD_ < 0.01 Cover_C < 0.01 Table 3.3: Parameter estimates and corresponding statistics for the claim frequency, claim probability and conditional claim frequency. Next, we discuss the parameteric model terms as displayed in Table 3.3. It is clear that male

Subject CS1 Actuarial Statistics 1 Core Principles. Syllabus. for the 2019 exams. 1 June 2018

` Subject CS1 Actuarial Statistics 1 Core Principles Syllabus for the 2019 exams 1 June 2018 Copyright in this Core Reading is the property of the Institute and Faculty of Actuaries who are the sole distributors.