Backtesting Stochastic Mortality Models: An Ex-Post Evaluation of Multi-Period-Ahead Density Forecasts

Similar documents
Online Appendix to: Implementing Supply Routing Optimization in a Make-To-Order Manufacturing Network

This specification describes the models that are used to forecast

1 Purpose of the paper

Introduction. Enterprises and background. chapter

Advanced Forecasting Techniques and Models: Time-Series Forecasts

Financial Econometrics Jeffrey R. Russell Midterm Winter 2011

Suggested Template for Rolling Schemes for inclusion in the future price regulation of Dublin Airport

a. If Y is 1,000, M is 100, and the growth rate of nominal money is 1 percent, what must i and P be?

A Note on Missing Data Effects on the Hausman (1978) Simultaneity Test:

Documentation: Philadelphia Fed's Real-Time Data Set for Macroeconomists First-, Second-, and Third-Release Values

Appendix B: DETAILS ABOUT THE SIMULATION MODEL. contained in lookup tables that are all calculated on an auxiliary spreadsheet.

LIDSTONE IN THE CONTINUOUS CASE by. Ragnar Norberg

Bank of Japan Review. Performance of Core Indicators of Japan s Consumer Price Index. November Introduction 2015-E-7

Computer Lab 6. Minitab Project Report. Time Series Plot of x. Year

Macroeconomics II A dynamic approach to short run economic fluctuations. The DAD/DAS model.

Forecasting with Judgment

Forecasting Sales: Models, Managers (Experts) and their Interactions

Key Formulas. From Larson/Farber Elementary Statistics: Picturing the World, Fifth Edition 2012 Prentice Hall. Standard Score: CHAPTER 3.

Data Mining Anomaly Detection. Lecture Notes for Chapter 10. Introduction to Data Mining

Data Mining Anomaly Detection. Lecture Notes for Chapter 10. Introduction to Data Mining

Extreme Risk Value and Dependence Structure of the China Securities Index 300

(1 + Nominal Yield) = (1 + Real Yield) (1 + Expected Inflation Rate) (1 + Inflation Risk Premium)

4452 Mathematical Modeling Lecture 17: Modeling of Data: Linear Regression

The Death of the Phillips Curve?

2. Quantity and price measures in macroeconomic statistics 2.1. Long-run deflation? As typical price indexes, Figure 2-1 depicts the GDP deflator,

DISCUSSION PAPER PI-0801

R e. Y R, X R, u e, and. Use the attached excel spreadsheets to

Volatility and Hedging Errors

Stock Market Behaviour Around Profit Warning Announcements

Watch out for the impact of Scottish independence opinion polls on UK s borrowing costs

Li Gan Guan Gong Michael Hurd. April, 2006

Constructing Out-of-the-Money Longevity Hedges Using Parametric Mortality Indexes. Johnny Li

An Alternative Test of Purchasing Power Parity

STATIONERY REQUIREMENTS SPECIAL REQUIREMENTS 20 Page booklet List of statistical formulae New Cambridge Elementary Statistical Tables

MA Advanced Macro, 2016 (Karl Whelan) 1

Unemployment and Phillips curve

Ch. 10 Measuring FX Exposure. Is Exchange Rate Risk Relevant? MNCs Take on FX Risk

Comparison of back-testing results for various VaR estimation methods. Aleš Kresta, ICSP 2013, Bergamo 8 th July, 2013

Systemic Risk Illustrated

UNIVERSITY OF MORATUWA

UCLA Department of Economics Fall PhD. Qualifying Exam in Macroeconomic Theory

On the Impact of Inflation and Exchange Rate on Conditional Stock Market Volatility: A Re-Assessment

A Method for Estimating the Change in Terminal Value Required to Increase IRR

Estimating Earnings Trend Using Unobserved Components Framework

San Francisco State University ECON 560 Summer 2018 Problem set 3 Due Monday, July 23

Final Exam Answers Exchange Rate Economics

EUI Working Papers DEPARTMENT OF ECONOMICS ECO 2010/06 DEPARTMENT OF ECONOMICS THE RELIABILITY OF REAL TIME ESTIMATES OF THE EURO AREA OUTPUT GAP

Finance Solutions to Problem Set #6: Demand Estimation and Forecasting

A NOTE ON BUSINESS CYCLE NON-LINEARITY IN U.S. CONSUMPTION 247

Predictive Ability of Three Different Estimates of Cay to Excess Stock Returns A Comparative Study for South Africa and USA

DOES EVA REALLY HELP LONG TERM STOCK PERFORMANCE?

A NOVEL MODEL UPDATING METHOD: UPDATING FUNCTION MODEL WITH GROSS DOMESTIC PRODUCT PER CAPITA

INSTITUTE OF ACTUARIES OF INDIA

The relation between U.S. money growth and inflation: evidence from a band pass filter. Abstract

Output: The Demand for Goods and Services

COOPERATION WITH TIME-INCONSISTENCY. Extended Abstract for LMSC09

The Relationship between Money Demand and Interest Rates: An Empirical Investigation in Sri Lanka

Problem Set 1 Answers. a. The computer is a final good produced and sold in Hence, 2006 GDP increases by $2,000.

A pricing model for the Guaranteed Lifelong Withdrawal Benefit Option

Description of the CBOE S&P 500 2% OTM BuyWrite Index (BXY SM )

Volume 31, Issue 1. Pitfall of simple permanent income hypothesis model

INSTITUTE OF ACTUARIES OF INDIA

Loss Functions in Option Valuation: A Framework for Model Selection

Multiple Choice Questions Solutions are provided directly when you do the online tests.

CENTRO DE ESTUDIOS MONETARIOS Y FINANCIEROS T. J. KEHOE MACROECONOMICS I WINTER 2011 PROBLEM SET #6

The Mathematics Of Stock Option Valuation - Part Four Deriving The Black-Scholes Model Via Partial Differential Equations

Empirical analysis on China money multiplier

Taylor Rules for Sweden s Monetary Policy Committee *

VOLATILITY CLUSTERING, NEW HEAVY-TAILED DISTRIBUTION AND THE STOCK MARKET RETURNS IN SOUTH KOREA

Subdivided Research on the Inflation-hedging Ability of Residential Property: A Case of Hong Kong

Models of Default Risk

COMPARISON OF THE CALIBRATION OF MORTALITY MODELS ON THE CZECH DATA

VaR and Low Interest Rates

FORECASTING WITH A LINEX LOSS: A MONTE CARLO STUDY

A Study of Process Capability Analysis on Second-order Autoregressive Processes

Portfolio investments accounted for the largest outflow of SEK 77.5 billion in the financial account, which gave a net outflow of SEK billion.

FINAL EXAM EC26102: MONEY, BANKING AND FINANCIAL MARKETS MAY 11, 2004

12. Exponential growth simulation.

The macroeconomic effects of fiscal policy in Greece

HEDGING SYSTEMATIC MORTALITY RISK WITH MORTALITY DERIVATIVES

Bond Prices and Interest Rates

Dual Valuation and Hedging of Bermudan Options

Market and Information Economics

Chapter 5. Two-Variable Regression: Interval Estimation and Hypothesis Testing

Robustness of Memory-Type Charts to Skew Processes

Supplement to Models for Quantifying Risk, 5 th Edition Cunningham, Herzog, and London

IJRSS Volume 2, Issue 2 ISSN:

VERIFICATION OF ECONOMIC EFFICIENCY OF LIGNITE DEPOSIT DEVELOPMENT USING THE SENSITIVITY ANALYSIS

Real Time Representations of the Output Gap

Evaluating Projects under Uncertainty

Synthetic CDO s and Basket Default Swaps in a Fixed Income Credit Portfolio

Loss Functions in Option Valuation: A Framework for Selection. Christian C.P. Wolff, Dennis Bams, Thorsten Lehnert

May 2007 Exam MFE Solutions 1. Answer = (B)

The role of the SGT Density with Conditional Volatility, Skewness and Kurtosis in the Estimation of VaR: A Case of the Stock Exchange of Thailand

DISCUSSION PAPER PI-1601

STABLE BOOK-TAX DIFFERENCES, PRIOR EARNINGS, AND EARNINGS PERSISTENCE. Joshua C. Racca. Dissertation Prepared for Degree of DOCTOR OF PHILOSOPHY

TESTING FOR SKEWNESS IN AR CONDITIONAL VOLATILITY MODELS FOR FINANCIAL RETURN SERIES

The Effect of Open Market Repurchase on Company s Value

The Impact of Interest Rate Liberalization Announcement in China on the Market Value of Hong Kong Listed Chinese Commercial Banks

Spectral Risk Measures with an Application to Futures Clearinghouse Variation Margin Requirements By John Cotter and Kevin Dowd

Transcription:

Cenre for Risk & Insurance Sudies enhancing he undersanding of risk and insurance Backesing Sochasic Moraliy Models: An Ex-Pos Evaluaion of Muli-Period-Ahead Densiy Forecass Kevin Dowd, Andrew J.G. Cairns, David Blake Guy D. Coughlan, David Epsein, Marwa Khalaf-Allah CRIS Discussion Paper Series 28.IV

Backesing Sochasic Moraliy Models: An Ex-Pos Evaluaion of Muli-Period-Ahead Densiy Forecass Kevin Dowd *, Andrew J.G. Cairns #, David Blake Guy D. Coughlan, David Epsein, Marwa Khalaf-Allah Sepember 8, 28 Absrac This sudy ses ou a backesing framework applicable o he muli-period-ahead forecass from sochasic moraliy models and uses i o evaluae he forecasing performance of six differen sochasic moraliy models applied o English & Welsh male moraliy daa. The models considered are: Lee-Carer s 992 one-facor model; a version of Renshaw-Haberman s 26 exension of he Lee-Carer model o allow for a cohor effec; he age-period-cohor model of Currie (26), which is a simplified version of Renshaw-Haberman; Cairns, Blake and Dowd s 26 wo-facor model; and wo generalised versions of he laer wih an added cohor effec. For he daa se used herein he resuls from applying his mehodology sugges ha he models perform adequaely by mos backess, and ha here is lile difference beween he performances of five of he models. The remaining model, however, shows forecas insabiliy. The sudy also finds ha densiy forecass ha allow for uncerainy in he parameers of he moraliy model are more plausible han forecass ha do no allow for such uncerainy. Key words: backesing, forecasing performance, moraliy models * Cenre for Risk & Insurance Sudies, Noingham Universiy Business School, Jubilee Campus, Noingham, NG8 BB, Unied Kingdom. Corresponding auhor: email a Kevin.Dowd@noingham.ac.uk. The auhors hank Lixia Loh and Liang Zhao for excellen research assisance. # Maxwell Insiue for Mahemaical Sciences, and Acuarial Mahemaics and Saisics, Herio-Wa Universiy, Edinburgh, EH4 4AS, Unied Kingdom. Pensions Insiue, Cass Business School, 6 Bunhill Row, London, ECY 8TZ, Unied Kingdom. Pension ALM Group, JPMorgan Chase Bank, 25 London Wall, London EC2Y 5AJ, Unied Kingdom.

. Inroducion A recen paper, Cairns e al. (27), examined he empirical fis of eigh differen sochasic moraliy models, variously labelled M o M8. Seven of hese models were furher analysed in a second sudy, Cairns e al. (28), which evaluaed he exane plausibiliy of he models probabiliy densiy forecass. Amongs oher findings, ha sudy found ha one of hese models M8, a version of he Cairns-Blake-Dowd (CBD) model (Cairns e al., 26) wih an allowance for a cohor effec generaed implausible forecass on US daa, and consequenly his model was also dropped from furher consideraion. A hird sudy, Dowd e al. (28), hen examined he goodness of fi of he remaining six models by analysing he saisical properies of heir various residual series. The six models ha were examined are: M, he Lee-Carer model (Lee and Carer, 992): M2B, a version of Renshaw and Haberman s cohoreffec generalisaion of he Lee-Carer model (Renshaw and Haberman, 26); M3B, a version of Currie s age-period-cohor model (Currie, 26), which is a simplified version of M2B; 2 M5, he CBD model; and M6 and M7, wo alernaive cohor-effec generalisaions of he CBD model. Deails of hese models specificaions are given in Appendix A. I is quie possible for a model o provide a good in-sample fi o hisorical daa and produce forecass ha appear o be plausible ex ane, bu sill produce poor ex-pos forecass, ha is, forecass ha differ significanly from subsequenly realised oucomes. A good model should herefore produce forecass ha perform well ouof-sample when evaluaed using appropriae forecas evaluaion or backesing mehods, as well as provide good fis o he hisorical daa and plausible forecass ex ane. The primary purpose of he presen paper is, accordingly, o se ou a backesing framework ha can be used o evaluae he ex-pos forecasing performance of a moraliy model. The model omied from his second sudy was he P-splines model of Currie e al. (24) and Currie (26). This model was dropped in par because of is poor performance relaive o he oher models when assessed by he Bayes Informaion Crierion, and in par because i canno be used o projec sochasic moraliy oucomes. 2 M2B and M3B are he versions of M2 and M3 ha assume an ARIMA(,,) process for he cohor effec (see Cairns e al., 28). 2

A secondary purpose of he paper is o illusrae his backesing framework by applying i o he six models lised above. The backesing framework is applied o each model over various forecas horizons using a paricular daa se, namely, LifeMerics daa for he moraliy raes of English & Welsh males 3 for ages varying from 6 o 89 and spanning he years 97 o 26. The backesing of moraliy models is sill in is infancy. Mos sudies ha assess moraliy forecass focus on ex-pos forecas errors (see, e.g., Keilman (997, 998), Naional Research Council (2), or Koissi e al. (26)). This is limied in so far as i ignores all informaion conained in he probabiliy densiy forecass, excep for he informaion refleced in he mean forecas or bes esimae. A more sophisicaed reamen and a good indicaor of he curren sae of bes pracice in he evaluaion of moraliy models is provided by Lee and Miller (2). They evaluaed he performance of he Lee-Carer (992) model by examining he behaviour of forecas errors (comparing MAD, RMSE, ec.) and plos of percenile error disribuions, alhough hey did no repor any formal es resuls based on hese laer plos. However, hey also repored plos showing moraliy predicion inervals and subsequenly realised moraliy raes, and he frequencies wih which realised moraliy observaions fell wihin he predicion inervals. These provide a more formal (i.e., probabilisic) sense of model performance. More recenly, CMI (26) included backesing evaluaions of he P-spline model and CMI (27) included backesing evaluaions of M and M2. Their evaluaions were based on plos of realised oucomes agains forecass, where he forecass included boh projecions of cenral values and projecions of predicion inervals, which also give some probabilisic sense of model performance. The backesing framework used in his paper can bes be undersood if we ouline he following key seps:. We begin by selecing he meric of ineres, namely he forecased variable ha is he focus of he backes. Possible merics include he moraliy rae, life 3 See Coughlan e al. (27) and www.lifemerics.com for he daa and a descripion of LifeMerics. The original source of he daa was he UK Office for Naional Saisics. 3

expecancy, fuure survival raes, and he prices of annuiies and oher lifeconingen financial insrumens. Differen merics are relevan for differen purposes, for example, in evaluaing he effeciveness of a hedge of longeviy or moraliy risk, he relevan meric is he moneary value of he underlying exposure. In his paper, we focus on he moraliy rae iself, bu, in principle, backess could be conduced on any of hese oher merics as well. 2. We selec he hisorical lookback window which is used o esimae he parameers of each model for any given year: hus, if we wish o esimae he parameers for year and we use a lookback window of lengh n, hen we are esimaing he parameers for year, using observaions from years n+ o. In his paper, we use a fixed-lengh 4 lookback window of years 5. 3. We selec he horizon (i.e., he lookforward window) over which we will make our forecass, based on he esimaed parameers of he model. In he presen sudy, we focus on relaively long-horizon forecass, because i is wih he accuracy of hese forecass ha pension plans are principally concerned, bu which also pose he greaes modelling challenges. 4. Given he above, we decide on he backes o be implemened and specify wha consiues a pass or fail resul. Noe ha we use he erm backes here o refer o any mehod of evaluaing forecass agains subsequenly realised oucomes. Backess in his sense migh involve he use of plos whose goodness of fi is inerpreed informally, as well as formal saisical ess of predicions generaed under he null hypohesis ha a model produces adequae forecass. The above framework for backesing sochasic moraliy models is a very general one. 4 The use of a fixed-lengh lookback window simplifies he underlying saisics and means ha our resuls are equally responsive o he news in any given year s observaions. By conras, an expanding lookback window is more difficul o handle and also means ha, as ime goes by, he news conained in he mos recen observaions receives less weigh, relaive o he expanding number of older observaions in he lookback sample. 5 We choose a -year lookback window because ha is he window emerging as he sandard amongs marke praciioners. However, we acknowledge ha a lookback window of years is no necessarily opimal from a forecasing perspecive. The choice of lookback window ineviably involves a radeoff: a long window is less responsive han a shor window o changes in he rend of moraliy raes, bu is less influenced by random flucuaions in moraliy raes han a shor window. The choice of lookback window is herefore o some exen subjecive. 4

Wihin his broad framework, we implemen he following four backes procedures for each model, noing ha he meric (moraliy rae) and lookback windows ( years) are he same in each case: Conracing horizon backess: Firs, we evaluae he convergence properies of he forecas moraliy rae o he acual moraliy rae a a specified fuure dae. We chose 26 as he forecas dae and we examine forecass of ha year s moraliy rae, where he forecass are made a differen daes in he pas. The firs forecas was made wih a model esimaed from years of hisorical observaions up o 98, he second wih he model esimaed from years of hisorical observaions up o 98, and so forh, up o 26. In oher words, we are examining forecass wih a fixed end dae and a conracing horizon from 26 years ahead down o one year ahead. The backes procedure hen involves a graphical comparison of he evoluion of he forecass owards he acual moraliy rae for 26 and inuiively good forecass should converge in a fairly seady manner owards he realised value for 26. Expanding horizon backess: Second, we consider he accuracy of forecass of moraliy raes over expanding horizons from a common fixed sar dae, or sepping off year. For example, he sar dae migh be 98, and he forecass migh be for one up o 26 years ahead, ha is, for 98 up o 26. The backes procedure again involves a graphical comparison of forecass agains realised oucomes, and he goodness of he forecass can be assessed in erms of heir closeness o he realised oucomes. Rolling fixed-lengh horizon backess: Third, we consider he accuracy of forecass over fixed-lengh horizons as he sepping off dae moves sequenially forward hrough ime. This involves examining plos of moraliy predicion inervals for some fixed-lengh horizon (e.g., 5 years) rolling forward over ime wih subsequenly realised oucomes superimposed on hem. Moraliy probabiliy densiy forecas ess: Fourh, we carry ou formal hypohesis ess in which we use each model as of one or more sar daes o 5

simulae he forecased moraliy probabiliy densiy a he end of some horizon period, and we hen compare he realised moraliy rae(s) agains his forecased probabiliy densiy (or densiies). The forecas passes he es if each realised rae lies wihin he more cenral region of he forecas probabiliy densiy and he forecas fails if i lies oo far ou in eiher ail o be plausible given he forecased probabiliy densiy. 6 For each of he four classes of es, we examine wo ypes of moraliy forecas. The firs of hese are forecass in which i is assumed ha he parameers of he moraliy models are known wih cerainy. The second are forecass, in which we make allowance for he fac ha he parameers of he models are only esimaed, i.e., heir rue values are unknown. We would regard he laer as more plausible, and evidence from oher sudies suggess ha allowing for parameer uncerainy can make a considerable difference o esimaes of quanifiable uncerainy (see, e.g., Cairns e al. (26), Dowd e al. (26) or Blake e al. (28)). For convenience we label hese he parameer cerain (PC) and parameer uncerain (PU) cases respecively. 7 More deails of he laer case are given in Appendix B. This paper is organised as follows. Secion 2 considers he conracing horizon backess, secion 3 considers he expanding horizon backess, secion 4 considers he 6 In principle, we could also perform furher backess based on he Probabiliy Inegral Transformaion (PIT) series ha are prediced o be sandard uniform under he null of forecas adequacy, which we migh do, e.g., by ess of he Kolmogorov-Smirnov family (e.g., Kolmogorov-Smirnov es, Kuiper ess, Lilliefors ess, ec.). We could also pu he PIT series hrough an inverse sandard normal disribuion funcion as suggesed by Berkowiz (2) and hen es hese series for he predicions of sandard normaliy, e.g., a zero mean, a uni sandard deviaion, a Jarque-Bera es of he skewness and kurosis predicions of normaliy, ec. We carried ou such ess bu do no repor hem here because he PIT and Berkowiz series are no prediced o be iid in a muli-period ahead forecasing conex (and more o he poin, in many cases iid is clearly rejeced), and he absence of iid undermines he validiy of sandard ess (e.g., such as he Kolmogorov-Smirnov es). A poenial soluion is offered by Dowd (27), who suggess ha he Berkowiz series should be a moving average of iid sandard normal series whose order is deermined by he properies of he year-ahead forecas, bu his is only a conjecure. An alernaive possibiliy is o fi some ad hoc process o he Berkowiz series (e.g., such as an ARMA ype process, as suggesed by Dowd, 28). We have no followed up hese suggesions in he presen sudy. 7 To be more precise, he PU versions of he model use a Bayesian approach o ake accoun of he uncerainy in he parameers driving he period and, where applicable, he cohor effecs. There is no need o ake accoun of uncerainy in he age effecs presen, because we can rely on he law of large numbers o give us fairly precise esimaes of hese laer effecs, i.e., so he esimaion errors in he age effecs will be negligible. 6

rolling fixed-lengh backess for an illusraive horizon of 5 years, and secion 5 considers he moraliy probabiliy forecas ess. Secion 6 concludes. 2. Conracing Horizon Backess: Examining he Convergence of Forecass hrough Time The firs kind of backes in our framework examines he consisency of forecass for a fixed fuure year (in our example below, he year 26) made in earlier years. For a well-behaved model we would expec consecuive forecass o converge owards he realised oucome as he dae he forecass are made (he sepping-off dae) approaches he forecas year. Figures, 2 and 3 show plos of forecass of he 26 moraliy rae for 65-year-old, 75-year-old and 85-year-old males, respecively, made in years 98, 98,, 26. The cenral lines in each plo are he relevan model s median forecas of he 26 moraliy rae based on 5, simulaion pahs, and he doed lines on eiher side are esimaes of he model s 9% predicion inerval or risk bounds (i.e., 5h and 95h perceniles). The sarred poin in each char is he realised moraliy rae for ha age in 26. These plos show he following paerns: For models M, M3B, M5, M6 and M7, he forecass end o decline in a sable way over ime owards a value close o he realised value and he predicion inervals narrow over ime. 8 For hese same models, he PU predicion inervals are considerably wider han he PC predicion inervals, bu here is negligible difference beween he PC and PU median plos. 8 I is noeworhy, however, ha models M and M5 boh produce projeced moraliy raes for he mid 6 s ha are sysemaically below he observed moraliy rae in 26. This appears o be due o a significan cohor effec in he daa ha he oher models are picking up, and his would explain why he realised moraliy raes for hese models for he age-65 chars in Figure are a lile ou of line wih forecass from he periods running up o 26. 7

For M2B, he forecass are someimes unsable, exhibiing major spikes, and here is also ypically less difference beween he PC and PU forecass han is he case wih he oher models. Noe ha we have only evaluaed he convergence of he models for one end dae and one daa se. This is insufficien o draw general conclusions abou he forecasing capabiliies of he various models. However, he insabiliy observed in M2 (or a leas our varian of M2, M2B) is an issue. Figure : Forecass of he 26 Moraliy Rae from 98 Onwards: Males aged 65.4 Males aged 65: Model M.4 Males aged 65: Model M2B.3.2.3.2. 98 985 99 995 2 25. 98 985 99 995 2 25.4 Males aged 65: Model M3B.4 Males aged 65: Model M5.3.2.3.2. 98 985 99 995 2 25. 98 985 99 995 2 25.4 Males aged 65: Model M6.4 Males aged 65: Model M7.3.2.3.2. 98 985 99 995 2 25 Sepping off year. 98 985 99 995 2 25 Sepping off year Noes: Forecass based on esimaes using English & Welsh male moraliy daa for ages 6-89 and a rolling -year hisorical window. The sepping-off year is he final year in he rolling window. The fied model is hen used o esimae he median and 9% predicion inerval for boh parameer-cerain forecass (given by he dashed lines) and parameer-uncerain cases (given by he coninuous lines). The realised moraliy rae for 26 is denoed by *. Based on 5, simulaion rials. 8

Figure 2: Forecass of he 26 Moraliy Rae from 98 Onwards: Males aged 75 Males aged 75: Model M Males aged 75: Model M2B.8.8.6.4.6.4.2 98 985 99 995 2 25.2 98 985 99 995 2 25 Males aged 75: Model M3B Males aged 75: Model M5.8.8.6.4.6.4.2 98 985 99 995 2 25.2 98 985 99 995 2 25 Males aged 75: Model M6 Males aged 75: Model M7.8.8.6.4.6.4.2 98 985 99 995 2 25 Sepping off year.2 98 985 99 995 2 25 Sepping off year Noes: As per Figure. Figure 3: Forecass of he 26 Moraliy Rae from 98 Onwards: Males aged 85.25 Males aged 85: Model M.25 Males aged 85: Model M2B.2.5..2.5..5 98 985 99 995 2 25.25.2.5. Males aged 85: Model M3B.5 98 985 99 995 2 25.25.2.5. Males aged 85: Model M6.5 98 985 99 995 2 25 Sepping off year.5 98 985 99 995 2 25.25.2.5. Males aged 85: Model M5.5 98 985 99 995 2 25.25.2.5. Males aged 85: Model M7.5 98 985 99 995 2 25 Sepping off year Noes: As per Figure. 9

3. Expanding Horizon Backess In he second class of backes, we consider he accuracy of forecass over increasing horizons agains he realised oucomes for hose horizons. Accuracy is refleced in he degree of consisency beween he oucome and he predicion inerval associaed wih each forecas. These backess are bes evaluaed graphically using chars of moraliy predicion inervals. Figure 4 shows he moraliy predicion inervals for age 65 for forecass saring in 98 (wih model parameers esimaed using daa from he preceding years). The chars show he 9% predicion inervals (or risk bounds ) as dashed lines for he PC forecass and coninuous lines for he PU forecass. Roughly speaking, if a model is adequae, we can be 9% confiden of any given oucome occurring beween he dashed risk bounds if we believe he PC forecass, and we can be 9% confiden of any given oucome occurring beween he coninuous risk bounds if we believe he PU forecass. The predicion inervals all show he same basic shape hey fan ou somewha over ime around a gradually decreasing rend and show a lile more uncerainy on he upper side han on he lower side. Bu he mos sriking finding is ha he PU risk bounds are considerably wider han heir PC equivalens, indicaing ha he PU forecass are more plausible (in he sense of having higher likelihoods) han he PC ones. The chars also show he forecased median moraliy forecass as doed lines for he PC forecass and coninuous lines for he PU forecass, and we see he median projecions falling as he forecas horizon lenghens. As in he earlier Figures, here are virually no differences beween he PC- and PU-based median projecions. Superimposed on he chars are he realised oucomes indicaed by sars. For he mos par, he pahs of realised oucomes end o be below he median projecion and move downwards over ime relaive o he forecass. This suggess ha, viewed from 98, mos forecass are biased upwards (i.e., hey under-esimae he downward rend of fuure moraliy raes) and he bias ends o increase wih he lengh of he forecas

horizon. However, he size and significance of his bias varies considerably across differen models and beween PC and PU forecass. Figure 4: Moraliy Predicion-Inerval Chars from 98: Males aged 65 Males aged 65: Model M PU: [xl, xm, xu, n] = [, 25,, 27].5 PC: [xl, xm, xu, n] = [7, 25,, 27].4.3.2. 98 985 99 995 2 25 Males aged 65: Model M3B PU: [xl, xm, xu, n] = [, 26,, 27].5 PC: [xl, xm, xu, n] = [2, 26,, 27].4.3.2. 98 985 99 995 2 25 Males aged 65: Model M6.6 PU: [xl, xm, xu, n] = [, 25,, 27].5 PC: [xl, xm, xu, n] = [4, 25,, 27].4.3.2. 98 985 99 995 2 25 Year Males aged 65: Model M2B PU: [xl, xm, xu, n] = [8, 27,, 27].5 PC: [xl, xm, xu, n] = [6, 27,, 27].4.3.2. 98 985 99 995 2 25 Males aged 65: Model M5 PU: [xl, xm, xu, n] = [, 27,, 27].5 PC: [xl, xm, xu, n] = [8, 27,, 27].4.3.2. 98 985 99 995 2 25 Males aged 65: Model M7.6 PU: [xl, xm, xu, n] = [, 9,, 27].5 PC: [xl, xm, xu, n] = [7, 9,, 27].4.3.2. 98 985 99 995 2 25 Year Noes: Forecass based on esimaes using English & Welsh male moraliy daa for ages 6-89 and years 97-98. The dashed lines refer o he forecas medians and bounds of he 9% predicion inerval for he parameer-cerain (PC) forecass and coninuous lines are heir equivalens for he parameer-uncerain (PU) forecass. The realised moraliy raes are denoed by *. For each of hese cases, xl and xm are he numbers of realised raes below he lower 5% and 5% predicion bounds, xu is he number of realised moraliy raes above he upper 5% bound and n is he number of forecass including ha for he saring poin of he forecass. Based on 5, simulaion rials. Figure 4 also shows quadruples in he form [ xl, xm, xu, n ], where n is he number of moraliy forecass, xl is he number of lower exceedances or he number of realised oucomes ou of n falling below he lower risk bound, xm is he number of observaions falling below he projeced median and xu is he number of upper exceedances or observaions falling above he upper risk bound. 9 These saisics 9 Noe hough ha he posiions of he individual observaions wihin he predicion inervals are no independen. If, for example, he 99 observaion is low, hen he 995 observaion is also likely o be low.

provide useful indicaors of he adequacy of model forecass. If he forecass are adequae, for any given PC or PU case, we would expec xl and xu o be abou 5% and we would expec xm o be abou 5%. Too many exceedances, on he oher hand, would sugges ha he relevan bounds were incorrecly forecased: for example, oo many realisaions above he upper risk bound would sugges ha he lower risk bound is oo low; oo many observaions below he median bound would sugges ha he forecass are biased upwards; and oo many observaions below he lower risk bound would sugges ha he lower risk bound is oo high. As a robusness check, Figure 5 shows he comparable char for 65-year-old males saring from year 99 (wih model parameers esimaed using daa for years 98-99), wih horizons now going ou 6 years ahead o 26. Figures 6-9 show he corresponding resuls for ages 75 and 85 for boh PC and PU forecass. These Figures sugges he following: The models forecas decreases in median moraliy raes, forecas risk bounds ha fan ou over he forecas horizon, and in mos cases considered show a bias ha increases wih he forecas horizon. The PU forecass perform beer han he PC forecass. Forecas performance ends o improve wih higher ages. For he sample periods considered, he forecass are fairly robus o he choice of sample period. 2

Figure 5: Moraliy Predicion-Inerval Chars from 99: Males aged 65.5 PC: [xl, xm, xu, n] = [, 6,, 7].4.3.2 Males aged 65: Model M PU: [xl, xm, xu, n] = [, 6,, 7]. 99 992 994 996 998 2 22 24 26.5 PC: [xl, xm, xu, n] = [7, 6,, 7].4.3.2 Males aged 65: Model M3B PU: [xl, xm, xu, n] = [, 6,, 7]. 99 992 994 996 998 2 22 24 26.6.5.4.3.2 Males aged 65: Model M6 PU: [xl, xm, xu, n] = [2, 6,, 7] PC: [xl, xm, xu, n] = [4, 6,, 7]. 99 992 994 996 998 2 22 24 26 Year.5 PC: [xl, xm, xu, n] = [3, 6,, 7].4.3.2 Males aged 65: Model M2B PU: [xl, xm, xu, n] = [, 6,, 7]. 99 992 994 996 998 2 22 24 26.5 PC: [xl, xm, xu, n] = [7, 7,, 7].4.3.2 Males aged 65: Model M5 PU: [xl, xm, xu, n] = [5, 7,, 7]. 99 992 994 996 998 2 22 24 26.6.5.4.3.2 Males aged 65: Model M7 PU: [xl, xm, xu, n] = [, 4,, 7] PC: [xl, xm, xu, n] = [, 4,, 7]. 99 992 994 996 998 2 22 24 26 Year Noes: As per Figure 4, bu using daa over years 98-99. Figure 6: Moraliy Predicion-Inerval Chars from 98: Males aged 75. Males aged 75: Model M PU: [xl, xm, xu, n] = [, 27,, 27]. Males aged 75: Model M2B PU: [xl, xm, xu, n] = [, 27,, 27].8.6.4 PC: [xl, xm, xu, n] = [2, 27,, 27].8.6.4 PC: [xl, xm, xu, n] = [3, 27,, 27] 98 985 99 995 2 25 Males aged 75: Model M3B. PU: [xl, xm, xu, n] = [, 27,, 27] 98 985 99 995 2 25 Males aged 75: Model M5. PU: [xl, xm, xu, n] = [, 25,, 27].8.6.4 PC: [xl, xm, xu, n] = [8, 27,, 27].8.6.4 PC: [xl, xm, xu, n] = [7, 25,, 27] 98 985 99 995 2 25 Males aged 75: Model M6. PU: [xl, xm, xu, n] = [, 27,, 27] 98 985 99 995 2 25 Males aged 75: Model M7. PU: [xl, xm, xu, n] = [, 27,, 27].8.6.4 PC: [xl, xm, xu, n] = [8, 27,, 27].8.6.4 PC: [xl, xm, xu, n] = [9, 27,, 27] 98 985 99 995 2 25 Year 98 985 99 995 2 25 Year Noes: As per Figure 4. 3

Figure 7: Moraliy Predicion-Inerval Chars from 99: Males aged 75. Males aged 75: Model M PU: [xl, xm, xu, n] = [, 4,, 7]. Males aged 75: Model M2B PU: [xl, xm, xu, n] = [, 5,, 7].8.6.4 PC: [xl, xm, xu, n] = [, 4,, 7].8.6.4 PC: [xl, xm, xu, n] = [, 5,, 7] 99 992 994 996 998 2 22 24 26 Males aged 75: Model M3B. PU: [xl, xm, xu, n] = [, 5,, 7] 99 992 994 996 998 2 22 24 26 Males aged 75: Model M5. PU: [xl, xm, xu, n] = [, 4,, 7].8.6.4 PC: [xl, xm, xu, n] = [3, 5,, 7].8.6.4 PC: [xl, xm, xu, n] = [4, 4,, 7] 99 992 994 996 998 2 22 24 26 Males aged 75: Model M6. PU: [xl, xm, xu, n] = [, 4,, 7] 99 992 994 996 998 2 22 24 26 Males aged 75: Model M7. PU: [xl, xm, xu, n] = [, 6,, 7].8.6.4 PC: [xl, xm, xu, n] = [3, 4,, 7].8.6.4 PC: [xl, xm, xu, n] = [5, 6,, 7] 99 992 994 996 998 2 22 24 26 Year 99 992 994 996 998 2 22 24 26 Year Noes: As per Figure 5. Figure 8: Moraliy Predicion-Inerval Chars from 98: Males aged 85.25 Males aged 85: Model M PU: [xl, xm, xu, n] = [, 22,, 27].25 Males aged 85: Model M2B PU: [xl, xm, xu, n] = [, 7,, 27].2.5. PC: [xl, xm, xu, n] = [4, 22,, 27].2.5. PC: [xl, xm, xu, n] = [, 5,, 27].5 98 985 99 995 2 25.25.2.5. Males aged 85: Model M3B PU: [xl, xm, xu, n] = [, 2,, 27] PC: [xl, xm, xu, n] = [2, 2,, 27].5 98 985 99 995 2 25.25.2.5. Males aged 85: Model M6 PU: [xl, xm, xu, n] = [, 8,, 27] PC: [xl, xm, xu, n] = [, 8,, 27].5 98 985 99 995 2 25 Year.5 98 985 99 995 2 25.25.2.5. Males aged 85: Model M5 PU: [xl, xm, xu, n] = [, 24,, 27] PC: [xl, xm, xu, n] = [2, 24,, 27].5 98 985 99 995 2 25.25.2.5. Males aged 85: Model M7 PU: [xl, xm, xu, n] = [, 26,, 27] PC: [xl, xm, xu, n] = [5, 26,, 27].5 98 985 99 995 2 25 Year Noes: As per Figure 4. 4

Figure 9: Moraliy Predicion-Inerval Chars from 99: Males aged 85.25 Males aged 85: Model M PU: [xl, xm, xu, n] = [, 4,, 7].25 Males aged 85: Model M2B PU: [xl, xm, xu, n] = [, 5,, 7].2.5. PC: [xl, xm, xu, n] = [2, 4,, 7].2.5. PC: [xl, xm, xu, n] = [, 5,, 7].5 99 992 994 996 998 2 22 24 26.25.2.5. Males aged 85: Model M3B PU: [xl, xm, xu, n] = [, 3,, 7] PC: [xl, xm, xu, n] = [, 3,, 7].5 99 992 994 996 998 2 22 24 26.25.2.5. Males aged 85: Model M6 PU: [xl, xm, xu, n] = [,,, 7] PC: [xl, xm, xu, n] = [,,, 7].5 99 992 994 996 998 2 22 24 26 Year.5 99 992 994 996 998 2 22 24 26.25.2.5. Males aged 85: Model M5 PU: [xl, xm, xu, n] = [,,, 7] PC: [xl, xm, xu, n] = [,,, 7].5 99 992 994 996 998 2 22 24 26.25.2.5. Males aged 85: Model M7 PU: [xl, xm, xu, n] = [,,, 7] PC: [xl, xm, xu, n] = [,,, 7].5 99 992 994 996 998 2 22 24 26 Year Noes: As per Figure 5. I is also useful o invesigae furher he performance of he differen models across PC and PU forecass. To help do so, we can examine Table which summarises he models average scores (i.e., xl, xm and xu ) from Figures 4-9, where he averages are aken for each model across all hese Figures. The firs six rows give he average percenages of oucomes ha are exceedances (i.e., i gives xl/ n, xm / n and xu / n). The nex six rows give he same excess exceedances in absolue value form ha is, xl / n.5, xm / n.5 and xu / n.5 and hese are expeced o be close o zero under he null hypohesis: hey show how he acual exceedances compare agains expecaions under he null. The nex six lines give he models rankings by how close he acual exceedances compare o expecaions. 5

Table : Exceedances Resuls for Expanding Horizon Backess Parameers Cerain Parameers Uncerain Model (a) Exceedances xl / n xm / n xu / n xl / n xm / n xu / n M 27.3% 8.8%.5% 3.% 8.8%.5% M2B 33.3% 64.4%.5% 5.9% 65.9%.5% M3B 24.2% 8.8% 2.3% 2.3% 8.8% 2.3% M5 36.4% 8.8%.5% 3.6% 8.8%.5% M6 3.3% 76.5% 2.3%.4% 76.5% 2.3% M7 27.3% 78.% 2.3% 3.% 78.% 2.3% (b) Absolue values of excess exceedances xl / n.5 xm / n.5 xu / n.5 xl / n.5 x n 5. M xu / n.5 M 22.3% 3.8% 3.5% 2.% 3.8% 3.5% M2B 28.3% 4.4% 3.5%.9% 5.9% 3.5% M3B 9.2% 3.8% 2.7% 2.7% 3.8% 2.7% M5 3.4% 3.8% 3.5% 8.6% 3.8% 3.5% M6 25.3% 26.5% 2.7% 6.4% 26.5% 2.7% M7 22.3% 28.% 2.7% 2.% 28.% 2.7% (c) Ranking by absolue values of excess exceedances M =2 =4 = =4 M2B 5 6 M3B =4 3 =4 M5 6 =4 5 =4 M6 4 2 4 2 M7 =2 3 = 3 Noes: Based on he exceedance resuls in Figures 4-9. x L, x and M x U are he numbers of observaions below he lower risk bound, below he median bound and above he upper risk bound respecively, and n is he number of forecass. The main poins o noe are: Lower exceedances: The PC models in general have far oo many lower exceedances. By conras, he corresponding PU lower exceedances are much closer o expecaions, and especially so for models M and M7 (which a 2% rank equal firs by his crierion) and M3B which follows close behind a 3%. These findings confirm our claim ha he PU forecass are more plausible han he PC forecass. Median exceedances: There are negligible differences beween he PC and PU median exceedance resuls, and he proporions of median exceedances are much higher han expeced. This confirms ha here is, in general, a endency for he forecass o be biased upwards. The op performing models by his 6

crierion are M2B (a 5.9% for he PU case), M6 (a 26.5%) and M7 (a 28%). The remaining models all score a lile higher a 3.8%. Upper exceedances: Finally, here are very few upper exceedances in every case, here is only eiher or upper exceedances so he upper exceedances resuls are close o expecaions and here are no noable differences across he models or across PC and PU forecass. Accordingly, here is no poin rying o rank he models upper-exceedances performance. 4. Rolling Fixed-Lengh Horizon Backess In he hird class of backess, we consider he accuracy of forecass over fixed horizon periods as hese roll forward hrough ime. Once again, accuracy is refleced in he degree of consisency beween he collecive se of realised oucomes and he predicion inervals associaed wih he forecass. We examine he case of an illusraive 5-year forecas horizon and, from his poin on, we resric ourselves o backesing he performance of he PU forecass only. Figures -6 show he rolling predicion inervals for each of he six models in urn. Each plo shows he rolling predicion inervals and realised moraliy oucomes for ages 65, 75 and 85 over he years 995 o 26, where he forecass are obained using daa from 5 years before. Figure gives he predicion inervals for model M, Figure gives he rolling predicion inervals for M2B, and so forh. To faciliae comparison, he y-axes in all he chars have he same range running from. o.2 on a logarihmic scale. The mos sriking feaure of hese chars is he insabiliy in he forecass of M2B in Figure. As wih Figures -3, his model generaes occasional sharp and seemingly implausible spikes in he forecass which again suggess some degree of insabiliy in he forecass of his model. The oher chars indicae ha here is no much o choose beween he oher five models. The models all show periods in which forecass are wihin he predicion inervals for all ages, bu especially for higher ages (i.e., age 85). They also show periods in which he forecass have drifed below he predicion inervals, paricularly for age 65. 7

Figure : Rolling 5-Year Ahead Predicion Inervals: M Age 85: [xl, xm, xu, n] = [,,, 2] Age 75: [xl, xm, xu, n] = [,,, 2] Age 65: [xl, xm, xu, n] = [, 2,, 2] Age 85 - Age 75 Age 65-2 995 996 997 998 999 2 2 22 23 24 25 26 Year Noes: Model esimaes based on English & Welsh male moraliy daa for ages 6-89 and a rolling - year hisorical window saring from 97-98. The coninuous, shor-dashed and long-dashed lines are he parameer-uncerain (PU) forecas medians and bounds of he 9% predicion inerval for ages 65, 75 and 85, respecively. The realised moraliy raes are denoed by *. For each age, xl and xm are he numbers of realised raes below he lower 5% and 5% predicion bounds, xu is he number of realised moraliy raes above he upper 5% bound and n is he number of forecass including ha for he saring poin of he forecass. Based on 5, simulaion rials. 8

U Figure : Rolling 5-Year Ahead Predicion Inervals: M2B Age 85: [xl, xm, xu, n] = [, 5,, 2] Age 75: [xl, xm, xu, n] = [, 2,, 2] Age 65: [xl, xm, xu, n] = [8, 2,, 2] Age 85 - Age 75 Age 65-2 995 996 997 998 999 2 2 22 23 24 25 26 Year Noes: As per Figure. Figure 2: Rolling 5-Year Ahead Predicion Inervals: M3B Age 85: [xl, xm, xu, n] = [, 8,, 2] Age 75: [xl, xm, xu, n] = [, 2,, 2] Age 65: [xl, xm, xu, n] = [2, 2,, 2] Age 85 - Age 75 Age 65-2 995 996 997 998 999 2 2 22 23 24 25 26 Year Noes: As per Figure. 9

Figure 3: Rolling 5-Year Ahead Predicion Inervals: M5 Age 85: [xl, xm, xu, n] = [, 8,, 2] Age 75: [xl, xm, xu, n] = [, 2,, 2] Age 65: [xl, xm, xu, n] = [9, 2,, 2] Age 85 - Age 75 Age 65-2 995 996 997 998 999 2 2 22 23 24 25 26 Year Noes: As per Figure. Figure 4: Rolling 5-Year Ahead Predicion Inervals: M6 Age 85: [xl, xm, xu, n] = [, 4,, 2] Age 75: [xl, xm, xu, n] = [, 2,, 2] Age 65: [xl, xm, xu, n] = [, 2,, 2] - Age 85 Age 75 Age 65-2 995 996 997 998 999 2 2 22 23 24 25 26 Year Noes: As per Figure. 2

Figure 5: Rolling 5-Year Ahead Predicion Inervals: M7 Age 85: [xl, xm, xu, n] = [, 8,, 2] Age 75: [xl, xm, xu, n] = [, 2,, 2] Age 65: [xl, xm, xu, n] = [4, 2,, 2] Age 85 - Age 75 Age 65-2 995 996 997 998 999 2 2 22 23 24 25 26 Year Noes: As per Figure. Table 2 presens a summary of he exceedance resuls repored in Figures -5. The Table leaves ou upper exceedances because here were none observed in any of hese Figures. We find ha: Lower exceedances: All models exhibi excess lower exceedances, bu he degree of excess is very low (.6%) for M and M3B, 5.6% for M7, and a lile over 2% for he remaining models. Median exceedances: All models show high excess median exceedances. The bes performing models by his crierion are M6 (27.8%) and M2B (3.5%). M2B, M5 and M7 hen come in join hird a 38.9%, and M comes in las a 4.7%. 2

Table 2: Exceedances Resuls for Rolling Fixed-Horizon Backess: Parameers Uncerain Model (a) Exceedances x / n x / n L M 5.6% 9.7% M2B 25.% 8.6% M3B 5.6% 88.9% M5 25.% 88.9% M6 27.8% 77.8% M7.% 88.9% (b) Excess exceedances xl / n.5 xm / n.5 M.6% 4.7% M2B 2.% 3.6% M3B.6% 38.9% M5 2.% 38.9% M6 22.8% 27.8% M7 6.% 38.9% (c) Ranking by excess exceedances M = 6 M2B =4 2 M3B = =3 M5 =4 =3 M6 6 M7 3 =3 Noes: Based on he exceedance resuls in Figures -5. x L and x M are he numbers of observaions below he lower risk bound and below he median bound, and n is he number of forecass. M 5. Moraliy Probabiliy Densiy Forecas Backess: Backess based on Saisical Hypohesis Tess The fourh and final class of backess involves formal hypohesis ess based on comparisons of realised oucomes agains forecass of he relevan probabiliy densiies. To elaborae, suppose we wish o use daa up o and including 98 o evaluae he forecased probabiliy densiy funcion of he moraliy rae of, say, 65-year olds in 26, involving a forecas horizon of 26 years ahead. A simple way o implemen Since we are esing forecass of fuure moraliy raes, we can also regard he ess considered here as similar o ess of q -forward forecass using he differen models; q -forwards are moraliy rae forward conracs (see Coughlan e al., 27). 22

such a es is o use each model o forecas he cumulaive densiy funcion (CDF) for he moraliy rae in 26, as of 98, and compare how he realised moraliy rae for 26 compares wih is forecased disribuion. To carry ou his backes for any given model, we use he daa from 97 o 98 o make a forecas of he CDF of he moraliy rae of 65-year olds in 26. The curves shown in Figure 6 are he CDFs for each of he six models. The null hypohesis is ha he realised moraliy rae for 65-year-old males in 26 is consisen wih he forecased CDF, so we superimpose on hese CDF curves he realised moraliy rae for his age group in 26. We hen deermine he p -value associaed wih he null hypohesis, which is obained by aking he value of he CDF where he verical line drawn up from he realised moraliy rae on he x-axis crosses he cumulaive densiy curve. A model passes he es if he p -value is large enough 2 (i.e., is no oo far ou ino he ails of he forecased cumulaive densiy) and oherwise fails i. For he backess illusraed in Figure 6, we ge esimaed p -values of 5.9%, 2.%, 7.4%, 4.9%, 5.2% and 6.5% for models M, M2B, M3B, M5, M6 and M7, respecively. 3 These resuls ell us ha he null hypohesis ha M generaes adequae forecass over a horizon of 26 years is associaed wih a probabiliy of 5.9%, and so would pass a es based on a convenional significance level of 5%. The corresponding p -values for M2B, M3B, M5, M6 and M7 are 2.%, 7.4%, 4.9%, 5.2% and 6.5%, so by hese es resuls he forecass of M2B and M5 fail a he 5% bu pass a he % significance level; whereas hose of oher models pass a boh significance levels. 4 The reader will recall ha we are now dealing wih PU forecass only. 2 This saemen does however assume ha we are dealing wih CDF-values on he lef-hand of he Figure. Sricly speaking, he forecass would also fail if he realised moraliy rae were associaed wih a CDF value ha is in he upper ail of he forecased disribuion. This raises he issue of -sided versus 2-sided ess which is deal wih furher in noe 4 below. 3 We should be careful o resis he empaion o use he esimaed p -values o rank models. The p - values from differen models are no comparable because he alernaive hypoheses are no comparable across he models. We can herefore only use he p -values o es each model on a sand-alone basis. 4 Where he realised values fall ino he lef- hand ails, he repored p -values are hose associaed wih -sided ess of he null. Of course, we may wish o work wih 2-sided ess insead, and we should in principle also ake accoun of he possibiliy ha we migh ge fail resuls because realised 23

Figure 6: Boosrapped P -Values of Realised Moraliy Oucomes: Males aged 65, 98 Sar, Horizon = 26 Years Ahead Males aged 65: Model M Males aged 65: Model M2B CDF under null.5 Realised q =.49 : p-value =.59 CDF under null.5 Realised q =.49 : p-value =.2.5..5.2.25.3.35.4 Males aged 65: Model M3B.5..5.2.25.3.35.4 Males aged 65: Model M5 CDF under null.5 Realised q =.49 : p-value =.74 CDF under null.5 Realised q =.49 : p-value =.49.5..5.2.25.3.35.4 Males aged 65: Model M6.5..5.2.25.3.35.4 Males aged 65: Model M7 CDF under null.5 Realised q =.49 : p-value =.52 CDF under null.5 Realised q =.49 : p-value =.65.5..5.2.25.3.35.4.5..5.2.25.3.35.4 Noe: Forecass based on models esimaed using English & Welsh male moraliy daa for ages 6-89 and years 97-98 assuming parameer uncerainy. Each figure shows he boosrapped forecas CDF of he moraliy rae in 26, where he forecas is based on he densiy forecas made as if in 98. Each black verical line gives he realised moraliy rae in 26 and is associaed p -value in erms of he forecas CDF. Based on 5, simulaion rials. Now suppose we are ineresed in carrying ou a es of he models 25-year-ahead moraliy densiy forecass. There are now wo cases o consider: we can sar in 98 and carry ou he es for a forecas of he 25 moraliy densiy, or we can sar in 98 and carry ou he es for a forecas of he 26 moraliy densiy. 5 These are illusraed in Figures 7 and 8 below. Pulling he resuls from hese hree Figures ogeher, models M, M3B and M7 always pass ess a convenional significance levels, and models M2B, M5 and M6 fail a leas once a he 5% significance level. moraliy oucomes fall ino he exreme righ-hand side of he forecased densiy. These exensions are fairly obvious and here is no need o dwell on hem furher here. 5 In principle, boh hese ess should give much he same answer under he null hypohesis and here is nohing o choose beween hem oher han he somewha exraneous consideraion (for presen purposes) ha he laer es uses slighly laer daa. 24

Figure 7: Boosrapped P -Values of Realised Moraliy Oucomes: Males aged 65, 98 Sar, Horizon = 25 Years Ahead Males aged 65: Model M Males aged 65: Model M2B CDF under null.5 Realised q =.49 : p-value =.59 CDF under null.5 Realised q =.49: p-value =.2.5..5.2.25.3.35.4 Males aged 65: Model M3B.5..5.2.25.3.35.4 Males aged 65: Model M5 CDF under null.5 Realised q =.49 : p-value =.77 CDF under null.5 Realised q =.49 : p-value =.54.5..5.2.25.3.35.4 Males aged 65: Model M6.5..5.2.25.3.35.4 Males aged 65: Model M7 CDF under null.5 Realised q =.49 : p-value =.45 CDF under null.5 Realised q =.49 : p-value =.64.5..5.2.25.3.35.4.5..5.2.25.3.35.4 Noe: Forecass based on models esimaed using English & Welsh male moraliy daa for ages 6-89 and years 97-98 assuming parameer uncerainy. Each figure shows he boosrapped forecas CDF of he moraliy rae in 25, where he forecas is based on he densiy forecas made as if in 98. Each black verical line gives he realised moraliy rae in 25 and is associaed p -value in erms of he forecas CDF. Based on 5, simulaion rials. 25

Figure 8: Boosrapped P -Values of Realised Moraliy Oucomes: Males aged 65, 98 Sar, Horizon = 25 Years Ahead Males aged 65: Model M Males aged 65: Model M2B CDF under null.5 Realised q =.49 : p-value =.26 CDF under null.5 Realised q =.49 : p-value =.75.5..5.2.25.3.35.4.5..5.2.25.3.35.4 Males aged 65: Model M3B Males aged 65: Model M5 CDF under null.5 Realised q =.49 : p-value =. CDF under null.5 Realised q =.49 : p-value =.5.5..5.2.25.3.35.4.5..5.2.25.3.35.4 Males aged 65: Model M6 Males aged 65: Model M7 CDF under null.5 Realised q =.49 : p-value =.66 CDF under null.5 Realised q =.49 : p-value =.287.5..5.2.25.3.35.4.5..5.2.25.3.35.4 Noe: Forecass based on esimaes using English & Welsh male moraliy daa for ages 6-89 and years 972-98 assuming parameer uncerainy. Each figure shows he boosrapped forecas CDF of he moraliy rae in 26, where he forecas is based on he densiy forecas made as if in 98. Each black verical line gives he realised moraliy rae in 26 and is associaed p -value in erms of he forecas CDF. Based on 5, simulaion rials. Along he same lines, if we wish o es he models forecass of 24-year-ahead densiy forecass, we have hree choices: sar in98 and forecas he densiy for year 24, sar in 98 and forecas he densiy for year 25, and sar in 982 and forecas he densiy for year 26. 6 We can carry on in he same way for forecass 23 years ahead, 22 years ahead, and so on, down o forecass year ahead. By he ime we ge o -year-ahead forecass, we would have 26 differen possibiliies (i.e., sar in 98, sar in 98, ec., up o sar in 25). 6 Again, in principle each of hese ess should yield similar resuls under he null. 26

So, for any given model, we can consruc a sample of 26 differen esimaes of he p -value of he null associaed wih -year-ahead forecass, we can consruc a sample of 25 differen esimaes of he p -value of he null associaed wih 2-year-ahead forecass, and so forh. We can hen consruc plos of esimaed p -values for any given horizon or se of horizons. Some examples are given in Figure 9. This Figure shows plos of he p - values for 65-year-olds associaed wih forecas horizons of 5, and 5 years. The p -values are fairly variable, bu our bes esimae of each p -value is o ake he mean, i.e., our bes esimae of he p -value associaed wih 5-year-ahead forecass is he mean of he 22 sample observaions of he esimaed 5-year-ahead p -value, our bes esimae of he p -value associaed wih -year-ahead forecass is he mean of he 7 sample observaions of he esimaed -year-ahead p -values, and so forh, and he Figure also repors hese mean esimaes. 7 Figures 9 and 2 give he corresponding resuls for 75-year olds and 85-year olds. In mos cases considered, he average p-values decline as he forecas horizon lenghens, indicaing ha forecas performance in hese cases usually deerioraes as he horizon lenghens. For ages 75 and 85, we find ha no model fails even a he 5% significance level; for age 65, however, we find hree failures (ou of a possible 8) a he 5% level: M2B, M5 and M6 fail for h = 5. Thus, ou of a oal of 3 8= 48 possible failures, we find no fails a he % significance level and only hree a he 5% significance level. 8 7 We also repea our earlier warning from secion 3 jus before Figure 4: he oucomes are no independen, so we canno rea alernaive esimaes of he p -values for any given model and horizon h as independen. However, his does no invalidae our claim ha he bes esimae of any of hese p -values is he mean, as repored in Figures 9-2 8 Wih so few fails o repor here is lile poin in summarising he resuls in a Table as we did wih he resuls from he previous wo secions. 27

Figure 9: Long-Horizon P -Values of Realised Fuure Moraliy Raes: Males aged 65 P-value.5 Males aged 65: Model M Average =.29 for forecass 5 years ahead Average =.88 for forecass years ahead Average =.43 for forecass 5 years ahead P-value.5 Males aged 65: Model M2B Average =.78 for forecass 5 years ahead Average =.86 for forecass years ahead Average =.4 for forecass 5 years ahead P-value 985 99 995 2 25 Males aged 65: Model M3B Average =.259 for forecass 5 years ahead Average =.64 for forecass years ahead Average =.9 for forecass 5 years ahead.5 P-value 985 99 995 2 25 Males aged 65: Model M5 Average =.7 for forecass 5 years ahead Average =.63 for forecass years ahead Average =.42 for forecass 5 years ahead.5 P-value 985 99 995 2 25 Males aged 65: Model M6 Average =.93 for forecass 5 years ahead Average =.82 for forecass years ahead Average =.39for forecass 5 years ahead.5 P-value 985 99 995 2 25 Males aged 65: Model M7 Average =.27 for forecass 5 years ahead Average =.78 for forecass years ahead Average =.32 for forecass 5 years ahead.5 985 99 995 2 25 Saring year 985 99 995 2 25 Saring year Noes: Forecass based on esimaes using English & Welsh male moraliy daa for ages 6-89 and a rolling -year window assuming parameer uncerainy. The, and lines refer o p - values over horizons of h = 5, and 5 years respecively. Based on 5, simulaion rials. Figure 2: Long-Horizon P -Values of Realised Fuure Moraliy Raes: Males aged 75 P-value.5 Males aged 75: Model M Average =.297 for forecass 5 years ahead Average =.34 for forecass years ahead Average =.267 for forecass 5 years ahead P-value.5 Males aged 75: Model M2B Average =.33 for forecass 5 years ahead Average =.326 for forecass years ahead Average =.32 for forecass 5 years ahead P-value 985 99 995 2 25 Males aged 75: Model M3B Average =.34 for forecass 5 years ahead Average =.282 for forecass years ahead Average =.228 for forecass 5 years ahead.5 P-value 985 99 995 2 25 Males aged 75: Model M5 Average =.38 for forecass 5 years ahead Average =.29 for forecass years ahead Average =.228 for forecass 5 years ahead.5 P-value 985 99 995 2 25 Males aged 75: Model M6 Average =.3 for forecass 5 years ahead Average =.284 for forecass years ahead Average =.226 for forecass 5 years ahead.5 P-value 985 99 995 2 25 Males aged 75: Model M7 Average =.32 for forecass 5 years ahead Average =.258 for forecass years ahead Average =.228 for forecass 5 years ahead.5 985 99 995 2 25 Saring year 985 99 995 2 25 Saring year Noes: As per Figure 9. 28

Figure 2: Long-Horizon P -Values of Realised Fuure Moraliy Raes: Males aged 85 P-value.5 Males aged 85: Model M Average =.24 for forecass 5 years ahead Average =.326 for forecass years ahead Average =.282 for forecass 5 years ahead P-value.5 Males aged 85: Model M2B Average =.335 for forecass 5 years ahead Average =.368 for forecass years ahead Average =.33 for forecass 5 years ahead P-value 985 99 995 2 25 Males aged 85: Model M3B Average =.38 for forecass 5 years ahead Average =.386 for forecass years ahead Average =.367 for forecass 5 years ahead.5 P-value 985 99 995 2 25 Males aged 85: Model M5 Average =.327 for forecass 5 years ahead Average =.377 for forecass years ahead Average =.38 for forecass 5 years ahead.5 P-value 985 99 995 2 25 Males aged 85: Model M6 Average =.327 for forecass 5 years ahead Average =.378 for forecass years ahead Average =.386 for forecass 5 years ahead.5 P-value 985 99 995 2 25 Males aged 85: Model M7 Average =.33 for forecass 5 years ahead Average =.37 for forecass years ahead Average =.37 for forecass 5 years ahead.5 985 99 995 2 25 Saring year 985 99 995 2 25 Saring year Noes: As per Figure 9. 6. Conclusions The purposes of his paper are: (i) o se ou a backesing framework ha can be used o evaluae he ex-pos forecasing performance of sochasic moraliy models; and (ii) o illusrae his framework by evaluaing he forecasing performance of a number of moraliy models calibraed under a paricular daa se. The backesing framework presened here is based on he idea ha forecas disribuions should be compared agains subsequenly realised moraliy oucomes: if he realised oucomes are compaible wih heir forecased disribuions, hen his would sugges ha he forecass and he models ha generaed hem are good ones; and if he forecas disribuions and realised oucomes are incompaible, his would sugges ha he forecass and models are poor. We discussed four differen classes of backes building on his general idea: (i) backess based on he convergence of forecass hrough ime owards he moraliy rae(s) in a given year; (ii) backess 29

based on he accuracy of forecass over muliple horizons; (iii) backess based on he accuracy of forecass over rolling fixed-lengh horizons and (iv) backess based on formal hypohesis ess ha involve comparisons of realised oucomes agains forecass of he relevan densiies over specified horizons. As far as he individual models are concerned, we find ha models M, M3B, M5, M6 and M7 perform well mos of he ime and here is relaively lile o choose beween hese models. Of he Lee-Carer class of models examined here, i is difficul o choose beween M and M3B; of he CBD models (i.e., beween M5, M6 and M7), however, we would sugges ha he evidence presened here poins o M7 being he bes performer on his daase, and o i being difficul o choose beween M5 and M6. Model M2B repeaedly shows evidence of considerable insabiliy. However, we should make clear ha we have examined a paricular version of M2 in his sudy and so canno rule ou he possibiliy ha oher specificaions of, or exensions o, M2 migh resolve he sabiliy problem idenified boh here and elsewhere (see, e.g., Cairns e al. (28) and Dowd e al. (28)). We would emphasise ha hese resuls are obained using one paricular daa se ha is, daa for English & Welsh males over limied sample periods. Accordingly, we make no claim for how hese models migh perform over oher daa ses or sample periods. We would also make wo oher commens of a more general naure. If one looks over Figures 4-5, i is clear ha, in mos bu no all cases, and in varying degrees depending on he model and wheher we ake accoun of parameer uncerainy or no: There are fewer upper exceedances han prediced under he null; There are usually more median exceedances realised oucomes below he median han prediced; and There are usually many more lower exceedances han prediced, especially for he parameer-cerain forecass. 3