Session 178 TS, Stats for Health Actuaries Moderator: Ian G. Duncan, FSA, FCA, FCIA, FIA, MAAA Presenter: Joan C. Barrett, FSA, MAAA
Session 178 Statistics for Health Actuaries October 14, 2015 Presented by: Joan C. Barrett, FSA, MAAA Ian Duncan, FSA, FIA, FCIA, FCA, MAAA
Today s Agenda Basic Statistics A Quick Look at Regression Analysis Page 2
Basic Statistics
The Statistical Triad Estimation Prediction Hypothesis Testing Page 4
A Few Basic Formulas Standard Mean Variance Deviation Symbol µ = E(X) σ 2 σ Var(X) Formula x i f(x i ) (x i - µ) 2 f(x i ) = E(x 2 )-E 2 (X) Excel Formula AVERAGE VAR.P STDEV.P Page 5
Claims Frequencies Bernoulli Binomial (N = 1,000) Intuitive Concept Flip a coin, once Flip a coin, N times Mean (µ) p, the probability of success Variance (σ 2 ) p x (1-p) N x p x (1 p) Np Hospital Admits Mean Variance Any claims Mean Variance 6.0% 5.6% 30% 21% 60 56 300 210 Page 6
Sample Calculations (Bernoulli) Variable Step Success Failure Combined Mean = µ Value of x 1 0 N/A Probability of x 6.0% 94.0% 100% Mean = µ = Weighted Average 6.0% 0.0% 6% Variance = σ 2 x - µ 94.0% -6.0% N/A (x - µ) 2 88.4% 0.4% N/A Variance = σ 2 = sum of squares 5.3% 0.3% 5.6% Page 7
Additional Formulas for Weighted Averages E(X) = Weight = c % Admits Var(X) c 2 Var(X) Children 33.0% 3.0% 2.9% 0.3% Women < 40 20.0% 5.0% 4.8% 0.2% Women 40+ 20.0% 10.0% 9.0% 0.4% Men < 40 13.0% 5.0% 4.8% 0.1% Men 40+ 14.0% 10.0% 9.0% 0.2% Combined 100.0% 6.0% 1.1% Key Formulas E(X+Y) = E(X)+E(Y) E(cX) = ce(x) Var(X+Y) = Var(X) + Var(Y) + 2Cov(X,Y) Var(X +Y) = Var(X) + Var (Y), if X and Y are independent Var(cX) = c 2 Var(X) Page 8
Normal Approximation to Binomial A binomial distribution is approximately normal if N > 30 N x p > 5 N x p x (1 p) > 5 Mean = N x p Variance = N x p x (1 p) Page 9
The Standard Normal Curve f(x) 45.0% 40.0% 35.0% 30.0% 25.0% 20.0% 15.0% 10.0% 5.0% 0.0% F(x) 110.0% 100.0% 90.0% 80.0% 70.0% 60.0% 50.0% 40.0% 30.0% 20.0% 10.0% 0.0% Probability Distribution -3.9-3.6-3.3-3.0-2.7-2.4-2.1-1.8-1.5-1.2-0.9-0.6-0.3 0.0 0.3 0.6 0.9 1.2 1.5 1.8 2.1 2.4 2.7 3.0 3.3 3.6 3.9 Cumulative Function 3.9 3.6 3.3 3.0 2.7 2.4 2.1 1.8 1.5 1.2 0.9 0.6 0.3 0.0 0.3 0.6 0.9 1.2 1.5 1.8 2.1 2.4 2.7 3.0 3.3 3.6 3.9 Standard Normal Mean = 0 Variance = 1 To convert any normal distribution to standard normal Z = x-µ σ Slide 10
Excel Functions for Normal Distribution logical value Function Description TRUE FALSE NORMDIST(x, mean, stddev,logical ) Curve Cumulative Bell-shaped Any normal distribution Yes Yes Input x-axis x-axis Returns y-axis y-axis Returns for x = 0 50.0% 39.9% Returns for x = 2 97.7% 5.4% NORM.INV(probability, mean, stdev) Input y-axis N/A Returns x-axis N/A Returns for probability = 50.0% 0.0 N/A Returns for probability = 97.7% 2.0 N/A Use this to graph bellshaped curves Use this to determine confidence limits Assumes standard normal Slide 11
Sensitivity Analysis Same shape, just shifted Same center, different shape Slide 12
Key Numbers To Remember f(x) 45.0% 40.0% 35.0% 30.0% 25.0% 20.0% 15.0% 10.0% 5.0% 0.0% - 1 standard deviation Probability Distribution +1 standard deviation -3.9-3.6-3.3-3.0-2.7-2.4-2.1-1.8-1.5-1.2-0.9-0.6-0.3 0.0 0.3 0.6 0.9 1.2 1.5 1.8 2.1 2.4 2.7 3.0 3.3 3.6 3.9 The standard normal curve is symmetrical We tend to look at variation around the mean Range around the Mean Probability +/- 1 standard deviation 68.3% +/- 2 standard deviations 95.4% +/- 3 standard deviations 99.7% +/- 1.96 standard deviations 95.0% +/- 2.58 standard deviations 99.0% Slide 13
Central Limit Theorem Suppose we have a sample of n independent draws, X 1, X 2,,X n, from any distribution Then we can define a new random variable, the sample mean Z = x = X 1 + X 2 + + X n n The sample mean is a standard normal distribution with mean = µ (the population mean) and variance σ 2 /n Which means Z is a standard normal variable where Slide 14
Our major concern When we want to be here? Does our data say we are here Will we be here next year? Are results stable year over year? Do I need a margin..how much? Slide 15
The trick is to narrow the curve 300.0% 250.0% 200.0% 150.0% 100.0% 50.0% 0.0% -1.5-1.4-1.3-1.2-1.1-1.0-0.9-0.8-0.7-0.6-0.5-0.4-0.3-0.2-0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 n=10 n= 50 The sample variance is σ 2 /n Major trade-off: The more homogenous the group, the smaller the sigma and the smaller the n Slide 16
Confidence Interval for IP Admits Members N = 1,000 N= 100,000 N = 1,000,000 Probability of Admit 6.0% 6.0% 6.0% Expected Admits 60 6,000 60,000 Variance = N x p x (1 - p) = σ 2 56.4 5,640 56,400 Standard Deviation = σ 7.5 75.1 237.5 Multiplier at 95% Confidence Level +/-1.96 +/-1.96 +/-1.96 Confidence Interval +/14.7 +/-147.2 +/-465.5 Confidence Limit as % of Mean +/-24.5% +/-2.5% +/-0.8% If a population has 1,000,000 members, then there is a 95% chance that any sample will be within +/-0.8% of the true mean Slide 17
The Standard for Full Credibility The standard can be expressed in terms of a confidence interval. Example: How many observations do I need to be 95% sure that my data is within +/- 1% of the true mean? In our example, full credibility requires roughly 1 million members or 60,000 admits Use logic on previous slide, but solve for N Slide 18
Hypothesis Testing Overview Description Null Hypothesis Alternate Hypothesis Mathematical µ = µ* µ µ* In Words The population mean is µ* Our current assumption, µ*, is still correct The population mean is not µ* We need to change our assumptions Hypothesis Accept the Hypothesis Reject the hypothesis True Correct Type 1 error False Type II error Correct Standard practice is to err on the side of avoiding Type 1 error accept the hypothesis unless clear indication to the contrary Slide 19
Hypothesis Testing where does your test statistic fall @95% confidence interval? Reject 2.5% 95.0% Accept Reject 2.5% The curve represents what the distribution will look like if the null hypothesis is true Where does your test-statistic fall? Slide 20
Hypothesis Testing Members N = 1,000 N= 100,000 N = 1,000,000 µ * = Expected Admits 60 6,000 60,000 σ = Standard Deviation 7.5 75.1 237.5 X = Actual Admits (3% higher than expected) 61.8 6,180 61,800 Z = Test Statistic = (X - µ*)/σ.0.24 2.4 7.58 +/-1.96 +/-1.96 +/-1.96 Accept/Reject Accept Reject Reject In each case, the actual admits are 3% higher than expected, but we accept the null hypothesis if we only have 1,000 members but reject it if we have 100,000 or more. Slide 21
p-value Basically, the probability of a Type 1 error the probability that your sample or a more extreme one will show a statistically significant difference even when the null hypothesis is true The lower the p-value the better should be less than 1- confidence level (5% at the 95% confidence level) Considered the gold standard for statistically significant differences But. p-value based on one sample from one population: What if the next sample shows there is no difference? What happens if you use a similar but not identical population that shows no difference? This has been controversial since the early days of statistics Recommendation Routinely check p-value using Z.TEST in Excel Reconcile differences Slide 22
But how do we know what µ and σ are? We are going to have to estimate µ and σ, but we need some criteria first: Consistent estimator: tends to converge on true value as the sample size becomes larger Maximum likelihood estimator: If the true value of the unknown parameter has this value, then the probability of observing this value is maximized Unbiased estimator: Expected value is equal to the true value Slide 23
Some rules of thumb Use the sample mean to estimate the population mean x = x i /n If n > 30, use the sample variance s 2 = (x x i ) 2 /(n-1) In Excel STDEV.P returns population standard deviation (divides by n) STDEV.S returns the sample standard deviation (divides by n-1) Slide 24
Chi-Square Distribution: Sampling Χ 2 = Z 12 + Z 22 + Z n 2 Where Z 1, Z 2, etc are independent, standard normal distributions Has n - 1 degrees of freedom E(Χ n ) = n Var(Χ n ) = 2n Σ (A E) 2 /E is approximately Chi-square with n k 1 degrees of freedom where k = number of parameters to be estimated Slide 25
The t Distribution for small samples Define a new distribution T = Where Z Y/n Z is a standard normal distribution Y is a Chi-square distribution with n degrees of freedom Example: t = (x µ*) s/ n Note 1: µ* is the hypothetical population mean, usually the current assumption Note 2: t has n-1 degrees of freedom Slide 26
t-distribution Examples T.DIST(x, df, logical) returns T.INV(probability, df) returns Degrees of Logical x Freedom True False -3 10 0.7% 1.1% 0 10 50.0% 38.9% 3 10 99.3% 1.1% -3 20 0.4% 0.8% 0 20 50.0% 39.4% 3 20 99.6% 0.8% Degrees of Probability Freedom x 0.7% 10-3 50.0% 10 0 99.3% 10 3 0.4% 20-3 50.0% 20 0 99.6% 20 3 Slide 27
Other Uses of Chi-Square Distributions Σ (A E) 2 /E is approximately Chi-square with n k 1 degrees of freedom where k = number of parameters to be estimated The chi-square test can be used to test independence between two distributions Can do hypothesis to indicate if there is a real difference in two distributions. Slide 28
Sample Probability Distribution: IP Length of Stay ALOS Probability Cumulative Variance Range i x i f(x i) F(x i) (x i -µ) 2 Exactly 1 day 1 1.0 20.9% 20.9% 9.0 Exactly 2 days 2 2.0 29.9% 50.8% 4.0 Exactly 3 days 3 3.0 18.9% 69.8% 1.0 Exactly 4 days 4 4.0 10.2% 80.0% 0.0 Exactly 5 days 5 5.0 5.4% 85.4% 1.0 Exactly 6 days 6 6.0 3.3% 88.7% 4.0 Exactly 7 days 7 7.0 2.4% 91.0% 9.0 Exactly 8 days 8 8.0 1.7% 92.7% 16.0 Exactly 9 days 9 9.0 1.2% 93.8% 24.9 10+ Days 10 22.0 6.1% 100.0% 323.8 Sum/Sumproduct 4.0 100.0% 24.1 Slide 29
Chi-Square Test Expected Expected Expected Actual Actual Range i LOS Distribution Admits LOS Admits Χ 2 Exactly 1 day 1 1 20.9% 20.9 1.0 18.0 0.41 Exactly 2 days 2 2 29.9% 29.9 2.0 32.0 0.14 Exactly 3 days 3 3 18.9% 18.9 3.0 20.0 0.06 Exactly 4 days 4 4 10.2% 10.2 4.0 10.0 0.00 Exactly 5 days 5 5 5.4% 5.4 5.0 5.0 0.04 Exactly 6 days 6 6 3.3% 3.3 6.0 3.0 0.02 Exactly 7 days 7 7 2.4% 2.4 7.0 2.0 0.06 Exactly 8 days 8 8 1.7% 1.7 8.0 2.0 0.07 Exactly 9 days 9 9 1.2% 1.2 9.0 1.0 0.02 10+ Days 10 22 6.1% 6.1 24.0 7.0 0.12 Sum/Sumproduct 4.0 100.0% 100.0 4.3 100.0 0.95 Chi-square Statistic 0.95 chisq.test(actual range,expected range) = 99.95% Slide 30
Regression Analysis
Residual: Observed Value Predicted Value Observation Independent Predicted Observed Number Variable Value Value Residual i x y y y i e i 1 0.5 5.2 6.0 0.8 2 1.0 6.1 5.0 (1.1) 3 1.5 7.0 8.0 1.0 4 2.0 7.9 6.0 (1.9) 5 2.5 8.8 10.0 1.2 There is a curve which is the true underlying values Residuals are values of a random variable ϵ i The residual is basically the difference between the dot and the line Slide 32
Underlying Assumptions y i = β 0 + β 1 x i + ϵ i x 1, x 2,, x n are non-stochastic variables E(ϵ i ) = 0 and var(ϵ i ) = σ 2 The ϵ i s are independent random variables Note: β 0, β 1 and σ 2 are the true unknown values. We are going to have to estimate these values based on a specific data set Slide 33
Analysis of Variance (ANOVA): Total Sum of Squares (TSS) Observation Observed Overall Number Value Mean Δ Δ 2 i y i y y i - y 1 6.0 7.0 (1.0) 1.0 2 5.0 7.0 (2.0) 4.0 3 8.0 7.0 1.0 1.0 4 6.0 7.0 (1.0) 1.0 5 10.0 7.0 3.0 9.0 Total 16.0 The purpose of ANOVA is to understand how much of the variance is accounted for by the curve The starting point is calculating the total variance from the overall mean (the red line) Slide 34
Regression Sum of Squares (RSS) Predicted Value Overall Mean Observation Predictied Overall Number Value Mean Δ Δ 2 i y x y 1 5.2 7.0 (1.8) 3.2 2 6.1 7.0 (0.9) 0.8 3 7.0 7.0 - - 4 7.9 7.0 0.9 0.8 5 8.8 7.0 1.8 3.2 Total 8.1 How much of the variance is explained by the fact that we have a curve? Looking the difference between the blue line and red line Slide 35
Error Sum of Squares (ERSS): Residuals 10 9 8 7 6 5 4 A Simple Regresssion Example y values Observation Observed Predicted 11 Number Value Value Δ Δ 2 i y i y y e i 1 6.0 5.2 0.8 0.6 2 5.0 6.1 (1.1) 1.2 3 8.0 7.0 1.0 1.0 4 6.0 7.9 (1.9) 3.6 5 10.0 8.8 1.2 1.4 Total 7.9 3-0.5 1.0 1.5 2.0 2.5 3.0 x values Measures unexplained variance Slide 36
TSS = RSS + ERSS Sum of Squares Abv Description of Δ Value Total (Total Variance) Regression (Explained Variance) Error (Unexplained Variance) TSS Observed Values Overall Mean 16.0 RSS Predicted Values Overall Mean 8.1 ERSS Observed Values Predicted Values 7.9 Total Variance = Explained Variance + Unexplained Variance R 2 = Explained Variance/Total Variance = % of Total Variance Explained by Regression R 2 = 8.1/16 = 51% Slide 37
Why are non-stochastic values important? Stochastic Variable Non-Stochastic Variable Member lives in Zip 999 Area-adjusted PMPM Member is 42 Male Age-sex adjusted PMPM Member took health risk assessment % taking health risk assessment Stochastic variables introduce variance not accounted for in standard analysis of variance May be ignoring factors important in determining the value Incentive for taking health risk assessment may not be the same for each group Is this variance always material? Slide 38
The Bad News Health care costs are not normally distributed, so generalized linear models may have to be used Excel does not handle generalized linear models Can still use other methods, but be careful about disclaimers Slide 39
Criteria for estimating β 0, β 1 and σ We are going to use same criteria that we used to estimate µ and σ in Stats 101 Consistent estimator: tends to converge on true value as the sample size becomes larger Maximum likelihood estimator: If the true value of the unknown parameter has this value, then the probability of observing this value is maximized Unbiased estimator: Expected value is equal to the true value Slide 40
Least Squares Estimate Basic premise: Find the values b 0 and b 1 which minimize the sum of the squares from each data point to the theoretical line (y i (b 0 -b 1 x i )) 2 Take first derivative and solve for values Results are consistent, maximum likelihood and unbiased estimators Slide 41
A Good Candidate for Simple Linear Regression Statistic Value Intercept 4.2 Slope 1.1 R 2 73% Variance appears to be normal ~ 2/3 fall within 1 standard deviation of the mean ~1/3 fall between 1 and 2 standard deviations Line is not too flat High R 2 Slide 42
Excel Formulas For Key Values Input Values Intercept: =intercept(known y s, known x s) Slope: =slope(known y s, known x s) R 2 : =rsq(known y s, known x s) x y 1.0 2.08 1.5 5.75 2.0 6.66 2.5 4.96 3.0 8.25 3.5 7.54 4.0 12.07 4.5 9.90 5.0 9.96 5.5 13.24 6.0 12.06 6.5 9.89 7.0 14.66 7.5 14.03 8.0 11.17 8.5 17.49 9.0 15.28 9.5 13.55 10.0 12.57 10.5 14.59 Slide 43
Excel has data analysis add-in Data Data Analysis Requires a one-time set-up Slide 44
Data Analysis has several options Choose regression option Slide 45
Minimum input: Known y s, known x s, output placement Slide 46
Sample Output SUMMARY OUTPUT Regression Statistics KEY VALUES Multiple R 0.857 R Square 0.735 Adjusted R Square 0.720 Standard Error 2.091 Observations 20 ANOVA ANOVA df SS MS F Significance F Regression 1 218.06 218.06 49.88 0.00 Residual 18 78.69 4.37 Total 19 296.75 COEFFICIENT Coefficients Standard Error t Stat P-value Lower 95% Upper 95% TEST DATA Intercept 4.2 1.04 4.03 0.00 2.01 6.39 X Variable 1 1.1 0.16 7.06 0.00 0.80 1.49 Additional output available if requested in dialogue box Slide 47
Anscombe s Quartet: Data Set 1 vs Data Set 2 Data Set 1 Data Set 2 What is the expected difference in slope, intercept and R 2? Would you rely on the curve for data set 1? For data set 2? Slide 48
Anscombe s Quartet: Data Set 1 vs Data Set 3 Data Set 1 Data Set 3 Would you rely on the curve for data set 3? Slide 49
Anscombe s Quartet: Data Set 1 vs Data Set 4 Data Set 1 Data Set 4 Would you rely on the curve for data set 4? Slide 50
Example: How well does risk score predict cost? $1,200 $1,000 $800 $600 $400 $200 $- Area Adjusted PMPM - 1.0 2.0 Risk Score 3.0 4.0 5.0 6.0 Function Value INTERCEPT $35.18 SLOPE $183.86 CORREL 62% AVERAGE 184.39 Used retro risk score Random sample of males aged 42 from a large group N = 28 Divided raw PMPM by area factor Slide 51
What is the Value of R 2? Slide 52
The Basics y i = β 0 + β 1 x 1 +.+ β n x n + ϵ i ϵ i is a value of the residual random variable described earlier Slide 53
Why Multiple Linear Regression Shape of the curve/polynomial y i = β 0 + β 1 x + β 2 x 2 + ϵ i Additional explanatory variables Age + Gender probably explains costs better than age alone Control for confounding factors All other things being equal Example: Control for area Slide 54
Are your independent variables dependent on each other? In most cases, find the independent variable that best explains overall variance Test each independent variable one at a time Also, combinations of variables Test interdependence by doing analytics comparing just the variables in question How well does age-gender predict risk score? How well does risk score predict age-gender? Slide 55
Categorical Variables Categorical: Separate into groups even if the variable is not numeric per se Gender: 1 = female, 0 = male Alternately Gender1 = 1 if female, 0 if male Gender2 = 0 if female 1, if male Think in terms of the marginal impact of each variable: The expected value of y i goes up β i with each unit change in x i Slide 56
Where Do You Go From Here? Pick a resource Barron s Business Statistics Anything by Jed Frees Practice, practice, practice Start with basics (chi-square, confidence limits, hypothesis testing) Move to regression analysis use adjusted PMPMs etc to get yourself started Make sure you can analyze and explain results Move to advanced analytics GLM for health care costs Trend methods Evaluation (probit, propensity analysis,etc) Disease progression (survival models) Slide 57
Q&A and Wrap-Up