Session 178 TS, Stats for Health Actuaries. Moderator: Ian G. Duncan, FSA, FCA, FCIA, FIA, MAAA. Presenter: Joan C. Barrett, FSA, MAAA

Similar documents
Version A. Problem 1. Let X be the continuous random variable defined by the following pdf: 1 x/2 when 0 x 2, f(x) = 0 otherwise.

Week 2 Quantitative Analysis of Financial Markets Hypothesis Testing and Confidence Intervals

INSTITUTE OF ACTUARIES OF INDIA EXAMINATIONS. 20 th May Subject CT3 Probability & Mathematical Statistics

Key Objectives. Module 2: The Logic of Statistical Inference. Z-scores. SGSB Workshop: Using Statistical Data to Make Decisions

7. For the table that follows, answer the following questions: x y 1-1/4 2-1/2 3-3/4 4

Homework: Due Wed, Nov 3 rd Chapter 8, # 48a, 55c and 56 (count as 1), 67a

Business Statistics 41000: Probability 3

Probability is the tool used for anticipating what the distribution of data should look like under a given model.

The topics in this section are related and necessary topics for both course objectives.

Statistics 13 Elementary Statistics

Diploma in Business Administration Part 2. Quantitative Methods. Examiner s Suggested Answers

Lecture 9. Probability Distributions. Outline. Outline

Homework: Due Wed, Feb 20 th. Chapter 8, # 60a + 62a (count together as 1), 74, 82

Stat 213: Intro to Statistics 9 Central Limit Theorem

Lecture 9. Probability Distributions

Previously, when making inferences about the population mean, μ, we were assuming the following simple conditions:

Intro to GLM Day 2: GLM and Maximum Likelihood

Statistics & Statistical Tests: Assumptions & Conclusions

INSTITUTE AND FACULTY OF ACTUARIES. Curriculum 2019 SPECIMEN EXAMINATION

Statistics for Business and Economics

Unit 5: Sampling Distributions of Statistics

Unit 5: Sampling Distributions of Statistics

Homework Assignment Section 3

Lecture Slides. Elementary Statistics Tenth Edition. by Mario F. Triola. and the Triola Statistics Series. Slide 1

10/1/2012. PSY 511: Advanced Statistics for Psychological and Behavioral Research 1

Introduction to Statistics I

ME3620. Theory of Engineering Experimentation. Spring Chapter III. Random Variables and Probability Distributions.

. (i) What is the probability that X is at most 8.75? =.875

Law of Large Numbers, Central Limit Theorem

Statistical Models of Stocks and Bonds. Zachary D Easterling: Department of Economics. The University of Akron

Chapter 6 Confidence Intervals Section 6-1 Confidence Intervals for the Mean (Large Samples) Estimating Population Parameters

ECON 214 Elements of Statistics for Economists 2016/2017

Sampling and sampling distribution

MATH 10 INTRODUCTORY STATISTICS

Where s the Beef Does the Mack Method produce an undernourished range of possible outcomes?

. 13. The maximum error (margin of error) of the estimate for μ (based on known σ) is:

σ e, which will be large when prediction errors are Linear regression model

Sampling Distributions For Counts and Proportions

STA 4504/5503 Sample questions for exam True-False questions.

The normal distribution is a theoretical model derived mathematically and not empirically.

Statistics for Business and Economics: Random Variables:Continuous

Logit Models for Binary Data

STATISTICAL DISTRIBUTIONS AND THE CALCULATOR

Stat 328, Summer 2005

1/2 2. Mean & variance. Mean & standard deviation

Section Sampling Distributions for Counts and Proportions

Normal Probability Distributions

Linear Regression with One Regressor

Lecture 2. Probability Distributions Theophanis Tsandilas

EXCEL STATISTICAL Functions. Presented by Wayne Wilmeth

Measure of Variation

Determining Sample Size. Slide 1 ˆ ˆ. p q n E = z α / 2. (solve for n by algebra) n = E 2

LESSON 7 INTERVAL ESTIMATION SAMIE L.S. LY

**BEGINNING OF EXAMINATION** A random sample of five observations from a population is:

Chapter 6: Random Variables and Probability Distributions

Basic Data Analysis. Stephen Turnbull Business Administration and Public Policy Lecture 4: May 2, Abstract

IOP 201-Q (Industrial Psychological Research) Tutorial 5

6. THE BINOMIAL DISTRIBUTION

DATA SUMMARIZATION AND VISUALIZATION

Section 0: Introduction and Review of Basic Concepts

Interval estimation. September 29, Outline Basic ideas Sampling variation and CLT Interval estimation using X More general problems

CHAPTER 8. Confidence Interval Estimation Point and Interval Estimates

The Normal Probability Distribution

MidTerm 1) Find the following (round off to one decimal place):

Econometric Methods for Valuation Analysis

MAKING SENSE OF DATA Essentials series

Exam 2 Spring 2015 Statistics for Applications 4/9/2015

NORMAL RANDOM VARIABLES (Normal or gaussian distribution)

STAT Chapter 6: Sampling Distributions

Data that can be any numerical value are called continuous. These are usually things that are measured, such as height, length, time, speed, etc.

Presented at the 2003 SCEA-ISPA Joint Annual Conference and Training Workshop -

AP STATISTICS FALL SEMESTSER FINAL EXAM STUDY GUIDE

Contents. An Overview of Statistical Applications CHAPTER 1. Contents (ix) Preface... (vii)

Statistics for Business and Economics

Tests for One Variance

STA2601. Tutorial letter 105/2/2018. Applied Statistics II. Semester 2. Department of Statistics STA2601/105/2/2018 TRIAL EXAMINATION PAPER

On one of the feet? 1 2. On red? 1 4. Within 1 of the vertical black line at the top?( 1 to 1 2

Simulation Wrap-up, Statistics COS 323

Chapter 7. Inferences about Population Variances

1 Inferential Statistic

Chapter 3 - Lecture 5 The Binomial Probability Distribution

CH 5 Normal Probability Distributions Properties of the Normal Distribution

Statistical Intervals (One sample) (Chs )

Statistical Intervals. Chapter 7 Stat 4570/5570 Material from Devore s book (Ed 8), and Cengage

Chapter 5 Basic Probability

Chapter 5. Continuous Random Variables and Probability Distributions. 5.1 Continuous Random Variables

Maximum Likelihood Estimation

Case Study: Heavy-Tailed Distribution and Reinsurance Rate-making

Week 1 Variables: Exploration, Familiarisation and Description. Descriptive Statistics.

Final Exam - section 1. Thursday, December hours, 30 minutes

Diploma Part 2. Quantitative Methods. Examiner s Suggested Answers

Chapter 7 Sampling Distributions and Point Estimation of Parameters

MATH 143: Introduction to Probability and Statistics Worksheet for Tues., Dec. 7: What procedure?

Stat 139 Homework 2 Solutions, Fall 2016

STA 103: Final Exam. Print clearly on this exam. Only correct solutions that can be read will be given credit.

Mixed models in R using the lme4 package Part 3: Inference based on profiled deviance

GETTING STARTED. To OPEN MINITAB: Click Start>Programs>Minitab14>Minitab14 or Click Minitab 14 on your Desktop

CHAPTER 6 DATA ANALYSIS AND INTERPRETATION

CS134: Networks Spring Random Variables and Independence. 1.2 Probability Distribution Function (PDF) Number of heads Probability 2 0.

Chapter 7: SAMPLING DISTRIBUTIONS & POINT ESTIMATION OF PARAMETERS

Transcription:

Session 178 TS, Stats for Health Actuaries Moderator: Ian G. Duncan, FSA, FCA, FCIA, FIA, MAAA Presenter: Joan C. Barrett, FSA, MAAA

Session 178 Statistics for Health Actuaries October 14, 2015 Presented by: Joan C. Barrett, FSA, MAAA Ian Duncan, FSA, FIA, FCIA, FCA, MAAA

Today s Agenda Basic Statistics A Quick Look at Regression Analysis Page 2

Basic Statistics

The Statistical Triad Estimation Prediction Hypothesis Testing Page 4

A Few Basic Formulas Standard Mean Variance Deviation Symbol µ = E(X) σ 2 σ Var(X) Formula x i f(x i ) (x i - µ) 2 f(x i ) = E(x 2 )-E 2 (X) Excel Formula AVERAGE VAR.P STDEV.P Page 5

Claims Frequencies Bernoulli Binomial (N = 1,000) Intuitive Concept Flip a coin, once Flip a coin, N times Mean (µ) p, the probability of success Variance (σ 2 ) p x (1-p) N x p x (1 p) Np Hospital Admits Mean Variance Any claims Mean Variance 6.0% 5.6% 30% 21% 60 56 300 210 Page 6

Sample Calculations (Bernoulli) Variable Step Success Failure Combined Mean = µ Value of x 1 0 N/A Probability of x 6.0% 94.0% 100% Mean = µ = Weighted Average 6.0% 0.0% 6% Variance = σ 2 x - µ 94.0% -6.0% N/A (x - µ) 2 88.4% 0.4% N/A Variance = σ 2 = sum of squares 5.3% 0.3% 5.6% Page 7

Additional Formulas for Weighted Averages E(X) = Weight = c % Admits Var(X) c 2 Var(X) Children 33.0% 3.0% 2.9% 0.3% Women < 40 20.0% 5.0% 4.8% 0.2% Women 40+ 20.0% 10.0% 9.0% 0.4% Men < 40 13.0% 5.0% 4.8% 0.1% Men 40+ 14.0% 10.0% 9.0% 0.2% Combined 100.0% 6.0% 1.1% Key Formulas E(X+Y) = E(X)+E(Y) E(cX) = ce(x) Var(X+Y) = Var(X) + Var(Y) + 2Cov(X,Y) Var(X +Y) = Var(X) + Var (Y), if X and Y are independent Var(cX) = c 2 Var(X) Page 8

Normal Approximation to Binomial A binomial distribution is approximately normal if N > 30 N x p > 5 N x p x (1 p) > 5 Mean = N x p Variance = N x p x (1 p) Page 9

The Standard Normal Curve f(x) 45.0% 40.0% 35.0% 30.0% 25.0% 20.0% 15.0% 10.0% 5.0% 0.0% F(x) 110.0% 100.0% 90.0% 80.0% 70.0% 60.0% 50.0% 40.0% 30.0% 20.0% 10.0% 0.0% Probability Distribution -3.9-3.6-3.3-3.0-2.7-2.4-2.1-1.8-1.5-1.2-0.9-0.6-0.3 0.0 0.3 0.6 0.9 1.2 1.5 1.8 2.1 2.4 2.7 3.0 3.3 3.6 3.9 Cumulative Function 3.9 3.6 3.3 3.0 2.7 2.4 2.1 1.8 1.5 1.2 0.9 0.6 0.3 0.0 0.3 0.6 0.9 1.2 1.5 1.8 2.1 2.4 2.7 3.0 3.3 3.6 3.9 Standard Normal Mean = 0 Variance = 1 To convert any normal distribution to standard normal Z = x-µ σ Slide 10

Excel Functions for Normal Distribution logical value Function Description TRUE FALSE NORMDIST(x, mean, stddev,logical ) Curve Cumulative Bell-shaped Any normal distribution Yes Yes Input x-axis x-axis Returns y-axis y-axis Returns for x = 0 50.0% 39.9% Returns for x = 2 97.7% 5.4% NORM.INV(probability, mean, stdev) Input y-axis N/A Returns x-axis N/A Returns for probability = 50.0% 0.0 N/A Returns for probability = 97.7% 2.0 N/A Use this to graph bellshaped curves Use this to determine confidence limits Assumes standard normal Slide 11

Sensitivity Analysis Same shape, just shifted Same center, different shape Slide 12

Key Numbers To Remember f(x) 45.0% 40.0% 35.0% 30.0% 25.0% 20.0% 15.0% 10.0% 5.0% 0.0% - 1 standard deviation Probability Distribution +1 standard deviation -3.9-3.6-3.3-3.0-2.7-2.4-2.1-1.8-1.5-1.2-0.9-0.6-0.3 0.0 0.3 0.6 0.9 1.2 1.5 1.8 2.1 2.4 2.7 3.0 3.3 3.6 3.9 The standard normal curve is symmetrical We tend to look at variation around the mean Range around the Mean Probability +/- 1 standard deviation 68.3% +/- 2 standard deviations 95.4% +/- 3 standard deviations 99.7% +/- 1.96 standard deviations 95.0% +/- 2.58 standard deviations 99.0% Slide 13

Central Limit Theorem Suppose we have a sample of n independent draws, X 1, X 2,,X n, from any distribution Then we can define a new random variable, the sample mean Z = x = X 1 + X 2 + + X n n The sample mean is a standard normal distribution with mean = µ (the population mean) and variance σ 2 /n Which means Z is a standard normal variable where Slide 14

Our major concern When we want to be here? Does our data say we are here Will we be here next year? Are results stable year over year? Do I need a margin..how much? Slide 15

The trick is to narrow the curve 300.0% 250.0% 200.0% 150.0% 100.0% 50.0% 0.0% -1.5-1.4-1.3-1.2-1.1-1.0-0.9-0.8-0.7-0.6-0.5-0.4-0.3-0.2-0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 n=10 n= 50 The sample variance is σ 2 /n Major trade-off: The more homogenous the group, the smaller the sigma and the smaller the n Slide 16

Confidence Interval for IP Admits Members N = 1,000 N= 100,000 N = 1,000,000 Probability of Admit 6.0% 6.0% 6.0% Expected Admits 60 6,000 60,000 Variance = N x p x (1 - p) = σ 2 56.4 5,640 56,400 Standard Deviation = σ 7.5 75.1 237.5 Multiplier at 95% Confidence Level +/-1.96 +/-1.96 +/-1.96 Confidence Interval +/14.7 +/-147.2 +/-465.5 Confidence Limit as % of Mean +/-24.5% +/-2.5% +/-0.8% If a population has 1,000,000 members, then there is a 95% chance that any sample will be within +/-0.8% of the true mean Slide 17

The Standard for Full Credibility The standard can be expressed in terms of a confidence interval. Example: How many observations do I need to be 95% sure that my data is within +/- 1% of the true mean? In our example, full credibility requires roughly 1 million members or 60,000 admits Use logic on previous slide, but solve for N Slide 18

Hypothesis Testing Overview Description Null Hypothesis Alternate Hypothesis Mathematical µ = µ* µ µ* In Words The population mean is µ* Our current assumption, µ*, is still correct The population mean is not µ* We need to change our assumptions Hypothesis Accept the Hypothesis Reject the hypothesis True Correct Type 1 error False Type II error Correct Standard practice is to err on the side of avoiding Type 1 error accept the hypothesis unless clear indication to the contrary Slide 19

Hypothesis Testing where does your test statistic fall @95% confidence interval? Reject 2.5% 95.0% Accept Reject 2.5% The curve represents what the distribution will look like if the null hypothesis is true Where does your test-statistic fall? Slide 20

Hypothesis Testing Members N = 1,000 N= 100,000 N = 1,000,000 µ * = Expected Admits 60 6,000 60,000 σ = Standard Deviation 7.5 75.1 237.5 X = Actual Admits (3% higher than expected) 61.8 6,180 61,800 Z = Test Statistic = (X - µ*)/σ.0.24 2.4 7.58 +/-1.96 +/-1.96 +/-1.96 Accept/Reject Accept Reject Reject In each case, the actual admits are 3% higher than expected, but we accept the null hypothesis if we only have 1,000 members but reject it if we have 100,000 or more. Slide 21

p-value Basically, the probability of a Type 1 error the probability that your sample or a more extreme one will show a statistically significant difference even when the null hypothesis is true The lower the p-value the better should be less than 1- confidence level (5% at the 95% confidence level) Considered the gold standard for statistically significant differences But. p-value based on one sample from one population: What if the next sample shows there is no difference? What happens if you use a similar but not identical population that shows no difference? This has been controversial since the early days of statistics Recommendation Routinely check p-value using Z.TEST in Excel Reconcile differences Slide 22

But how do we know what µ and σ are? We are going to have to estimate µ and σ, but we need some criteria first: Consistent estimator: tends to converge on true value as the sample size becomes larger Maximum likelihood estimator: If the true value of the unknown parameter has this value, then the probability of observing this value is maximized Unbiased estimator: Expected value is equal to the true value Slide 23

Some rules of thumb Use the sample mean to estimate the population mean x = x i /n If n > 30, use the sample variance s 2 = (x x i ) 2 /(n-1) In Excel STDEV.P returns population standard deviation (divides by n) STDEV.S returns the sample standard deviation (divides by n-1) Slide 24

Chi-Square Distribution: Sampling Χ 2 = Z 12 + Z 22 + Z n 2 Where Z 1, Z 2, etc are independent, standard normal distributions Has n - 1 degrees of freedom E(Χ n ) = n Var(Χ n ) = 2n Σ (A E) 2 /E is approximately Chi-square with n k 1 degrees of freedom where k = number of parameters to be estimated Slide 25

The t Distribution for small samples Define a new distribution T = Where Z Y/n Z is a standard normal distribution Y is a Chi-square distribution with n degrees of freedom Example: t = (x µ*) s/ n Note 1: µ* is the hypothetical population mean, usually the current assumption Note 2: t has n-1 degrees of freedom Slide 26

t-distribution Examples T.DIST(x, df, logical) returns T.INV(probability, df) returns Degrees of Logical x Freedom True False -3 10 0.7% 1.1% 0 10 50.0% 38.9% 3 10 99.3% 1.1% -3 20 0.4% 0.8% 0 20 50.0% 39.4% 3 20 99.6% 0.8% Degrees of Probability Freedom x 0.7% 10-3 50.0% 10 0 99.3% 10 3 0.4% 20-3 50.0% 20 0 99.6% 20 3 Slide 27

Other Uses of Chi-Square Distributions Σ (A E) 2 /E is approximately Chi-square with n k 1 degrees of freedom where k = number of parameters to be estimated The chi-square test can be used to test independence between two distributions Can do hypothesis to indicate if there is a real difference in two distributions. Slide 28

Sample Probability Distribution: IP Length of Stay ALOS Probability Cumulative Variance Range i x i f(x i) F(x i) (x i -µ) 2 Exactly 1 day 1 1.0 20.9% 20.9% 9.0 Exactly 2 days 2 2.0 29.9% 50.8% 4.0 Exactly 3 days 3 3.0 18.9% 69.8% 1.0 Exactly 4 days 4 4.0 10.2% 80.0% 0.0 Exactly 5 days 5 5.0 5.4% 85.4% 1.0 Exactly 6 days 6 6.0 3.3% 88.7% 4.0 Exactly 7 days 7 7.0 2.4% 91.0% 9.0 Exactly 8 days 8 8.0 1.7% 92.7% 16.0 Exactly 9 days 9 9.0 1.2% 93.8% 24.9 10+ Days 10 22.0 6.1% 100.0% 323.8 Sum/Sumproduct 4.0 100.0% 24.1 Slide 29

Chi-Square Test Expected Expected Expected Actual Actual Range i LOS Distribution Admits LOS Admits Χ 2 Exactly 1 day 1 1 20.9% 20.9 1.0 18.0 0.41 Exactly 2 days 2 2 29.9% 29.9 2.0 32.0 0.14 Exactly 3 days 3 3 18.9% 18.9 3.0 20.0 0.06 Exactly 4 days 4 4 10.2% 10.2 4.0 10.0 0.00 Exactly 5 days 5 5 5.4% 5.4 5.0 5.0 0.04 Exactly 6 days 6 6 3.3% 3.3 6.0 3.0 0.02 Exactly 7 days 7 7 2.4% 2.4 7.0 2.0 0.06 Exactly 8 days 8 8 1.7% 1.7 8.0 2.0 0.07 Exactly 9 days 9 9 1.2% 1.2 9.0 1.0 0.02 10+ Days 10 22 6.1% 6.1 24.0 7.0 0.12 Sum/Sumproduct 4.0 100.0% 100.0 4.3 100.0 0.95 Chi-square Statistic 0.95 chisq.test(actual range,expected range) = 99.95% Slide 30

Regression Analysis

Residual: Observed Value Predicted Value Observation Independent Predicted Observed Number Variable Value Value Residual i x y y y i e i 1 0.5 5.2 6.0 0.8 2 1.0 6.1 5.0 (1.1) 3 1.5 7.0 8.0 1.0 4 2.0 7.9 6.0 (1.9) 5 2.5 8.8 10.0 1.2 There is a curve which is the true underlying values Residuals are values of a random variable ϵ i The residual is basically the difference between the dot and the line Slide 32

Underlying Assumptions y i = β 0 + β 1 x i + ϵ i x 1, x 2,, x n are non-stochastic variables E(ϵ i ) = 0 and var(ϵ i ) = σ 2 The ϵ i s are independent random variables Note: β 0, β 1 and σ 2 are the true unknown values. We are going to have to estimate these values based on a specific data set Slide 33

Analysis of Variance (ANOVA): Total Sum of Squares (TSS) Observation Observed Overall Number Value Mean Δ Δ 2 i y i y y i - y 1 6.0 7.0 (1.0) 1.0 2 5.0 7.0 (2.0) 4.0 3 8.0 7.0 1.0 1.0 4 6.0 7.0 (1.0) 1.0 5 10.0 7.0 3.0 9.0 Total 16.0 The purpose of ANOVA is to understand how much of the variance is accounted for by the curve The starting point is calculating the total variance from the overall mean (the red line) Slide 34

Regression Sum of Squares (RSS) Predicted Value Overall Mean Observation Predictied Overall Number Value Mean Δ Δ 2 i y x y 1 5.2 7.0 (1.8) 3.2 2 6.1 7.0 (0.9) 0.8 3 7.0 7.0 - - 4 7.9 7.0 0.9 0.8 5 8.8 7.0 1.8 3.2 Total 8.1 How much of the variance is explained by the fact that we have a curve? Looking the difference between the blue line and red line Slide 35

Error Sum of Squares (ERSS): Residuals 10 9 8 7 6 5 4 A Simple Regresssion Example y values Observation Observed Predicted 11 Number Value Value Δ Δ 2 i y i y y e i 1 6.0 5.2 0.8 0.6 2 5.0 6.1 (1.1) 1.2 3 8.0 7.0 1.0 1.0 4 6.0 7.9 (1.9) 3.6 5 10.0 8.8 1.2 1.4 Total 7.9 3-0.5 1.0 1.5 2.0 2.5 3.0 x values Measures unexplained variance Slide 36

TSS = RSS + ERSS Sum of Squares Abv Description of Δ Value Total (Total Variance) Regression (Explained Variance) Error (Unexplained Variance) TSS Observed Values Overall Mean 16.0 RSS Predicted Values Overall Mean 8.1 ERSS Observed Values Predicted Values 7.9 Total Variance = Explained Variance + Unexplained Variance R 2 = Explained Variance/Total Variance = % of Total Variance Explained by Regression R 2 = 8.1/16 = 51% Slide 37

Why are non-stochastic values important? Stochastic Variable Non-Stochastic Variable Member lives in Zip 999 Area-adjusted PMPM Member is 42 Male Age-sex adjusted PMPM Member took health risk assessment % taking health risk assessment Stochastic variables introduce variance not accounted for in standard analysis of variance May be ignoring factors important in determining the value Incentive for taking health risk assessment may not be the same for each group Is this variance always material? Slide 38

The Bad News Health care costs are not normally distributed, so generalized linear models may have to be used Excel does not handle generalized linear models Can still use other methods, but be careful about disclaimers Slide 39

Criteria for estimating β 0, β 1 and σ We are going to use same criteria that we used to estimate µ and σ in Stats 101 Consistent estimator: tends to converge on true value as the sample size becomes larger Maximum likelihood estimator: If the true value of the unknown parameter has this value, then the probability of observing this value is maximized Unbiased estimator: Expected value is equal to the true value Slide 40

Least Squares Estimate Basic premise: Find the values b 0 and b 1 which minimize the sum of the squares from each data point to the theoretical line (y i (b 0 -b 1 x i )) 2 Take first derivative and solve for values Results are consistent, maximum likelihood and unbiased estimators Slide 41

A Good Candidate for Simple Linear Regression Statistic Value Intercept 4.2 Slope 1.1 R 2 73% Variance appears to be normal ~ 2/3 fall within 1 standard deviation of the mean ~1/3 fall between 1 and 2 standard deviations Line is not too flat High R 2 Slide 42

Excel Formulas For Key Values Input Values Intercept: =intercept(known y s, known x s) Slope: =slope(known y s, known x s) R 2 : =rsq(known y s, known x s) x y 1.0 2.08 1.5 5.75 2.0 6.66 2.5 4.96 3.0 8.25 3.5 7.54 4.0 12.07 4.5 9.90 5.0 9.96 5.5 13.24 6.0 12.06 6.5 9.89 7.0 14.66 7.5 14.03 8.0 11.17 8.5 17.49 9.0 15.28 9.5 13.55 10.0 12.57 10.5 14.59 Slide 43

Excel has data analysis add-in Data Data Analysis Requires a one-time set-up Slide 44

Data Analysis has several options Choose regression option Slide 45

Minimum input: Known y s, known x s, output placement Slide 46

Sample Output SUMMARY OUTPUT Regression Statistics KEY VALUES Multiple R 0.857 R Square 0.735 Adjusted R Square 0.720 Standard Error 2.091 Observations 20 ANOVA ANOVA df SS MS F Significance F Regression 1 218.06 218.06 49.88 0.00 Residual 18 78.69 4.37 Total 19 296.75 COEFFICIENT Coefficients Standard Error t Stat P-value Lower 95% Upper 95% TEST DATA Intercept 4.2 1.04 4.03 0.00 2.01 6.39 X Variable 1 1.1 0.16 7.06 0.00 0.80 1.49 Additional output available if requested in dialogue box Slide 47

Anscombe s Quartet: Data Set 1 vs Data Set 2 Data Set 1 Data Set 2 What is the expected difference in slope, intercept and R 2? Would you rely on the curve for data set 1? For data set 2? Slide 48

Anscombe s Quartet: Data Set 1 vs Data Set 3 Data Set 1 Data Set 3 Would you rely on the curve for data set 3? Slide 49

Anscombe s Quartet: Data Set 1 vs Data Set 4 Data Set 1 Data Set 4 Would you rely on the curve for data set 4? Slide 50

Example: How well does risk score predict cost? $1,200 $1,000 $800 $600 $400 $200 $- Area Adjusted PMPM - 1.0 2.0 Risk Score 3.0 4.0 5.0 6.0 Function Value INTERCEPT $35.18 SLOPE $183.86 CORREL 62% AVERAGE 184.39 Used retro risk score Random sample of males aged 42 from a large group N = 28 Divided raw PMPM by area factor Slide 51

What is the Value of R 2? Slide 52

The Basics y i = β 0 + β 1 x 1 +.+ β n x n + ϵ i ϵ i is a value of the residual random variable described earlier Slide 53

Why Multiple Linear Regression Shape of the curve/polynomial y i = β 0 + β 1 x + β 2 x 2 + ϵ i Additional explanatory variables Age + Gender probably explains costs better than age alone Control for confounding factors All other things being equal Example: Control for area Slide 54

Are your independent variables dependent on each other? In most cases, find the independent variable that best explains overall variance Test each independent variable one at a time Also, combinations of variables Test interdependence by doing analytics comparing just the variables in question How well does age-gender predict risk score? How well does risk score predict age-gender? Slide 55

Categorical Variables Categorical: Separate into groups even if the variable is not numeric per se Gender: 1 = female, 0 = male Alternately Gender1 = 1 if female, 0 if male Gender2 = 0 if female 1, if male Think in terms of the marginal impact of each variable: The expected value of y i goes up β i with each unit change in x i Slide 56

Where Do You Go From Here? Pick a resource Barron s Business Statistics Anything by Jed Frees Practice, practice, practice Start with basics (chi-square, confidence limits, hypothesis testing) Move to regression analysis use adjusted PMPMs etc to get yourself started Make sure you can analyze and explain results Move to advanced analytics GLM for health care costs Trend methods Evaluation (probit, propensity analysis,etc) Disease progression (survival models) Slide 57

Q&A and Wrap-Up