Semiparametric Modeling, Penalized Splines, and Mixed Models

Semi 1 Semiparametric Modeling, Penalized Splines, and Mixed Models David Ruppert Cornell University http://wwworiecornelledu/~davidr January 24 Joint work with Babette Brumback, Ray Carroll, Brent Coull, Ciprian Crainiceanu, Matt Wand, Yan Yu, and others

Semi 2 Example (data from Hastie and James, this analysis in RWC) spinal bone mineral density 6 8 1 12 14 1 15 2 25 age (years)

Semi 3 Possible Model SBMD i,j is spinal bone mineral density on ith subject at age equal to age i,j SBMD i,j = U i + m(age i,j ) + ɛ i,j, i = 1,, m = 23, j = i,, n i U i is the random intercept for subject i {U i } are assumed iid N(, σu 2 )

Semi 4 Underlying philosophy 1 minimalist statistics keep it as simple as possible 2 build on classical parametric statistics 3 modular methodology

Semi 5 Reference Semiparametric Regression by Ruppert, Wand, and Carroll (23) Lots of examples from biostatistics

Semi 6 Recent Example April 17, 23 Canfield et al (23) Intellectual impairment and blood lead longitudinal (mixed model) nine covariates (modelled linearly) effect of lead modelled as a spline (semiparametric model) disturbing conclusion

Semi 7 13 12 11 1 Quadratic IQ 9 8 7 Spline 6 5 1 15 2 25 3 35 lead (microgram/deciliter) Thanks to Rich Canfield for data and estimates

Semi 8 Semiparametric regression Partial linear or partial spline model: Y i = W T i β W + m(x i ) + ɛ i m(x) = X T i β X + B T (x)b B T (x) = ( B 1 (x) B K (x) ) Eg, X T i = ( X i X p i ) B T (x) = { (x κ 1 ) p + (x κ K ) p + }

Semi 9 Example m(x) = β + β 1 x + b 1 (x κ 1 ) + + + b K (x κ K ) + slope jumps by b k at κ k

Semi 1 Linear plus function 2 plus fn 18 derivative 16 14 12 1 8 6 4 2 5 1 15 2 25 3

Semi 11 Fitting LIDAR data with plus functions log ratio -1-8 -6-4 -2 4 5 6 7 range

Semi 12 Generalization m(x) = β +β 1 x+ +β p x p +b 1 (x κ 1 ) p ++ +b K (x κ K ) p + pth derivative jumps by p! b k at κ k first p 1 derivatives are continuous

Semi 13 4 35 Quadratic plus function plus fn derivative 2nd derivative 3 25 2 15 1 5 5 1 15 2 25 3

Semi 14 Raw Data Ordinary Least Squares 2 knots 3 knots 5 knots 2 2 2 2 4 4 4 4 6 6 6 6 8 8 8 8 1 4 6 1 4 6 1 4 6 1 4 6 1 knots 2 knots 5 knots 1 knots 2 2 2 2 4 4 4 4 6 6 6 6 8 8 8 8 1 4 6 1 4 6 1 4 6 1 4 6

Semi 15 Penalized least-squares Minimize n { Y (W T i β W + X T i β X + B T (X i )b) } 2 + λ b T Db i=1 Eg, D = I

Semi 16 Raw Data Penalized Least Squares 2 knots 3 knots 5 knots 2 2 2 2 4 4 4 4 6 6 6 6 8 8 8 8 1 4 6 1 4 6 1 4 6 1 4 6 1 knots 2 knots 5 knots 1 knots 2 2 2 2 4 4 4 4 6 6 6 6 8 8 8 8 1 4 6 1 4 6 1 4 6 1 4 6

Semi 17 Ridge Regression From previous slide: n { Y (W T i β W + X T i β X + B T (X i )b) } 2 + λ b T Db i=1 Let X have row ( Wi T X T i B T (X i ) ) Then β W β = { X X b T X + λ blockdiag(,, D) } 1 X T Y Also, a BLUP in a mixed model and an empirical Bayes estimator

Semi 18 where b is N(, σ 2 b Σ b ) Linear Mixed Models Y = Xβ + Zb + ε Xβ are the fixed effects and Zb are the random effects Henderson s equations ( ) β b = ( X T X X T Z Z T X Z T Z + λσ 1 b λ = σ2 ɛ σ 2 b ) 1 ( X T Y Z T Y )

Semi 19 From previous slides: Let X have row ( Wi T X T i B T (X i ) ) Then β W β = { X X b T X + λ blockdiag(,, D) } 1 X T Y Linear mixed model: ( ) ( β X T X X T Z = b Z T X = Z T Z + λσ 1 b ) 1 ( X T Y Z T Y { ( X Z ) T ( X Z ) + λ blockdiag(, Σ 1 b )} 1 ( X Z ) T Y )

Semi 2 Selecting λ 1 cross-validation (CV) 2 generalized cross-validation (GCV) 3 ML or REML in mixed model framework

Semi 21 Selecting the Number of Knots (a) SpaHet, j = 3, typical data set 15 115 (b) MASE comparisons y 1 5 5 True full search 1 2 4 6 8 1 relative MASE 11 15 1 95 5 2 4 8 12 K fixed nknots myopic full search 15 25 frequency 1 5 ASE K=4 125 1 2 3 4 5 6 number of knots (coded) n = 2 125 25 ASE K=5

Semi 22 5 (a) SpaHetLS, j = 3, n = 2, 115 (b) MASE comparisons y 5 True full search 2 4 6 8 1 relative MASE 11 15 1 95 5 2 4 8 12 K fixed nknots myopic full search 25 15 x 1 3 frequency 2 15 1 5 ASE K=4 1 5 1 2 3 4 5 6 number of knots (coded) n = 2, 5 1 15 ASE K=5 x 1 3

Semi 23 x 1 4 2 MSE MSE 1 Variance Bias 5 1 15 2 25 df fit (λ) Optimal n = 1,, 2 knots, quadratic spline

Semi 24 Return to spinal bone mineral density study spinal bone mineral density 6 8 1 12 14 1 15 2 25 age (years) SBMD i,j = U i + m(age i,j ) + ɛ i,j, i = 1,, m = 23, j = i,, n i

Semi 25 X = 1 age 11 1 age 1n1 1 age m1 1 age mnm

Semi 26 Z = 1 (age 11 κ 1 ) + (age 11 κ K ) + 1 (age 1n1 κ 1 ) + (age 1n1 κ K ) + 1 (age m1 κ 1 ) + (age m1 κ K ) + 1 (age mnm κ 1 ) + (age mnm κ K ) +

Semi 27 u = U 1 U m b 1 b K

Semi 28 spinal bone mineral density 6 8 1 1 15 2 25 age (years) Variability bars on m and estimated density of U i

Semi 29 Broken down by ethnicity Hispanic 1 15 2 25 White 14 12 spinal bone mineral density 14 12 1 Asian Black 1 8 6 8 6 1 15 2 25 age (years)

Semi 3 Model with ethnicity effects SBMD ij = U i + m(age ij ) + β 1 black i + β 2 hispanic i Asian is the reference group +β 3 white i + ε ij, 1 j n i, 1 i m

Semi 31 Only requires an expansion of the fixed effects by adding the columns black 1 hispanic 1 white 1 black 1 hispanic 1 white 1 black m hispanic m white m black m hispanic m white m

Semi 32 contrast with Asian subjects 5 1 15 Black Hispanic White

Semi 33 In this model, the age effects curve for the four ethnic groups are parallel Could we model them as non-parallel? Might be problematic in this example because of the small values of the n i But the methodology should be useful in other contexts

Semi 34 Add interactions between age and black, hispanic, and white These are fixed effects Then add interactions between black, hispanic, white, and asian and the linear plus functions in age These are mean-zero random effects with their own variance component This variance component control the amount of shrinkage of the enthicity-specific curves to the overall effect

Semi 35 Penalized Splines and Additive Models Additive model: Y i = m 1 (X 1,i ) + + m P (X P,i ) + ɛ i

Semi 36 Bivariate additive spline model Y i = β +β x,1 X i + b x,1 (X i κ x,1 ) + + + b x,k (X i κ x,kx ) + + β z,1 Z i + b z,1 (Z i κ z,1 ) + + + b z,k (Z i κ z,kz ) + + ɛ i no need for backfitting computation very rapid no identifiability issues inference is simple

Semi 37 Bayesian methods The linear mixed model is half-bayesian The random effects have a prior The parameters without a prior are: fixed effects give them diffuse normal priors variance components give them diffuse inverse gamma priors

Semi 38 Bayesian methods Can be easily implemented in WinBUGS or programmed in, say, MATLAB Allows Bayes rather than empirical Bayes inference Uncertainty due to smoothing parameter selection is taken into account

Semi 39 The Bias-Variance Trade-off and Confidence Bands lambda= lambda=1 log ratio -8-4 log ratio -8-4 4 5 6 7 range 4 5 6 7 range lambda=3 lambda=1 log ratio -8-4 log ratio -8-4 4 5 6 7 range 4 5 6 7 range

Semi 4 How does one adjust confidence intervals for bias? undersmooth so variance dominates and bias can be safetly ignored

Semi 41 x 1 4 45 4 35 n=1, 2 knots σ=3 MSE 3 25 2 15 MSE 1 5 Variance Bias 2 1 6 1 5 1 4 1 3 1 2 log(λ) optimal

Semi 42 Adjustment for bias continued estimate bias by a higher order method and subtract off bias (essentially the same as above) Wahba/Nychka Bayesian intervals bias is random so adds to posterior variance interval is widened but there is no offset

Semi 43 Wahba/Nychka Bayesian Intervals y = Xβ + Zu + ε, Cov [ ] u ε [ σ 2 = u I σεi 2 ], C = ( X Z ) β and ũ are BLUPs

Semi 44 ([ β ] u) Cov ũ = σε(c 2 T C+ σ2 ε D) 1 C T C(C T C+ σ2 σu 2 ε D) 1 σu 2 (Frequentist variance Ignores bias) ([ Cov ]) β ũ u = σ 2 ε(c T C + σ2 ε σ 2 u D) 1 (Bayesian posterior variance Takes bias into account)

Semi 45 strontium ratio 772 7725 773 7735 774 7745 775 95 1 15 11 115 12 age (million years)

Semi 46 1 8 6 4 2 Effect of measurement error y 2 4 6 8 1 4 3 2 1 1 2 3 4 5 x plus error W = X + error and Var(X) = Var(error)

Semi 47 Correction for measurement error Relatively little research in this area Fan and Truong (1993): deconvolution kernels first work inefficient in finite-sample studies no inference strictly for 1-dimensional smoothing Carroll, Maca, Ruppert functional SIMEX methods and structural spline methods more efficient than Fan and Truong

Semi 48 Berry, Carroll, and Ruppert (JASA, 22) fully Bayesian smoothing or penalized splines rather efficient in finite-sample studies inference available scales up semiparametric inference is easy structural

Semi 49 Berry, Carroll, and Ruppert starts with mixed-model spline formulation but fully Bayesian conjugate priors true covariates are iid normal but surprisingly robust normal measurement error in Gibbs, only sampling of true (unknown) covariates requires a Hastings-Metropolis step

Semi 5 1 8 6 4 2 Effect of measurement error y 2 4 6 8 1 4 3 2 1 1 2 3 4 5 x plus error W = X + error and Var(X) = Var(error)

Semi 51 Correction for measurement error 1 8 6 4 2 2 4 6 8 1 4 3 2 1 1 2 3 4 Solid: true Dotted: uncorrected Dashed: corrected

Semi 52 Measurement Error, continued Ganguli, Staudenmayer, Wand: EM maximum likelihood estimation in BCR model Works about as well as the fully Bayesian approach Extension to additive models

Semi 53 Generalized Regression Extension to non-gaussian responses is conceptually easy Get a GLLM However, GLIM s are not trivial Can use: Monte Carlo EM Or MCMC

Semi 54 Single-Index Models Y i = g(x T i θ) + Z T i β + ɛ i Yu and Ruppert (22, JASA) Let g(x) = γ + γ 1 x + + γ p x p +c 1 (x κ 1 ) p + + + c K (x κ K ) p + Becomes a nonlinear regression model Y i = m(x i, Z i, θ, β, γ, c) + ɛ i