University of Iceland School of Engineering and Sciences Department of Industrial Engineering, Mechanical Engineering and Computer Science IÐN106F Industrial Statistics II - Bayesian Data Analysis Fall 2009 Take-home Exam This is a open-book exam. Assigned: Friday November 27th 2009 at 16:00. Due: Monday November 30th 2009 before 10:00. There is a total of five problems each problem weighing 20%. Within a problem each scenario weighs the same. Note that the number of scenarios is not the same between problems. It is assumed that students have access to Matlab or R or S+ but of course other computer programs can be used. Programs written for Matlab, R, S+ or other software should be indicated in the appendix. 1
1. (20%) The number of live births in Iceland from 1991 to 2008 can be found in the file birth_vs_total.txt (column 2) along with year (column 1) and total number of Icelanders ( 10 3 ) (column 3). Denote the number of births by y t, t = 1,..., T, T = 18, where t = 1 denotes the year 1991 and t = 18 denotes the year 2008. The total number of Icelanders is denoted by x t, t = 1,..., T. Assume a Poisson model for live births in Iceland of the form y i Poisson(x i θ), t = 1,..., T where θ is an unknown parameter representing the rate of birth per thousand Icelanders. Let p(θ) denote a noninformative prior for θ. Here a gamma distribution is assumed for θ, that is, p(θ) = Gamma(θ α = 0.001, β = 0.001). (a) Evaluate the posterior distribution of θ. (b) Evaluate the adequacy of the model by computing a Bayesian p-value for the following discrepancy measure T(y, θ) = T(y) = 1 T T 2 (y t x tˆθ) t=1 where T t=1 ˆθ = y t T t=1 x. t Is the proposed Poisson model adequate? If not, discuss what extensions could be made. 2
2. (20%) The data set in the file precip_rvk_1_2.txt contains measurements on the annual maximum daily precipitation (mm per 24 hours, from 9:00 AM to 9:00 AM the next day) recorded at two nearby station in Reykjavík over the years 1926 to 1985. (a) Plot a normal probability plot of the logarithm of the data for each station and evaluate the lognormal assumption for the data. (b) Assume the data from each station follow a lognormal distribution, that is, the logarithm of the data follow a normal distribution with mean µ i and variance σ 2 i, i = 1, 2. Use the noninformative prior p(µ i, σi 2) σ 2 i, i = 1, 2, and draw a sample of size L = 10 4 after burn-in within each of four chains from the joint posterior distribution of µ i and σi 2, i = 1, 2, by using the Gibbs sampler. Based on this sample compute the posterior mean, standard deviation, 2.5%, 25%, 50%, 75% and 97.5% percentiles for µ i and σ i, i = 1, 2. (c) Assume the data from both stations follow the same lognormal distribution with mean µ 0 and variance σ 2 0. Use the noninformative prior p(µ 0, σ 2 0 ) σ 2 0, and draw a sample of size L = 10 4 after burn-in within each of four chains from the joint posterior distribution of µ 0 and σ 2 0 by using the Gibbs sampler. Based on this sample compute the posterior mean, standard deviation, 2.5%, 25%, 50%, 75% and 97.5% percentiles for µ 0 and σ 0. (d) Assume the data from each station follow a lognormal distribution with mean µ i, i = 1, 2, and joint variance σ 2 0. Use the noninformative prior p(µ 1, µ 2, σ 2 0) σ 2 0 and draw a sample of size L = 10 4 after burn-in within each of four chains from the joint posterior distribution of µ i, i = 1, 2, and σ0 2 by using the Gibbs sampler. Based on this sample compute the posterior mean, standard deviation, 2.5%, 25%, 50%, 75% and 97.5% percentiles for µ i, i = 1, 2, and σ 0. (e) Compute DIC for the three models introduced in (b), (c) and (d). Which of these three models should be preferred according to DIC? 3
3. (20%) The median values of owner-occupied homes were collected for each of 506 areas of Boston along with thirteen other variables for the purpose of predicting housing values in other areas of Boston. The dependent variable is y i = log(the median value of owner-occupied homes in area i (in $1000)). The explanatory variables are x i,2 = per capita crime rate by town in area i x i,3 = percentage of residential land zoned for lots over 25,000 sq.ft. in area i x i,4 = percentage of non-retail business acres per town in area i x i,5 = Charles River dummy variable in area i (= 1 if tract bounds river; 0 otherwise) x i,6 = nitric oxides concentration in area i (parts per 10 million) x i,7 = average number of rooms per dwelling in area i x i,8 = percentage of owner-occupied units built prior to 1940 in area i x i,9 = weighted distances to five Boston employment centres in area i x i,10 = index of accessibility to radial highways in area i x i,11 = full-value property-tax rate per $10,000 in area i x i,12 = pupil-teacher ratio by town in area i x i,13 = 1000(B 0.63) 2 where B is the proportion of blacks by town in area i x i,14 = percentage of the population with lower status in area i x i,15 = (x i,3 x 3 ) 2 x i,16 = (x i,14 x 14 ) 2 where x 3 and x 14 are the sample means of x 3 and x 14, respectively. 4
The file boston_housing.data contains this data set with columns (x 2, x 3,..., x 14, z). Note that for this analysis the logarithm of z is needed (y = log(z)) and the variables x 15 and x 16 need to be created from x 3 and x 14. The following linear model is proposed 16 E(y i β, σ 2, X) = β j x ij, for i = 1,..., 506, with x i1 = 1 for all i. We further assume that the y s are independent, normally distributed and have equal variance, that is var(y i β, σ 2, X) = σ 2, for all i. j=1 (a) Plot y versus x 2, x 3,..., x 16, a total of 15 figures. Which explanatory variables show a clear relationship with y, and which don t? (b) Create a table with the 15 best models according to DIC where the table has columns showing DIC, the total number of parameters, and which explanatory variables are in the model. Based on the table and your knowledge of the problem, select one model for these data. This model will be used below. Use the Matlab routine dic_normal_models.m to compute all possible models, see course s web-page. To support your decision compute point estimates (posterior mean) and 95% marginal posterior intervals for the parameters β and σ in the full model, that is, the model which uses all the variables. Use L = 10000 (c) Draw a normal probability plot of the standardized residuals. Do the standardized residuals appear to follow a normal distribution? Draw the standardized residuals versus the predicted y, that is X ˆβ, and also versus all of the fifteen explanatory variables. Does the variance appear to be fixed when plotted against these variables? (d) Compute point estimates (posterior mean) and 95% marginal posterior intervals for the parameters in the final model selected in (b), that is, β and σ. Use L = 10000. 5
(e) Interpret the parameters in the model, that is, explain the effect of each explanatory variable on the median value of owner-occupied homes. Take into account the log-transformation (what is the expected value of z = exp(y) and how does it change when x i is increased by one unit?). (f) Compute a prediction and a 95% posterior predictive interval for an area in Boston that has the following explanatory variables; x 2 = 4.3, x 3 = 41, x 4 = 8.4, x 5 = 0, x 6 = 0.71, x 7 = 6.7, x 8 = 39, x 9 = 2.1, x 10 = 7, x 11 = 383, x 12 = 17.8, x 13 = 350, x 14 = 15. Take into account the log-transformation. 6
4. (20%) The sum of exponentially distributed random variables follows the Erlang distribution. Given that v j follows an exponential distribution with mean 1/θ, θ > 0, j = 1,..., r, then y = r j=1 follows an Erlang distribution with parameters r and θ. The parameter r is an integer which is usually known while θ is usually unknown. The Erlang distribution is a special case of the gamma distribution with α = r and β = θ. v j The density of a random variable y that follows the Erlang distribution is given by p(y r, θ) = θr y r 1 e θy, y > 0 (r 1)! and the mean and variance are E(y) = r/θ and var(y) = r/θ 2, respectively. (a) Assume n independent observations, y i, i = 1,..., n, follow an Erlang distribution with r = 20 and some unknown θ. Assume that the prior distribution is a gamma distribution with parameters α = 0.001 and β = 0.001, that is p(θ) = Gamma(θ α = 0.001, β = 0.001) Find the general form of the posterior distribution of θ. (b) Evaluate the numerical values of the parameters of the posterior distribution using the data set in erlang.dat which contains n = 100 observed values. (c) Assume that it is not known whether r is equal to 18, 19, 20, 21 or 22. Further, assume that Compute the probability Pr(r = s) = 1, s = 18, 19, 20, 21, 22. 5 Pr(r = s y), 7
for s = 18, 19, 20, 21, 22, where y is a vector containing all the data. Which of these five values is most likely? (Hint: Pr(r = s y) can be evaluated analytically via integration of a gamma density. When computing Pr(r = s y), it is better to take the logarithm first, compute its value and then take the exponent.) 8
5. (20%) The binomial distribution is usually parameterized with n and θ and the sampling distribution of y is given by p(y n, θ) = Bin(y n, θ) = ( ) n θ y (1 θ) n y, y y {0, 1,..., n}. The binimial distribution can also be parameterized with κ = log(θ) log(1 θ), and θ and (1 θ) as function of κ are θ = eκ 1 1 + eκ, and 1 θ = 1 + e κ. In that case the sampling distribution of y is given by p(y n, κ) = Bin(y n, κ) = ( ) ( ) n e κ y ( ) n y 1 = y 1 + e κ 1 + e κ ( ) n e κy y (1 + e κ ) n. If the uniform distribution is assumed as a prior distribution for θ then the posterior distribution of θ is a beta distribution with α = y + 1 and β = n y + 1. If θ is transformed to κ then the posterior density of κ is given by p(κ y) = (n + 1)! y!(n y)! e κ(y+1) (1 + e κ ) (n+2). (a) Find the normal approximation to the posterior density of κ. (b) Draw the exact posterior density of κ and the approximated normal posterior density of κ on the same graph when n = 30 and y = 21. 9