FINANCIAL ECONOMETRICS AND EMPIRICAL FINANCE MODULE 2

MSc. Finance/CLEFIN 2017/2018 Edition FINANCIAL ECONOMETRICS AND EMPIRICAL FINANCE MODULE 2 Midterm Exam Solutions June 2018 Time Allowed: 1 hour and 15 minutes Please answer all the questions by writing your answers in the spaces provided. There are two optional questions (4 and 5). No additional papers will be collected and therefore they will not be marked. You always need to carefully justify your answers and show your work. The exam is closed book, closed notes. No calculators are useful or permitted. You can withdraw until 10 minutes before the due time. Question 1.A (10 points) Describe in detail, also with reference to the examples that have been provided in the lectures, the four stylized facts that have motivated the development of models of conditional heteroskedasticity. With reference to two independent Gaussian GARCH(1,1) models one estimated for US stock returns and the other for US Treasury note returns discuss whether and how these two models could fully account for the stylized facts that you have listed and discussed. 1

As for the Gaussian GARCH(1,1) case, as commented in the slides and lecture notes (see above), because it implies volatility clustering (by construction, as past squared shocks explain current conditional variance), this model automatically inflates the tails of the implied unconditional distribution of returns; moreover, because shocks enter raised to a power 2, so that small shocks are pushed towards zero while shocks larger than 1 are magnified, it is likely that a GARCH(1,1) model will also inflate the probability mass just around the (conditional) mean of the data, because of spells of time characterized by small shocks, small variance, small returns, small shocks, etc. When both the extreme tails and the center of the distribution are inflated vs. (say) a Gaussian benchmark, then we speak of leptokurtosis, i.e., the first stylized fact. However, by construction a GARCH(1,1) model is symmetric and cannot capture the third stylized fact. Moreover, without further extensions, a GARCH cannot even pick up time variation in all higher-order moments beyond the second. Finally, two independent GARCH processes, just because they are independent, have no hope to capture co-movements in volatility, either separately from time variation in covariances, or jointly with such covariances. Question 1.B (3 points) Mlado Vizov an analyst at Peeled & Head Ass. has just made a simple mathematical observation concerning a W-period rolling window variance estimator, σ W, namely that 2

σ W 1 W ε ε 1 W1 ε ε σ W 1. Therefore he claims that a rolling window variance estimator is just a special case of a RiskMetrics model, under the restriction that both the terms on the right-hand side are multiplied by a unit coefficient. Do you agree with his claim? Carefully explain your reasoning. As you know, a RiskMetrics model is simply written as σ 1λε λσ. However, note that the variance process on the left- and right-hand sides of the RiskMetrics are the same: on the right we just have one lag of the process on the left. In the case pointed out by Mlado, we have instead that the process on the left, a W-observation rolling window variance estimator σ W is structurally different from the process on the right, σ W 1, a (W -1)-observation rolling window variance estimator, which just uses less data. Therefore we can say that there is no restriction on RiskMetrics that can take us to a rolling window variance process and as a result Mlado is wrong. Question 1.C (4 points) Ms. Nanny Ogombo is due to give a presentation on the process followed by the conditional variance of the log-price returns on the 3-month futures on Natural Gas, traded on the NYMEX. The audience is composed of two types of customers: option pricers and risk managers. In the hurry to leave her home, Nanny has forgotten her most recent GARCH(1,1) estimates of the parameters of process. However, Nanny remembers for a fact that for the process she had estimated, α β 0.98. Therefore she decides to roughly allocate the estimated persistence to α and β as follows: α 0.1 and β 0.88, because this will not count for very much, as only overall persistence matters. Knowing that in fact the true values of the estimates of α and β had been instead 0.04 and 0.94, do you agree with her decision? In particular, selecting α 0.1 0.04 and β 0.880.94 will distort and in what ways the volatility forecasts and decisions taken by option pricers vs. risk managers? Make sure to carefully justify your answer. Nanny is making a mistake because we know that for a given persistence index the exact values taken by α and β make a strong difference for the practical application of GARCH models, see also slides copied below. In particular, if one underestimates α and overestimates β, she obtains an implied conditional variance process that is rather (excessively) smooth but characterized by a less spiky dynamics and therefore because the unconditional variance is just determined by the implied persistence and such is determined independently of how a given estimated persistence is allocated between α and β by a higher general level of predicted variance. Therefore underestimating α to overestimate β generates a general, normal level of predicted variance that is higher than it should be and this may lead a risk manager to be overcautious, by setting aside amounts of cushion or regulatory capital that are most of the time excessive. Conversely, an option pricer will face a process in which using a simpler, constant variance model such as Black-Scholes is perceived to be generally not so bad and possibly leading to 3

acceptable errors, in spite of the time-varying variance implied by the process. In practice, such a selection may be then punished by the subsequent behavior of the underlying security price. Question 2.A (11 points) Discuss why and how would you proceed to specify an asymmetric GARCH model in which the standardized shocks are drawn from a t-student distribution. What types of potential asymmetric GARCH models would you consider and what are their pros and cons? How would you proceed to test that, given some selection of a framework, the resulting, estimated model for the conditional mean and conditional variance, say y μ σ z z IID t0, v is correctly specified? As discussed in general terms, the correct specification would be tested by checking that in fact the estimated, standardized residuals, z y μ /σ, are in fact IID t0, v. I have copied the corresponding slide just below. 4

Question 2.B (3 points) Giorgio Gutierro, an analyst at Black Eagler Inc., has just estimated two t-student GARCH-type models on de-meaned monthly Italian excess equity and 10-Treasury returns (denoted as x and x, respectively), obtaining (p-values are in parentheses): 7

x exp 0.134. x 0.164. z 0.114. z Ez 0.905 0.047 0.073 x, 0.044 I..., x, 0.872 lnσ., σ., z z z z IID t0, 11.23 IID t0, 13.08.,. where the indicator variable I, has the same meaning seen in the lectures. Giorgio reports then that both Italian stock and bond excess returns contain evidence of leverage effects, i.e., that negative news increase predicted variance more than positive (non-negative) news do. Do you trust Giorgio s conclusions? Make sure to clearly justify your answer. You will need to recognize that the model estimated for excess stock returns is a EGARCH(1,1). Therefore, because when z 0, we have that the term in squared bracket under the square root is 0.164. z 0.114. z 0.114. Ez (this follows from z z 0 when z 0), while when z 0, the same term becomes 0.164 z 0.114 z 0.114 Ez... and the former exceeds the latter, Giorgio is correct: when shocks are negative, volatility increases more than when shocks are non-negative. In the case of bond returns, we are facing a threshold GARCH(1,1) model and we have that when shocks are negative, the model is x 0.047 0.073 0.044 x..., 0.872 σ., z while when they are non-negative, the model becomes x 0.047 0.073 x, 0.872.. 0.047 0.029x, 0.872. σ., z σ., and visibly, news increase variance more in the latter case than in the former: clearly, this is the opposite of leverage effects. Therefore on the second account Giorgio is incorrect. Question 2.C (3 points) Mr. Milenko Petic, and quant researcher at Toastercity & Ass., is comparing two alternative estimates of a GARCH-in mean model for Japanese stock returns, both obtained from an identical sample of data. The first set of parameter estimates has been obtained by assuming a t-student distribution for the standardized errors and by jointly maximizing the log-likelihood function with respect to all the 6 coefficients, and it is as follows: r 0.008. 0.805 0.016.. σ. 0.081. ε 0.897 0.016 0.081 ε.. 0.897 σ. z z IID t0, 9.806.. The second set of parameter estimates has been obtained by assuming a t-student distribution for the standardized errors but in two steps: first the GARCH model has been estimated for the returns as r 0.009 0.019 0.078 ε... 0.901 σ. z z IID t0, 9.537 ;. z 8

next, given the estimated time series of predicted variances obtained from the GARCH model,, the conditional mean function is estimated by simple OLS as: σ r 0.010 0.785 σ e 0.010 0.785 0.019 0.078 ε 0.901 σ e........ Milenko claims that the two estimators are equally valid, i.e., that their asymptotic properties are identical. Do you agree with his claim? What are the pros and cons of the two alternative estimation methods that have been described? Make sure to clearly justify your answer. Milenko is incorrect: the second estimator is obtained through a two-step process but it fails to assume normal standardized residuals and as such it cannot be rationalized as an instance of QMLE. As such, because estimation has been split in two steps, its statistical properties are simply unknown. The first estimator may instead be a MLE which, at least asymptotically, would be consistent, asymptotically normal, and the most efficient estimator, but such properties would require correct specification of everything in the model, i.e., conditional mean (the GARCH-in mean structure), conditional variance, as well as the t-student distribution for the shocks. In case any misspecification had occurred, then the estimator would lose all of the claimed asymptotic properties. As a result, the second estimator also appears to be risky. To hedge such risk, Milenko should subject the standardized residuals from the model to a carefully designed battery of diagnostic checks. Question 3.A (9 points) Describe the theoretical justifications as well as the practical implementations of tests of the forecasting validity of a conditional heteroscedasticity model based on the linear regression ε abσ e, where e is a white noise shock and σ are the one-step ahead conditional variance forecasts derived from a given model. How would you estimate this linear model? Under what circumstances the null that the model yields unbiased and efficient forecasts will be rejected? Discuss whether you would also use the regression R-square to assess the validity of the variance model. Make sure to clearly justify your answers. 9

Question 3.B (4.5 points) Mindy Kauskakas, an independent researcher, has estimated two models to predict the one-day ahead variance of US aggregate excess stock returns. For two models, call them A and B, Mindy has obtained the following results from a regression of squared residuals (from a MA(1) model that has been pre-specified using Box-Jenkins analysis) on variance predictions, σ, and (estimated standard errors are reported in parantheses): σ, ε 0.434 0.549 σ, e R 0.029,.. ε 0.106 1.146 σ, e R 0.159... Scatter plots of the squared residuals vs. variance predictions with a regression line superimposed are as follows: 10

Which of the two models, if any, can be considered to be a valid prediction tool? Apart from the estimates of the two sets of regression coefficients, do the two plots reveal additional information to back up your judgement on the relative merits of the two models? Make sure to clearly justify your answers. Both models should be rejected because they provide biased and inefficient (i.e., not moving one-to-one with target values) forecasts. In particular, in the case of model A, we have: t 0.434 10.027 2 reject null of a 0 0.044 0.549 1 t 16.596 16.596 2 reject null of b 1 0.027 Clearly, in this case to test the joint null of a = 0 and b = 1 using an F-test will lead to a rejection. Moreover, with all its limitations, the R-square in this case is largely disappointing. The left panel of the picture also shows an interesting phenomenon: in a non-negligible fraction of the sample, recorded variance is large and exceeds 20 (careful, this is not a percentage!) but the model predicts a variance of almost zero, which is a reason for concern; in a few cases, we also record the opposite pattern: the recorded squared error is small and a below 1-2, but model A returns predictions that exceed 5 or even 10, see the green circles in the copy of the figures below. Of course, these regularities contribute to a rather small regression R-square. In the case of model B, we have: t 0.106 2.762 2.762 2 reject null of a 0 0.038 1.146 1 t 6.448 2 reject null of b 1 0.027 Clearly, in this case to test the joint null of a = 0 and b = 1 using an F-test will lead to a rejection. However, the R-square of this regression is not as low and disappointing as the one we have gotten for model B, and corresponds to almost the maximum one may hope to get with this type of data. This is qualitatively confirmed by the rightmost plot of the figure, in which high squared residuals are always matched by non-zero a substantial variance predictions: when variance will be high, the model will forecast that. However, remains visible and actually gets even stronger (see orange circle) the second type of bias: in a considerable fraction of the sample, the recorded squared error is small and a below 1-2, but model B returns predictions that exceed 5 or even 10, which implies that a fraction of the time, mode B predicts a high variance that fails to materialize in the data. 11

Question 3.C (2.5 points) In the case of model B, Mindy proceeds then to look for ways to improve the model and its predictive performance. She obtains the following evidence: 12

Keep in mind that model B has been estimated assuming that the standardized shocks are drawn from a t-student distribution, which justifies the selection of benchmark in the kernel density plot (third plot going clockwise). What is your advice to Mindy as to ways to improve the predictive power of model B? Make sure to clearly justify your answer. In fact, it all looks rather good apart from one piece of evidence: the kernel density comparison and especially the quantile-quantile plot reveal that the t-student inflates the tails of the predicted density excessively given the tail thickness expressed by the data (also because the estimated number of degrees of freedom, less than 8, appears to be really small). On the contrary there is no evidence that any residual ARCH structure is left in the data or of asymmetries that are not captured (see the kernel plot), even though further tests for asymmetries using the LM principles or news impact curves might be explored. Finally, note that one piece of evidence is rather redundant and unhelpful the histogram provides the background to test for normality, but there is no presumption here that the data may come from a Gaussian distribution. Question 4 (4 points) Consider the following QML estimates for a Gaussian GARCH(1,1)-DCC(1,1) with correlation targeting, applied to the bivariate vector time series of bitcoin and ripple cryptocurrency returns (p-values are in parentheses): r 0.229 0.289 0.151 ε... 0.848 σ., z r 0.299 0.074 0.450 ε... 0.548 q, 0.311 0.104.. z z 0.872 13 z σ., q,., where ε r, estimated constant mean. Is the process for the auxiliary variable q stationary? Can you compute the implied, estimated value of the constant in the process for q,, given that correlation targeting has been applied? What guarantees that the implied DCC correlation ρ,, q q,, q always falls in the interval [-1, +1]? Finally, illustrate the difference between the two processes: σ, 0.289 0.151 ε 0.848 σ,..., 1 0.104, 0.872 q. z. q Will they return identical values? What is the meaning of them being different and what does that imply? Yes, the process for the auxiliary variable q, is stationary because 0.104 + 0.872 = 0.976 < 1. Yes, it is easy to compute the implied, estimated value of the constant in the process for q,, given that correlation targeting has been applied: ω, 1 0.104 0.8720.311 0.0075. What guarantees that the implied DCC correlation

ρ,, q q,, q falls in the interval [-1, +1] is the very functional form adopted, i.e., the fact that, 1 0.104, 0.872 q, 1 0.104 q. z. z 0.872 q. q,.. This is in fact the meaning and interest in the auxiliary variable. See posted lecture notes for a proof of the reason. The two processes: σ, 0.289 0.151 ε 0.848 σ,..., 1 0.104, 0.872 q. z. q are different and will not return identical values. This is clear from the fact that q, 1 0.104 ε /σ, 0.872 q,.. so that, even if we set q, =σ,, q, σ, σ, 0.104ε 0.872σ, will in general fail to equal σ, 0.848σ, 0.151ε 0.289. These are different, and they must be kept so, to ensure that ρ, q, q,, q yields predicted correlations in [-1, +1]. As discussed in the lectures, this is the inconsistency of the DCC approach: it is as if two different notions of conditional variance are used by the same model contemporaneously. Moreover, in this specific case, note that while the two individual GARCH models have been estimated without imposing variance targeting, correlation targeting has appeared in the DCC model, and this generates additional inconsistencies. Question 5 (4 points) After having explained in what kind of reactions to the problem of structural instability in econometric relationships it is possible to classify Markov regime switching models, proceed to illustrate the structure and statistical features of a bivariate MSIH(2,0) model, i.e., a model with no VAR components but in which the covariance matrix is regime switching. Will a MSIH(2,0) induce unconditional skewness and kurtosis in the implied predictive density obtained from model estimates? The MSIH(2,0) model requested is a special case (with two regimes instead of three) of the 14

example discussed in the lectures: The comments on the statistical features may require that you provide some details along the following lines: 15