Nonparametric Estimation of a Hedonic Price Function Daniel J. Henderson,SubalC.Kumbhakar,andChristopherF.Parmeter Department of Economics State University of New York at Binghamton February 23, 2005 Abstract In this paper we attempt to replicate the results of an article (Anglin and Gençay 1996) published in this journal which applied semiparametric procedures to estimate a hedonic price function. To relax additional restrictive assumptions, we also employ a fully nonparametric model that captures nonlinearity in both continuous and categorical variables. We find that the nonparametric procedure gives more intuitive and meaningful results. Keywords: Nonparametric, Generalized Kernel Estimation, Hedonic Price JEL Classification No.: C13, C14, D40 Daniel J. Henderson, Department of Economics, State University of New York, Binghamton, NY 13902-6000, (607) 777-4480, Fax: (607) 777-2681, e-mail: djhender@binghamton.edu. Subal C. Kumbhakar, Department of Economics, State University of New York, Binghamton, NY 13902-6000, (607) 777-4762, Fax: (607) 777-2681, e-mail: kkar@binghamton.edu. Christopher F. Parmeter, Department of Economics, State University of New York, Binghamton, NY 13902-6000, Fax: (607) 777-2681, e-mail: cparmet1@binghamton.edu.
1 Introduction In this paper we present replication results of the hedonic price functions estimated in the housing market study by Anglin and Gençay (1996), hereafter AG. We also report results that can be viewed as robustness checks of their semiparametric model. In hedonic price models it is argued that the value of a good (a house in the present case) depends on the amounts of attributes it contains. Thus, its price will be a function of the attributes/characteristics z 1,,z l. Implicit prices of the characteristics can be computed from the partial derivatives of the price function with respect to the level of the characteristics. Since these derivatives may be dependent upon the levels of these characteristics, the choice of functional form in empirical analysis is quite important. To relax some of the traditional functional form assumptions AG used a semiparametric approach in which the hedonic price function is specified as(equation11inag) ln P i = β 1 DRV i + β 2 REC i + β 3 FFIN i + β 4 GHW i + β 5 CA i + β 6 GAR i +β 7 REG i + q[lot i,bdms i,fb i,sty i ]+ε i, i =1,...,n, (1) where P i is the price of house i; DRV, REC, FFIN, GHW, CA,andREG are dummy variables for driveway, recreational room, finished basement, gas water heating, central air conditioning, and preferred neighbourhood; GAR, BDMS, FB, andsty are the number of garages, bedrooms, full bathrooms, and stories; and LOT is the lot size (in square feet). As a comparison, they considered the following parametric benchmark model ln P i = β 1 DRV i + β 2 REC i + β 3 FFIN i + β 4 GHW i + β 5 CA i + β 6 GAR i + β 7 REG i + γ 1 ln(lot i )+γ 2 ln(bdms i )+γ 3 ln(fb i )+γ 4 ln(sty i )+u i. (2) We reproduced the results of this parametric model, reported in Table II of their paper. The OLS results for the other parametric models (viz., extensions of the above model in 2) and the findings from the corresponding specification tests were also reproduced without any difficulty. 1
2 Semiparametric Estimation In a general setting, we can write equation (1) as y i = x 0 1iβ + m (x 2i )+ε i, i =1,...,n. (3) Semiparametric models of the above form were studied by Robinson (1988) and Stock (1989) to take advantage of a known parametric relationship and an unknown functional relationship in economic models. In this setup, y i is the dependent variable, x 1i is a q-dimensional vector of discrete contrasts, β is a q-dimensional vector of contrast effects, x 2i is a k-dimensional vector of continuous variables that effect the dependent variable through the unknown function m ( ), ε i is a stochastic error that accounts for inherent randomness in the model, and n is the sample size available to the researcher. Robinson (1988) noted that the additional information given by the linear portion of (1) leads to n-consistent estimation of the linear parameters (when correctly specified). While AG were concerned primarily with the linear coefficients and the out-of-sample prediction of the estimates, we thought it would be interesting to also study the unknown function and its derivatives. This is of interest because the derivatives of the variables in the nonlinear function are often of equal or greater interest to the researcher. In other words, it is natural to ask what is the implicit price of lot size, one extra bedroom, bathroom, etc. Specifically, we follow AG s methodology to estimate the linear coefficients, and then use local linear least squares to obtain estimates of m (x 2 ) and its derivatives. Our estimation of (1) gave us similar results to the AG study, but there were subtle differences that may be attributed to computational differences and the accuracy of computer precision (note that the conclusions of the paper do not change). Specifically, we were not able to achieve the same bandwidth via the method used in their paper and we were not able to match the linear estimates given in their Table V (using their specified bandwidth or our calculated bandwidth). Our semiparametric estimates of the linear coefficients (along with the associated standard errors) are provided in Table 1. It is worth mentioning that the OLS estimates (reported in Table II of AG) are not much different from the semiparametric results. Thus, if the main objective is to estimate the linear parameters, the OLS results might be as good as the semiparametric results. Here we argue that the semiparametric models are more flexible so far as the estimation of implicit prices (derivatives) of the variables in the unknown function are concerned. 2
Once we obtained the linear coefficients, we estimated the unknown function ³y i = y i x 0 b 1iβ = m (x 2i )+ε i, using local linear least squares and employed the same kernel function and bandwidth used by AG. The results for the local linear least squares estimation of the semiparametric model are found in Table 2. The table reports the median coefficient with respect to each variable (along with it s bootstrapped standard error in italics). One striking feature of the table is the coefficient on lot size. Our results suggestthatthelotsizevariable(lot ) has a small, but significant, impact on the log price of a house. However, this coefficient is deceiving since the dependent variable is the natural logarithm of price and the lot size of the property is in level form. Given that the mean values of LOT and price are 5150 square feet and 68112, respectively, the average implicit price of a square foot of a lot is (0.00007 68112 =) 4.77 dollars. For the parametric model the average implicit price (evaluated at the mean) is (0.303 (68112/5150) =) 4.01 dollars, which is not much different from the semiparametric estimate. One might, however, ask why the lot size variable in the parametric model appears in natural logarithm while it appears in level in the semiparametric model. One would expect (for the purpose of comparison) that LOT would traditionally be included in the unknown function in logarithmic form. 1 Further, it is not quite clear to the present authors why three of the four discrete (what we refer to as ordered categorical) variables were included in the unknown function and the fourth (GAR) was not. In fact, given that the unknown function is traditionally composed of continuous variables, we are unsure as to why any of the four discrete variables would show up inside this function in the first place. This is especially mysterious since AG wished to compare the semiparametric model to a benchmark parametric model where three of their discrete variables were included in logarithmic form, 2 see their equation (9). AG did estimate several variations of their benchmark parametric model (including a log-level model), but their out-of-sample results of the semiparametric model were compared with their benchmark parametric model which had LOT, BDMS, FB,andSTY in logarithmic form. While there are reasons to exclude shift variables from an unconstrained, unknown function (see Pagan and Ullah 1999, pp. 198), we follow Rosen s (1974) suggestion and 1 We also estimated the semiparametric model in (1) with lot size measured in logs. Although the median coefficient on ln(lot ) is 0.292, the implicit price does not change significantly, nor do the coefficients on BDMS, FB, orsty. 2 GAR cannot be included in logarithmic form because there are zero values for some of the observations. 3
consider a fully nonlinear specification of the hedonic price equation. This setup leads us to estimate the model fully nonparametrically. 3 Nonparametric Estimation Although the techniques used in AG were cutting edge at the time the paper was written, recent advances allow us to estimate their model fully nonparametrically. Previously, with the presence of categorical regressors, authors were often forced to use semiparametric techniques (e.g. see Robinson 1988, and Stock 1989). However, Li and Racine (2004) and Racine and Li (2004) developed a model to smooth both ordered and unordered categorical data in a nonparametric kernel regression. This is especially important here because the idea is to check robustness/appropriateness of the results from the parametric and semiparametric models. To estimate the hedonic price function fully nonparametrically, we utilize Li-Racine Generalized Kernel Estimation. To begin, consider the nonparametric regression model y i = m(x i )+ε i, i =1,...,n (4) where y i is the dependent variable (in our case, ln(p )) for observation i. Further, m is the unknown smooth hedonic price function with argument x i =[x c i,xu i,xo i ],wherexc i is a vector of continuous regressors (in our case, a single continuous regressor, ln(lot )), x u i is a vector of regressors that assume unordered discrete values (DRV, REC, FFIN, GHW, CA, REG), x o i is a vector of regressors that assume ordered discrete values (GAR, BDMS, FB, STY ), ε is an additive error, and n is the number of observations. It is well known that estimation of the bandwidths is the most salient factor when performing nonparametric estimation. Although there exist many automatic selection methods, we utilize Hurvich, Simonoff, and Tsai s (1998) Expected Kullback Leibler (AIC c ) criteria. This method chooses smoothing parameters using an improved version of a criterion based on the Akaike Information Criteria. AIC c has been shown to perform well in small samples and avoids the tendency to undersmooth as often happens with other approaches such as Least-Squares Cross-Validation. 3 The results for the local linear least squares estimation of the nonparametric model 3 See Li and Racine (2004) and Racine and Li (2004) for further details on the procedure and bandwidth choice. 4
can be found in Table 3. 4 The table reports the mean coefficient with respect to each variable (along with it s bootstrapped standard error in italics), as well as the coefficients at the 25th, 50th, and 75th percentiles (labelled Q1, Q2, and Q3). The first thing to notice are the mean and quartile values of the coefficients on the discrete variables. Besides the insignificant REC =1(which was found to be significant in both the parametric and semiparametric procedures) and BDMS =2(perhaps due to the fact that there were only two houses in the sample with a single bedroom), the coefficients vary significantly over the quartiles. This variation suggests that the dummy variable approach is not appropriate. In other words, assuming a constant coefficient for these variables across the entire sample is incorrect. Next, as compared to the semiparametric approach, the nonparametric approach gives smaller estimates for the unordered categorical variables. Another benefit of the generalized kernel estimation procedure is that we can now analyze changes across ordered categorical variables without assuming a linear shift. For instance, the coefficient on GAR =1shows the counterfactual increase in the log price of a particular house when you increase the number of car garages from zero to one, ceteris paribus. Similarly, GAR =2would show the counterfactual increase in the log price of a particular house when you increase the number of car garages from zero to two, ceteris paribus. If the linear structure is appropriate, one would expect the coefficient on GAR 2 (this is grouped because there are very few houses in the sample with a three car garage) to be at least twice that of GAR =1. This is not the case. The mean coefficient goes from 0.026 on GAR =1to 0.029 on GAR 2. In other words, having a one car garage significantly increases the log price of a home, but the effect of an upgrade from a one-car to a two-car garage on the log price of a home is minimal. Finally, the coefficient on ln(lot ) is positive and significant at each quartile. Each of these results suggests that the nonparametric procedure is more appropriate for this particular data set. To pursue the idea of comparing results across models further, we compute the implicit price of lot size (which we consider to be an important variable in a home s price) for the 5 houses which have the median lot size of 4600 square feet. We compare two parametric models (equation 2 in the present paper which corresponds to equation 9 in AG, and equation 2 in the present paper where the variables BDMS, FB,andSTY are 4 All bandwidths were calculated using N c. 5
not logged), to the semiparametric and nonparametric models. The results are reported in Table 4. For the first house, the implicit price of lot size is almost the same across all four models. This is not the case for the other homes. For the second house (which is the most expensive in this group), the implicit price of lot size is the highest across all models. The price difference for the other houses across models is quite high. In three out of five cases, the implicit prices derived from the nonparametric model is the highest. To provide a basis for comparison for these implicit prices, we also computed the average price per square foot of lot size for each of these houses. The results show that the average price per square foot is much higher than the implicit price. Finally, we note that the parametric model isaspecialcaseofthesemiparametric model. Further, the semiparametric model is a special case of the nonparametric model. When the results differ, it is often argued that the more restrictive approach is inappropriate. Thus, it might be suggested that the OLS and semiparametric results are biased because they fail to take all nonlinearities of the model into account. 4 Conclusion In this paper we attempted to replicate the results of Anglin and Gençay (1996). Although we were able to exactly replicate their parametric results, we were unable to obtain identical results for their semiparametric procedure. In spite of the fact that our results differed, they did so slightly and the conclusions of the model stayed the same. We therefore assume that the differences are most likely due to differences in programming software. We enhanced their findings by also estimating the unknown function and its derivatives. Further, we extended their model by using advances in the literature which allow us to smooth both continuous and categorical data. In addition to being able to smooth discrete data, our preferred model employed new techniques for bandwidth estimation. Our results showed that the semiparametric model is too restrictive and that the use of a fully nonparametric model gives more intuitive and meaningful results. 6
References [1] Anglin, P. M., and R. Gençay (1996). Semiparametric Estimation of a Hedonic Price Function, Journal of Applied Econometrics, 11, 633-48. [2] Hurvich, C. M., J. S. Simonoff, and C.-L. Tsai (1998). Smoothing Parameter Selection in Nonparametric Regression Using an Improved Akaike Information Criterion, Journal of the Royal Statistical Society, Series B, 60, 271-93. [3] Li, Q., and J. Racine (2004). Cross-Validated Local Linear Nonparametric Regression, Statistica Sinica, 14, 485-512. [4] N c, Nonparametric software by Jeff Racine (http://www.economics.mcmaster.ca/racine/). [5] Pagan, A., and A. Ullah (1999). Nonparametric Econometrics, Cambridge, Cambridge University Press. [6] Racine, J., and Q. Li (2004). Nonparametric Estimation of Regression Functions with Both Categorical and Continuous Data, Journal of Econometrics, 119, 99-130. [7] Robinson, P. M. (1988). Root-N-Consistent Semiparametric Regression, Econometrica, 56, 931-54. [8] Rosen, S. (1974). Hedonic Prices and Implicit Markets: Product Differentiation in Pure Competition, Journal of Political Economy, 82, 34-55. [9] Stock, J. (1989). Nonparametric Policy Analysis, Journal of the American Statistical Association, 84, 567-75. 7
Table 1 Hedonic Price Function: Comparison of Linear Coefficients from the Semiparametric Specification 5 Variable AG (1996) HKP (2005) DRV 0.147 0.117 0.048 0.028 REC 0.078 0.077 0.028 0.027 FFIN 0.097 0.098 0.023 0.023 GHW 0.191 0.173 0.045 0.045 CA 0.158 0.154 0.022 0.022 GAR 0.064 0.056 0.013 0.012 REG 0.124 0.120 0.025 0.025 5 The natural log of price is the dependent variable in each regression. These are the estimates from AG s equation (11) and our equation (1). Both sets of estimates were calculated using the kernel function and bandwidth suggested by AG. Standard errors of the estimates are given in italics beneath each estimate. 8
Table 2 Hedonic Price Function: Local Linear Least Squares Results of Derivatives of the Unknown Function 6 Variable Median LOT 0.00007 0.00003 BDMS 0.0512 0.0451 FB 0.1718 0.1441 STY 0.0781 0.0997 6 The natural log of price is the dependent variable in the regression. Standard errors for the local linear least squares procedure (listed in italics beneath each estimate) were calculated with 199 bootstrap replications using the same kernel function and bandwidth employed by AG. However we must note that these results must be viewed with caution because the matrix of kernel values was near singular for several of the bootstrap iterations (this problem is alleviated, if e.g., we arbitrailly increase the bandwidth which results in relatively minor changes in the coefficients). 9
Table 3 Hedonic Price Function: Generalized Kernel Estimation 7 Variable Mean Q1 Q2 Q3 DRV =1 0.051 0.025 0.043 0.075 0.009 0.002 0.009 0.011 REC =1 0.000 0.000 0.000 0.000 0.003 0.003 0.003 0.003 FFIN =1 0.113 0.076 0.155 0.286 0.028 0.012 0.028 0.036 GHW =1 0.186 0.075 0.155 0.286 0.028 0.012 0.012 0.035 CA =1 0.142 0.104 0.138 0.174 0.021 0.020 0.021 0.034 REG =1 0.078 0.055 0.081 0.109 0.006 0.001 0.013 0.013 BDMS =2 0.014 0.023 0.016 0.012 0.008 0.011 0.008 0.008 BDMS =3 0.031 0.014 0.030 0.046 0.008 0.007 0.008 0.008 BDMS =4 0.045 0.013 0.037 0.067 0.005 0.007 0.008 0.007 BDMS 5 0.075 0.036 0.058 0.107 0.013 0.008 0.018 0.012 FB =2 0.156 0.103 0.150 0.202 0.011 0.014 0.022 0.041 FB 3 0.294 0.231 0.307 0.361 0.029 0.029 0.029 0.025 STY =2 0.061 0.030 0.055 0.084 0.004 0.003 0.004 0.004 STY =3 0.127 0.093 0.128 0.166 0.002 0.004 0.002 0.008 STY =4 0.197 0.167 0.185 0.250 0.012 0.008 0.009 0.008 GAR =1 0.026 0.005 0.024 0.046 0.002 0.003 0.003 0.003 GAR 2 0.029 0.011 0.030 0.041 0.003 0.002 0.003 0.003 ln(lot ) 0.404 0.320 0.390 0.473 0.077 0.069 0.078 0.069 7 The natural logarithm of price is used as the dependent variable in the regression. Q1, Q2, and Q3 refer to first, second, and third quartile, respectively. AICc is used for bandwith selection. Bootstrapped standard errors (199 replications) are listed in italics beneath each estimate. 10
Table 4 Implicit Price of Lot Size 8 House OLS(II) OLS(III) SP NP Price Price/Lot 1 2.926 2.832 2.887 2.820 43000 9.348 2 8.642 8.365 11.146 15.227 127000 27.609 3 3.402 3.293 5.085 4.437 50000 10.870 4 4.083 3.952 5.266 8.266 60000 13.043 5 5.137 4.973 6.042 8.859 75500 16.413 8 Each house in the table has the median lot size of 4600 square feet. Each coefficient is the implict price of lot size (e.g., P / LOT = β(ln(lot )) P / LOT). OLS(II) refers to the results using the estimation procedure from Table II of AG, whereas OLS(III) refers to the results using the estimation procedure from Table III of AG. SP refers to the semiparametric results using the estimation procedure from Table 2 in our paper, while NP refers to the nonparametric results using the estimation procedure from Table 3 in our paper. 11