Resampling techniques to determine direction of effects in linear regression models

Size: px

Start display at page:

Download "Resampling techniques to determine direction of effects in linear regression models"

Hugo White
6 years ago
Views:

1 Resampling techniques to determine direction of effects in linear regression models Wolfgang Wiedermann, Michael Hagmann, Michael Kossmeier, & Alexander von Eye University of Vienna, Department of Psychology Corresponding Author: Wolfgang Wiedermann University of Vienna Unit of Research Methods Liebiggasse 5 A-1010 Vienna, Austria Tel wolfgang.wiedermann@univie.ac.at Acknowledgement The authors are indebted to Rainer W. Alexandrowicz, Ingrid Koller, and Anna P. Nutt. 1

2 Abstract Previous studies have shown that, in the context of linear regression analysis, the cube of the Pearson correlation coefficient can be expressed by the ratio of the third moment of the response variable to the third moment of the explanatory variable (Dodge & Rousson, 2001). This relation implies that the skewness of the response variable is always smaller than the skewness of the explanatory variable, and directional dependency can be determined based on the third moments of variables. The current study extends the concept of directional dependency and focuses on distributional properties of the residuals of two competing linear regression models. It is shown that the residual skewness of the mis-specified regression model is larger than the residual skewness of the true regression model. Based on this result, three significance tests are developed that can be used to determine the direction of dependence in non-normally distributed samples. A Monte-Carlo simulation experiment is performed to analyze robustness and power properties of the proposed tests under various degrees of correlations, sample sizes, and population distributions. Additionally, an empirical example is provided which underlines important assumptions of the proposed resampling procedures. Recommendations are given for making decisions concerning the direction of effects based on the three significance tests. Keywords: direction of effects, directional dependence, permutation, bootstrap, significance test 2

3 Concepts of correlation and linear regression are widely used in observational research. Although both concepts are highly related to each other, important differences exist. A correlation coefficient is usually defined by a symmetric formula which refers to the fact that two variables share the same role (i.e., no statements about causation is made). In contrast, in regression analyses it is necessary to define an explanatory (independent) variable and a response (dependent) variable. Thus, regression analysis requires a theoretical derivation of the causal relation between the variables of interest, which predefines the direction of a potentially observable effect. In many applications the direction of effects is obvious because reversing the direction seems implausible. For example, observing a negative correlation between age and alcohol intake, it is more plausible to assume that growing age causes a reduction of alcohol consumption due to a maturing out effect (e.g., Muthén & Muthén, 2000) than the assumption that alcohol intake rejuvenates individuals. Thus, in a linear regression model it is more plausible to regress alcohol intake on age than vice versa. However, in particular in observational research, examples exist where causal inference is more challenging because of (at least) two alternatives to causation: reverse causation and confounding as alternative explanations for the association between an exposure (X) and an outcome (Y). Reverse causation means that both directions are conceivable to explain the observed effect (i.e., X causes Y vs. Y causes X). Confounding means that a third variable (Z) influences the causal relationship between X and Y (McGue, Osler, & Christensen, 2010; McNamee, 2003). Consider, for example, cannabis use and the occurrence of schizophrenia. Here, it might be less clear whether cannabis use causes schizophrenia (causation) or whether patients with schizophrenia diagnoses are more likely to use cannabis (reverse causation; Arseneault et al., 2004; Degenhardt et al., 2009). In addition, the association between cannabis consumption and schizophrenia might also be confounded 3

4 by additional risk factors such as familial functioning or medical history (Smit, Bolier & Cuijpers, 2004). Standard linear regression or correlational analyses do not allow one to derive conclusions concerning the direction of effects. This can simply be demonstrated by considering that the Pearson correlation of two continuous standardized variables (X and Y) is mathematically identical to the linear regression parameter of X, assuming that Y is regressed on X, and the linear regression parameter of Y (assuming a model where X is regressed on Y) (see for example; Rodgers & Nicewander, 1988; von Eye & DeShon, 2012a). Recently, Dodge and Rousson (2001) discussed asymmetric properties of the correlation coefficient (see also Dodge & Rousson, 2000; Dodge & Yadegari, 2010; Muddapur, 2003; von Eye & DeShon, 2008) and have shown that the correlation coefficient can be expressed as the ratio of the third central moments of the response and the explanatory variable under the assumption that the explanatory variable is not normally distributed an assumption which seems reasonable for data obtained in the social sciences (Micceri, 1989). This asymmetric property implies that statements about the directional dependence can be made based on indicators of skewness. Von Eye and DeShon (2008) and Dodge and Yadegari (2010) have extended this idea to the case of the fourth and fifth order moments. The aim of the present study is to propose statistical inference methods for deciding which regression model seems more appropriate to describe the data generating process (i.e., Y is regressed on X versus X is regressed on Y). The article is structured as follows. First, we give a brief introduction to the concept of directional dependence. Second, instead of focusing on the third moment of the explanatory and the response variable, we focus on properties of the third moments of the residuals in a standard linear regression setting. We show that, under certain assumptions, the skewness of model residuals can serve as a basis to 4

5 determine the direction of effects. Third, based on this proposition, three significance tests are suggested for the comparison of two competing regression models. Fourth, results of a Monte-Carlo simulation are presented to assess the accuracy of the proposed tests. Fifth, we present an empirical example to illustrate important characteristics of the proposed tests. Finally, practical recommendations, limitations, and potential future research directions are discussed. The concept of directional dependence Let X and Y be two continuous variables. Then the two linear regression models (1) and (2) can be estimated to represent the data generating process. Here, and refer to the intercepts of the models, and refer to the slope parameters, and and denote the model residuals. Generally, the residuals are assumed to be normally distributed (with an expected value of zero; [ ] ), homoscedastic, and are independent of the explanatory variable. Without restricting generality and for simplicity, we further assume in the following that both variables X and Y are skewed and positively correlated. Let (3) be the Pearson correlation coefficient, where is the covariance of X and Y and and are the standard deviations of X and Y, respectively. In addition, let 5

6 [ ] (4) and [ ] (5) be the skewness of X and Y, i.e., the third moment of the variables. Dodge and Rousson (2000, 2001) as well as Dodge and Yadegari (2010) have shown that the following relationship between the third moments of the variables holds for the linear regression model given in Equation (1):, (6) where refers to the skewness of the residuals regressing Y on X. Furthermore, assuming symmetrically distributed residuals ( ) and an asymmetric distribution for X ( ) it follows that. Because the range of the correlation is 1 ρ 1, this relation implies that the skewness of the response variable will always be smaller than the skewness of the explanatory variable. Based on this implication, the following decisions about the direction of effects can be derived: (1) if and, then X is the explanatory variable and Y is the response, (2) if and, then Y is the explanatory and X is the response, and (3) if then the direction of the effect cannot be determined (cf., von Eye & DeShon, 2012a). In addition to these descriptive guidelines von Eye and DeShon (2008, 2012a) and Pornprasertmanit and Little (2012) propose using significance tests to determine directional dependence. The basic idea given in von Eye and DeShon (2008 and 2012a) is to use D Agostino s normality test (D Agostino, 1971) to assess whether X and/or Y significantly deviate from normality. Pornprasertmanit and Little (2012) suggest using bootstrapping techniques to evaluate the higher moments of X and Y. 6

7 Some distributional properties of the residuals We can use the concept of directional dependence as a starting point for the following considerations: Instead of focusing on distributional properties of X and Y, we focus on distributional properties of the residuals and. Von Eye and DeShon (2012a) emphasize that if a symmetric error variable (e.g., normally distributed noise) is added to a skewed explanatory variable (X), the resulting variable Y will be less skewed than X. As discussed in the previous section, directional dependency implies that (under the true model) the skewness of the response variable will always be less than the skewness of the explanatory variable. It follows that the residuals of the mis-specified model ( ) will be more asymmetric than the residuals of the correctly specified model ( ) and that the skewness of the residuals ( and ) can be used to decide between the two models. We now describe the first three moments of the residuals of the mis-specified model ( ) in detail. For simplicity and without loss of generality we set the intercepts in both models ( and ) to zero. Considering the relationship (7) and inserting (1) in (2), we obtain the following model for the error :. (8) Assuming that X and are independent, the following relation holds for the variances in Equation (8): ( ). (9) In this fashion we insert the third central moments in Equation (8) and arrive at 7

8 [( ) ] ( ( ) ) [( ) ] ( ) [( ) ]. (10) Now, assuming a symmetric distribution for (i.e., the true model) in (10), we obtain the following interesting implications: 1) The skewness of X and the skewness of have the same sign. 2) Assuming that the remaining terms are fixed, the skewness of increases with increasing skewness of X. 3) Assuming that the remaining terms are fixed, the skewness of decreases with increasing correlation ρ. From these implications, we can conclude that the skewness of model residuals can be used as a valuable indicator to determine the direction of effects. In the following section, we describe three significance tests for evaluating the symmetry of model residuals. Significance tests for deciding on the direction of effects Pornprasertmanit and Little (2012) suggested using bootstrapping techniques (see e.g., Efron & Tibshirani, 1993) to decide whether and deviate from expectancy, and whether the difference in skewness estimates deviates from expectancy. Many previous studies have demonstrated that the application of resampling techniques conserves Type I Error rates and leads to improved power in various settings of statistical inference (Keselman et al., 2008; Wilcox, Keselman, & Kowalchuk, 1998; Wilcox, 2005). In the following section we introduce three resampling approaches to assess whether the observed skewness of the residuals deviates significantly from zero. 8

9 Nonparametric bootstrap test. Previous studies, which investigated the accuracy of bootstrap strategies in the linear regression context, have mainly focused on robust inference on regression coefficients (Afifi et al., 2007; Wilcox, 2003, 2005; Wu, 1986). We suggest the application of a nonparametric bootstrap to construct a bootstrapped cumulative density function (bcdf) of the skewness estimates of model residuals. In detail, bootstrap samples of size n (i.e., the number of observations) are randomly drawn with replacement out of the observed model residuals. Let be the skewness estimate based on a bootstrap sample. This process is repeated m times and the resulting bcdf consists of m bootstrap skewness estimates. In order to test the null hypothesis let be the probability that the bootstrap estimate is greater than the theoretical value of zero, i.e.,. Then can be estimated as the proportion of bootstrap estimates for which exceeds zero, where is an indicator function which takes value 1 if is true, and 0 otherwise. Under the null hypothesis, approaches a uniform distribution as n and m increase (Wilcox, 2005) and the null hypothesis is rejected if is smaller than the nominal significance level α. Parametric bootstrap test. Second, instead of a nonparametric bootstrap, a parametric approach can be applied as well. Algorithmically, model residuals are used to first estimate the variance of the error distribution ( ). Next, m samples of size n are drawn from a normal distribution with zero expectancy and estimated variance. Again, for each bootstrap sample the skewness is estimated. The proportion of bootstrap samples for which exceeds the original skewness estimate can be used as an estimate for the probability of given the assumption of a symmetric error distribution:, where is an indicator function which takes value 1 if is true, and 0 otherwise. If is smaller than a chosen nominal significance level (α) the hypothesis of a 9

10 symmetric error is rejected. Obviously, the proposed approach is not restricted to the normal distribution. Any symmetrical distribution with an expected value of zero could be considered as well. However, the assumption of normally distributed errors is commonly met in the context of linear regression. Thus, we decided to consider this special case. Permutation test. Third, we suggest applying principles of permutation (Fisher, 1935; Neyman, 1923; Pitman, 1937) to assess the direction of effects. Permutation techniques have successfully been used to evaluate the impact of explanatory variables in linear regression settings (Anderson, 2001; Freedman & Lane, 1983; ter Braak, 1992). We propose the following procedure: Under the true model we assume a symmetric error distribution ( ) with an expected value of zero. In other words, for n residuals we would expect n/2 residuals being less than and n/2 being greater than zero. Given the assumption of exchangeability under, the signs of the residuals can be randomly shuffled. First, permutations of the absolute values of residuals are generated, so that n/2 residuals are assigned to the group with negative sign and the remaining n/2 residuals are assigned to the group with positive sign. Second, for each permutation sample step is repeated m times. The proportion of permutations where is estimated. Again, this exceeds the originally observed skewness of the residuals can again be used as an estimate for the probability under the assumption of a symmetrically distributed error. If is smaller than the chosen significance level α, the null hypothesis of symmetry is rejected. Again, is an indicator function which takes value 1 if is true, and 0 otherwise. It is important to note that the proposed test is only asymptotically equivalent to an exact permutation procedure (complete enumeration). For large sample sizes it seems impossible to establish the null distribution using all possible permutations (Dwass, 1957). The current approach of randomly sampling permutations is also known as Monte Carlo sampling (Hope, 1968; Nichols & Holmer, 2001). However, we 10

11 decide to use the term permutation test to avoid confusions with the Monte Carlo simulation experiment described below. Based on these resampling tests we suggest the following steps in deciding about the direction of causation: 1. Decide whether X or Y is the explanatory variable based on theoretical considerations and estimate parameters of the corresponding regression model (e.g., Model 1: ). 2. Examine distributional properties of Model 1-residuals ( ) using the resampling tests (i.e., test whether the null hypothesis of residual symmetry,, can be retained). 3. Estimate the parameters of the competing regression model (Model 2: ) to obtain the corresponding residuals. 4. Examine distributional properties of Model 2-residuals ( ) using the resampling tests (i.e., test whether the null hypothesis of residual symmetry,, can be retained). 5. If can be retained and can be rejected, Y is the response variable and X is the explanatory variable. If can be rejected and can be retained, X is the response variable and Y is the explanatory variable. If both null hypotheses ( and ) are rejected (or retained), no decision concerning the direction of the effect can be made, based on the methodology proposed in this article. To evaluate the performance of the proposed resampling tests for the identification of the regression model of the true data generating process, a Monte Carlo simulation experiment was conducted, which is explained in detail in the following section. 11

12 Methods To assess the performance of the proposed methods, a Monte Carlo simulation was conducted using the R statistical environment (R Core Team, 2013). The following simulation parameters were varied: Skewness of the explanatory variable (, magnitude of correlation (ρ), sample size (n), and type of error distribution. These experimental factors are described in the following paragraphs. Skewness of the explanatory variable. The explanatory variable X followed a gamma distribution with pre-specified shape and scale parameters c and d, respectively. The scale parameter was fixed at d = 1 throughout the simulation study. The shape parameter c, which is related to the skewness of X through (see, for example, Evans, Hastings, & Peacock, 2000) was chosen to obtain the desired skewness values of = 0.5, 1, 2, 3, and 4. Magnitude of correlation. To ensure that the predictor (X) and the response variable (Y) show the desired magnitude of correlation, Y was generated according to Equation (1), i.e., constitutes the true underlying model throughout the study. The slope parameter (b) was obtained by and chosen to generate the desired correlations of ρ = 0, 0.1, (0.1), 0.9. Here, denotes the residual variance, which was set to. Distribution of the residuals. The parametric bootstrap test relies on the assumption of normally distributed residuals. To analyze the behavior of the procedures when this assumption is violated, we considered two cases for the distribution of the residuals. In case A, the residuals followed a standard normal distribution N(0, 1), which is congruent with the distributional assumption of the parametric test. In case B, to mimic a symmetrically distributed error which violates the assumption of normality, residuals were sampled from a 12

13 Laplace distribution with zero mean and unit variance L(0, 1). Laplace distributed variates are expected to show a skewness of 0 and a kurtosis of 6 (Evans, Hastings, & Peacock, 2000). Number of observations. Sample sizes (n) were 50, 100, 150, 200, 250, and 500. For each pair of variables X and Y, two linear regression models were estimated according to Equations (1) and (2). In the first model Y was regressed on X, which corresponds to the true data generating process (i.e., the true model). In the second model X was regressed on Y, which contradicts the true underlying model. Next, the residuals of the competing models ( and ) were evaluated using the parametric bootstrap test, the nonparametric bootstrap test, and the permutation test. The number of resamplings for each test was set to m = 2000 to obtain reliable estimates for the significance of skewness deviations. For each experimental cell of the 5 (skewness of X) 10 (magnitude of correlation) 6 (sample size) 2 (distribution of the residuals) design 1000 repetitions were realized. The rates of false rejections of the true model (i.e., correctly rejecting the false model (i.e., is rejected) and the rates of is rejected) were chosen as descriptive measures for each combination of simulation parameters. Obviously, we expected the first one (the Type I error rate) to be near the chosen nominal significance level (α), which was set at α =.05. The second rate corresponds to the power of the procedure to identify the true regression model. To evaluate the robustness of Type I error rates, a 20 % robustness criterion was chosen. Thus, given α =.05, a test was considered robust if the empirical Type I error rates do not exceed the interval To evaluate the influence of simulation parameters on the test performances, standard ANOVA techniques were applied. Instead of a classical test statistic (such as a t-, or F-value) the proposed tests provide p-values based on the resampling procedures. Thus, logit-transformed p-values (i.e., ) were used as the dependent variable to avoid floor/ceiling effects due to the bounded interval 13

14 of [0,1] 1. The implemented R program is freely available upon request. The source codes for the proposed tests as well as an example of application are given in the appendix. Results Case A: Normally-distributed residuals Type I Error Figure 1 gives the Type I error rates (i.e., rejecting the null hypothesis of residual symmetry of the correctly specified regression model) of the three procedures as a function of the correlation (ρ) and the skewness of X. Overall, the Type I error rates do not systematically vary by magnitude of correlations and levels of skewness. The permutation test tends to suggest overly liberal decisions, for all simulated scenarios. That is, the Type I error rates are constantly above the robustness boundary of 6 %. As expected, the parametric bootstrap test performs best in protecting the nominal significance level. In the majority of cases the Type I error rates of the nonparametric bootstrap test lie also within the selected robustness interval of 4 6 %. Figure 2 shows the effect of sample size on empirical Type I error rates of the three tests. Non-robustness of the permutation test is even more pronounced for small sample sizes. Type I error rates of the bootstrap procedures lie within the robustness interval and, again, the parametric bootstrap test outperforms the nonparametric bootstrap test. 1 We also performed logistic regressions using the statistical decision of a test (i.e., a binary indicator whether the null hypothesis is accepted or rejected) as the dependent variable. No noteworthy differences resulted between the results of the logistic regression and the ANOVA models. Results of the logistic regression can be obtained from the authors upon request. 14

15 Figure 1: Type I error rates for as function of skewness and correlation (simulating normally distributed residuals). Figure 2: Type I error rates for as function sample size (simulating normally distributed residuals). 15

16 To analyze the sensitivity of the tests to the simulation factors we employed ANOVAs using the logit-transformed p-values of each test as the dependent variable. The skewness of X ( ), magnitude of correlation (ρ), and sample size (n) were selected as independent variables. According to the simulation setup each of the = 300 experimental cells contained 1000 observations. Because of the large number of observations in each cell, we will focus on partial η² values instead of p-values. Details of the ANOVA model are given in Table 1. Overall, R² values as a measure of model fit are close to zero for all ANOVAs. In addition, partial η² estimates suggest that the experimental factors of the simulation do not affect the distribution of logit values and, thus, explain close to nothing of the variation in logits (all partial η² 0.001). The null hypothesis of the ANOVA states that observations across groups are randomly sampled values of the same underlying population, which is reflected in the equality of means. The current results suggest that average logit values of each test do not systematically vary across experimental conditions. In other words, the test statistics are not affected by the simulation factors. For the correctly specified regression model we conclude that the tests only reject the null hypothesis of residual symmetry by chance according to the nominal significance level, and that regardless of sample sizes, skewness levels of X, and magnitude of correlation. 16

17 Table 1: ANOVA results for the Type I error simulation with normally distributed error terms. Source df Typ III Sum of Squares Mean Squares F-value p-value partial η² parametric bootstrap (R² = ) Correlation Skewness Correlation x Skewness Sample Size Correlation x Sample Size Skewness x Sample Size Correlation x Sample Size x Skewness nonparametric bootstrap (R² = ) Correlation Skewness Correlation x Skewness Sample Size Correlation x Sample Size Skewness x Sample Size Correlation x Sample Size x Skewness permutation (R² = ) Correlation Skewness Correlation x Skewness Sample Size Correlation x Sample Size Skewness x Sample Size Correlation x Sample Size x Skewness Power Next, we ask questions concerning the power of the proposed procedures (i.e., the rejection rates of the null hypothesis of residual symmetry of the mis-specified model). Figure 3 summarizes the power properties of the parametric bootstrap, the nonparametric bootstrap, and the permutation test. Overall, the power of all tests increases with the skewness of X and the number of observations (n). In addition, for a fixed level of skewness the power of the tests decreases with the magnitude of correlation between X and Y (Figure 4 17

18 shows this effect more explicitly). For highly skewed explanatory variables and highly correlated variables the parametric bootstrap test outperforms the nonparametric bootstrap and the permutation test. For more symmetric explanatory variables, the three significance tests turn out to be equally powerful. Again, to further explore the sensitivity of the three tests, ANOVAs were performed using the logit-transformed p-values of the corresponding significance tests as dependent variable. Results are summarized in Table 2. Overall, model fit estimates varied from R² = 0.73 to 0.77 depending on the significance test. The largest effects can be observed for the magnitude of correlation (partial η² values range from 0.59 to 0.63) and the level of skewness of X (all partial η² values > 0.4). Partial effect size estimates for sample size varied between 0.17 and Average logit values decrease with the correlation between the variables (ρ) and the number of observations (n). Due to the relation, smaller logit values represent smaller resampling p-values and, thus, more power. Reversely, average logit values increase with the magnitude of correlation which results in higher resampling p- values and less power. In addition, the two-way interactions correlation skewness, correlation sample size, and skewness sample size are statistically meaningful (partial η² range from 0.02 to 0.12). For a fixed level of correlation average logit values decrease with the skewness of X (i.e., the power increases). However, with increasing magnitude of correlation the average logit values increase, i.e., an increasing magnitude of correlation decreases the power of the tests. Similarly, average logit values decrease with increasing sample size for a fixed level of ρ. In other words, all tests are more powerful for larger samples. However, for a fixed number of observations the average logit values again increase as a function of ρ (i.e., the power of the tests reduces for highly correlated variables). The meaningful two-way interaction skewness sample size suggests that the systematic decrease in average logits is even more pronounced for large sample sizes. 18

19 Figure 3: Observed power for as function of sample size, correlation, skewness (simulating normally distributed residuals). Circles represent the nonparametric bootstrap, triangles the parametric bootstrap, and squares the permutation test. 19

20 Figure 4: Observed power for as a function of correlation (simulating normally distributed residuals). Finally, ANOVAs reveal significant three-way interactions ( correlation sample size skewness ) for all significance tests (partial η² = ). For a given fixed level of skewness the power of each test decreases with increasing correlation between X and Y. An increase in the skewness of X or an increase in sample size counters this pattern and can keep the power of the test on a sufficient level even for higher correlations (Figure 3). This pattern is about the same for each test and can be expected from the result given in Equation (10). 20

21 Source Table 2: ANOVA results for the power simulation with normally distributed error terms. df Typ III Sum of Squares Mean Squares F-value p-value partial η² parametric bootstrap (R² = ) Correlation < Skewness < Correlation Skewness < Sample Size < Correlation Sample Size < Skewness Sample Size < Correlation Sample Size Skewness < nonparametric bootstrap (R² = ) Correlation < Skewness < Correlation Skewness < Sample Size < Correlation Sample Size < Skewness Sample Size < Correlation Sample Size Skewness < permutation (R² = ) Correlation < Skewness < Correlation Skewness < Sample Size < Correlation Sample Size < Skewness Sample Size < Correlation Sample Size Skewness < Case B: Laplace-distributed residuals Type I Error So far, we focused on cases where the assumption of normality of residuals is fulfilled. The parametric bootstrap test relies on this assumption. Now, we focus on scenarios where residuals are drawn from a symmetric but non-normal distribution (i.e., Laplacedistributed residuals). Here, we expect the nonparametric bootstrap test and the permutation 21

22 test to show better robustness and power properties. Figure 5 gives the Type I error rates (i.e., rejection rates of the null hypothesis of residual symmetry of the correctly specified model) for the three tests as a function of correlation (ρ) and skewness ( ). In a strict sense, all tests fail to meet the robustness criterion of 4 6 %. However, departures from the 6 % boundary are small for the nonparametric bootstrap test and the permutation test. As expected, nonrobustness is more pronounced for the parametric bootstrap test (Type I error rates are > 24 % for all scenarios). Overall, the magnitude of Type I errors does not depend on the skewness of X or the magnitude of correlation between X and Y. Figure 5: Type I error rates for as function of skewness and correlation (simulating Laplace distributed residuals). 22

23 Figure 6 gives the Type I error rates as a function of sample size. The performance of the permutation test improves with increasing sample size and meets the robustness criterion for n > 150. In contrast, non-robustness of the parametric bootstrap test tends to even increase for larger sample sizes. Figure 6: Type I error rates for as function sample size (simulating Laplace distributed residuals). Table 3 gives the ANOVA results again using the logit-transformed p-values as the dependent variable. R² values as well as partial η² estimates were generally close to zero in all models. This again suggests that the factors of the simulation do not affect average logits of the parametric bootstrap, the nonparametric bootstrap, and the permutation test. 23

24 Table 3: ANOVA results for the Type I error simulation with Laplace distributed error terms. Source df Typ III Sum of Squares Mean Squares F-value p-value partial η² parametric bootstrap (R² = ) Correlation Skewness Correlation Skewness Sample Size Correlation Sample Size Skewness Sample Size Correlation Sample Size Skewness nonparametric bootstrap (R² = ) Correlation Skewness Correlation Skewness Sample Size Correlation Sample Size Skewness Sample Size Correlation Sample Size Skewness permutation (R² = ) Correlation Skewness Correlation Skewness Sample Size Correlation Sample Size Skewness Sample Size Correlation Sample Size Skewness Power Finally, we ask whether power functions of the proposed tests are affected by nonnormally distributed residuals. Again, the power of the tests constitutes the rejection rate of the null hypothesis of residual symmetry of the mis-specified model. Figure 7 shows the probability of rejecting the null hypothesis of symmetry of residuals as a function of sample size, correlation, and skewness of the explanatory variable. Results of the Type I error rates suggest that the three tests are unable to protect the nominal significance level according to 24

25 the 20 % robustness criterion. In particular the parametric bootstrap test shows empirical Type I error rates larger than 24 % (instead of 5 %). Thus, the power functions of the three procedures are not comparable in the usual sense. In this case, Zhang and Boos (1994) suggest the comparison of adjusted power estimates. Following this suggestion, we estimated the adjusted nominal significance level (i.e., the critical p-value; ) which corresponds to the 95 th quantile of the empirically observed distribution of resampling p-values for all three tests under the null hypothesis of indistinguishable regression models. Figure 7 shows the power using (instead of 5 %) as the nominal significance level for the considered sample sizes, magnitude of correlation, and skewness of the explanatory variable. Generally, patterns do not differ from those observed in the case of normally distributed residuals. Again, the power increases with the skewness of the explanatory variable and the number of observations. Adjusted power estimates are inversely related to the magnitude of correlation, i.e., the three significance tests lose power with increasing ρ (see also Figure 8). Both, the permutation test and the nonparametric bootstrap test outperform the parametric bootstrap test, as expected. This power superiority is more pronounced for more symmetric explanatory variables. ANOVAs were again performed using the logit-transformed p-values as the dependent variable to further explore the sensitivity of the three tests. Results are summarized in Table 4. Overall, model fit estimates varied from R² = 0.64 to Again, the strongest effects were observed for magnitude of correlation between X and Y (partial η² range from 0.45 to 0.63) and the skewness of the explanatory variable X (all partial η² > 0.35). Effect size estimates for the sample size factor varied from 0.11 to The average logit values decrease with increasing skewness and increasing sample size. Reversely, average logit values increase with the underlying correlation between X and Y. In addition, small effect sizes were observed for the two-way interactions correlation skewness (partial η² = ), correlation sample size (partial η² = ), and skewness sample size (all 25

26 partial η² ~ 0.03). Again, average logits decrease as skewness of X and number of observations increase. However, for a fixed degree of skewness or a fixed number of observations, average logits increase as a function of the magnitude of correlation which means that the tests lose power with increasing correlation. Reversely, the power of tests increases with increasing skewness and sample size. The effect size estimates for the threeway interaction correlation sample size skewness varied from 0.05 to Again, larger sample sizes together with highly skewed explanatory variables restore the power even for highly correlated data. 26

27 Figure 7: Adjusted power for as function of sample size, correlation, and skewness (simulating Laplace distributed residuals). Circles represent the nonparametric bootstrap, triangles the parametric bootstrap, and squares the permutation test. 27

28 Figure 8: Adjusted power for as a function of correlation (simulating Laplace distributed residuals). Comparison of statistical decisions Applying the proposed significance tests in practice involves statistical decisions concerning the skewness estimate of the residuals of one model (e.g., Y is regressed on X) and statistical inference on the skewness estimate of the residuals of the competing model (i.e., X is regressed on Y). Throughout the study, the first model constitutes the true data generating process. In this case we would expect 1) that the null hypothesis is accepted and 2) that the corresponding null hypothesis for the mis-specified model is rejected. Thus, in the following section we are interested in the behavior of the tests considering the combined statistical decisions for and. In particular, we focus on the 28

29 power of the tests in terms of model selection (i.e., retaining and simultaneously rejecting ) and we focus on those outcomes that do not enable researchers to draw conclusions concerning the direction of the observed effect, i.e., both null hypotheses are retained or both null hypotheses rejected. Factor Table 4: ANOVA results for the power simulation with Laplace distributed error terms. df Typ III Sum of Squares Mean Squares F-value p-value partial η² parametric bootstrap (R² = ) Correlation < Skewness < Correlation Skewness < Sample Size < Correlation Sample Size < Skewness Sample Size < Correlation Sample Size Skewness < nonparametric bootstrap (R² = ) Correlation < Skewness < Correlation Skewness < Sample Size < Correlation Sample Size < Skewness Sample Size < Correlation Sample Size Skewness < permutation (R² = ) Correlation < Skewness < Correlation Skewness < Sample Size < Correlation Sample Size < Skewness Sample Size < Correlation Sample Size Skewness < Figure 9 shows the power of the three tests in terms of identifying the true model as a function of sample size, magnitude of correlation, and skewness of X simulating normally 29

30 distributed residuals. Each line represents the proportion of correct model selections based on the combined inference concerning and. The power to identify the true model again increases with sample size and skewness of X. Again, power decreases with the magnitude of the correlation. However, high correlations together with high skewness and larger sample sizes conserve the power of the procedure. Generally, the parametric bootstrap test shows a power advantage for highly skewed predictors and highly correlated variables. For example, consider the case of n = 200, ρ = 0.8, and = 4: Here, the parametric bootstrap tests is able to select the correct model in 93.2 % of the simulated samples, whereas the nonparametric bootstrap test and the permutation test identify the correct model in 83.3 % and 80.4 % of the simulated samples, respectively. Figure 10 shows the (unadjusted) power functions in terms of model selection simulating Laplace distributed residuals. Overall, power increases with sample sizes and skewness of X and decreases with magnitude of correlation. In the majority of the simulated scenarios, the parametric bootstrap test is less powerful than the nonparametric bootstrap and permutation test. A reversed picture can be observed for highly correlated variables (see ρ > 0.7). Here, the parametric bootstrap test seems to be more powerful than the two competitors. However, this power advantage must be interpreted with caution, for two reasons: First, Figures 5 and 6 reveal that the Type I error rates of the parametric test are far above the nominal significance level. In other words, statistical decisions concerning residual symmetry of the correctly specified regression model are too liberal, which also biases statistical decisions in terms of model selection. Second, the comparison of the unadjusted power functions of the three tests for for Laplace distributed residuals suggests that the power loss of the parametric test is less pronounced for highly correlated variables. This scenario also conserves the power in terms of combined inference. In line with suggestions of 30

31 Zhang and Boos (1994), unadjusted power functions for are not shown here (these results can be obtained from the first author upon request). Next, we focus on those outcomes where no distinct decision about causation can be made. Tables 5 and 6 gives the relative frequencies for those cases where both significance tests lead to a statistically significant result. Findings are presented for the sample sizes n = 50, 150, 250, and 500 and ρ = 0, 0.2, 0.4, 0.6, and 0.8. Results of the remaining experimental conditions are very similar to the presented findings. For normally distributed residuals (see Table 5), the percentages of rejecting both null hypotheses for all simulated samples are 3.62 %, 4.00 %, and 4.87 % for the parametric bootstrap, the nonparametric bootstrap, and the permutation test, respectively. In general, these percentages decrease with increasing correlation between X and Y and increase with sample size n and skewness of X. Table 6 shows the results for Laplace-distributed residuals. Overall, the percentages of combined rejection of zero skewness are 4.55 % and 3.88 % for the nonparametric bootstrap test and the permutation test, respectively. In contrast, due to the heavily inflated Type I error rates of the parametric bootstrap test we observe the high percentage of combined rejections of %. Again, for all tests the percentages of combined rejections decrease with increasing correlation. Reversely, an increase of combined rejections can be observed as a function of the skewness of X. For the parametric bootstrap and the nonparametric bootstrap we also observe an increase of combined rejections for larger sample sizes. In contrast, the percentages of combined rejections of the permutation test slightly decline with increasing sample size. 31

32 Figure 9: Power in terms of model selection (i.e., retaining and rejecting ) as a function of sample size, correlation, and skewness (simulating normally distributed residuals). Circles represent the nonparametric bootstrap, triangles the parametric bootstrap, and squares the permutation test. 32

33 Figure 10: Power in terms of model selection (i.e., retaining and rejecting ) as a function of sample size, correlation, and skewness (simulating Laplace distributed residuals). Circles represent the nonparametric bootstrap, triangles the parametric bootstrap, and squares the permutation test. 33

34 Table 5: Relative frequencies of events where the skewness estimates of both error terms significantly deviate from zero (simulating normally-distributed error terms). Parametric bootstrap test Nonparametric boostrap test Permutation test n: ρ Skewness of X = Skewness of X = Skewness of X = Skewness of X = Skewness of X =

35 Table 6: Relative frequencies of events where the skewness estimates of both error terms significantly deviate from zero (simulating Laplace distributed error terms) Parametric bootstrap test Nonparametric bootstrap test Permutation test n: ρ Skewness of X = Skewness of X = Skewness of X = Skewness of X = Skewness of X = Next, we are interested in the second scenario where no distinct decision about the direction of the effect can be made, i.e., the case in which both significance tests suggest retaining the null hypotheses of zero skewness of residuals. These rates can be interpreted as the combined Type II error rates. Table 7 shows the combined Type II error rates for the case of normally-distributed residuals. Across all generated samples the percentages of a combined Type II error are 20.45%, %, and % for the parametric bootstrap, the nonparametric bootstrap, and the permutation test, respectively (which is in line with the 35

Appendix A (Pornprasertmanit & Little, in press) Mathematical Proof

Appendix A (Pornprasertmanit & Little, in press) Mathematical Proof Definition We begin by defining notations that are needed for later sections. First, we define moment as the mean of a random variable