Clustered standard errors and hypothesis tests in fixed effects models

James E. Pustejovsky

UPDATED April 09, 2024 to use current syntax for constraints argument in clubSandwich::Wald_test().

I’ve recently been working with my colleague Beth Tipton on methods for cluster-robust variance estimation in the context of some common econometric models, focusing in particular on fixed effects models for panel data—or what statisticians would call “longitudinal data” or “repeated measures.” We have a new working paper, which you can find here.

The importance of using CRVE (i.e., “clustered standard errors”) in panel models is now widely recognized. Less widely recognized, perhaps, is the fact that standard methods for constructing hypothesis tests and confidence intervals based on CRVE can perform quite poorly in when you have only a limited number of independent clusters. What’s worse, it can be hard to determine what counts as a large-enough sample to trust standard CRVE methods, because the finite-sample behavior of the variance estimators and test statistics depends on the configuration of the covariates, not just the total sample size. For example, suppose you have state-level panel data from 50 states across 15 years and are trying to estimate the effect of some policy using difference-in-differences. If only 5 or 6 states have variation in the policy variable over time, then you’re almost certainly in small-sample territory. And the sample size issues can be subtler than this, too, as I’ll show below.

One solution to this problem is to use bias-reduced linearization (BRL), which was proposed by Bell and McCaffrey (2002) and has recently begun to receive attention from econometricians (e.g., Cameron & Miller, 2015; Imbens & Kolesar, 2015). The idea of BRL is to correct the bias of standard CRVE based on a working model, and then to use a degrees-of-freedom correction for Wald tests based on the bias-reduced CRVE. That may seem silly (after all, the whole point of CRVE is to avoid making distributional assumptions about the errors in your model), but it turns out that the correction can help quite a bit, even when the working model is wrong. The degrees-of-freedom correction is based on a standard Satterthwaite-type approximation, and also relies on the working model. There’s now quite a bit of evidence (which we review in the working paper) that BRL performs well even in samples with a small number of clusters.

In the working paper, we make two contributions to all this:

One problem with Bell and McCaffrey’s original formulation of BRL is that it does not work in some very common models for panel data, such as state-by-year panels that include fixed effects for each state and each year (Angrist and Pischke, 2009, point out this issue in their chapter on “non-standard standard error issues”). We propose a generalization of BRL that works even in models with arbitrary sets of fixed effects. We also address how to calculate the correction when the regression is fit using the “within” estimator, after absorbing the fixed effects.
We propose a method for testing hypotheses that involve multiple parameter constraints (which, in classical linear regression, you would test with an F statistic). The method involves approximating the distribution of the cluster-robust Wald statistic using Hotelling’s T-squared distribution (a multiple of an F distribution), where the denominator degrees of freedom are estimated based on the working model. For one-parameter constraints, the test reduces to a t-test with Satterthwaite degrees of freedom, and so it is a natural extension of the existing BRL methods.

The paper explains all this in greater detail, and also reports a fairly extensive simulation study that we designed to emuluate the types of covariates and study designs encountered in micro-economic applications. We’ve also got an R package that implements our methods (plus some other variants of CRVE, which I’ll explain some other time) in a fairly streamlined way. Here’s an example of how to use the package to do inference for a fixed effects panel data model.

Effects of changing the minimum legal drinking age

Carpenter and Dobkin (2011) analyzed the effects of changes in the minimum legal drinking age on rates of motor vehicle fatalies among 18-20 year olds, using state-level panel data from the National Highway Traffic Administration’s Fatal Accident Reporting System. In their new textbook, Angrist and Pischke (2014) developed a stylized example based on Carpenter and Dobkin’s work. I’ll use Angrist and Pischke’s data and follow their analysis, just because their data are easily available.

The outcome is the incidence of deaths in motor vehicle crashes among 18-20 year-olds (per 100,000 residents), for each state plus the District of Columbia, over the period 1970 to 1983. Tthere were several changes in the minimum legal drinking age during this time period, with variability in the timing of changes across states. Angrist and Pischke (following Carpenter and Dobkin) use a difference-in-differences strategy to estimate the effects of lowering the minimum legal drinking age from 21 to 18. A basic specification is

\[y_{it} = \alpha_i + \beta_t + \gamma r_{it} + \epsilon_{it},\]

for \(i\) = 1,…,51 and \(t\) = 1970,…,1983. In this model, \(\alpha_i\) is a state-specific fixed effect, \(\beta_t\) is a year-specific fixed effect, \(r_{it}\) is the proportion of 18-20 year-olds in state \(i\) in year \(t\) who are legally allowed to drink, and \(\gamma\) captures the effect of shifting the minimum legal drinking age from 21 to 18. Following Angrist and Pischke’s analysis, I’ll estimate this model both by (unweighted) OLs and by weighted least squares with weights corresponding to population size in a given state and year.

Unweighted OLS

The following code does some simple data-munging and the estimates the model by OLS:

Code

# get data from Angrist & Pischke's website
library(foreign)
deaths <- read.dta("http://masteringmetrics.com/wp-content/uploads/2015/01/deaths.dta", convert.factors=FALSE)

# subset for 18-20 year-olds, deaths in motor vehicle accidents
MVA_deaths <- subset(deaths, agegr==2 & dtype==2 & year <= 1983, select = c(-dtype, -agegr))

# fit by OLS
lm_unweighted <- lm(mrate ~ 0 + legal + factor(state) + factor(year), data = MVA_deaths)

The coef_test function from clubSandwich can then be used to test the hypothesis that changing the minimum legal drinking age has no effect on motor vehicle deaths in this cohort (i.e., \(H_0: \gamma = 0\)). The usual way to test this is to cluster the standard errors by state, calculate the robust Wald statistic, and compare that to a standard normal reference distribution. The code and results are as follows:

Code

# devtools::install_github("jepusto/clubSandwich") # install the clubSandwich package
library(clubSandwich)
coef_test(lm_unweighted, vcov = "CR1", cluster = MVA_deaths$state, test = "z")["legal",]

 Coef. Estimate   SE t-stat d.f. (z) p-val (z) Sig.
 legal     7.59 2.38   3.19      Inf   0.00143   **

Our work argues shows that a better approach would be to use the bias-reduced linearization CRVE, together with Satterthwaite degrees of freedom. In the package, the BRL adjustment is called “CR2” because it is directly analogous to the HC2 correction used in heteroskedasticity-robust variance estimation. When applied to an OLS model estimated by lm, the default working model is an identity matrix, which amounts to the “working” assumption that the errors are all uncorrelated and homoskedastic. Here’s how to apply this approach in the example:

Code

coef_test(lm_unweighted, vcov = "CR2", cluster = MVA_deaths$state, test = "Satterthwaite")["legal",]

 Coef. Estimate   SE t-stat d.f. (Satt) p-val (Satt) Sig.
 legal     7.59 2.43   3.12        25.7      0.00442   **

The Satterthwaite degrees of freedom will be different for each coefficient in the model, and so the coef_test function reports them right alongside the standard error. In this case, the degrees of freedom are about half of what you might expect, given that there are 51 clusters. The p-value for the CR2+Satterthwaite test is about twice as large as the p-value based on the standard Wald test. But of course, the coefficient is still statistically significant at conventional levels, and so the inference doesn’t change.

Unweighted “within” estimation

The plm package in R provides another way to estimate the same model. It is convenient because it absorbs the state and year fixed effects before estimating the effect of legal. The clubSandwich package works with fitted plm models too:

Code

library(plm)
plm_unweighted <- plm(mrate ~ legal, data = MVA_deaths, 
                      effect = "twoways", index = c("state","year"))
coef_test(plm_unweighted, vcov = "CR1S", cluster = "individual", test = "z")

 Coef. Estimate   SE t-stat d.f. (z) p-val (z) Sig.
 legal     7.59 2.38   3.19      Inf   0.00143   **

Code

coef_test(plm_unweighted, vcov = "CR2", cluster = "individual", test = "Satterthwaite")

 Coef. Estimate   SE t-stat d.f. (Satt) p-val (Satt) Sig.
 legal     7.59 2.43   3.12        25.7      0.00442   **

For the standard approach, I’ve used the variant of the correction factor implemented in Stata (called CR1S in the clubSandwich package), but this makes very little difference in the standard error or the p-value. For the test based on CR2, the degrees of freedom are slightly different than the results based on the fitted lm model, but the p-values agree to four decimals. The differences in degrees of freedom are due to numerical imprecision in the calculations.

Population-weighted estimation

The difference between the standard method and the new method are not terribly exciting in the above example. However, things change quite a bit if the model is estimated using population weights. As far as I know, plm does not handle weighted least squares, and so I go back to fitting in lm with dummies for all the fixed effects.

Code

lm_weighted <- lm(mrate ~ 0 + legal + factor(state) + factor(year), 
                  weights = pop, data = MVA_deaths)
coef_test(lm_weighted, vcov = "CR1", cluster = MVA_deaths$state, test = "z")["legal",]

 Coef. Estimate   SE t-stat d.f. (z) p-val (z) Sig.
 legal      7.5 2.16   3.47      Inf    <0.001  ***

Code

coef_test(lm_weighted, vcov = "CR2", cluster = MVA_deaths$state, test = "Satterthwaite")["legal",]

 Coef. Estimate  SE t-stat d.f. (Satt) p-val (Satt) Sig.
 legal      7.5 2.3   3.27        8.65       0.0103    *

Using population weights slightly reduces the point estimate of the effect, while also slightly increasing its precision. If you were following the standard approach, you would probably be happy with the weighted estimates and wouldn’t think about it any further. However, our recommended approach—using the CR2 variance estimator and Satterthwaite correction—produces a p-value that is an order of magnitude larger (though still significant at the conventional 5% level). The degrees of freedom are just 8.6—drastically smaller than would be expected based on the number of clusters.

Even with weights, the coef_test function uses an “independent, homoskedastic” working model as a default for lm objects. In the present example, the outcome is a standardized rate and so a better assumption might be that the error variances are inversely proportional to population size. The following code uses this alternate working model:

Code

coef_test(lm_weighted, vcov = "CR2", 
          cluster = MVA_deaths$state, target = 1 / MVA_deaths$pop, 
          test = "Satterthwaite")["legal",]

 Coef. Estimate  SE t-stat d.f. (Satt) p-val (Satt) Sig.
 legal      7.5 2.2   3.41          13      0.00467   **

The new working model leads to slightly smaller standard errors and a couple of additional degrees of freedom, though we remain in small-sample territory.

Robust Hausman test

CRVE is also used in specification tests, as in the Hausman-type test for endogeneity of unobserved effects. Suppose that the model includes an additional control for the beer taxation rate in state \(i\) at time \(t\), denoted \(s_{it}\). The (unweighted) fixed effects model is then

\[y_{it} = \alpha_i + \beta_t + \gamma_1 r_{it} + \gamma_2 s_{it} + \epsilon_{it},\]

and the estimated effects are as follows:

Code

lm_FE <- lm(mrate ~ 0 + legal + beertaxa + factor(state) + factor(year), data = MVA_deaths)
coef_test(lm_FE, vcov = "CR2", cluster = MVA_deaths$state, test = "Satterthwaite")[c("legal","beertaxa"),]

    Coef. Estimate   SE t-stat d.f. (Satt) p-val (Satt) Sig.
    legal     7.59 2.51  3.019       24.58      0.00583   **
 beertaxa     3.82 5.27  0.725        5.77      0.49663

If the unobserved effects \(\alpha_1,...,\alpha_{51}\) are uncorrelated with the regressors, then a more efficient way to estimate \(\gamma_1,\gamma_2\) is by weighted least squares, with weights based on a random effects model. However, if the unobserved effects covary with \(\mathbf{r}_i, \mathbf{s}_i\), then the random-effects estimates will be biased.

We can test for whether endogeneity is a problem by including group-centered covariates as additional regressors. Let \(\tilde{r}_{it} = r_{it} - \frac{1}{T}\sum_t r_{it}\), with \(\tilde{s}_{it}\) defined analogously. Now estimate the regression

\[y_{it} = \beta_t + \gamma_1 r_{it} + \gamma_2 s_{it} + \delta_1 \tilde{r}_{it} + \delta_2 \tilde{s}_{it} + \epsilon_{it},\]

which does not include state fixed effects. The parameters \(\delta_1,\delta_2\) represent the differences between the random effects and fixed effects estimands of \(\gamma_1, \gamma_2\). If these are both zero, then the random effects estimator is unbiased. Thus, the joint test for \(H_0: \delta_1 = \delta_2 = 0\) amounts to a test for non-endogeneity of the unobserved effects.

For efficiency, we should estimate this using weighted least squares, but OLS will work too:

Code

MVA_deaths <- within(MVA_deaths, {
  legal_cent <- legal - tapply(legal, state, mean)[factor(state)]
  beer_cent <- beertaxa - tapply(beertaxa, state, mean)[factor(state)]
})

lm_Hausman <- lm(mrate ~ 0 + legal + beertaxa + legal_cent + beer_cent + factor(year), data = MVA_deaths)
coef_test(lm_Hausman, vcov = "CR2", cluster = MVA_deaths$state, test = "Satterthwaite")[1:4,]

      Coef. Estimate   SE  t-stat d.f. (Satt) p-val (Satt) Sig.
      legal   -9.180 7.62 -1.2042       24.94       0.2398     
   beertaxa    3.395 9.40  0.3613        6.44       0.7295     
 legal_cent   16.768 8.53  1.9665       33.98       0.0575    .
  beer_cent    0.424 9.25  0.0458        5.86       0.9650

To conduct a joint test on the centered covariates, we can use the Wald_test function. The usual way to test this hypothesis would be to use the CR1 variance estimator to calculate the robust Wald statistic, then use a \(\chi^2_2\) reference distribution (or equivalently, compare a re-scaled Wald statistic to an \(F(2,\infty)\) distribution). The Wald_test function reports the latter version:

Code

Wald_test(
  lm_Hausman, 
  constraints = constrain_zero(c("legal_cent","beer_cent")), 
  vcov = "CR1", 
  cluster = MVA_deaths$state, 
  test = "chi-sq"
)

   test Fstat df_num df_denom  p_val sig
 chi-sq  2.93      2      Inf 0.0534   .

The test is just shy of significance at the 5% level. If we instead use the CR2 variance estimator and our newly proposed approximate F-test (which is the default in Wald_test), then we get:

Code

Wald_test(
  lm_Hausman, 
  constraints = constrain_zero(c("legal_cent","beer_cent")), 
  vcov = "CR2", 
  cluster = MVA_deaths$state
)

 test Fstat df_num df_denom p_val sig
  HTZ  2.57      2     12.4 0.117

The low degrees of freedom of the test indicate that we’re definitely in small-sample territory and should not trust the asymptotic \(\chi^2\) approximation.

References

Angrist, J. D., & Pischke, J.-S. (2009). Mostly harmless econometrics: An empiricist’s companion. Princeton, NJ: Princeton University Press.
Angrist, J. D. and Pischke, J.-S. (2014). Mastering ’metrics: The Path from Cause to Effect. Princeton, NJ: Princeton University Press.
Bell, R. M., & McCaffrey, D. F. (2002). Bias reduction in standard errors for linear regression with multi-stage samples. Survey Methodology, 28(2), 169-181.
Cameron, A. C., & Miller, D. L. (2015). A practitioner’s guide to cluster-robust inference. URL: http://cameron.econ.ucdavis.edu/research/Cameron_Miller_JHR_2015_February.pdf
Carpenter, C., & Dobkin, C. (2011). The minimum legal drinking age and public health. Journal of Economic Perspectives, 25(2), 133-156. doi:10.1257/jep.25.2.133
Imbens, G. W., & Kolesar, M. (2015). Robust standard errors in small samples: Some practical advice. URL: https://www.princeton.edu/~mkolesar/papers/small-robust.pdf

Back to top