A handmade clubSandwich for multi-site trials

James E. Pustejovsky

I’m just back from the Society for Research on Educational Effectiveness meetings, where I presented work on small-sample corrections for cluster-robust variance estimators in two-stage least squares models, which I’ve implemented in the clubSandwich R package. Here’s my presentation. So I had “clubSandwich” estimators on the brain when a colleague asked me about whether the methods were implemented in SAS.

The short answer is “no.”

The moderately longer answer is “not unless we can find funding to pay someone who knows how to program properly in SAS.” However, for the specific model that my colleague was interested in, it turns out that the small-sample corrections implemented in clubSandwich can be expressed in closed form, and they’re simple enough that they could easily be hand-calculated. I’ll sketch out the calculations in the remainder of this post.

A multi-site trial

Consider a multi-site trial conducted across $J$ sites, which we take as a sample from a larger super-population of sites. Each site consists of $n_{j}$ units, of which $p_{j} n_{j}$ are randomized to treatment and the remainder $(1 - p_{j}) n_{j}$ are randomized to control. For each unit $i$ in each site $j$ , we have an outcome $y_{i j}$ and a treatment indicator $t_{i j}$ .

A conventional approach to estimating the overall average impact in this setting is to use a model with a treatment indicator and fixed effects for each site: $y_{i j} = β_{j} + δ t_{i j} + e_{i j}$ and then to cluster the standard errors by site. Clustering by site makes sense here if (and only if) we’re interested in generalizing to the super-population of sites.

Let ${\hat{δ}}_{j}$ denote the impact estimate from site $j$ , calculated as the difference in means between treated and untreated units at site $j$ : ${\hat{δ}}_{j} = \frac{1}{n_{j} p_{j}} (\sum_{i = 1}^{n_{j}} t_{i j} y_{i j}) - \frac{1}{n_{j} (1 - p_{j})} (\sum_{i = 1}^{n_{j}} (1 - t_{i j}) y_{i j}) .$ for $j = 1, . ., J$ . The overall impact estimate here is a precision-weighted average of the site-specific impacts: $\hat{δ} = \frac{1}{W} \sum_{j = 1}^{J} w_{j} {\hat{δ}}_{j},$ where $w_{j} = n_{j} p_{j} (1 - p_{j})$ and $W = \sum_{j} w_{j}$ .

Sandwich estimators

The conventional clustered variance estimator (or sandwich estimator) for $\hat{δ}$ is a simple function of the (weighted) sample variance of the site-specific effects. It can be calculated directly as: $V^{C R 0} = \frac{1}{W^{2}} \sum_{j = 1}^{J} w_{j}^{2} {({\hat{δ}}_{j} - \hat{δ})}^{2} .$ Under a conventional random effects model for the $δ_{j}$ s, this estimator has a downward bias in finite samples.

The clubSandwich variance estimator here uses an estimator for the sample variance of site-specific effects that is unbiased under a certain working model. It is only slightly more complicated to calculate: $V^{C R 2} = \frac{1}{W^{2}} \sum_{j = 1}^{J} \frac{w_{j}^{2} {({\hat{δ}}_{j} - \hat{δ})}^{2}}{1 - w_{j} / W} .$

The other difference between conventional methods and the clubSandwich approach is in the reference distribution used to calculate hypothesis tests and confidence intervals. The conventional approach uses a standard normal reference distribution (i.e., a z-test) that is asymptotically justified. The clubSandwich approach uses a $t$ reference distribution, with degrees of freedom estimated using a Satterthwaite approximation. In the present context, the degrees of freedom are a little bit ugly but still not hard to calculate: $d f = {[\sum_{j = 1}^{J} \frac{w_{j}^{2}}{(W - w_{j})^{2}} - \frac{2}{W} \sum_{j = 1}^{J} \frac{w_{j}^{3}}{(W - w_{j})^{2}} + \frac{1}{W^{2}} {(\sum_{j = 1}^{J} \frac{w_{j}^{2}}{W - w_{j}})}^{2}]}^{- 1} .$

In the special case that all sites are of the same size and use a constant treatment allocation, the weights become equal. The clubSandwich variance estimator then reduces to $V^{C R 2} = \frac{S_{δ}^{2}}{J} where S_{δ}^{2} = \frac{1}{J - 1} \sum_{j = 1}^{J} {({\hat{δ}}_{j} - \hat{δ})}^{2},$ and the degrees of freedom reduce to simply $d f = J - 1$ .

Tennessee STAR

Here is a worked example of the calculations (using R of course, because my SAS programming skills atrophied years ago). I’ll use data from the famous Tennessee STAR class size experiment, which was a multi-site trial in which students were randomized to small or regular-sized kindergarten classes within each of several dozen schools. To make the small-sample issues more pronounced, I’ll limit the sample to urban schools and look at impacts of small class-size on reading and math scores at the end of kindergarten. STAR was actually a three-arm trial—the third arm being a regular-sized class but with an additional teacher aide. For simplicity (and following convention), I’ll collapse the teacher-aide condition and the regular-sized class condition into a single arm and also limit the sample to students with complete outcome data on both tests.

Code

library(tidyverse)
data(STAR, package = "AER")

STAR_urban <-
  STAR %>%
  filter(
    # limit to urban/inner city schools
    schoolk %in% c("urban","inner-city"),
    # limit to complete outcome data
    !is.na(readk), !is.na(mathk)
  ) %>%
  droplevels() %>%
  # collapse control conditions
  mutate(stark = fct_collapse(stark, regular = c("regular","regular+aide"))) %>%
  select(schoolidk, stark, readk, mathk)

STAR_summary <- 
  STAR_urban %>%
  count(schoolidk)

After these exclusions, the data include a total of 1810 students from 23 schools, ranging in size from 34 to 134 students.

For starters, let’s get the average impacts using a seeming unrelated regression specification, with both conventional and clubSandwich standard errors.

Code

library(clubSandwich)
STAR_fit <- lm(cbind(readk, mathk) ~ 0 + schoolidk + stark, data = STAR_urban)

# conventional SEs
CR0 <- 
  coef_test(STAR_fit, vcov = "CR0", 
            cluster = STAR_urban$schoolidk, 
            test = "z",
            coefs = c("readk:starksmall","mathk:starksmall"))

CR0

            Coef. Estimate   SE t-stat d.f. (z) p-val (z) Sig.
 readk:starksmall     6.16 2.73   2.25      Inf    0.0241    *
 mathk:starksmall    12.13 4.79   2.53      Inf    0.0113    *

Code

# clubSandwich SEs
CR2 <- 
  coef_test(STAR_fit, vcov = "CR2", 
            cluster = STAR_urban$schoolidk, 
            coefs = c("readk:starksmall","mathk:starksmall"))

CR2

            Coef. Estimate   SE t-stat d.f. (Satt) p-val (Satt) Sig.
 readk:starksmall     6.16 2.81   2.19          19       0.0409    *
 mathk:starksmall    12.13 4.92   2.47          19       0.0234    *

Now I’ll do it “by hand”—or rather, with a bit of dplyr:

Code

# summary statistics by site

school_summaries <- 
  STAR_urban %>%
  group_by(schoolidk, stark) %>%
  summarise(
    # means by arm and site
    readk = mean(readk),
    mathk = mean(mathk),
    n_arm = n()
  ) %>%
  summarise(
    # impact estimates by site
    readk = diff(readk),
    mathk = diff(mathk),
    n = sum(n_arm),
    p = n_arm[stark=="small"] / n
  ) %>%
  mutate(w = n * p * (1 - p))

# overall impacts

school_summaries %>%
  gather("subject","impact_j", readk, mathk) %>%
  group_by(subject) %>%
  summarise(
    impact = weighted.mean(impact_j, w = w),
    SE_CR0 = sqrt(sum(w^2 * (impact_j - impact)^2) / sum(w)^2),
    SE_CR2 = sqrt(sum(w^2 * (impact_j - impact)^2 / (1 - w / sum(w))) / sum(w)^2),
    df_CR2 = 1 / (sum(w^2 / (sum(w) - w)^2) - 
                    2 * sum(w^3 / (sum(w) - w)^2) / sum(w) + 
                    sum(w^2 / (sum(w) - w))^2 / sum(w)^2)
  ) %>%
  knitr::kable(digits = 2)

subject	impact	SE_CR0	SE_CR2	df_CR2
mathk	12.13	4.79	4.92	18.99
readk	6.16	2.73	2.81	18.99

The CR0 and CR2 standard errors match the results from coef_test, as do the Satterthwaite degrees of freedom. Note that the degrees of freedom are equal to 19 in this case, a bit less than $J - 1 = 22$ due to variation in the weight assigned to each school.

Other weights

Some analysts might not like the approach of using precision-weighted average of the site-specific impacts, as I’ve examined here. Instead, one might choose to weight the site-specific effects by the site-specific sample sizes, or to use some sort of random effects weighting that allows for random heterogeneity across sites. The formulas given above for conventional and clubSandwich clustered variance estimators apply directly to other weighting schemes too. Just substitute your favorite weights in place of $w_{j}$ . When doing so, the clubSandwich estimator will be exactly unbiased under the assumption that your preferred weighting scheme corresponds to inverse-variance weighting, and the Satterthwaite degrees of freedom approximation will be derived under the same model.

--- title: A handmade clubSandwich for multi-site trials date: '2019-03-09' categories: - sandwiches - robust variance estimation - econometrics - weighting code-tools: true --- I'm just back from the [Society for Research on Educational Effectiveness](https://www.sree.org/assets/conferences/2019s/program.pdf) meetings, where I presented work on small-sample corrections for cluster-robust variance estimators in two-stage least squares models, which I've implemented in the [`clubSandwich`](/software/clubSandwich/) R package. [Here's my presentation](/files/SREE-2019-2SLS-CRVE.pdf). So I had "clubSandwich" estimators on the brain when a colleague asked me about whether the methods were implemented in SAS. The short answer is "no." The moderately longer answer is "not unless we can find funding to pay someone who knows how to program properly in SAS." However, for the specific model that my colleague was interested in, it turns out that the small-sample corrections implemented in clubSandwich can be expressed in closed form, and they're simple enough that they could easily be hand-calculated. I'll sketch out the calculations in the remainder of this post. ## A multi-site trial Consider a multi-site trial conducted across $J$ sites, which we take as a sample from a larger super-population of sites. Each site consists of $n_j$ units, of which $p_j n_j$ are randomized to treatment and the remainder $(1 - p_j) n_j$ are randomized to control. For each unit $i$ in each site $j$, we have an outcome $y_{ij}$ and a treatment indicator $t_{ij}$. A conventional approach to estimating the overall average impact in this setting is to use a model with a treatment indicator and fixed effects for each site: $$ y_{ij} = \beta_j + \delta t_{ij} + e_{ij} $$ and then to cluster the standard errors by site. Clustering by site makes sense here if (and only if) we're interested in generalizing to the super-population of sites. Let $\hat\delta_j$ denote the impact estimate from site $j$, calculated as the difference in means between treated and untreated units at site $j$: $$ \hat\delta_j = \frac{1}{n_j p_j} \left(\sum_{i=1}^{n_j} t_{ij} y_{ij}\right) - \frac{1}{n_j (1 - p_j)} \left(\sum_{i=1}^{n_j} (1 - t_{ij}) y_{ij}\right). $$ for $j = 1,..,J$. The overall impact estimate here is a precision-weighted average of the site-specific impacts: $$ \hat\delta = \frac{1}{W} \sum_{j=1}^J w_j \hat\delta_j, $$ where $w_j = n_j p_j (1 - p_j)$ and $W = \sum_j w_j$. ## Sandwich estimators The conventional clustered variance estimator (or sandwich estimator) for $\hat\delta$ is a simple function of the (weighted) sample variance of the site-specific effects. It can be calculated directly as: $$ V^{CR0} = \frac{1}{W^2} \sum_{j=1}^J w_j^2 \left(\hat\delta_j - \hat\delta\right)^2. $$ Under a conventional random effects model for the $\delta_j$s, this estimator has a downward bias in finite samples. The clubSandwich variance estimator here uses an estimator for the sample variance of site-specific effects that is unbiased under a certain working model. It is only slightly more complicated to calculate: $$ V^{CR2} = \frac{1}{W^2} \sum_{j=1}^J \frac{w_j^2 \left(\hat\delta_j - \hat\delta\right)^2}{1 - w_j / W}. $$ The other difference between conventional methods and the clubSandwich approach is in the reference distribution used to calculate hypothesis tests and confidence intervals. The conventional approach uses a standard normal reference distribution (i.e., a z-test) that is asymptotically justified. The clubSandwich approach uses a $t$ reference distribution, with degrees of freedom estimated using a Satterthwaite approximation. In the present context, the degrees of freedom are a little bit ugly but still not hard to calculate: $$ df = \left[\sum_{j=1}^J \frac{w_j^2}{(W - w_j)^2} - \frac{2}{W}\sum_{j=1}^J \frac{w_j^3}{(W - w_j)^2} + \frac{1}{W^2} \left(\sum_{j=1}^J \frac{w_j^2}{W - w_j} \right)^2 \right]^{-1}. $$ In the special case that all sites are of the same size and use a constant treatment allocation, the weights become equal. The clubSandwich variance estimator then reduces to $$ V^{CR2} = \frac{S_\delta^2}{J} \qquad \text{where} \qquad S_\delta^2 = \frac{1}{J - 1}\sum_{j=1}^J \left(\hat\delta_j - \hat\delta\right)^2, $$ and the degrees of freedom reduce to simply $df = J - 1$. ## Tennessee STAR Here is a worked example of the calculations (using R of course, because my SAS programming skills atrophied years ago). I'll use data from the famous Tennessee STAR class size experiment, which was a multi-site trial in which students were randomized to small or regular-sized kindergarten classes within each of several dozen schools. To make the small-sample issues more pronounced, I'll limit the sample to urban schools and look at impacts of small class-size on reading and math scores at the end of kindergarten. STAR was actually a three-arm trial---the third arm being a regular-sized class but with an additional teacher aide. For simplicity (and following convention), I'll collapse the teacher-aide condition and the regular-sized class condition into a single arm and also limit the sample to students with complete outcome data on both tests. ```{r, message = FALSE} library(tidyverse) data(STAR, package = "AER") STAR_urban <- STAR %>% filter( # limit to urban/inner city schools schoolk %in% c("urban","inner-city"), # limit to complete outcome data !is.na(readk), !is.na(mathk) ) %>% droplevels() %>% # collapse control conditions mutate(stark = fct_collapse(stark, regular = c("regular","regular+aide"))) %>% select(schoolidk, stark, readk, mathk) STAR_summary <- STAR_urban %>% count(schoolidk) ``` After these exclusions, the data include a total of `{r} sum(STAR_summary$n)` students from `{r} nrow(STAR_summary)` schools, ranging in size from `{r} min(STAR_summary$n)` to `{r} max(STAR_summary$n)` students. For starters, let's get the average impacts using a seeming unrelated regression specification, with both conventional and clubSandwich standard errors. ```{r} library(clubSandwich) STAR_fit <- lm(cbind(readk, mathk) ~ 0 + schoolidk + stark, data = STAR_urban) # conventional SEs CR0 <- coef_test(STAR_fit, vcov = "CR0", cluster = STAR_urban$schoolidk, test = "z", coefs = c("readk:starksmall","mathk:starksmall")) CR0 # clubSandwich SEs CR2 <- coef_test(STAR_fit, vcov = "CR2", cluster = STAR_urban$schoolidk, coefs = c("readk:starksmall","mathk:starksmall")) CR2 ``` Now I'll do it "by hand"---or rather, with a bit of `dplyr`: ```{r} # summary statistics by site school_summaries <- STAR_urban %>% group_by(schoolidk, stark) %>% summarise( # means by arm and site readk = mean(readk), mathk = mean(mathk), n_arm = n() ) %>% summarise( # impact estimates by site readk = diff(readk), mathk = diff(mathk), n = sum(n_arm), p = n_arm[stark=="small"] / n ) %>% mutate(w = n * p * (1 - p)) # overall impacts school_summaries %>% gather("subject","impact_j", readk, mathk) %>% group_by(subject) %>% summarise( impact = weighted.mean(impact_j, w = w), SE_CR0 = sqrt(sum(w^2 * (impact_j - impact)^2) / sum(w)^2), SE_CR2 = sqrt(sum(w^2 * (impact_j - impact)^2 / (1 - w / sum(w))) / sum(w)^2), df_CR2 = 1 / (sum(w^2 / (sum(w) - w)^2) - 2 * sum(w^3 / (sum(w) - w)^2) / sum(w) + sum(w^2 / (sum(w) - w))^2 / sum(w)^2) ) %>% knitr::kable(digits = 2) ``` The CR0 and CR2 standard errors match the results from `coef_test`, as do the Satterthwaite degrees of freedom. Note that the degrees of freedom are equal to `{r} round(CR2$df[1], 1)` in this case, a bit less than $J - 1 = 22$ due to variation in the weight assigned to each school. ## Other weights Some analysts might not like the approach of using precision-weighted average of the site-specific impacts, as I've examined here. Instead, one might choose to weight the site-specific effects by the site-specific sample sizes, or to use some sort of random effects weighting that allows for random heterogeneity across sites. The formulas given above for conventional and clubSandwich clustered variance estimators apply directly to other weighting schemes too. Just substitute your favorite weights in place of $w_j$. When doing so, the clubSandwich estimator will be exactly unbiased under the assumption that your preferred weighting scheme corresponds to inverse-variance weighting, and the Satterthwaite degrees of freedom approximation will be derived under the same model.