Approximating the distribution of cluster-robust Wald statistics

robust variance estimation
distribution theory
Author

James E. Pustejovsky

Published

March 24, 2024

In Tipton and Pustejovsky (2015), we examined several different small-sample approximations for cluster-robust Wald test statistics, which are like F statistics but based on cluster-robust variance estimators. These statistics are, frankly, kind of weird and awkward to work with, and the approximations that we examined were far from perfect. In this post, I will look in detail at the robust Wald statistic for a simple but common scenario: a one-way ANOVA problem with clusters of dependent observations.

Meta-ANOVA

Consider a setup where clusters can be classified into one of C categories, with each cluster of observations falling into a single category. Let μ=[μc]c=1C denote the means of these categories. Suppose we have an estimator of those means μ^=[μ^c]c=1C and a corresponding cluster-robust variance estimator VR=c=1CVcR. Note that VR is diagonal because the estimators for each category are independent. Assume that the robust variance estimator is unbiased so E(VcR)=Var(μ^c)=ψc for c=1,...,C. Let Ψ=c=1Cψc.

Suppose that we want to test the null hypothesis that that means of the categories are all equal, H0:μ1=μ2==μC. We can express this null using a q×C contrast matrix C=[1q Iq], where q=C1. The null hypothesis is then Cμ=0q. The corresponding cluster-robust Wald statistic is Q=μ^C(CVRC)1Cμ^. Under the null hypothesis, the distribution of Q will converge to a χq2 as the number of clusters in each category grows large. However, with a limited number of clusters in some of the categories, this approximate reference distribution is not very accurate and tests based on it can have wildly inflated type I error rates.

Small-sample approximation

In the paper, we considered several different ways of approximating the distribution of Q that work at smaller sample sizes. One class of approaches to approximating the sampling distribution of Q is to use a Hotelling’s T2 distribution with degrees of freedom η. Given the degrees of freedom, Hotelling’s T2 is a multiple of an F distribution: ηq+1ηqQF(q,ηq+1). The question is then how to determine η.

Several of the approaches that we considered are based on representing the Q statistic as Q=zD1z, where Ω=CΨC, z=Ω1/2Cμ^c, G=Ω1/2C, and D=GVRG. The various approaches we considered involve different ways of approximating the sampling distribution of D.

Zhang’s approximation

One of the approximations involves finding degrees of freedom η by following a strategy suggested by Zhang (2012, 2013). These degrees of freedom are given by ηZ=q(q+1)s=1qt=1qVar(dst), where dst is the entry in row s, column t of D. To find ηZ, we can compute the denominator using general formulas given in the paper. However, with a bit of analysis we can find a much simpler expression for the special case of one-way ANOVA.

A bit of math

Before going further, it’s useful to observe that D is invariant to linear transformations of C. In particular, an equivalent way to write the null hypothesis is as H0:Ψ1/2C=0q, where Ψ=c=2Cψc is the diagonal of the true sampling variances of categories 2 through C, omitting the first category. Thus, let me redefine Ω=Ψ1/2CΨCΨ1/2, z=Ω1/2Ψ1/2Cμ^c, and G=Ω1/2Ψ1/2C. This transformation of the constraint matrix will make it possible to find a closed-form expression for Ω1/2.

Deriving Zhang’s degrees of freedom

Now, observe that Ω=Ψ1/2CΨCΨ1/2=Ψ1/2(Ψ+ψ11q1q)Ψ1/2=Iq+ψ1ff, where f=Ψ1/21q=[ψc1/2]c=2C. From the Woodbury identity, Ω1=I1Wff, where W=c=1C1ψc.

Fasi, Higham, and Liu (2023) provide formulas for pth roots of low-rank updates to scaled identity matrices. Their results provide a neat closed-form expression for Ω1/2. From their Equation (1.9), Ω1/2=Iqκ ff, where κ=ψ1Wψ1+W. Further, we can write the q×C matrix G as G=Ω1/2Ψ1/2C=(Iqκ ff)Ψ1/2[1q, Iq]=[κ(Wψ11)ψ1ψ1f,(Iqκ ff)Ψ1/2], with entries given by gsc={κ(Wψ11)ψ1ψ1ψs+1ifc=1I(s+1=c)ψcκψcψs+1ifc>1. Because D=GVRG and VR is diagonal, we can write the entries of D as dst=c=1CgscgtcVcR. And because the variance estimators for each category are independent, Var(dst)=c=1Cgsc2gtc2Var(VcR). In prior work, we derived expressions for the Satterthwaite degrees of freedom for variances of average effect sizes, and the same formulas can be applied here with the category-specific VcR. Let me write νc=2[E(VcR)]2/Var(VcR) for the degrees of freedom corresponding to category c. Then Var(dst)=2c=1Cgsc2gtc2ψc2νc. We can use this to obtain an expression for Zhang’s approximate degrees of freedom: q(q+1)ηZ1=s=1qt=1qVar(dst)=2s=1qt=1qc=1Cgsc2gtc2ψc2νc=2c=1Cψc2νc(s=1qgsc2)2. Now, all we need to do is simplify… s=1qgs12=s=1q(κ(Wψ11)ψ1)2ψ12ψs+1=(κ(Wψ11)ψ1)2ψ12c=2C1ψs+1=(κ(Wψ11)ψ1)2ψ12(Wψ11)ψ1=...a bunch of tedious algebra...=1ψ12(ψ11W) and, for c=2,...,C, s=1qgsc2=s=1q(I(s+1=c)ψcκψcψs+1)2=1ψc2κψc2+κ2ψc2s=1q1ψs+1=1ψc2κψc2+κ2ψc2(Wψ11)ψ1=...a bunch of tedious algebra...=1ψc2(ψc1W) Thus, q(q+1)ηZ1=2c=1Cψc2νc(s=1qgsc2)2=2c=1C1νcψc2(ψc1W)2=2c=1C1νc(11ψcW)2 or, rearranging, ηZ=C(C1)2c=1C1νc(11ψcW)2. It’s a surprisingly clean formula! Once these degrees of freedom are calculated, the degrees of freedom for the reference F distribution would be q and ηZq+1.

Other approximations

In the paper, we also considered two other degrees of freedom approximations, which involve not only the variances of dst but also the covariances between entries. In principle, one could follow similar algebra to get expressions for these other degrees of freedom as well. However, our simulations indicated that the other degrees of freedom approximations tend to be overly conservative and produce type-I error rates way below the nominal level (essentially, hardly ever rejecting the null) and less accurate than HTZ. So, there’s not much reason to work through them unless you find algebra enjoyable for its own sake.

What’s the right non-central distribution?

A further question about this cluster-robust Wald statistic is how to approximate its sampling distribution under specific alternative hypotheses. In other words, given a vector of means μ1,...,μC where the null does not hold, plus some information to determine ψc and νc for c=1,...,C, how could we approximate the distribution of Q? We need something like a non-central Hotelling’s T2 distribution…

Back to top