The Woodbury identity

A life-hack for analyzing hierarchical models

hierarchical models
matrix algebra
Author

James E. Pustejovsky

Published

December 4, 2020

As in many parts of life, statistics is full of little bits of knowledge that are useful if you happen to know them, but which hardly anybody ever bothers to mention. You would think, if something is so useful, perhaps your professors would spend a fair bit of time explaining it to you. But maybe the stuff seems trivial, obvious, or simple to them, so they don’t bother.

One example of this is Excel keyboard shortcuts. In a previous life, I was an Excel jockey so I learned all the keyboard shortcuts, such as how to move the cursor to the last cell in a continuous block of entries (ctrl + an arrow key). Whenever I do this while sharing a screen in a meeting, someone is invariably astounded and wants to know what dark sorcery I’m conjuring. It’s a simple trick, but a useful one—especially if you’re working with a really large dataset with thousands of rows. But it’s also something that there’s no reason to expect anyone to figure out on their own, and that no stats or quant methods professor is going to spend class time demonstrating.

Let me explain another, slightly more involved example, involving one of my favorite pieces of matrix algebra. There’s a thing called the Woodbury identity, also known as the Sherman-Morrison-Woodbury identity, that is a little life hack for inverting certain types of matrices. It has a Wikipedia page, which I have visited many times. It is a very handy bit of math, if you happen to be a statistics student working with hierarchical models (such as meta-analytic models). I’ll give a statement of the identity, then explain a bit about the connection to hierarchical models.

The Woodbury identity

Say that you’ve got four matrices, an n×n matrix A, a k×k matrix C, an n×k matrix U, and a k×n matrix V. Assume that A and C are invertible. The Woodbury identity tells you how to get the inverse of a certain combination of these matrices: (A+UCV)1=A1A1U(C1+VA1U)1VA1. Admit it, you’re impressed. “Dude! Mind. Blown.” you’re probably saying to yourself right now.

Or perhaps you’re still a touch skeptical that this formula is worth knowing. Let me explain the connection to hierarchical models.

Hierarchical models

Hierarchical linear models are a mainstay of statistical analysis in many, many areas of application, including education research, where we often deal with data collected on individuals (students, teachers) nested within larger aggregate units (like schools). In meta-analysis, these models come up if we’re dealing with samples that have more than one relevant outcome, so that we have multiple effect size estimates nested within a given sample or study.

Suppose we have a hierarchical structure with J clusters, where cluster j has nj individual observations. A quite general way of expressing a hierarchical model for such a data structure is Yj=Xjβ+Zjηj+ϵj, for j=1,...,J, where, for cluster j:

  • Yj is an nj×1 vector of outcomes,
  • Xj is an nj×p design matrix for the fixed effects,
  • β is a p×1 vector of fixed effect coefficients,
  • Zj is an nj×q design matrix for the random effects,
  • ηj is a q×1 vector of random effects, and
  • ϵj is an nj×1 vector of level-1 errors.

In this model, we assume that the random effects have mean zero and unknown variance-covariance matrix T, often assumed to be an unstructured, symmetric and invertible matrix; we assume that the level-1 errors are also mean zero with variance-covariance matrix Σj; and we assume that ηj is independent of ϵj. In many instances, we might assume that the entries of ej are all independent, so Σj will be a multiple of an identity matrix, Σj=σ2Ij. In other instances (such as models for longitudinal data), Σ might be a patterned matrix that includes off-diagonal terms, such as an auto-regressive structure.

What is the marginal variance of Yj|Xj in this model? In other words, if we combine the variance due to the random effects and the variance of the level-1 errors, what do we get? We get Var(Yj|Xj)=Vj=ZjTZj+Σj, a matrix that, if you reverse the terms, looks like Vj=Σj+ZjTZj a simple form of the combination of matrices in the left-hand side of the Woodbury identity. Thus, the identity tells us how we can invert this matrix.

But why would we care about inverting this variance-covariance matrix, you might ask? One good reason is that the fixed effect coefficients in the hierarchical model are estimated by weighted least squares, where the weight matrices are the inverse of an estimate of Vj. Thus, to understand how the weights in a hierarchical model work, it’s quite useful to be able to invert Vj. Another good (related) reason is that the sampling variance of the fixed effect estimates is approximately Var(β^)(j=1JXjVj1Xj)1 (it would be exact if we knew the parameters of Vj with certainty). So if we want to understand the precision of β^ or the power of a hypothesis test involving β^, then we we won’t be able to get very far without inverting Vj.

Directly applying the identity, we get Vj1=Σj1Σj1Zj(T1+ZjΣj1Zj)1ZjΣj1 This expression looks like a bit of a mess, I’ll admit, but it can be useful. Things simplify quite a bit of Σj1 has a form that is easy to invert (like a multiple of an identity matrix) and if the dimension of the random effects q is small. Under these conditions, Σj1 is easy to work with, T1 is manageable because it has small dimensions, and ZjΣj1Zj becomes manageable because it also has small dimensions (q×q, in both cases).

Random intercepts

As an example, consider a very simple model that includes only random intercepts, so Zj=1j, an nj×1 vector with every entry equal to 1, and T is simply τ2, the variance of the random intercepts. For simplicity, let’s also assume that the level-1 errors are independent, so Σj=σ2Ij and Σj1=σ2Ij. Applying the Woodbury identity, Vj1=Σj1Σj11j(T1+1jΣj11j)11jΣj1=σ2Ijσ41j(τ2+σ21j1j)11j=σ2Ijσ4(τ2+σ2nj)11j1j=σ2(Ijτ2σ2+njτ21j1j). Try checking this for yourself by carrying through the matrix algebra for VjVj1, which should come out equal to Ij.

Now suppose that the design matrix is also quite simple, consisting of just an intercept term Xj=1j, so that β=β is simply a population mean. How precise is the estimate of the population mean from this hierarchical model? Well, the sampling variance of the estimator β^ is approximately Var(β^)(j=1J1jVj11j)1=(σ2j=1J1j(Ijτ2σ2+njτ21j1j)1j)1=(σ2j=1Jnj(1njτ2σ2+njτ2))1=(σ2j=1Jnjσ2σ2+njτ2)1=(j=1Jnjσ2+njτ2)1=(σ2+τ2)(j=1Jnj1+(nj1)ρ)1, where ρ=τ2/(τ2+σ2) is the intra-class correlation. Squint at this expression for a bit and you can see how the ICC influences the varince. If ρ is near zero, then the sampling variance will be close to (σ2+τ2)/N, which is what you would get if you treated every observation as independent. If ρ is near 1, then the sampling variance ends up being nearly (σ2+τ2)/J, which is what you would get if you treated every cluster as a single observation. For intermediate ICCs, the sample size from cluster j (in the numerator of the fraction inside the summation) gets cut down to size accordingly.

The estimator of the population mean is a weighted average of the outcomes. Specifically, β^=(j=1J1jV^j11j)1j=1J1jV^j1Yj, where V^j is an estimator of Vj. If you carry through the matrix algebra, you’ll find that β^=(j=1Jnjσ2+njτ2)1j=1J1jYjσ2+njτ2=1Wj=1Ji=1njwjyij, where wj=11+(nj1)ρ and W=j=1Jnjwj. From this, we can see that the weight of a given observation depends on the ICC and the size of the cluster. If the ICC is low, then weights will all be close to 1. For higher ICCs, observations in smaller clusters get proportionately more weight than observations in larger clusters.

A meta-analysis example

In a previous post on multi-variate meta-analysis, I examined how weighting works in some multi-variate meta-analysis models, where you have multiple effect size estimates nested within a study. Letting Tij denote effect size estimate i in study j, for i=1,...,nj and j=1,...,J. The first model I considered in the previous post was Tij=μ+ηj+νij+eij, where Var(ηj)=τ2, Var(νij)=ω2, Var(eij)=Vj, treated as known, and cor(ehj,eij)=ρ for some specified value of ρ. This model makes the simplifying assumptions that the effect sizes within a given study all have the same sampling variance, Vj, and that there is a single correlation between pairs of outcomes from the same study, that is constant across all pairs of outcomes and across all studies.

You can write this model in matrix form as Tj=μ1j+ηj1j+νj+ej, where Var(νj)=ω2Ij and Var(ej)=Vj[ρ1j1j+(1ρ)Ij]. It follows that Var(Tj)=(τ2+Vjρ)1j1j+[ω2+Vj(1ρ)]Ij. The Woodbury identity comes in handy here again, if we want to examine the weights implied by this model or the sampling variance of the overall average effect size estimator. I’ll leave it as an exercise to find an expression for the weight assigned to effect size Tij under this model. You could also try finding an expression for the variance of the overall average effect size estimator μ^, based on inverse-variance weighting, when the model is correctly specified.

Another meta-analysis example

In the previous post, I also covered weighting in a bit more general model, where the sampling variances and correlations are no longer quite so constrained. As before, we have Tj=μ1j+ηj1j+νj+ej, where Var(ηj)=τ2 and Var(νj)=ω2Ij. But now let Var(ej)=Σj for some arbitrary, symmetric, invertible matrix Σj. The marginal variance of Tj is therefore Var(Tj)=τ21j1j+ω2Ij+Σj. Let Sj=(ω2Ij+Σj)1. Try applying the Woodbury identity to invert Var(Tj) in terms of τ2, nj, and Sj. Then see if you can derive the weight assigned to effect i in study j under this model. See the previous post for the solution.

Back to top

Footnotes

  1. This model is what we call the “correlated-and-hierarchical effects model” in my paper (with Beth Tipton) on extending working models for robust variance estimation.↩︎

  2. Or squint hard at the formula for the variance of Tj, and you’ll see that it has the same form as the random intercepts model in the previous example. Just replace the τ2 in that model with τ2+Vjρ and replace the σ2 in that model with ω2+Vj(1ρ).↩︎

  3. See the previous post for the answer.↩︎

  4. In the previous post, I expressed the weights in terms of sij, the sum of the entries in row i of the Sj matrix. In vector form, sj=(s1j s2j  snjj)=Sj1j.↩︎