Simulating correlated standardized mean differences for meta-analysis

James E. Pustejovsky

As I’ve discussed in previous posts, meta-analyses in psychology, education, and other areas often include studies that contribute multiple, statistically dependent effect size estimates. I’m interested in methods for meta-analyzing and meta-regressing effect sizes from data structures like this, and studying this sort of thing often entails conducting Monte Carlo simulations. Monte Carlo simulations involve generating artificial data—in this case, a set of studies, each of which has one or more dependent effect size estimates—that follows a certain distributional model, applying different analytic methods to the artificial data, and then repeating the process a bunch of times. Because we know the true parameters that govern the data-generating process, we can evaluate the performance of the analytic methods in terms of bias, accuracy, hypothesis test calibration and power, confidence interval coverage, and the like.

In this post, I’ll discuss two alternative methods to simulate meta-analytic datasets that include studies with multiple, dependent effect size estimates: simulating individual participant-level data or simulating summary statistics. I’ll focus on the case of the standardized mean difference (SMD) because it is so common in meta-analyses of intervention studies. For simplicity, I’ll assume that the effect sizes all come from simple, two-group comparisons (without any covariate adjustment or anything like that) and that the individual observations are multi-variate normally distributed within each group. Our goal will be to simulate a set of \(K\) studies, where study \(k\) is based on measuring \(J_k\) outcomes on a sample of \(N_k\) participants, all for \(k = 1,...,K\). Let \(\boldsymbol\delta_k = (\delta_{1k} \cdots \delta_{J_k k})'\) be the \(J_k \times 1\) vector of true standardized mean differences for study \(k\). I’ll assume that we know these true effect size parameters for all \(K\) studies, so that I can avoid committing to any particular form of random effects model.

Simulating individual participant-level data

The most direct way to simulate this sort of effect size data is to generate outcome data for every artificial participant in every artificial study. Let \(\mathbf{Y}_{ik}^T\) be the \(J_k \times 1\) vector of outcomes for treatment group participant \(i\) in study \(k\), and let \(\mathbf{Y}_{ik}^C\) be the \(J_k \times 1\) vector outcomes for control group participant \(i\) in study \(k\), for \(i=1,...,N_k / 2\) and \(k = 1,...,K\). Assuming multi-variate normality of the outcomes, we can generate these outcome vectors as \[ \mathbf{Y}_{ik}^T \sim N\left(\boldsymbol\delta_k, \boldsymbol\Psi_k\right) \qquad \text{and}\qquad \mathbf{Y}_{ik}^C \sim N\left(\mathbf{0}, \boldsymbol\Psi_k\right), \] where \(\boldsymbol\Psi_k\) is the population correlation matrix of the outcomes in study \(k\). Note that I am setting the mean outcomes of the control group participants to zero and also specifying that the outcomes all have unit variance within each group. After simulating data based on these distributions, the effect size estimates for each outcome can be calculated directly, following standard formulas.

Here’s what this approach looks like in code. It is helpful to simplify things by focusing on simulating just a single study with multiple, correlated effect sizes. Focusing first on just the input parameters, a function might look like the following:

Code

r_SMDs_raw <- function(delta, J, N, Psi) {
  # stuff
  return(ES_data)  
}

In the above function skeleton, delta would be the true effect size parameter \(\boldsymbol\delta_k\), J would be the number of effect sizes to generate \((J_k)\), N is the total number of participants \((N_k)\), and Psi is a matrix of correlations between the outcomes \((\Psi_k)\). From these parameters, we’ll generate raw data, calculate effect size estimates and standard errors, and return the results in a little dataset.

To make the function a little bit easier to use, I’m going overload the Psi argument so that it can be a single number, indicating a common correlation between the outcomes. Thus, instead of having to feed in a \(J_k \times J_k\) matrix, you can specify a single correlation \(r_k\), and the function will assume that all of the outcomes are equicorrelated. In code, the logic is:

Code

if (!is.matrix(Psi)) Psi <- Psi + diag(1 - Psi, nrow = J)

Here’s the function with the innards:

Code

r_SMDs_raw <- function(delta, J, N, Psi) {

  require(mvtnorm) # for simulating multi-variate normal data
  
  # create Psi matrix assuming equicorrelation
  if (!is.matrix(Psi)) Psi <- Psi + diag(1 - Psi, nrow = J)
  
  # generate control group summary statistics
  Y_C <- rmvnorm(n = N / 2, mean = rep(0, J), sigma = Psi)
  ybar_C <- colMeans(Y_C)
  sd_C <- apply(Y_C, 2, sd)
  
  # generate treatment group summary statistics
  delta <- rep(delta, length.out = J)
  Y_T <- rmvnorm(n = N / 2, mean = delta, sigma = Psi)
  ybar_T <- colMeans(Y_T)
  sd_T <- apply(Y_T, 2, sd)

  # calculate Cohen's d
  sd_pool <- sqrt((sd_C^2 + sd_T^2) / 2)
  ES <- (ybar_T - ybar_C) / sd_pool
  
  # calculate SE of d
  SE <- sqrt(4 / N + ES^2 / (2 * (N - 2)))

  data.frame(ES = ES, SE = SE, N = N)

}

In action:

Code

delta <- rnorm(4, mean = 0.2, sd = 0.1)
r_SMDs_raw(delta = delta, J = 4, N = 40, Psi = 0.6)

         ES        SE  N
1 0.3855613 0.3193055 40
2 0.5944490 0.3234959 40
3 0.4130571 0.3197576 40
4 0.9150846 0.3331939 40

Or if you’d rather specify the full \(\Psi_k\) matrix yourself:

Code

Psi_k <- 0.6 + diag(0.4, nrow = 4)
Psi_k

     [,1] [,2] [,3] [,4]
[1,]  1.0  0.6  0.6  0.6
[2,]  0.6  1.0  0.6  0.6
[3,]  0.6  0.6  1.0  0.6
[4,]  0.6  0.6  0.6  1.0

Code

r_SMDs_raw(delta = delta, J = 4, N = 40, Psi = Psi_k)

         ES        SE  N
1 0.5366188 0.3221629 40
2 0.5439397 0.3223244 40
3 0.1063544 0.3164630 40
4 0.7696820 0.3283213 40

Exercises

The function above is serviceable but quite basic. I can think of several additional features that one might like to have for use in research simulations, but I’m feeling both cheeky and lazy at the moment, so I’ll leave them for you, dear reader. Here are some suggested exercises:

Add an argument to the function, Hedges_g = TRUE, which controls where the simulated effect size is Hedges’ \(g\) or Cohen’s \(d\). If it is Hedges’ g, make sure that the standard error is corrected too.
Add an argument to the function, p_val = TRUE, which allows the user to control whether or not to return \(p\)-values from the test of mean differences for each outcome. Note that the p-values should be for a test of the raw mean differences between groups, rather than a test of the effect size \(\delta_{jk} = 0\).
Add an argument to the function, corr_mat = FALSE, which controls whether the function returns just the simulated effect sizes and SEs or both the simulated effect sizes and the full sampling variance-covariance matrix of the effect sizes. See here for the relevant formulas.

Simulating summary statistics

Another approach to simulating SMDs is to sample from the distribution of the summary statistics used in calculating the effect size. This approach should simplify the code, at the cost of having to use a bit of distribution theory. Let \(\mathbf{\bar{y}}_{Tk}\) and \(\mathbf{\bar{y}}_{Ck}\) be the \(J_k \times 1\) vectors of sample means for the treatment and control groups, respectively. Let \(\mathbf{S}_k\) be the \(J_k \times J_k\) sample covariance matrix of the outcomes, pooled across the treatment and control groups. Again assuming multi-variate normality, and following the same notation as above: \[ \mathbf{\bar{y}}_{Ck} \sim N\left(\mathbf{0}, \frac{2}{N_k} \boldsymbol\Psi_k\right), \qquad \mathbf{\bar{y}}_{Tk} \sim N\left(\boldsymbol\delta_k, \frac{2}{N_k} \boldsymbol\Psi_k\right), \] and \[ \left(\mathbf{\bar{y}}_{Tk} - \mathbf{\bar{y}}_{Ck}\right) \sim N\left(\boldsymbol\delta_k, \frac{4}{N_k} \boldsymbol\Psi_k\right). \] This shows how we could directly simulate the numerator of the standardized mean difference.

A further bit of distribution theory says that the pooled sample covariance matrix follows a multiple of a Wishart distribution with \(N_k - 2\) degrees of freedom and scale matrix \(\Psi_k\): \[ (N_k - 2) \mathbf{S}_k \sim Wishart\left(N_k - 2, \Psi_k \right). \] Thus, to simulate the denominators of the SMD estimates, we can simulate a single Wishart matrix, pull out the diagonal entries, divide by \(N_k - 2\), and take the square root. In all, we draw a single \(J_k \times 1\) observation from a multi-variate normal distribution and a single \(J_k \times J_k\) observation from a Wishart distribution. In contrast, the raw data approach requires simulating \(N_k\) observations from a multi-variate normal distribution, then calculating \(4 J_k\) summary statistics (M and SD for each group on each outcome).

Exercises

Once again, I’ll leave it to you, dear reader, to do the fun programming bits:

Create a modified version of the function r_SMDs_raw that simulates summary statistics instead of raw data (Call it r_SMDs_stats).
Use the microbenchmark package (or your preferred benchmarking tool) to compare the computational efficiency of both versions of the function.
Check your work! Verify that both versions of the function generate the same distributions if the same parameters are used as input.

Which approach is better?

Like many things in research, there’s no clearly superior method here. The advantage of the summary statistics approach is computational efficiency. It should generally be faster than the raw data approach, and if you need to generate 10,000 meta-analysis each with 80 studies in them, the computational savings might add up. On the other hand, computational efficiency isn’t everything.

I see two potential advantages of the raw data approach. First is interpretability: simulating raw data is likely easier to understand. It feels tangible and familiar, harkening back to those bygone days we spent learning ANOVA, whereas the summary statistics approach requires a bit of distribution theory to follow (bookmark this blog post!). Second is extensibility: it is relatively straightforward to extend the approach to use other distributional models for the raw dat (perhaps you want to look at outcomes that follow a multi-variate \(t\) distribution?) or more complicated estimators of the SMD (difference-in-differences? covariate-adjusted? cluster-randomized trial?). To use the summary statistics approach in more complicated scenarios, you’d have to work out the sampling distributions for yourself, or locate the right reference.

Of course, there’s also no need to choose between these two approaches. As I’m trying to hint at in Exercise 6, it’s actually useful to write both. Then, you can use the (potentially slower) raw data version to verify that the summary statistics version is correct.

Simulating full meta-analyses

So far we’ve got a data-generating function that simulates a single study’s worth of effect size estimates. To study meta-analytic methods, we’ll need to build out the function to simulate multiple studies. To do so, I think it’s useful to use the technique of mapping, as implemented in the purrr package’s map_* functions. The idea here is to first generate a “menu” of study-specific parameters for each of \(K\) studies, then apply the r_SMDs function to each parameter set.

Let’s consider how to do this for a simple random effects model, where the true effect size parameter is constant within each study (i.e., \(\boldsymbol\delta_k = (\delta_k \cdots \delta_k)'\)), and in a model without covariates. We’ll need to generate a true effect for each study, along with a sample size, an outcome dimension, and a correlation between outcomes. For the true effects, I’ll assume that \[ \delta_k \sim N(\mu, \tau^2), \] \[ J_k \sim 2 + Poisson(3), \] \[ N_k \sim 20 + 2 \times Poisson(10), \] and \[ r_k \sim Beta\left(\rho \nu, (1 - \rho)\nu\right), \] where \(\rho = \text{E}(r_k)\) and \(\nu > 0\) controls the variability of \(r_k\) across studies, with smaller \(\nu\) corresponding to more variable correlations. Specifically, \(\text{Var}(r_k) = \rho (1 - \rho) / (1 + \nu)\). These distributions are just made up, without any particular justification.

Here’s what these distributional models look like in R code:

Code

K <- 6
mu <- 0.2
tau <- 0.05
J_mean <- 5
N_mean <- 45
rho <- 0.6
nu <- 39

study_data <- 
  data.frame(
    delta = rnorm(K, mean = mu, sd = tau),
    J = 2 + rpois(K, J_mean - 2),
    N = 20 + 2 * rpois(K, (N_mean - 20) / 2),
    Psi = rbeta(K, rho * nu, (1 - rho) * nu)
  )

study_data

      delta J  N       Psi
1 0.2092696 2 38 0.6668007
2 0.2299119 5 42 0.4487449
3 0.0664768 3 54 0.7200542
4 0.1609013 6 34 0.6149950
5 0.2180899 6 54 0.7553845
6 0.1299188 3 58 0.7155828

Once we have the “menu” of study-level characteristics, it’s just a matter of mapping the parameters to the data-generating function. One way to do this is with pmap_df:

Code

library(purrr)
meta_data <- pmap_df(study_data, r_SMDs_raw, .id = "study")
meta_data

   study          ES        SE  N
1      1 -0.12575818 0.3247812 38
2      1 -0.24412399 0.3257160 38
3      2  0.26556471 0.3100317 42
4      2  0.95331328 0.3264938 42
5      2  0.78835910 0.3209470 42
6      2  0.12122926 0.3089042 42
7      2  0.05945635 0.3086783 42
8      3 -0.16817702 0.2726647 54
9      3 -0.28555997 0.2736022 54
10     3  0.10045972 0.2723437 54
11     4 -0.55943777 0.3500532 34
12     4 -0.31311385 0.3452230 34
13     4 -0.29413648 0.3449621 34
14     4 -0.25214567 0.3444422 34
15     4 -0.36021385 0.3459400 34
16     4 -0.53997094 0.3495752 34
17     5 -0.30208044 0.2737727 54
18     5  0.04982397 0.2722094 54
19     5  0.04273584 0.2721978 54
20     5 -0.15579886 0.2725940 54
21     5 -0.24978710 0.2732655 54
22     5 -0.25431832 0.2733056 54
23     6 -0.32438779 0.2643956 58
24     6 -0.14951874 0.2629926 58
25     6 -0.36681929 0.2648904 58

Code

table(meta_data$study)


1 2 3 4 5 6 
2 5 3 6 6 3

Putting it all together into a function, we have

Code

r_meta <- function(K, mu, tau, J_mean, N_mean, rho, nu) {
  require(purrr)
  
  study_data <- 
    data.frame(
      delta = rnorm(K, mean = mu, sd = tau),
      J = 2 + rpois(K, J_mean - 2),
      N = 20 + 2 * rpois(K, (N_mean - 20) / 2),
      Psi = rbeta(K, rho * nu, (1 - rho) * nu)
    )
  
  pmap_df(study_data, r_SMDs_raw, .id = "study")
}

Exercises

Modify r_meta so that it uses r_SMDs_stats.
Add options to r_meta for Hedges_g, p_val = TRUE, and corr_mat = FALSE and ensure that these get passed along to the r_SMDs function.
One way to check that the r_meta function is working properly is to generate a very large meta-analytic dataset, then to verify that the generated distributions align with expectations. Here’s a very large meta-analytic dataset:
Code
```
meta_data <- r_meta(
  100000, 
  mu = 0.2, tau = 0.05, 
  J_mean = 5, N_mean = 40, 
  rho = 0.6, nu = 39
)
```
Compare the distribution of the simulated dataset against what you would expect to get based on the input parameters.
Modify the r_meta function so that \(J_k\) and \(N_k\) are correlated, according to \[ \begin{align} J_k &\sim 2 + Poisson(\mu_J - 2) \\ N_k &\sim 20 + 2 \times Poisson\left(\frac{1}{2}(\mu_N - 20) + \alpha (J_k - \mu_J) \right) \end{align} \] for user-specified values of \(\mu_J\) (the average number of outcomes per study), \(\mu_N\) (the average total sample size per study), and \(\alpha\), which controls the degree of dependence between \(J_k\) and \(N_k\).

A challenge

The meta-analytic model that we’re using here is quite simple—simplistic, even—and for some simulation studies, something more complex might be needed. For example, we might need to generate data from a model that includes within-study random effects, as in: \[ \delta_{jk} = \mu + u_k + v_{jk}, \quad \text{where}\quad u_k \sim N(0, \tau^2), \quad v_{jk} \sim N(0, \omega^2). \] Even more complex would be to simulate from a multi-level meta-regression model \[ \delta_{jk} = \mathbf{x}_{jk} \boldsymbol\beta + u_k + v_{jk}, \quad \text{where}\quad u_k \sim N(0, \tau^2), \quad v_{jk} \sim N(0, \omega^2), \] where \(\mathbf{x}_{jk}\) is a \(1 \times p\) row-vector of covariates describing outcome \(j\) in study \(k\) and \(\boldsymbol\beta\) is a \(p \times 1\) vector of meta-regression coefficients. In past work, I’ve done this by writing a data-generating function that takes a fixed design matrix \(\mathbf{X} = \left(\mathbf{x}_{11}' \cdots \mathbf{x}_{J_K K}'\right)'\) as an input argument, along with \(\boldsymbol\beta\). The design matrix would also include an identifier for each unique study. There are surely better (simpler, easier to follow) ways to implement the multi-level meta-regression model. I’ll once again leave it to you to work out an approach.

Back to top