A quirk of nlme::varIdent

Wherein I document how to control the reference level in models that use nlme::varIdent() for specifying a variance structure.
Rstats
programming
hierarchical models
nlme
Author

James E. Pustejovsky

Published

December 28, 2024

The nlme package is a venerable, now quite old package for fitting hierarchical models in R. Compared to the better-known lme4 package, nlme provides greater flexibility for specifying dependence structures at the finest-grain level of the model, such as allowing for heteroskedastic errors or correlated errors that follow an auto-regressive structure within each higher-level unit.1 Due to its quite advanced age (the first entry in the changelog is from December 23, 1999!), it is perhaps not surprising that the package has a number of what one might call quirks—odd, counter-intuitive behaviors that are unlike the behavior of other core model-fitting functions in R. A few of my past posts have documented some of these quirks. Here’s another one.

1 nlme is a minor obsession for me because Several packages that I’ve worked on provide formal enhancements to or are at least very closely connected to it. The clubSandwich package provides cluster-robust standard errors for fitted models, and it supports models fitted with nlme::lme() and nlme::gls(). The lmeInfo package provides analytic derivatives for models fitted with nlme::lme() and nlme::gls(). The scdhlm package for calculating effect sizes from single-case designs relies heavily on nlme::lme() and on lmeInfo::g_mlm().

varIdent()

In nlme’s syntax, varIdent() is used to specify a model in which the variance of the lowest-level error term differs depending on a categorical covariate. I’ll call this a “heteroskedastic error” model. Say that we have repeated measures (indexed by \(h\)) nested within units (indexed by \(i\)), so that \(y_{hi}\) is measurement \(h\) on unit \(i\) for \(h = 1,...,H_i\) and \(i=1,...,N\). Let \(c_{hi} \in \{1,...,C\}\) denote the category of observation \(hi\), and suppose that the last category serves as a reference level. Say we have some model with predictors \(\mathbf{x}_{hi}\) and potentially also with random effects terms described by \(\mathbf{z}_{hi}\): \[ y_{hi} = \mathbf{x}_{hi} \boldsymbol\beta + \mathbf{z}_{hi} \mathbf{u}_i + e_{hi} \] where \[ \text{Var}(e_{hi}) = \sigma^2 \times \exp\left( \sum_{j=1}^{C-1} \lambda_j I(c_{hi} = j)\right). \] The coefficients \(\lambda_1,...,\lambda_{C-1}\) capture how the variance of errors in category \(j\) differ from the variance of errors in category \(C\). The question is, how can one control the reference level (i.e., choose which category is \(C\)) when fitting this model?

Let me give a few examples of models with this feature using Orthodont, one of the standard datasets included in the nlme package. Below I do a bit of tidying up on the factors in this dataset.

library(tidyverse)
library(nlme)
data("Orthodont")

Orthodont <- 
  Orthodont %>%
  as.data.frame() %>%
  mutate(
    age = as.integer(age),
    Subject = factor(Subject, levels = sort(levels(Subject)), ordered = FALSE),
    Sex_rev = fct_rev(Sex),
    Sex3 = factor(if_else(
      Subject %in% c("M11","M12","M13","M14","M15","M16"),
      "Non-binary", 
      Sex), levels = c("Female","Male","Non-binary")
    )
  )

Here is a simple example of a heteroskedastic error model, without any random effects or any other bells or whistles.

gls1 <- gls(
  distance ~ age + Sex, 
  weights = varIdent(form = ~ 1 | Sex),
  data = Orthodont
)
summary(gls1)
Generalized least squares fit by REML
  Model: distance ~ age + Sex 
  Data: Orthodont 
       AIC      BIC    logLik
  494.3251 507.5949 -242.1626

Variance function:
 Structure: Different standard deviations per stratum
 Formula: ~1 | Sex 
 Parameter estimates:
     Male    Female 
1.0000000 0.9378976 

Coefficients:
                Value Std.Error   t-value p-value
(Intercept) 17.811621 1.1119255 16.018718       0
age          0.650648 0.0975569  6.669421       0
SexFemale   -2.321023 0.4396048 -5.279794       0

 Correlation: 
          (Intr) age   
age       -0.965       
SexFemale -0.173  0.000

Standardized residuals:
        Min          Q1         Med          Q3         Max 
-2.58304984 -0.65117352 -0.02923653  0.51805238  2.30992386 

Residual standard error: 2.329342 
Degrees of freedom: 108 total; 105 residual

The reference level of the variance structure is Male.2

2 For ease of exposition, let me make a little helper function that pulls out the reference level of the variance structure.

ref_level <- function(mod) {
  attr(mod$modelStruct$varStruct, "groupNames")[1]
}
ref_level(gls1)
[1] "Male"

Here’s another gls model, this time allowing the errors to be correlated within subject:

gls2 <- update(
  gls1, 
  correlation = corCompSymm(form = ~ 1 | Subject)
)
ref_level(gls2)
[1] "Female"

The reference category has now changed to Female. The same thing happens if I fit a heteroskedastic error model using lme():

lme1 <- lme(
  distance ~ age + Sex, 
  random = ~ 1 | Subject, 
  weights = varIdent(form = ~ 1 | Sex),
  data = Orthodont
)
ref_level(lme1)
[1] "Female"

It would be nice to be able to control the reference level.3

3 I’m not the only one seeking to do this. Someone on the R-SIG-mixed-models listserv had the same query in July of 2013, but none of the responses provide a solution.

What doesn’t work

The usual way that one would specify reference levels is to set a contrast for the factor:

contrasts(Orthodont$Sex) <- contr.treatment(2, base = 1)

But this does not seem to have any effect on the selected reference level of the variance structures:

update(gls1, data = Orthodont) |> ref_level()
[1] "Male"
update(gls2, data = Orthodont) |> ref_level()
[1] "Female"
update(lme1, data = Orthodont) |> ref_level()
[1] "Female"

Nor does using a factor with the levels in reverse order:

update(gls1, weights = varIdent(form = ~ 1 | Sex_rev)) |> ref_level()
[1] "Male"
update(gls2, weights = varIdent(form = ~ 1 | Sex_rev)) |> ref_level()
[1] "Female"
update(lme1, weights = varIdent(form = ~ 1 | Sex_rev)) |> ref_level()
[1] "Female"

Emma Knight suggested on R-SIG-mixed-models that the reference level is selected based on the category with the largest number of observations (which is Male). That could be what determines the reference level in gls1 (though it clearly can’t be so for gls2 or lme). I’ve created another factor called Sex3 where the largest category is Female. If this theory is true, then the reference level of gls1 should change to Female:

update(gls1, weights = varIdent(form = ~ 1 | Sex3)) |> ref_level()
[1] "Male"

Nothing changes with the other models either:

update(gls2, weights = varIdent(form = ~ 1 | Sex3)) |> ref_level()
[1] "Female"
update(lme1, weights = varIdent(form = ~ 1 | Sex3)) |> ref_level()
[1] "Female"

What about the order of the data? There is a bug report documenting that varIdent() depends on the sort order of the categorical grouping variable when used in a gls() call.4 Sorting the data in descending order of Sex3 does affect the reference level for gls1:

4 This StackOverflow question notes the same issue and links to a patch from Ben Bolker.

Orthodont_sort <- arrange(Orthodont, Sex3)
Orthodont_rev <- arrange(Orthodont, desc(Sex3))
head(Orthodont_sort$Sex, 4)
[1] Female Female Female Female
attr(,"contrasts")
       2
Male   0
Female 1
Levels: Male Female
head(Orthodont_rev$Sex3, 4)
[1] Non-binary Non-binary Non-binary Non-binary
Levels: Female Male Non-binary
update(
  gls1, 
  data = Orthodont_sort
) |> 
  ref_level()
[1] "Female"
update(
  gls1, 
  weights = varIdent(form = ~ 1 | Sex3),
  data = Orthodont_rev
) |> 
  ref_level()
[1] "Non-binary"

However, this doesn’t seem to affect gls2 or lme1 at all:

update(
  gls2, 
  data = Orthodont_sort
) |> 
  ref_level()
[1] "Female"
update(
  gls2, 
  weights = varIdent(form = ~ 1 | Sex3), 
  data = Orthodont_rev
) |> 
  ref_level()
[1] "Female"
update(
  lme1, 
  data = Orthodont_sort
) |> 
  ref_level()
[1] "Female"
update(
  lme1, 
  weights = varIdent(form = ~ 1 | Sex3), 
  data = Orthodont_rev
) |> 
  ref_level()
[1] "Female"

Very curious.

How to control the reference level

I posted about this issue to R-SIG-mixed-models with some of the above notes. Sebastian Meyer provided a very helpful response, noting that

lme() internally reorders the data by the grouping factor(s) before initialization, so the order of that factor (‘Subject’ in your example…) will indirectly determine the reference level of varIdent(), regardless of how your data or strata levels are ordered originally.

Similar re-ordering behavior occurs in gls() when the model includes a grouping structure, as it does in gls2 because of the corCompSymm() correlation structure. This insight provides a key for controlling the reference level: for a given categorical variable, the reference level will be taken as the level of the first observation in the first level of the grouping factor. Thus, for models with grouping structure, one can set the reference level to Male by re-ordering the levels of Subject so that a unit with Sex = 'Male' occurs first.

Orthodont$Subject2 <- fct_relevel(Orthodont$Subject, "M01")
table(Orthodont$Subject2) |> head()

M01 F01 F02 F03 F04 F05 
  4   4   4   4   4   4 
update(
  gls2, 
  correlation = corCompSymm(form = ~ 1 | Subject2)
) |> 
  ref_level()
[1] "Male"
update(
  lme1, 
  random = ~ 1 | Subject2
) |> 
  ref_level()
[1] "Male"

For models without grouping structure, the reference level is indeed controlled by sort order, as if there were only a single group. To control the reference level, one would need to sort the data accordingly. For example, to make Female the reference level in gls1, the first observation in the data needs to have Sex = 'Female'. This is so in Orthodont_sort:

update(gls1, data = Orthodont_sort) |> ref_level()
[1] "Female"

What if the categorical variable varies within the grouping factor? For instance, say that we want to fit a model with heteroskedastic errors varying by age of the subject, which varies from 8 to 14 across the repeated measures of each participant. Here’s a basic heteroskedastic error gls():

gls3 <- gls(
  distance ~ age + Sex, 
  weights = varIdent(form = ~ 1 | age),
  data = Orthodont
)

And here is one with a correlation structure:

gls4 <- update(
  gls3, 
  correlation = corCompSymm(form = ~ 1 | Subject)
)

In both cases, the reference level is the youngest age:

c(ref_level(gls3), ref_level(gls4))
[1] "8" "8"

To set this to something else in the first model, just re-arrange the first few rows:

Orthodont2 <- Orthodont[c(4,1,2,3,5:108),]

update(
  gls3, 
  data = Orthodont2
) |> 
  ref_level()
[1] "14"

However, this doesn’t work for gls45 because the first level of the grouping factor is not the first one to appear in the data.6 But we could sort the data by the grouping factor and then by age so that the desired reference level appears first:

5 See

update(gls4, data = Orthodont2) |> ref_level()
[1] "8"

6 The first level of Subject corresponds to rows 65 through 68:

which(levels(Orthodont2$Subject)[1] == Orthodont2$Subject)  
[1] 65 66 67 68
Orthodont2_sort <- 
  Orthodont %>%
  arrange(Subject, desc(age))

head(Orthodont2_sort)
   distance age Subject    Sex Sex_rev   Sex3 Subject2
68     23.0  14     F01 Female  Female Female      F01
67     21.5  12     F01 Female  Female Female      F01
66     20.0  10     F01 Female  Female Female      F01
65     21.0   8     F01 Female  Female Female      F01
72     25.5  14     F02 Female  Female Female      F02
71     24.0  12     F02 Female  Female Female      F02
update(gls4, data = Orthodont2_sort) |> ref_level()
[1] "14"

It’s not elegant, nor particularly intuitive, but it seems to work.

Setting a value

Sebastian also suggested another technique for directly controlling the reference level of the variance structure:

It seems that only if there are more than two strata, the reference level for varIdent() can be chosen via a named initial parameter ‘value’, leaving out the desired reference level, but I haven’t tested if this works as intended with both gls() and lme(). It would then make sense to support such a choice also in the case of only two strata, so for a single parameter.

Setting the argument value of varIdent() specifies initialization values for the variance structure. Sebastian’s suggestion is to specify values for every level of the categorical variable except for the desired reference level. Let me give this a try with the Sex3 variable, which has three distinct levels.

gls_M <- update(
  gls1, 
  weights = varIdent(form = ~ 1 | Sex3)
)

gls_F <- update(
  gls1, 
  weights = varIdent(
    value = c(`Non-binary` = 0.01, Male = 0.01), 
    form = ~ 1 | Sex3
  )
)

gls_NB <- update(
  gls1, 
  weights = varIdent(
    value = c(Female = 0.01, Male = 0.01), 
    form = ~ 1 | Sex3
  )
)

coef(gls_M$modelStruct$varStruct, unconstrained = FALSE)
Non-binary     Female 
 0.7438755  0.8608531 
coef(gls_F$modelStruct$varStruct, unconstrained = FALSE)
Non-binary       Male 
  0.864114   1.161639 
coef(gls_NB$modelStruct$varStruct, unconstrained = FALSE)
  Female     Male 
1.157254 1.344311 

Indeed, this works for models with variance structures but no correlation structures or random effects. Likewise for gls models that include grouping structure:

update(
  gls2, 
  weights = varIdent(form = ~ 1 | Sex3)
) |>
  ref_level()
[1] "Female"
update(
  gls2, 
  weights = varIdent(
    value = c(`Non-binary` = 0.01, Female = 0.01), 
    form = ~ 1 | Sex3
  )
) |>
  ref_level()
[1] "Male"
update(
  gls2, 
  weights = varIdent(
    value = c(Female = 0.01, Male = 0.01), 
    form = ~ 1 | Sex3
  )
) |> 
  ref_level()
[1] "Non-binary"

And for lme models with random effects:

update(lme1, weights = varIdent(form = ~ 1 | Sex3)) |>
  ref_level()
[1] "Female"
update(
  lme1, 
  weights = varIdent(
    value = c(`Non-binary` = 0.01, Female = 0.01), 
    form = ~ 1 | Sex3
  )
) |>
  ref_level()
[1] "Male"
update(
  lme1, 
  weights = varIdent(
    value = c(Female = 0.01, Male = 0.01), 
    form = ~ 1 | Sex3
  )
) |> 
  ref_level()
[1] "Non-binary"

However, as Sebastian noted, it does not work for categories with only two levels (so only one non-reference level):

update(
  gls1, 
  weights = varIdent(
    value = c(Male = 0.01), 
    form = ~ 1 | Sex
  )
) |>
  ref_level()
[1] "Male"
update(
  gls2, 
  weights = varIdent(
    value = c(Female = 0.01), 
    form = ~ 1 | Sex
  )
) |>
  ref_level()
[1] "Female"
update(
  lme1, 
  weights = varIdent(
    value = c(Female = 0.01), 
    form = ~ 1 | Sex
  )
) |>
  ref_level()
[1] "Female"

Thus, this is only a partial solution. I agree with Sebastian that it would be useful to support user-specified levels within the value argument, even for factors with just one non-reference level.

Colophon

sessioninfo::session_info()
─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.4.1 (2024-06-14 ucrt)
 os       Windows 11 x64 (build 22631)
 system   x86_64, mingw32
 ui       RTerm
 language (EN)
 collate  English_United States.utf8
 ctype    English_United States.utf8
 tz       America/Chicago
 date     2024-12-28
 pandoc   3.2 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)

─ Packages ───────────────────────────────────────────────────────────────────
 ! package     * version date (UTC) lib source
 P cli           3.6.3   2024-06-21 [?] RSPM
 P colorspace    2.1-1   2024-07-26 [?] RSPM
 P digest        0.6.37  2024-08-19 [?] RSPM (R 4.4.0)
 P dplyr       * 1.1.4   2023-11-17 [?] RSPM
 P evaluate      0.23    2023-11-01 [?] RSPM (R 4.4.0)
 P fansi         1.0.6   2023-12-08 [?] RSPM
 P fastmap       1.1.1   2023-02-24 [?] RSPM (R 4.4.0)
 P forcats     * 1.0.0   2023-01-29 [?] RSPM
 P generics      0.1.3   2022-07-05 [?] RSPM
 P ggplot2     * 3.5.1   2024-04-23 [?] RSPM
 P glue          1.8.0   2024-09-30 [?] RSPM (R 4.4.0)
 P gtable        0.3.6   2024-10-25 [?] RSPM (R 4.4.0)
 P hms           1.1.3   2023-03-21 [?] RSPM
 P htmltools     0.5.7   2023-11-03 [?] RSPM (R 4.4.1)
 P htmlwidgets   1.6.4   2023-12-06 [?] RSPM
 P jsonlite      1.8.8   2023-12-04 [?] RSPM
 P knitr         1.47    2024-05-29 [?] RSPM (R 4.4.0)
 P lattice       0.22-5  2023-10-24 [?] RSPM (R 4.4.1)
 P lifecycle     1.0.4   2023-11-07 [?] RSPM
 P lubridate   * 1.9.3   2023-09-27 [?] RSPM
 P magrittr      2.0.3   2022-03-30 [?] RSPM
 P munsell       0.5.1   2024-04-01 [?] RSPM
 P nlme        * 3.1-166 2024-08-14 [?] RSPM
 P pillar        1.9.0   2023-03-22 [?] RSPM
 P pkgconfig     2.0.3   2019-09-22 [?] RSPM
 P purrr       * 1.0.2   2023-08-10 [?] RSPM
 P R6            2.5.1   2021-08-19 [?] RSPM
 P readr       * 2.1.5   2024-01-10 [?] RSPM
   renv          1.0.5   2024-02-29 [1] RSPM (R 4.4.0)
 P rlang         1.1.4   2024-06-04 [?] RSPM
 P rmarkdown     2.27    2024-05-17 [?] RSPM
 P rstudioapi    0.17.1  2024-10-22 [?] RSPM (R 4.4.0)
 P scales        1.3.0   2023-11-28 [?] RSPM
 P sessioninfo   1.2.2   2021-12-06 [?] RSPM (R 4.4.0)
 P stringi       1.8.4   2024-05-06 [?] RSPM
 P stringr     * 1.5.1   2023-11-14 [?] RSPM
 P tibble      * 3.2.1   2023-03-20 [?] RSPM
 P tidyr       * 1.3.1   2024-01-24 [?] RSPM
 P tidyselect    1.2.1   2024-03-11 [?] RSPM
 P tidyverse   * 2.0.0   2023-02-22 [?] RSPM
 P timechange    0.3.0   2024-01-18 [?] RSPM
 P tzdb          0.4.0   2023-05-12 [?] RSPM
 P utf8          1.2.4   2023-10-22 [?] RSPM
 P vctrs         0.6.5   2023-12-01 [?] RSPM
   withr         3.0.2   2024-10-28 [1] RSPM (R 4.4.0)
 P xfun          0.45    2024-06-16 [?] RSPM (R 4.4.0)
 P yaml          2.3.8   2023-12-11 [?] RSPM (R 4.4.0)

 [1] C:/Users/jamespustejovsky/Documents/jepusto-quarto/renv/library/R-4.4/x86_64-w64-mingw32
 [2] C:/Users/jamespustejovsky/AppData/Local/R/cache/R/renv/sandbox/R-4.4/x86_64-w64-mingw32/e0da0d43

 P ── Loaded and on-disk path mismatch.

──────────────────────────────────────────────────────────────────────────────
Back to top