Performing cross-validation on linear mixed-effects models and hierarchical generalized additive models

model_logocv performs leave-one-group-out cross-validation on a linear mixed-effects model (class lmerMod or lmerTest) or hierarchical generalized additive model (class gam) model with a single random grouping factor.

Usage

model_logocv(model, data, group, control = NULL)

Arguments

model: Model object for cross-validation. Supported classes are "lmerMod" (from lme4:: lmer), "lmerTest" (from lmerTest::lmer) and "gam" (from mgcv::gam)
data: Data frame or tibble containing data used for model calibration. None of the variables used in 'model' can contain NAs.
group: Name of variable in data used as a random grouping factor in model.
control: Optional settings to prevent convergence issues with lmer models. Default = NULL.

Value

A list of two elements: (i) the RMSE, and (ii) a data frame or tibble containing the input dataset plus an additional column (pred_cv) containing the cross-validation predicted values.

Details

Leave-one-group-out cross validation is a re-sampling procedure than can be used to evaluate a model’s predictive performance for new levels of a random grouping factor. If that grouping factor is Site, then the function evaluates the model’s ability to predict the response at new sites not in the calibration dataset.

One group (site) is omitted (to form a separate test set), the model is fitted to the data for the remaining groups (the training set), and then the re-fitted model is then used to predict the response variable for the test set. (Note that these predictions use only the fixed terms in the model; random effects are excluded when predicting for new sites). This is process is repeated for each group so that a prediction is generated for every observation in the full dataset. The overall performance of the model is measured by the root mean square error (RMSE), which quantifies the difference between the observed and predicted values. Comparing RMSE values for competing models can be used to guide model selection.

When refitting the model during cross-validation, all arguments to lmer() and gam() take default values, with the exception of (i) 'REML' (which inherits from the original lmer model object), and (ii) 'control' (which can be set via the 'control' argument for lmer models only).

Not recommended for use on more complex models with multiple (crossed or nested) random grouping factors.

Examples

library(lme4)
library(mgcv)

## Example 1: Cross-validation on linear mixed-effects model
# model1 <- lmer(Reaction ~ Days + (Days | Subject), sleepstudy)
# out1 <- model_logocv(model = model1, data = sleepstudy, group = "Subject")
# out1[[1]] # RMSE
# out1[[2]] # predicted values from cross-validation

# convergence issues, so try different optimizer
# my_control = lmerControl(optimizer="bobyqa", optCtrl = list(maxfun = 10000))
# model1b <- lmer(Reaction ~ Days + (Days | Subject), sleepstudy, control = my_control)
# model_logocv(model = model1b, data = sleepstudy, group = "Subject", control = my_control)

## Example 2: Cross-validation on hierarchical generalised additive model
# model2 <- gam(Reaction ~ s(Days) + s(Subject, bs = "re") + s(Days, Subject, bs = "re"), data = sleepstudy)
# model_logocv(model = model2, data = sleepstudy, group = "Subject")

# compare alternative models
# model2b <- gam(Reaction ~ s(Days) + s(Subject, bs = "re"), data = sleepstudy)
# model2c <- gam(Reaction ~ s(Subject, bs = "re"), data = sleepstudy)
# out2 <- model_logocv(model = model2, data = sleepstudy, group = "Subject")
# out2b <- model_logocv(model = model2b, data = sleepstudy, group = "Subject")
# out2c <- model_logocv(model = model2c, data = sleepstudy, group = "Subject")
# out2[[1]]; out2b[[1]] ;out2c[[1]]