Performing cross-validation on linear mixed-effects models and hierarchical generalized additive models

model_cv performs repeated, stratified k-fold cross-validation on a linear mixed-effects model (class: lmerMod or lmerTest) or hierarchical generalized additive model (class: gam) model with a single random grouping factor.

Usage

model_cv(model, data, group, k=5, r=1, control=NULL)

Arguments

model: Model object for cross-validation. Supported classes are "lmerMod" (from lme4::lmer), "lmerTest" (from lmerTest::lmer) and "gam" (from mgcv::gam)
data: Data frame or tibble containing data used for model calibration. None of the variables used in 'model' can contain NAs.
group: Name of variable in data used as a random grouping factor in model.
k: Number of folds. Default = 5.
r: Number of repeats. Default = 1.
control: Optional settings to prevent convergence issues with lmer models. Default = NULL.

Value

A list of two elements: (i) the RMSE, and (ii) a data frame or tibble containing the input dataset plus r additional columns containing the cross-validation predicted values (named pred_cv#).

Details

k-fold cross validation is a re-sampling procedure than can be used to evaluate a model’s predictive performance for groups in the calibration dataset. If the grouping factor is Site, then the function evaluates the model’s ability to predict the response at sites in the calibration dataset.

The data is randomly split into k folds. One fold is omitted (to form a separate test set), the model is fitted to the data from the remaining folds (the training set), and the re-fitted model is then used to predict the response variable for the test set. This is repeated k times so that a prediction is generated for every observation in the full dataset. If required, the whole process can then be repeated r times to provide a more precise estimate of model performance, albeit at greater computational cost. The overall performance of the model is measured by the root mean square error (RMSE), which quantifies the difference between the observed and predicted values. When r>1, the RMSE is calculated using the predicted values across all repeats. Comparing RMSE values for competing models can be used to guide model selection.

Because lmer and gam models include a random grouping factor (e.g. site), the cross-validation procedure is stratified, to ensure that the observations for each group are split as evenly as possible among the k folds.

There is no hard rule for choosing k, but values of 5 or 10 have been demonstrated empirically to provide a good trade-off between bias and variance in a model’s estimated performance. r takes a default value of 1, but can be increased to yield a more precise estimate of the RMSE.

When refitting the model during cross-validation, all arguments to lmer() and gam() take default values, with the exception of (i) 'REML' (which inherits from the original lmer model object), and (ii) 'control' (which can be set via the 'control' argument for lmer models only).

This function is not recommended for use on more complex models with multiple (crossed or nested) random grouping factors.

Examples

library(lme4)
#> Loading required package: Matrix
library(mgcv)
#> Loading required package: nlme
#> 
#> Attaching package: 'nlme'
#> The following object is masked from 'package:lme4':
#> 
#>     lmList
#> The following object is masked from 'package:dplyr':
#> 
#>     collapse
#> This is mgcv 1.9-0. For overview type 'help("mgcv-package")'.

## Example 1: Cross-validation on linear mixed-effects model
# model1 <- lmer(Reaction ~ Days + (Days | Subject), sleepstudy)
# out1 <- model_cv(model = model1, data = sleepstudy, group = "Subject", k = 5, r = 1)
# out1[[1]] # RMSE
# out1[[2]] # predicted values from cross-validation

# more precise estimate of RMSE by increasing r
# model_cv(model = model1, data = sleepstudy, group = "Subject", k = 5,  r = 10)

# convergence issues, so try different optimizer
# my_control = lmerControl(optimizer="bobyqa", optCtrl = list(maxfun = 10000))
# model1b <- lmer(Reaction ~ Days + (Days | Subject), sleepstudy, control = my_control)
# model_cv(model = model1b, data = sleepstudy, group = "Subject", k = 5, r = 1, control = my_control)

## Example 2: Cross-validation on hierarchical generalised additive model
# model2 <- gam(Reaction ~ s(Days) + s(Subject, bs = "re") + s(Days, Subject, bs = "re"), data = sleepstudy)
# model_cv(model = model2, data = sleepstudy, group = "Subject", k = 10, r = 1)

# compare alternative models
# model2b <- gam(Reaction ~ s(Days) + s(Subject, bs = "re"), data = sleepstudy)
# model2c <- gam(Reaction ~ s(Subject, bs = "re"), data = sleepstudy)
# out2 <- model_cv(model = model2, data = sleepstudy, group = "Subject", k = 10, r = 1)
# out2b <- model_cv(model = model2b, data = sleepstudy, group = "Subject", k = 10, r = 1)
# out2c <- model_cv(model = model2c, data = sleepstudy, group = "Subject", k = 10, r = 1)
# out2[[1]]; out2b[[1]] ;out2c[[1]]