Skip to contents

Introduction

This case study illustrates a typical HE Toolkit workflow using a dataset first analysed by Dunbar et al. (2010) as part of the Environment Agency-funded DRIED-UP2 study. The study used family-level macroinvertebrate LIFE scores and a mix of gauged and modelled mean daily flow data to explore how flow history and channel morphology interact to influence macroinvertebrate communities in upland rivers.

This re-analysis using the HE Toolkit has several differences from Dunbar et al. (2010) as it focuses on a reduced set of 68 macroinvertebrate sampling locations in England, uses LIFE O:E ratios rather than LIFE as the response variable, and uses updated macroinvertebrate and flow data up to 2015.

Two different modelling approaches are applied in this case study: a linear mixed-effect model, and a Generalised Additive Model (GAM). Linear models assume that the explanatory variables are linearly related to the response, whereas GAMs allow relationships between the explanatory variables and the response to be described by smooth curves or linear terms (Wood, 2017). This case study will illustrate how the same explanatory and response variables can be modeled using the two different approaches.

As this case study is intended to illustrate the application of the HE Toolkit, rather than be a comprehensive analysis in its own right, some of the analytical steps have been simplified for brevity.

This case study is optimised for viewing in Chrome.

Set-up

Install and load the HE Toolkit:

#install.packages("remotes")
library(remotes)
# Conditionally install hetoolkit from github
if ("hetoolkit" %in% installed.packages() == FALSE) {
  remotes::install_github("APEM-LTD/hetoolkit")
}
library(hetoolkit)

Install and load the other packages necessary for this case study:

if(!require(pacman)) install.packages("pacman")
pacman::p_load(knitr, DT)
pacman::p_load(lmerTest, gamm4) # for modelling 
pacman::p_load(geojsonio, leaflet, rmapshaper, sf) # for maps
pacman::p_load(remotes, insight, RCurl, htmlTable, GGally, imputeTS, patchwork, ggfortify, plyr, rnrfa)

## Set ggplot theme
theme_set(theme_bw()) 

Metadata file

We start by importing a “metadata” file with the following columns:

  • biol_site_id - a list of BIOSYS site IDs for the 68 macroinvertebrate sampling locations;
  • rhs_survey_id - IDs of paired River Habitat Surveys (RHS) (note: these are survey IDs, not site IDs);
  • flow_site_id - IDs of paired sites for which gauged or modelled flows are available (note that some flow sites are paired with more than one macroinvertebrate site, so there are only 56 unique flow site IDs); and
  • flow_input - a vector specifying where to source the flow data for each site (gauged flows from the National River Flow Archive “NRFA” or Hydrology Data Explorer “HDE”, or modelled flows from a local “RDS” file).
# Import the metadata file containing site id's
cs2_metadata <- readxl::read_excel("data/CaseStudy2/cs2_metadata.xlsx", col_types = "text")

# Get site lists, for use with functions
cs2_rhssurvey <- cs2_metadata$rhs_survey_id
cs2_biolsites <- cs2_metadata$biol_site_id
cs2_flowsites <- cs2_metadata$flow_site_id
cs2_flowinputs <- cs2_metadata$flow_input

Look at the metadata table:

Several standardised column names are used throughout the hetoolkit package and throughout this case study and its associated datasets. These include:

  • biol_site_id = macroinvertebrate sampling site ids.
  • flow_site_id = flow gauging station ids.
  • flow = flow data, as downloaded using the import_flow function
  • rhs_survey_id = River Habitat Survey (RHS) ids (survey id, not site id, in case multiple surveys have been undertaken at a site).

River Habitat Survey data

The import_rhs function allows the user to download River Habitat Survey (RHS) data from Open Data.

The function either:

Data can be optionally filtered by survey ID.

Downloaded raw data files (in .zip format) will be automatically removed from the working directory following the completed execution of the function.

Below, we use our list cs2_rhssurveys to filter the downloaded data.

To enable the RHS data to be joined to the biology and flow data at a later stage, it is necessary to rename the Survey ID column to rhs_survey_id.

# Import RHS data 
cs2_rhs <- hetoolkit::import_rhs(surveys = cs2_rhssurvey)

Clean RHS data

Variables can be renamed, modified and dropped from the data set if unwanted. Below we rename the variable Survey ID to rhs_survey_id to match the standardised column names used throughout the hetoolkit case studies.

# Rename survey.id to rhs_survey_id to facilitate later data joining.
cs2_rhs <- cs2_rhs %>% 
  dplyr::rename(rhs_survey_id = Survey.ID, HMSRBB = Hms.Rsctned.Bnk.Bed.Sub.Score)

# use select function to keep variables of interest
cs2_rhs <- cs2_rhs %>%
  dplyr::select(rhs_survey_id, HMSRBB, HQA)

# Divide HMSRBB by 100, to avoid scaling issues in models later 
cs2_rhs$HMSRBB <- cs2_rhs$HMSRBB/100

View the RHS data:

Environmental data

Import environment data

The import_env function allows the user to download environmental base data from the Environment Agency’s Ecology and Fish Data Explorer.

The function either:

Data can be optionally filtered by site ID.

When saving, the name of rds file is hard-wired to: INV_OPEN_DATA_SITES_ALL.rds.

If saving prior to filtering, the name of the filtered rds file is hard-wired to: INV_OPEN_DATA_SITE_F.rds.

Below, we use our list biolsites to filter the data from EDE.

# Import environment data from the EDE 
cs2_env_data <- hetoolkit::import_env(sites = cs2_biolsites)

Clean Environment data

# We have two sites where a value of 0 in the bed substrate composition has be read in as an NA. Lets change that to 0 to avoid NA's: 
cs2_env_data[38, "SILT_CLAY"] <- 0

# We have 15 sites with no river bed composition data at all, lets remove those sites 
cs2_env_data <- cs2_env_data %>%
  dplyr::filter(!is.na(PEBBLES_GRAVEL) & !is.na(SAND) & !is.na(SILT_CLAY) & !is.na(BOULDERS_COBBLES))

# Load in additional environment data for these sites
additional_env_data <- readRDS("data/CaseStudy2/additional_env_data.RDS")

# Join env data sets together: 
cs2_env_data <- rbind(cs2_env_data, additional_env_data)

# Force conductivity to a numeric data type
cs2_env_data$CONDUCTIVITY <- as.numeric(cs2_env_data$CONDUCTIVITY)

View the environmental data:

Map sites

To make a map of the biology sites, we use the environmental base data downloaded from the EDE using import_env, this gives us the biol_site NGR’s. The NGR’s are translated to full latitude / longitude and matched back to the env_data.

Finally we use mapview to plot the points indicating the sample sites. The points are labelled with the biology sample site ID, the EA area code and the catcchment.

# Convert national grid ref (NGR) to full lat / long from env_data (from import_env function)
# WGS84 is lat/long. 
temp.eastnorths <- osg_parse(cs2_env_data$NGR_10_FIG, coord_system = "WGS84") %>% as_tibble()

## Match to back to env data to give details on map
cs2_env_data_map <- cbind(cs2_env_data, temp.eastnorths) %>%
  dplyr::select(AGENCY_AREA, WATER_BODY, CATCHMENT, WATERBODY_TYPE, biol_site_id, lat, lon) %>%
  dplyr::mutate(label = paste0("<b>", as.character(biol_site_id), "</b><br/>", AGENCY_AREA, "<br/>", CATCHMENT))
## Create map
leaflet() %>%
  leaflet::addTiles() %>%
  leaflet::addMarkers(lng = cs2_env_data_map$lon, lat = cs2_env_data_map$lat, popup = cs2_env_data_map$label,
            options = popupOptions(closeButton = FALSE))

Assess site similarity

Prior to building a hydro-ecological model, it is important identify any sites that could potentially be outliers because they are physically dissimilar to the others sites in the dataset.

The plot_sitepca function performs a Principal Components Analysis (PCA) which reduces a set of site-level environmental variables down to two uncorrelated ‘principal components’ and plots these components as a two-dimensional scatter plot. Sites that are closer together have more similar environmental characteristics. The proportion of the total environmental variation explained by each component is indicated on the axis labels. The first principal component (PC1) is plotted on the horizontal axis, as it represents the most dominant source of variation in the dataset. The environmental gradients represented by the two axes can be interpreted by reference to the arrows (eigenvectors), which show the direction and strength of correlation between the principal components and the individual environmental variables.

hetoolkit::plot_sitepca(data = cs2_env_data, 
                        vars = c("ALTITUDE", "SLOPE", "WIDTH", "DEPTH", "BOULDERS_COBBLES", "PEBBLES_GRAVEL", "SAND", "SILT_CLAY"),
                        eigenvectors = TRUE, 
                        label_by = "biol_site_id") 

Biology data

Import data

The import_inv function imports macroinvertebrate sampling data from the Environment Agency’s Ecology and Fish Data Explorer. The data can either be downloaded from https://environment.data.gov.uk/ecology-fish/downloads/INV_OPEN_DATA.zip or read in from a local .csv or .rds file. The data can be optionally filtered by site ID and sample date.

Below, we use our list cs2_biolsites to filter the data from the EDE. We download data for the period from 1995 (pre-1995 data was excluded due to uncertain QA procedures in place at this time) to 2021.

# Import biology data from the EDE
cs2_biol_data <- hetoolkit::import_inv(source = "parquet", sites = cs2_biolsites, start_date = "1995-01-01", end_date = "2021-12-31") 

# Set biol_site_id to character 
cs2_biol_data$biol_site_id <- as.character(cs2_biol_data$biol_site_id) 

Let’s view the imported data:

Calculate expected scores for macroinvertebrate indices

The predict_indices function mirrors the functionality of the RICT2 model, by using the environmental data downloaded from the EDE to generate expected scores under minimally impacted reference conditions.

To run the predict_indices function, the ALTITUDE, SLOPE, BOULDERS_COBBLES, PEBBLES_GRAVEL, SAND and SILT_CLAY fields must all be complete:

# check for missing data
cs2_env_data %>%
  dplyr::select(ALTITUDE, SLOPE, BOULDERS_COBBLES, PEBBLES_GRAVEL, SAND, SILT_CLAY) %>%
  summarise_all(~sum(is.na(.)))
## # A tibble: 1 × 6
##   ALTITUDE SLOPE BOULDERS_COBBLES PEBBLES_GRAVEL  SAND SILT_CLAY
##      <int> <int>            <int>          <int> <int>     <int>
## 1        0     0                0              0     0         0

Now we can calculate the expected scores using the predict_indices function from the hetoolkit. We will calculate all of the indices and then filter these down to the indices we want to use for this analysis.

# Run the predict function
cs2_predict_data <- hetoolkit::predict_indices(env_data = cs2_env_data, file_format = "EDE")

# Select the columns we want to keep - for this case study we only want to keep the family-level LIFE index
cs2_predict_data <- cs2_predict_data %>%
  dplyr::select(c("biol_site_id", "SEASON", "TL3_LIFE_Fam_DistFam"))

# Rename and re-code the season column 
cs2_predict_data <- cs2_predict_data %>%
  dplyr::rename(Season = SEASON) %>%
  dplyr::mutate(Season = case_when(Season == 1 ~ "Spring",
                                   Season == 2 ~ "Summer",
                                   Season == 3 ~ "Autumn",
                                   Season == 4 ~ "Winter"))

# Select the environmental variables we want to keep 
cs2_env_data <- cs2_env_data %>% 
  dplyr::select(biol_site_id, NGR_10_FIG, ALTITUDE, SLOPE, WIDTH, DEPTH, BOULDERS_COBBLES, PEBBLES_GRAVEL, SAND, SILT_CLAY, ALKALINITY, EASTING, NORTHING)

View the expected scores:

Join Biology and Environmental data with expected indices

Prior to joining the data, it is advisable to average results for any replicate samples collected from the same site in the same month.

# Average LIFE scores from same site, year and season
cs2_biol_data <- cs2_biol_data %>%
  dplyr::group_by(biol_site_id, Year, Season, Month) %>%
  dplyr::summarise(LIFE_F_O = mean(LIFE_FAMILY_INDEX),
                   date = mean(SAMPLE_DATE)) %>%
  dplyr::ungroup()

Calculate O:E ratios

Now we join the observed and expected LIFE scores into a single data frame so that O:E (Observed/Expected) ratios can be calculated.

# Join predicted indices to the invertebrate data by id and season into a new data frame called cs2_biol_all
cs2_biol_all <- dplyr::left_join(cs2_biol_data, cs2_predict_data, by = c("biol_site_id", "Season")) 

# Join the environmental data for later use
cs2_biol_all <-  dplyr::left_join(cs2_biol_all, cs2_env_data, by = "biol_site_id")

# Calculate LIFE O:E scores
cs2_biol_all <- cs2_biol_all %>%
  mutate(LIFE_F_E = TL3_LIFE_Fam_DistFam,
         LIFE_F_OE = LIFE_F_O / LIFE_F_E)
  
cs2_biol_all <- subset(cs2_biol_all, select = c(biol_site_id, Year, Season, Month, date, LIFE_F_OE))
cs2_biol_all$biol_site_id <- as.character(cs2_biol_all$biol_site_id)

Exploratory plots

Let’s have a closer look at the biology data. The following plots show:

  • The number of macroinvertebrate samples per site, by season;
  • The monthly distribution of macroinvertebrate samples; and
  • The distribution of the response variable (LIFE F O:E).
# Number of MI samples by site and season
ggplot(cs2_biol_all, aes(x = factor(biol_site_id, levels = rev(levels(factor(biol_site_id)))), fill = Season)) +
  geom_bar()+
  coord_flip()+
  xlab("Site") +
  ylab("Number of MI samples")

# Monthly distribution of samples 
cs2_biol_all %>%
  ggplot(aes(x = Month)) +
  geom_bar(fill = 'Blue')

# Histogram of LIFE O:E ratios 
ggplot(data = cs2_biol_all, aes(x = LIFE_F_OE)) + 
  geom_histogram(color="black", fill="grey") +
  xlab("LIFE F OE") 

Split data by season

For this case study, the biology data needs to be split into two data frames, Spring and Autumn. The spring samples will be joined to flow statistics for the preceding six month Winter period, and the Autumn samples will be joined to flow statistics for the preceding six month Summer period.

# create cs2_spring_biol
cs2_spring_biol <- cs2_biol_all[cs2_biol_all$Season == "Spring",]

# create cs2_autumn_biol
cs2_autumn_biol <- cs2_biol_all[cs2_biol_all$Season == "Autumn",]

Flow data

Import data

import_flow is a high-level function that calls import_nrfa, import_hde and import_flowfiles to import data for a user-defined list of sites.

Below, we start by importing mean daily flow data from the National River Flow Archive (NRFA) and the Hydrology Data (HDE) for a subset of 22 macroinvertebrate sites.

# Create a new data set only containing sites with flow data to import
cs2_flowinputs <- cs2_metadata[cs2_metadata$flow_input != "RDS",]

# Note that some gauging stations are paired to more than one macroinvertebrate site, so there are 16 unique gauging stations in total
unique(cs2_flowinputs$flow_site_id)

# Import flow data 
cs2_flow1 <- hetoolkit::import_flow(sites = cs2_flowinputs$flow_site_id, 
                                    inputs = cs2_flowinputs$flow_input, 
                                    start_date = "1992-01-01", end_date = "2015-12-31")

# Remove unwanted 'input' column (so that modelled flow data can be appended - see below)
cs2_flow1 <- cs2_flow1 %>%
  dplyr::select(-input)

For the remaining sites we manually load in modelled flow data from a local RDS file:

cs2_flow2 <- readRDS("data/CaseStudy2/cs2_flow2.rds")

Now we append the modelled flows to the gauged flows to give a single data frame:

# Combine cs2_flow1 and cs2_flow2 into one data frame
cs2_flow_data <- rbind(cs2_flow1, cs2_flow2)

Explore gaps in flow data

The plot_heatmap function is designed to visualise and summarise gaps in time series data. It plots time series data for multiple sites as a tiled heat map, and optionally produces tubular summaries of data completeness by time period and site. Although designed for application with flow time series data, it can be applied to any type of numerical data, with or without a time dimension.

In the following example plot_heatmap is used to explore gaps in the time series of mean daily flows, and then summarise how complete the data are in each calendar month.

# Generate a heatmap of mean daily flows
cs2_heatmap <- hetoolkit::plot_heatmap(data = cs2_flow_data, x = "date", y = "flow_site_id", fill = "flow", dual = TRUE)

View a table of completeness statistics from each site:

Impute missing data

The impute_flow function is designed to infill missing records in daily flow time series for one or more sites (gauging stations) using either an interpolation or an equipercentile method. Imputation of missing flow data can improve the later estimation of flow statistics using the calc_flowstats function and aid visualisation of hydro-ecological relationships using the plot_hev function.

Here we use the equipercentile method and automatically select donor sites to infill gaps in mean daily flow data. The donor site to be used for each target site can be specified by the user via the donors argument (see ?impute_flow for more detail).

# Impute flows with equipercentile method
cs2_flow_data <- hetoolkit::impute_flow(cs2_flow_data,
                                        site_col = "flow_site_id",
                                        date_col = "date",
                                        flow_col = "flow",
                                        method = "equipercentile")

Calculate flow statistics

Dunbar et al. (2010) explored the how flows in the preceding six month period influence spring and autumn LIFE O:E ratios. Flow statistics were calculated for winter (October-March) and summer (April-September) periods, and these two sets of statistics were each z-score standardised to indicate whether flows were above or below normal for the time of year. To mirror that analysis, we run the calc_flowstats function twice - first for winter and then for summer - so that the winter and summer flow statistics are standardised separately.

calc_flowstats takes a time series of measured or modelled flows and uses a user-defined moving window to calculate a suite of long-term and time-varying flow statistics for one or more sites (stations).

The function uses the win_start, win_width and win_step arguments to define a moving window, which divides the flow time series into a sequence of time periods. These time periods may be contiguous, non-contiguous or overlapping.

The sequence of time periods continues up to and including the present date, even when this extends beyond the period covered by the input flow dataset, as this facilitates the subsequent joining of flow statistics and ecology data by the join_he function.

It is primarily designed to work with mean daily flows (e.g. as produced by import_flow), but can also be applied to time series data on a longer (e.g. monthly) time step. Regardless, the data should be regularly spaced and the same time step should be used for all sites.

The function also requires site, date and flow data columns to be specified in order to run. The default values for these arguments, “flow_site_id”, “date” and “flow” respectively, match the column headers from the outputs of the import_flow and impute_flow functions. The defaults allow the data from these functions to be passed without editing into calc_flowstats, however the names can be changed as normal to match the data supplied.

This is an updated version of the calc_flowstats function. The first iteration of the function, which employs fixed 6-monthly spring and autumn flow periods, remains available using calc_flowstats_old.

Note that calc_flowstats returns a list of two data frames. The first contains a suite of time-varying flow statistics for every time period at every site, and is extracted using [[1]]. See ?calc_flowstats for definitions of these statistics. The second data frame contains long-term flow statistics and can be extracted using [[2]].

# Calculate flow statistics for summer periods only 
cs2_summer_flowstats <- hetoolkit::calc_flowstats(cs2_flow_data, site_col = "flow_site_id",
                                                  date_col = "date",
                                                  flow_col = "flow",
                                                  win_start = "1993-04-01",
                                                  win_width = "6 months",
                                                  win_step = "12 month")

# Extract time-varying flow statistics from calc_flowstats output
cs2_summer_flowstats1 <- cs2_summer_flowstats[[1]]

# Remove rows where flowstats are NA (post 2015-12-31)
cs2_summer_flowstats1 <- cs2_summer_flowstats1[!is.na(cs2_summer_flowstats1$Q95z),]

# Select variables we want to keep 
cs2_summer_flowstats1 <- cs2_summer_flowstats1 %>%
  dplyr::select(flow_site_id, win_no, start_date, end_date, prop_missing, mean, Q10z, Q95z)

# Save a copy as a RDS file for quick importing
saveRDS(cs2_summer_flowstats1, file = "summer_flowstats1.RDS")

# Calculate flow statistics for winter period only 
cs2_winter_flowstats <- hetoolkit::calc_flowstats(cs2_flow_data, site_col = "flow_site_id",
                                                  date_col = "date",
                                                  flow_col = "flow",
                                                  win_start = "1993-10-01",
                                                  win_width = "6 months",
                                                  win_step = "12 months")

# Extract time-varying flow statistics from calc_flowstats output
cs2_winter_flowstats1 <- cs2_winter_flowstats[[1]]

# Remove rows where flowstats are NA (post 2015-12-31)
cs2_winter_flowstats1 <- cs2_winter_flowstats1[!is.na(cs2_winter_flowstats1$Q95z),]

# Join_he requires unique window id's for each site, as we have ran calc_flowstats twice we add 0.5 to the winter flowstats win_no
cs2_winter_flowstats1$win_no <- cs2_winter_flowstats1$win_no + 0.5

# Select the variables we want to keep
cs2_winter_flowstats1 <- cs2_winter_flowstats1 %>%
  dplyr::select(flow_site_id, win_no, start_date, end_date, prop_missing, mean, Q10z, Q95z)
 
# Save a copy as a RDS file for quick importing
saveRDS(cs2_winter_flowstats1, file = "winter_flowstats1.RDS")

Join hydrological and ecological data

The join_he function links biology data with time-varying flow statistics for one or more antecedent (lagged) time periods (as calculated by the calc_flowstats function) to create a combined data set for hydro-ecology modelling. The function provides two alternative methods for linking biology samples to flow statistics for antecedent time periods: * Method = “A” (default), lag 0 is defined for each biology sample as the most recently finished time period; * Method = “B”, lag 0 is defined as the most recently started flow time period.

To describe the antecedent flow conditions prior to each biology sample, the time periods are labelled relative to the date of the biology sample, with lag 0 representing either the most recently finished (method = “A”) or most recently started (method = “B”) flow time period. The time period immediately prior to the Lag 0 time period is the Lag 1 period, and the period immediately prior to that is the Lag 2 period, and so on. The function allows the user to select which antecedent (lagged) time periods the biology data are joined to, e.g. lag = c(0,1,2).

The function also includes an option to specify the “join_type”:

  • “add_flows” (default) produces a data set of biological metrics (response variables) and flow statistics (predictor variables) for hydro-ecological modelling.

  • “add_biol” produces a time series of flow statistics with associated biological metrics which can be used to assess the coverage of historical flow conditions using the plot_rngflows function.

Below we use the join_he function with method B to join the spring biology data to the most recent winter flow statistics (sw), and the autumn biology data to the most recent summer flow statistics (as). We use both “add_flows” to create a calibration dataset for hydro-ecological modelling (called HE1) and “add_biol” to create a dataset for visualisation (called HE2).

# Join Summer flow stats to Autumn biology data to using "add_flows"
cs2_as_HE1 <- hetoolkit::join_he(cs2_autumn_biol, 
                                 flow_stats = cs2_summer_flowstats1, 
                                 mapping = cs2_metadata, 
                                 lags = 0, 
                                 method = "B", 
                                 join_type = "add_flows")

# Join Autumn biology data to Summer flow stats using "add_biol" 
cs2_as_HE2 <- hetoolkit::join_he(cs2_autumn_biol, 
                                 flow_stats = cs2_summer_flowstats1, 
                                 mapping = cs2_metadata, 
                                 lags = 0,
                                 method = "B", 
                                 join_type = "add_biol")

# Join Winter flow stats to Spring biology data using "add_flows"
cs2_sw_HE1 <- hetoolkit::join_he(cs2_spring_biol, 
                                 flow_stats = cs2_winter_flowstats1, 
                                 mapping = cs2_metadata, 
                                 lags = 0, 
                                 method = "B", 
                                 join_type = "add_flows")

# Join Spring biology data to Winter flow stats using "add_biol"
cs2_sw_HE2 <- hetoolkit::join_he(cs2_spring_biol, 
                                 flow_stats = cs2_winter_flowstats1, 
                                 mapping = cs2_metadata, 
                                 lags = 0, 
                                 method = "B", 
                                 join_type = "add_biol")

And now bind the seasonal data back together, and join the RHS data to the HE1 (model calibration) dataset:

cs2_HE1 <- rbind(cs2_as_HE1, cs2_sw_HE1) %>%
  dplyr::left_join(cs2_rhs, by = "rhs_survey_id") 

# Remove any rows where LIFE_F_OE scores are NA 
cs2_HE1 <- cs2_HE1[!is.na(cs2_HE1$LIFE_F_OE), ]

cs2_HE2 <- rbind(cs2_as_HE2, cs2_sw_HE2)

Exploratory plots

Assess coverage of historical flow conditions

A hydro-ecological model has a greater chance of revealing relationships between biology metrics and flow variables if the biology samples span a wide range of flow conditions. Good coverage of historical flow conditions by the biology sample data also means that the calibrated model is likely to be better at predicting biological responses under high and low flow scenarios.

The plot_rngflows function generates a scatterplot for two flow variables and overlays two convex hulls: one showing the full range of flow conditions experienced historically, and a second convex hull showing the range of flow conditions with associated biology samples. This visualisation helps identify to what extent the available biology data span the full range of flow conditions experienced historically.

In the following example, the Q95z and Q10z flow statistics have been selected as measures of low and high flows within each 6 month antecedent period. The first plot shows all of the data from all of the sites, and the second plot is faceted to show the data separately for each site individually. For the dataset as a whole, the biology samples appear to provide excellent coverage of historical flow conditions.

hetoolkit::plot_rngflows(data = cs2_HE2, flow_stats = c("Q95z_lag0","Q10z_lag0"), biol_metric = "LIFE_F_OE", wrap_by = NULL)

# And group by site id
hetoolkit::plot_rngflows(data = cs2_HE2, flow_stats = c("Q95z_lag0","Q10z_lag0"), biol_metric = "LIFE_F_OE", wrap_by = "biol_site_id", label = "Year")

HEV plots

The plot_hev function generates, for one site of interest, a time series plot of biology sample data and flow summary statistics, known as as a hydroecological validation (HEV) plot. HEV plots provide a visual assessment of trends in the historical data, and an initial guide to possible relationships that could be explored and quantified using a hydroecological model.

The data set created by the join_he function can be used to produce a simple HEV plot:

cs2_HE1 %>%
  filter(biol_site_id == "53819") %>%
  dplyr::select(date, Q95z_lag0, Q10z_lag0, LIFE_F_OE) %>%
  hetoolkit::plot_hev(date_col = "date",
                      flow_stat = c("Q95z_lag0", "Q10z_lag0"),
                      biol_metric = "LIFE_F_OE")

A smarter-looking HEV plot can be created using a little manual data manipulation; this creates a “sky-scraper” plot that illustrates more clearly the time period over which the flow statistics are calculated.

# Join together summer and winter flow statistics
cs2_flowstats1 <- rbind(cs2_summer_flowstats1, cs2_winter_flowstats1)

# Create two-column table mapping biology sites to flow sites 
cs2_mapping <- data.frame(cs2_metadata[,c("biol_site_id", "flow_site_id")])

# Filter biology and flow data sets to sites of interest 
cs2_biol_data_hev <- dplyr::filter(cs2_biol_all, biol_site_id %in% unique(cs2_mapping$biol_site_id))
cs2_flow_data_hev <- dplyr::filter(cs2_flowstats1, flow_site_id %in% unique(cs2_mapping$flow_site_id))

# Create a complete daily time series for each biology site 
cs2_hev_data <- expand.grid(
  biol_site_id = unique(cs2_biol_data_hev$biol_site_id),
  date = seq.Date(as.Date("1993-01-01"), as.Date("2015-12-31"), by = "day"),
  stringAsFactor = FALSE)

# Create Month and Year columns on which to join the 6-monthly flow statistics 
cs2_hev_data <- cs2_hev_data %>%
  mutate(Month = lubridate::month(date),
         Year = lubridate::year(date))

# Join in biology data 
cs2_hev_data <- cs2_hev_data %>%
  left_join(cs2_biol_data_hev, by = c("biol_site_id", "date", "Year", "Month"))

# Join in flow statistics
cs2_hev_data <- join_he(biol_data = cs2_hev_data, flow_stats = cs2_flow_data_hev, mapping = cs2_mapping, method = "A", join_type = "add_biol")

Using this new cs2_hev_data dataset, the following example produces a HEV plot for the LIFE_F_OE response variable for one site of interest. However, if you have multiple response variables, you can use multiplot = TRUE to produce multiple HEV plots. Notice the difference in this HEV plot compared to the previous one.

cs2_hev_data %>%
  filter(biol_site_id == "53819") %>%
  dplyr::select(date, Q95z_lag0, Q10z_lag0, LIFE_F_OE) %>%
  hetoolkit::plot_hev(date_col = "date",
                      flow_stat = c("Q95z_lag0", "Q10z_lag0"),
                      biol_metric = "LIFE_F_OE")

Modelling

Linear Model

Model calibration

Once the biology and flow data have been processed and assembled into a single data frame, spatial and temporal variation in a biological metric of interest can be modelled as a function of one or more flow summary statistics to reveal potential hydroecological relationships. For datasets comprising multiple sites, linear mixed-effects models fitted using the lmer function from the lme4 package have the ability to describe a ‘global’ relationship that applies across all sites, whilst also quantifying the degree of variability in this relationship from site to site.

Following Dunbar et al. (2010), we model variation in LIFE (expressed as an O:E ratio rather than as an observed score to better control for natural variation in invertebrate community composition) as a function of: high (Q10z_lag0) and low (Q95z_lag0) flows in the preceding six-month winter or summer period; Season (spring or autumn); Habitat Quality Assessment (HQA) score, representing instream habitat quality; Habitat Modification Score (HMSRBB) representing the extent and severity of morphological channel alteration; and time (Year, centered to 2005 = 0, termed Year_cen) to account for underlying long-term trends. Selected interaction terms were also included to explore if and how the effect of flow history varies with season, HMSRBB and HQA (see Dunbar et al. (2010) for details).

To reflect the hierarchical structure of the data (samples nested in sites), the models include a random intercept to account for site-to-site variation in mean LIFE O:E. Following Dunbar et al. (2010), we also include a random slope to account for site-specific time trends.

To illustrate the challenge of selecting a preferred model, we fit four competing models of varying complexity and use cross-validation to compare their predictive performance.

# First center the year variable (which aids model convergence)
cs2_HE1$Year_cen <- cs2_HE1$Year - 2005

# And make Spring the base season
cs2_HE1$Season <- factor(cs2_HE1$Season, levels = c("Spring", "Autumn"))

Candidate Models:

Model 1
# Model 1 (Model A in Table 2 of Dunbar et al. 2010)
cs2_mod1 <- lme4::lmer(LIFE_F_OE ~ 
                         Q10z_lag0 +
                         Season + 
                         Q95z_lag0 +                        
                         HQA +
                         Year_cen + 
                         HMSRBB +
                         Q10z_lag0:Season +                 
                         Q95z_lag0:HQA + 
                         #Q95z_lag0:HMSRBB +
                         (Year_cen | biol_site_id), 
                       data = cs2_HE1,
                       control = lmerControl(optimizer="bobyqa", optCtrl = list(maxfun = 10000)))
summary(cs2_mod1)
## Linear mixed model fit by REML ['lmerMod']
## Formula: LIFE_F_OE ~ Q10z_lag0 + Season + Q95z_lag0 + HQA + Year_cen +  
##     HMSRBB + Q10z_lag0:Season + Q95z_lag0:HQA + (Year_cen | biol_site_id)
##    Data: cs2_HE1
## Control: lmerControl(optimizer = "bobyqa", optCtrl = list(maxfun = 10000))
## 
## REML criterion at convergence: -5654.5
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -4.3601 -0.6234 -0.0211  0.6245  4.0125 
## 
## Random effects:
##  Groups       Name        Variance  Std.Dev. Corr
##  biol_site_id (Intercept) 1.287e-03 0.035872     
##               Year_cen    2.640e-06 0.001625 0.18
##  Residual                 9.533e-04 0.030876     
## Number of obs: 1467, groups:  biol_site_id, 67
## 
## Fixed effects:
##                          Estimate Std. Error t value
## (Intercept)             9.764e-01  2.996e-02  32.589
## Q10z_lag0               1.787e-03  1.259e-03   1.419
## SeasonAutumn           -1.732e-03  1.631e-03  -1.062
## Q95z_lag0               1.652e-02  5.131e-03   3.219
## HQA                     6.623e-04  5.444e-04   1.216
## Year_cen                1.059e-03  2.656e-04   3.988
## HMSRBB                 -1.472e-03  7.499e-04  -1.963
## Q10z_lag0:SeasonAutumn  1.971e-03  1.800e-03   1.095
## Q95z_lag0:HQA          -2.261e-04  9.609e-05  -2.353
## 
## Correlation of Fixed Effects:
##             (Intr) Q10z_0 SsnAtm Q95z_0 HQA    Yer_cn HMSRBB Q10_0:
## Q10z_lag0    0.013                                                 
## SeasonAutmn -0.025 -0.005                                          
## Q95z_lag0    0.026 -0.087 -0.027                                   
## HQA         -0.984 -0.011 -0.002 -0.026                            
## Year_cen     0.068 -0.030  0.004 -0.020 -0.036                     
## HMSRBB      -0.595 -0.009  0.000 -0.004  0.526 -0.018              
## Q10z_lg0:SA -0.014 -0.637  0.084 -0.027  0.011  0.004  0.010       
## Q95z_l0:HQA -0.025  0.054  0.017 -0.981  0.026  0.016  0.004 -0.008
Model 2
# Model 2 (Model B in Table 2 of Dunbar et al. 2010)
cs2_mod2 <- lme4::lmer(LIFE_F_OE ~ 
                         Q10z_lag0 +
                         Season + 
                         Q95z_lag0 +                          
                         HQA +
                         Year_cen + 
                         HMSRBB +
                         Q10z_lag0:Season +                          
                         #Q95z_lag0:HQA + 
                         Q95z_lag0:HMSRBB +
                         (Year_cen | biol_site_id), 
                       data = cs2_HE1,
                       control = lmerControl(optimizer="bobyqa", optCtrl = list(maxfun = 10000)))
summary(cs2_mod2)
## Linear mixed model fit by REML ['lmerMod']
## Formula: LIFE_F_OE ~ Q10z_lag0 + Season + Q95z_lag0 + HQA + Year_cen +  
##     HMSRBB + Q10z_lag0:Season + Q95z_lag0:HMSRBB + (Year_cen |  
##     biol_site_id)
##    Data: cs2_HE1
## Control: lmerControl(optimizer = "bobyqa", optCtrl = list(maxfun = 10000))
## 
## REML criterion at convergence: -5650.9
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -4.3802 -0.6323 -0.0233  0.6256  4.0217 
## 
## Random effects:
##  Groups       Name        Variance  Std.Dev. Corr
##  biol_site_id (Intercept) 1.287e-03 0.035878     
##               Year_cen    2.712e-06 0.001647 0.17
##  Residual                 9.555e-04 0.030911     
## Number of obs: 1467, groups:  biol_site_id, 67
## 
## Fixed effects:
##                          Estimate Std. Error t value
## (Intercept)             0.9750963  0.0300084  32.494
## Q10z_lag0               0.0019424  0.0012586   1.543
## SeasonAutumn           -0.0016731  0.0016330  -1.025
## Q95z_lag0               0.0041032  0.0011280   3.638
## HQA                     0.0006851  0.0005452   1.257
## Year_cen                0.0010626  0.0002683   3.960
## HMSRBB                 -0.0014513  0.0007513  -1.932
## Q10z_lag0:SeasonAutumn  0.0019120  0.0018026   1.061
## Q95z_lag0:HMSRBB        0.0001463  0.0001333   1.097
## 
## Correlation of Fixed Effects:
##             (Intr) Q10z_0 SsnAtm Q95z_0 HQA    Yer_cn HMSRBB Q10_0:
## Q10z_lag0    0.015                                                 
## SeasonAutmn -0.025 -0.006                                          
## Q95z_lag0    0.008 -0.153 -0.045                                   
## HQA         -0.984 -0.012 -0.002 -0.003                            
## Year_cen     0.069 -0.030  0.004 -0.005 -0.037                     
## HMSRBB      -0.595 -0.009  0.000 -0.012  0.526 -0.019              
## Q10z_lg0:SA -0.014 -0.637  0.084 -0.154  0.011  0.004  0.010       
## Q95_0:HMSRB -0.005 -0.008 -0.004 -0.458  0.002 -0.033  0.021 -0.011
Model 3
# Model 3 (no interactions)
cs2_mod3 <- lme4::lmer(LIFE_F_OE ~ 
                         Q10z_lag0 +
                         Season + 
                         Q95z_lag0 +                          
                         HQA +
                         Year_cen + 
                         HMSRBB +
                         #Q10z_lag0:Season +                          
                         #Q95z_lag0:HQA + 
                         #Q95z_lag0:HMSRBB +
                         (Year_cen | biol_site_id), 
                       data = cs2_HE1,
                       control = lmerControl(optimizer="bobyqa", optCtrl = list(maxfun = 10000)))
summary(cs2_mod3)
## Linear mixed model fit by REML ['lmerMod']
## Formula: LIFE_F_OE ~ Q10z_lag0 + Season + Q95z_lag0 + HQA + Year_cen +  
##     HMSRBB + (Year_cen | biol_site_id)
##    Data: cs2_HE1
## Control: lmerControl(optimizer = "bobyqa", optCtrl = list(maxfun = 10000))
## 
## REML criterion at convergence: -5675.3
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -4.4485 -0.6337 -0.0169  0.6230  4.0265 
## 
## Random effects:
##  Groups       Name        Variance  Std.Dev. Corr
##  biol_site_id (Intercept) 1.292e-03 0.035940     
##               Year_cen    2.772e-06 0.001665 0.17
##  Residual                 9.551e-04 0.030905     
## Number of obs: 1467, groups:  biol_site_id, 67
## 
## Fixed effects:
##                Estimate Std. Error t value
## (Intercept)   0.9756205  0.0300374  32.480
## Q10z_lag0     0.0028179  0.0009700   2.905
## SeasonAutumn -0.0018128  0.0016269  -1.114
## Q95z_lag0     0.0048603  0.0009862   4.928
## HQA           0.0006796  0.0005458   1.245
## Year_cen      0.0010733  0.0002703   3.972
## HMSRBB       -0.0014763  0.0007519  -1.963
## 
## Correlation of Fixed Effects:
##             (Intr) Q10z_0 SsnAtm Q95z_0 HQA    Yer_cn
## Q10z_lag0    0.007                                   
## SeasonAutmn -0.024  0.062                            
## Q95z_lag0    0.003 -0.383 -0.038                     
## HQA         -0.984 -0.007 -0.003  0.000              
## Year_cen     0.069 -0.036  0.003 -0.022 -0.037       
## HMSRBB      -0.595 -0.003 -0.001  0.000  0.526 -0.018
Model 4
# Model 4 (no interactions and no random slope)
cs2_mod4 <- lme4::lmer(LIFE_F_OE ~ 
                         Q10z_lag0 +
                         Season + 
                         Q95z_lag0 +                          
                         HQA +
                         Year_cen + 
                         HMSRBB +
                         #Q10z_lag0:Season +                          
                         #Q95z_lag0:HQA + 
                         #Q95z_lag0:HMSRBB +
                         (1 | biol_site_id), 
                       data = cs2_HE1,
                       control = lmerControl(optimizer="bobyqa", optCtrl = list(maxfun = 10000)))
summary(cs2_mod4)
## Linear mixed model fit by REML ['lmerMod']
## Formula: LIFE_F_OE ~ Q10z_lag0 + Season + Q95z_lag0 + HQA + Year_cen +  
##     HMSRBB + (1 | biol_site_id)
##    Data: cs2_HE1
## Control: lmerControl(optimizer = "bobyqa", optCtrl = list(maxfun = 10000))
## 
## REML criterion at convergence: -5617.1
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.9218 -0.6263 -0.0226  0.6255  3.8326 
## 
## Random effects:
##  Groups       Name        Variance Std.Dev.
##  biol_site_id (Intercept) 0.001321 0.03635 
##  Residual                 0.001045 0.03233 
## Number of obs: 1467, groups:  biol_site_id, 67
## 
## Fixed effects:
##                Estimate Std. Error t value
## (Intercept)   0.9808000  0.0305782  32.075
## Q10z_lag0     0.0021135  0.0009900   2.135
## SeasonAutumn -0.0019872  0.0016991  -1.170
## Q95z_lag0     0.0052440  0.0010117   5.183
## HQA           0.0005446  0.0005567   0.978
## Year_cen      0.0007905  0.0001274   6.205
## HMSRBB       -0.0015364  0.0007683  -2.000
## 
## Correlation of Fixed Effects:
##             (Intr) Q10z_0 SsnAtm Q95z_0 HQA    Yer_cn
## Q10z_lag0    0.003                                   
## SeasonAutmn -0.025  0.060                            
## Q95z_lag0    0.015 -0.381 -0.036                     
## HQA         -0.984 -0.002 -0.003 -0.010              
## Year_cen     0.039 -0.045 -0.004  0.012 -0.037       
## HMSRBB      -0.593  0.001  0.000 -0.009  0.523 -0.005


To aid model selection, the hetoolkit package includes two functions for performing cross-validation.

model_cv performs repeated, stratified k-fold cross-validation, which can be used to evaluate a model’s ability to predict the response at sites in the calibration dataset. Alternatively, model_logocv performs leave-one-group-out cross-validation, which can be used to evaluate a model’s ability to predict the response at sites not in the calibration dataset. Both functions can be applied to a linear mixed-effects model (class: lmerMod) or a hierarchical generalized additive model (class: gam) model with a single random grouping factor. See ?model_cv and ?model_logocv for details.

model_cv and model_logocv measure the performance of the model under different situations, and so will not necessarily agree on which is the best model. If the priority is to be able to predict the response at sites in the calibration dataset, then use model_cv; if the priority is to be able to predict the response at sites not in the calibration dataset, then use model_logocv.

Here we use model_cv to compare the four candidate models fitted above. Predictive performance is measured using the Root Mean Square Error (RMSE); lower values indicate better predictions. r is set at 10 to give stable RMSE estimates and control is used to help prevent model convergence issues.

my_control = lmerControl(optimizer="bobyqa", optCtrl = list(maxfun = 10000))

# Model 1
cs2_cv_mod1 <- hetoolkit::model_cv(model = cs2_mod1,
                               data = cs2_HE1,
                               group = "biol_site_id",
                               k = 5,
                               r = 10, 
                               control = my_control)

# Model 2
cs2_cv_mod2 <- hetoolkit::model_cv(model = cs2_mod2,
                               data = cs2_HE1,
                               group = "biol_site_id",
                               k = 5,
                               r = 10,
                               control = my_control)

# Model 3
cs2_cv_mod3 <- hetoolkit::model_cv(model = cs2_mod3,
                               data = cs2_HE1,
                               group = "biol_site_id",
                               k = 5,
                               r = 10, 
                               control = my_control)

# Model 4
cs2_cv_mod4 <- hetoolkit::model_cv(model = cs2_mod1,
                               data = cs2_HE1,
                               group = "biol_site_id",
                               k = 5,
                               r = 10,
                               control = my_control)

Model 1 has the lowest RMSE, so is the preferred model.

# Let's compare the RMSE results 
cs2_cv_mod1[[1]]
## [1] 0.03233307
cs2_cv_mod2[[1]]
## [1] 0.03235574
cs2_cv_mod3[[1]]
## [1] 0.0323448
cs2_cv_mod4[[1]]
## [1] 0.03232345

Model interpretation

## Linear mixed model fit by REML ['lmerMod']
## Formula: LIFE_F_OE ~ Q10z_lag0 + Season + Q95z_lag0 + HQA + Year_cen +  
##     HMSRBB + Q10z_lag0:Season + Q95z_lag0:HQA + (Year_cen | biol_site_id)
##    Data: cs2_HE1
## Control: lmerControl(optimizer = "bobyqa", optCtrl = list(maxfun = 10000))
## 
## REML criterion at convergence: -5654.5
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -4.3601 -0.6234 -0.0211  0.6245  4.0125 
## 
## Random effects:
##  Groups       Name        Variance  Std.Dev. Corr
##  biol_site_id (Intercept) 1.287e-03 0.035872     
##               Year_cen    2.640e-06 0.001625 0.18
##  Residual                 9.533e-04 0.030876     
## Number of obs: 1467, groups:  biol_site_id, 67
## 
## Fixed effects:
##                          Estimate Std. Error t value
## (Intercept)             9.764e-01  2.996e-02  32.589
## Q10z_lag0               1.787e-03  1.259e-03   1.419
## SeasonAutumn           -1.732e-03  1.631e-03  -1.062
## Q95z_lag0               1.652e-02  5.131e-03   3.219
## HQA                     6.623e-04  5.444e-04   1.216
## Year_cen                1.059e-03  2.656e-04   3.988
## HMSRBB                 -1.472e-03  7.499e-04  -1.963
## Q10z_lag0:SeasonAutumn  1.971e-03  1.800e-03   1.095
## Q95z_lag0:HQA          -2.261e-04  9.609e-05  -2.353
## 
## Correlation of Fixed Effects:
##             (Intr) Q10z_0 SsnAtm Q95z_0 HQA    Yer_cn HMSRBB Q10_0:
## Q10z_lag0    0.013                                                 
## SeasonAutmn -0.025 -0.005                                          
## Q95z_lag0    0.026 -0.087 -0.027                                   
## HQA         -0.984 -0.011 -0.002 -0.026                            
## Year_cen     0.068 -0.030  0.004 -0.020 -0.036                     
## HMSRBB      -0.595 -0.009  0.000 -0.004  0.526 -0.018              
## Q10z_lg0:SA -0.014 -0.637  0.084 -0.027  0.011  0.004  0.010       
## Q95z_l0:HQA -0.025  0.054  0.017 -0.981  0.026  0.016  0.004 -0.008

The preferred model with the lowest RMSE (model 1) included the fixed predictors Q10z_lag0, Q95z_lag0, Season, HQA, Year_cen, and HMSRBB, plus two interaction terms, while the random effects included intercept and Year_cen varying by site. The diagnostic plots (see below) indicate that all of these effects were approximately linear.

On average, LIFE O:E was lower in autumn than spring, increased with habitat diversity (HQA score) and decreased with the extent of habitat modification (HMSRBB score). There was also an underlying long-term increasing trend across all sites (Year_cen) of ca 0.03. After controlling for these influences, LIFE O:E was positively associated with the Q95z flow in the preceding six-month winter or summer period and to a lesser extent with the Q10z flow. The effect of Q10z was stronger in autumn than in spring (Q10z * season interaction), and the effect of Q95z decreased with the HQA score (Q95z * HQA interaction), indicating that macroinvertebrate communities at sites with more diverse habitat were less sensitive to antecedent low flows.

The random effects estimates showed a relatively large variance in the intercept among the 68 sites, indicating a high degree of unexplained spatial variation, but only a small variance in the Year_cen effect, indicating a similar long-term trend across all sites. The random intercepts and slopes were only weakly (r = 0.16) correlated.

In spite of the use of a smaller dataset (68 vs 87 sites) and LIFE O:E ratios in preference to LIFE scores, these results are broadly consistent with those reported by Dunbar et al. (2010).

Model diagnostics

The diag_lmer function produces a variety of diagnostic plots for a mixed-effects regression (lmer) model. Below we create model diagnostic plots for Model 1 as it had the lowest RMSE score.

# generate diagnostic plots 
cs2_diag_plots <- diag_lmer(
 model = cs2_mod1, 
 data = cs2_HE1, 
 facet_by = NULL)

These include:

  1. Fitted vs observed values. This plot shows the relationship between the fitted (predicted) and observed (true) values of the response variable. If the model fits the data well, the points should be clustered around the diagonal 1:1 line, which represents a perfect fit (fitted=observed). Here we can see the model provides a reasonably good fit to the data, but with a slight tendency to under-predict high LIFE O:E ratios and over-predict low LIFE O:E ratios (which is very common in regression modelling).

  1. Normal probability plot. This is a graphical technique for verifying the assumption of a normally distributed error structure. The model residuals are plotted against a theoretical normal distribution in such a way that the points should form an approximate straight line. Here we can see that the residuals conform very closely to a normal distribution, with only small deviations at the extremities.

  1. Residuals vs fitted. This plot checks the assumption of homogeneous variances, which states that the error variance should be constant regardless of what value the response takes. Here we can see that the residuals have a similar amount of vertical scatter at all fitted values of the response, so this assumption is verified.

  1. Histogram of residuals. This is an alternative to plot 2 for verifying the assumption of normally distributed errors. Here we can see that the probability distribution of the model residuals (shown in pink) is very similar to a perfect normal distribution.

  1. Residuals vs model fixed predictors. To verify the assumption of linearity, the model’s residuals are plotted against each of the fixed predictors in turn. If the relationship between the response and each predictor is truly linear, then there should be no trend in the residuals. Here we can see that the linear (red) line fitted to the residuals is horizontal, indicating that the strength of the linear effects has not been over-or under-estimated. Also in most cases the loess-smoothed (blue) line is close to horizontal; there is a small departure from linearity for Q95z_lag0, Q10z_lag0 and Year_cen, but for Q95z_lag0 and Q10z_lag0 this is due to a small number of unusually low/high data points and in all cases the deviations are small, so in this example a linear model appears to provide a reasonable representation of these relationships.

Other useful diagnostic plots for lmer models include:

  • Caterpillar plots, which plot the random intercepts and slopes for each site as deviations from the global intercept and slope. Sites coloured blue are those where, all else being equal, the intercept or slope is higher than average, and red sites have lower-than-average intercepts and slopes. This confirms that there is reasonably large (+/- 0.1) variation in LIFE O:E ratios from site to site that cannot be explained by the other, fixed effects in the model. The random variance for Year_cen is much smaller, but this still translates into important differences in time trends among sites (see below for a clearer visualisation).
sjPlot::plot_model(model = cs2_mod1, 
                   type = "re", 
                   facet.grid = FALSE, 
                   free.scale = FALSE, 
                   title = NULL, 
                   vline.color = "darkgrey") + 
  ylim(-0.15,0.15)

  • Partial effects plots, for visualising the fixed effects that are common to all sites; these show the modelled effect of each predictor in turn when other continuous predictors are held constant at their mean values and factors are fixed at their reference level (i.e. spring for Season). This confirms the positive associations with low flows (Q95z) and habitat quality (HQA) and the negative effect of bed and bank resectioning (HMSRBB).
sjPlot::plot_model(model = cs2_mod1, 
                   type = "pred",
                   pred.type = "fe",
                   title = NULL, terms= "Q95z_lag0")

sjPlot::plot_model(model = cs2_mod1, 
                   type = "pred",
                   pred.type = "fe",
                   title = NULL, terms= "HQA")

sjPlot::plot_model(model = cs2_mod1, 
                   type = "pred",
                   pred.type = "fe",
                   title = NULL, terms= "HMSRBB")

  • Interaction plots, showing how the effect of one predictor depends upon the value of another predictor. Here we can see that the effect of high flows (Q10z_lag0) is marginally stronger in autumn than in spring, and that sites with poorer habitat quality (low HQA scores) are more sensitive to low flows (Q95z_lag0).
sjPlot::plot_model(model = cs2_mod1,
                   type = "pred",
                   terms = c("Q10z_lag0", "Season"))

sjPlot::plot_model(model = cs2_mod1,
                   type = "pred", 
                   terms = c("Q95z_lag0", "HQA [35, 50, 65]"))

  • Finally, the mixed-effects estimates of the time trend at each site can be extracted and plotted. For reference, the vertical dashed blue line is the global slope (i.e. the average time trend of 0.001 per year - see the output from summary() above). This shows that, after accounting for interannual variation in flow, the majority of sites display a trend of increasing LIFE O:E ratios over time, but that a minority of sites have a decreasing trend.
cs2_mod.coef <- coef(cs2_mod1)[[1]]

cs2_mod.coef$biol_site_id <- as.character(row.names(cs2_mod.coef))

ggplot(cs2_mod.coef, aes(x = biol_site_id, y = Year_cen)) +
  geom_point(colour = "orange")+
  coord_flip() + 
  geom_hline(yintercept = 0, linetype = "dashed", colour = "black") + 
  geom_hline(yintercept = 0.0011, linetype = "dashed", colour = "blue")

Generalised Additive Model

In addition to a linear model, we also demonstrate how you may apply a generalised additive model for analysis of hydroecological data.

Hierarchical Generalised Additive Models (HGAMs), implemented using the mgcv library were used to model spatial and temporal variation in LIFE O:E. HGAMs allow relationships between the explanatory variables and the response to be described by smooth curves or linear terms (Wood, 2017), whilst also incorporating terms to represent unexplained site-to-site variation (Pedersen et al., 2019). By flexibly describing non-linear relationships nonparametrically, without making a priori assumptions about the form of the relationship, HGAMs can be a suitable alternative to more simple, less flexible linear models.

Linear models 1 and 2 were re-fitted using the gam() function from mgcv. The same explanatory variables and interaction terms were tested, but incorporated into the gam() model structure. The mixture of continuous and categorical predictors was included within the full models as fixed effects, with continuous predictors fitted as smoothed terms (using the s() notation).

To model interactions of continuous variables, the functions, te() and ti(), can be used. The function te() is applicable when variables are not on the same scale, and when interactions include main effects. The function ti() is used for modelling interaction surfaces that do not include the main effects, under the assumption that they will be included separately. In models 1 and 2, below, interactions between two smoothed terms (i.e Q95z_lag0 * HQA, or Q95z_lag0 * HMSRBB) were therefore specified using the notation ti(), so the main effects could be evaluated separately from the interaction term.

In addition, a ‘factor-smooth’ term (Pedersen et al. 2019) was used to model random site-specific intercepts and time trends; the inclusion of a ‘factor-smooth’ term allowed the temporal trend to vary from site to site, and is specified in place of the random effects included within the linear model structures.

Candidate Models:

Model 1
# model 1 (includes interactions)
cs2_gam1 <- mgcv::gam(LIFE_F_OE ~ 
                     Season +
                     s(Year_cen) +
                     s(Q10z_lag0) +
                     s(Q95z_lag0) +
                     s(Q10z_lag0, by = Season, m = 1) +
                     s(HQA) +
                     s(HMSRBB) +
                     ti(Q95z_lag0, HQA) +
                     #ti(Q95z_lag0, HMSRBB) +
                     s(Year_cen, biol_site_id, bs = "fs"),
                    data = cs2_HE1,  
                    #select = TRUE,
                    method = 'REML')

#saveRDS(cs2_gam1, "data/cs2_gam1.RDS")
summary(cs2_gam1)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## LIFE_F_OE ~ Season + s(Year_cen) + s(Q10z_lag0) + s(Q95z_lag0) + 
##     s(Q10z_lag0, by = Season, m = 1) + s(HQA) + s(HMSRBB) + ti(Q95z_lag0, 
##     HQA) + s(Year_cen, biol_site_id, bs = "fs")
## 
## Parametric coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.004595   0.004370 229.882   <2e-16 ***
## SeasonAutumn -0.001616   0.001589  -1.017    0.309    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                                 edf  Ref.df      F  p-value    
## s(Year_cen)               3.896e+00   4.785  7.748 1.70e-06 ***
## s(Q10z_lag0)              1.642e+00   2.010  2.727   0.0668 .  
## s(Q95z_lag0)              1.001e+00   1.002 15.486 8.73e-05 ***
## s(Q10z_lag0):SeasonSpring 6.938e-04   8.000  0.000   0.9904    
## s(Q10z_lag0):SeasonAutumn 2.868e+00   8.000  0.769   0.0343 *  
## s(HQA)                    1.000e+00   1.000  0.070   0.7916    
## s(HMSRBB)                 3.228e+00   3.282  3.180   0.0121 *  
## ti(Q95z_lag0,HQA)         1.002e+00   1.004  3.746   0.0530 .  
## s(Year_cen,biol_site_id)  1.205e+02 553.000  3.288  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.637   Deviance explained = 67.1%
## -REML = -2863.8  Scale est. = 0.00089614  n = 1467
Model 2
# model 2 (includes interactions)
cs2_gam2 <- mgcv::gam(LIFE_F_OE ~ 
                     Season +
                     s(Year_cen) +
                     s(Q10z_lag0) +
                     s(Q95z_lag0) +
                     s(Q10z_lag0, by = Season, m = 1) +
                     s(HQA) +
                     s(HMSRBB) +
                     #ti(Q95z_lag0, HQA) +
                     ti(Q95z_lag0, HMSRBB) +
                     s(Year_cen, biol_site_id, bs = "fs"),
                    data = cs2_HE1,  
                    #select = TRUE,
                    method = 'REML')

#saveRDS(cs2_gam2, "data/cs2_gam2.RDS")
summary(cs2_gam2)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## LIFE_F_OE ~ Season + s(Year_cen) + s(Q10z_lag0) + s(Q95z_lag0) + 
##     s(Q10z_lag0, by = Season, m = 1) + s(HQA) + s(HMSRBB) + ti(Q95z_lag0, 
##     HMSRBB) + s(Year_cen, biol_site_id, bs = "fs")
## 
## Parametric coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.004460   0.004375 229.572   <2e-16 ***
## SeasonAutumn -0.001511   0.001590  -0.951    0.342    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                                 edf  Ref.df      F  p-value    
## s(Year_cen)               3.969e+00   4.870  7.914 1.10e-06 ***
## s(Q10z_lag0)              1.666e+00   2.042  2.776   0.0643 .  
## s(Q95z_lag0)              1.003e+00   1.006 15.422 8.96e-05 ***
## s(Q10z_lag0):SeasonSpring 9.427e-04   8.000  0.000   0.9830    
## s(Q10z_lag0):SeasonAutumn 2.881e+00   8.000  0.773   0.0338 *  
## s(HQA)                    1.001e+00   1.001  0.064   0.8001    
## s(HMSRBB)                 3.203e+00   3.257  3.222   0.0117 *  
## ti(Q95z_lag0,HMSRBB)      2.377e+00   3.304  0.978   0.4627    
## s(Year_cen,biol_site_id)  1.199e+02 554.000  3.258  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.637   Deviance explained = 67.1%
## -REML = -2862.9  Scale est. = 0.00089688  n = 1467


Similar to the linear models, we use model_cv to compare the two candidate GAMs fitted above. model_cv performs repeated, stratified k-fold cross-validation, which can be used to evaluate a model’s ability to predict the response at sites in the calibration dataset. Predictive performance is measured using the Root Mean Square Error (RMSE); lower values indicate better predictions. r is set at 10 to give stable RMSE estimates and control is used to help prevent model convergence issues.

my_control = lmerControl(optimizer="bobyqa", optCtrl = list(maxfun = 10000))

# Model 1
cs2_cv_gam1 <- hetoolkit::model_cv(model = cs2_gam1,
                               data = cs2_HE1,
                               group = "biol_site_id",
                               k = 5,
                               r = 10, 
                               control = my_control)

# Model 2
cs2_cv_gam2 <- hetoolkit::model_cv(model = cs2_gam2,
                               data = cs2_HE1,
                               group = "biol_site_id",
                               k = 5,
                               r = 10,
                               control = my_control)

RMSE is very similar for both models, and is only marginally lower for model 1.

# Let's compare the RMSE results 
cs2_cv_gam1[[1]]
## [1] 0.03212151
cs2_cv_gam2[[1]]
## [1] 0.03214353

In this instance, we also compared the Akaike Information Criterion (AIC) of model 1 and model 2. AIC is a relative measure of how well a model fits a set of data. A lower AIC value indicates a better-fitting model. Model 2 has a notably lower AIC, compared to model 1.

Model 2 was therefore chosen as the final model.

# Let's compare the AIC 
AIC(cs2_gam1, cs2_gam2)
##                df       AIC
## cs2_gam1 148.3127 -5978.682
## cs2_gam2 152.5935 -5969.910

Model interpretation

The final model (model 2) included the fixed predictors Q10z_lag0, Q95z_lag0, Season, HQA, Year_cen, and HMSRBB, plus two interaction terms, and a factor-smooth term that allowed the temporal trend to vary from site-to-site.

On average, LIFE O:E was lower in autumn than spring. LIFE O:E decreased with the extent of habitat modification (HMSRBB score), with the steepest decrease in LIFE O:E observed when HMSRBB was greater than 20, and habitat quality (HQA) demonstrated a weak positive relationship with LIFE O:E. There was also an underlying long-term increasing trend across all sites (Year_cen); however, in contrast to the linear model, the HGAM found that LIFE O:E scores peaked in ~2005, and decreased slightly after this - although more recent temporal change (~2010-onwards) in LIFE O:E is associated with a higher degree of uncertainty.

After controlling for these influences, LIFE O:E was positively associated with the Q95z flow in the preceding six month winter or summer period and to a lesser extent with the Q10z flow, similar to the linear model. While Q95z demonstrated a linear relationship with LIFE O:E, Q10z demonstrated a slightly non-linear relationship, and LIFE O:E gradually plateaued at the highest antecedent flows; however, at the highest antecedent flows, the flow-ecology relationship was associated with a higher degree of uncertainty. The effect of Q10z was stronger in autumn than in spring (s(Q10z_lag0, by = season) interaction).

The effect of Q95z was strongest at sites with a high HMSRBB score (ti(Q95z_lag0, HMSRBB) interaction), indicating that macroinvertebrate communities at sites with more modified habitat were more sensitive to antecedent low flows, while sites with more natural habitat demonstrated lesser sensitivity to change in antecedent flow. At modified sites, LIFE O:E was positively associated with the Q95z flow in the preceding six months, whereas at sites with a lesser degree of modification, LIFE O:E remained relatively unchanged regardless of antecedent flow conditions. However, the magnitude of change in LIFE O:E associated with the interaction term (ti(Q95z_lag0, HMSRBB)) was weaker when compared to the individual fixed effects (s(Q95z_lag0) and s(HMSRBB)); with LIFE O:E showing a maximum change of ~0.01 for the interaction term, and a maximum change of ~0.03 and ~0.1 for the Q95z and HMSRBB main effects, respectively.

The factor-smooth estimates showed a relatively large variance in the intercept among the 68 sites, indicating a high degree of spatial variation.

viz.gam <- mgcViz::getViz(cs2_gam2)
print(plot(viz.gam, allTerms = TRUE), pages = 1)

summary(cs2_gam2)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## LIFE_F_OE ~ Season + s(Year_cen) + s(Q10z_lag0) + s(Q95z_lag0) + 
##     s(Q10z_lag0, by = Season, m = 1) + s(HQA) + s(HMSRBB) + ti(Q95z_lag0, 
##     HMSRBB) + s(Year_cen, biol_site_id, bs = "fs")
## 
## Parametric coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.004460   0.004375 229.572   <2e-16 ***
## SeasonAutumn -0.001511   0.001590  -0.951    0.342    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                                 edf  Ref.df      F  p-value    
## s(Year_cen)               3.969e+00   4.870  7.914 1.10e-06 ***
## s(Q10z_lag0)              1.666e+00   2.042  2.776   0.0643 .  
## s(Q95z_lag0)              1.003e+00   1.006 15.422 8.96e-05 ***
## s(Q10z_lag0):SeasonSpring 9.427e-04   8.000  0.000   0.9830    
## s(Q10z_lag0):SeasonAutumn 2.881e+00   8.000  0.773   0.0338 *  
## s(HQA)                    1.001e+00   1.001  0.064   0.8001    
## s(HMSRBB)                 3.203e+00   3.257  3.222   0.0117 *  
## ti(Q95z_lag0,HMSRBB)      2.377e+00   3.304  0.978   0.4627    
## s(Year_cen,biol_site_id)  1.199e+02 554.000  3.258  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.637   Deviance explained = 67.1%
## -REML = -2862.9  Scale est. = 0.00089688  n = 1467

Model diagnostics

We can run gam.check(), which produces some diagnostic information about the fitting procedure and results. The default is to produce 4 residual plots, some information about the convergence of the smoothness selection optimisation, and to run diagnostic tests of whether the basis dimension choices are adequate. Below we run gam.check() for Model 2, as it had the lowest RMSE score, to produce four diagnostic plots.

These include:

  1. Q-Q Plot. This plots two sets of quantiles against one another. If the residuals fall along a roughly straight line at a 45-degree angle, then the residuals are roughly normally distributed. We can see in our Q-Q plot below that the residuals tend to deviate from the 45-degree line on the tail ends, which could be an indication that they’re not normally distributed.

  2. Residuals vs linear pred. This plot checks the assumption of homogeneous variances, which states that the error variance should be constant regardless of what value the response takes. Here we can see that the residuals generally have a similar amount of vertical scatter at all fitted values of the response, with some deviation at lower values of the linear predictor.

  3. Histogram of residuals. This is an alternative to plot 2 for verifying the assumption of normally distributed errors. Here we can see that the probability distribution of the model residuals is reasonably similar to a normal distribution.

  4. Response vs fitted values. This plot shows the relationship between the fitted (predicted) and observed (true) values of the response variable. If the model fits the data well, the points should be clustered around the diagonal 1:1 line, which represents a perfect fit (fitted=observed). Here we can see the model provides a reasonably good fit to the data, but with the weakest predictive power at low LIFE O:E values (observed values).

gam.check(cs2_gam2)

## 
## Method: REML   Optimizer: outer newton
## full convergence after 10 iterations.
## Gradient range [-0.0005207262,0.001209649]
## (score -2862.883 & scale 0.0008968777).
## Hessian positive definite, eigenvalue range [0.0002308591,731.3442].
## Model rank =  749 / 749 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##                                 k'      edf k-index p-value    
## s(Year_cen)               9.00e+00 3.97e+00    0.92  <2e-16 ***
## s(Q10z_lag0)              9.00e+00 1.67e+00    0.95   0.030 *  
## s(Q95z_lag0)              9.00e+00 1.00e+00    0.97   0.095 .  
## s(Q10z_lag0):SeasonSpring 8.00e+00 9.43e-04    0.95   0.045 *  
## s(Q10z_lag0):SeasonAutumn 8.00e+00 2.88e+00    0.95   0.045 *  
## s(HQA)                    9.00e+00 1.00e+00    0.99   0.320    
## s(HMSRBB)                 9.00e+00 3.20e+00    0.98   0.125    
## ti(Q95z_lag0,HMSRBB)      1.60e+01 2.38e+00    0.98   0.255    
## s(Year_cen,biol_site_id)  6.70e+02 1.20e+02    0.92  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In addition to the diagnostic plots, gam.check() returns a check of the basis dimensions (k) of smoothed terms that were included within model 2. This can also be returned by running k.check(). For model 2 above, the basis dimensions (k) are set at the default values of mgcv::gam().

Above, the gam.check() report returned low p-values (and k-index <1) for the Q10_lag0 and Year_cen variables, and their interactions. These results indicate that k may be too low, especially where the EDF (effective degrees of freedom) are close to their maximum possible value (k’), meaning that the wiggliness of the model may be overly constrained by the default k, and the model could fit the data better (with greater wiggliness) if k were to be increased. In other words, the trade off between smoothness and wiggliness is not balanced. The model was first refitted with larger k values (k = 20) for Q10_lag0 and Year_cen variables (below).

Re-running k.check() on the adjusted model demonstrated some increase in p-values, but these remained reasonably low for the Q10z_lag0 and Year_cen variables. However, applying k=20 only marginally increased the EDF for the Q10z_lag0 and Year_cen variables. A second adjusted model, applying k=30 for Q10z_lag0 and k=25 for Year_cen, was also trialed and, again, only marginal increases in the EDF were observed. To balance computational time, and model wiggliness, a k value of 20 was therefore applied for Q10z_lag0 and Year_cen variables. Using this model, we can see that the EDF are no longer close to k’.

In case it is not clear from the above, note that k, k-index and k’ denote (somewhat confusingly) slightly different things. See Pedersen et al. (2019; https://doi.org/10.7717/peerj.6876) or Wood (2017; https://www.taylorfrancis.com/books/mono/10.1201/9781315370279/generalized-additive-models-simon-wood) for further information on this and other aspects of GAMs.

Adjust k-values:

Model 2_k20
cs2_gam2_k20 <- mgcv::gam(LIFE_F_OE ~ 
                     Season +
                     s(Year_cen, k = 20) +
                     s(Q10z_lag0, k = 20) +
                     s(Q95z_lag0) +
                     s(Q10z_lag0, by = Season, m = 1, k = 20) +
                     s(HQA) +
                     s(HMSRBB) +
                     #ti(Q95z_lag0, HQA) +
                     ti(Q95z_lag0, HMSRBB) +
                     s(Year_cen, biol_site_id, bs = "fs"),
                    data = cs2_HE1,  
                    #select = TRUE,
                    method = 'REML')


#saveRDS(cs2_mod2_k20, "vignettes/data/cs2_mod2_k20.RDS")
mgcv::k.check(cs2_gam2_k20)
##                            k'          edf   k-index p-value
## s(Year_cen)                19   4.10974374 0.9234596  0.0000
## s(Q10z_lag0)               19   1.76416286 0.9546517  0.0250
## s(Q95z_lag0)                9   1.00347102 0.9664289  0.0775
## s(Q10z_lag0):SeasonSpring  19   0.00213214 0.9546517  0.0325
## s(Q10z_lag0):SeasonAutumn  19   3.05661203 0.9546517  0.0425
## s(HQA)                      9   1.00024406 0.9864765  0.3250
## s(HMSRBB)                   9   3.20435675 0.9759677  0.1625
## ti(Q95z_lag0,HMSRBB)       16   2.40441894 0.9903202  0.3325
## s(Year_cen,biol_site_id)  670 119.84391853 0.9234596  0.0000
Model 2_k30
cs2_gam2_k50 <- mgcv::gam(LIFE_F_OE ~ 
                     Season +
                     s(Year_cen, k = 25) +
                     s(Q10z_lag0, k = 30) +
                     s(Q95z_lag0) +
                     s(Q10z_lag0, by = Season, m = 1, k = 30) +
                     s(HQA) +
                     s(HMSRBB) +
                     #ti(Q95z_lag0, HQA) +
                     ti(Q95z_lag0, HMSRBB) +
                     s(Year_cen, biol_site_id, bs = "fs"),
                    data = cs2_HE1,  
                    #select = TRUE,
                    method = 'REML')

#saveRDS(cs2_mod2_k50, "vignettes/data/cs2_mod2_k50.RDS")
##                            k'          edf   k-index p-value
## s(Year_cen)                24   4.11051603 0.9232838  0.0000
## s(Q10z_lag0)               29   1.79627379 0.9546158  0.0350
## s(Q95z_lag0)                9   1.00151997 0.9663879  0.1100
## s(Q10z_lag0):SeasonSpring  29   0.00303473 0.9546158  0.0350
## s(Q10z_lag0):SeasonAutumn  29   3.04890844 0.9546158  0.0375
## s(HQA)                      9   1.00006668 0.9865325  0.2700
## s(HMSRBB)                   9   3.20396115 0.9760166  0.2100
## ti(Q95z_lag0,HMSRBB)       16   2.40761255 0.9922870  0.3325
## s(Year_cen,biol_site_id)  670 119.83432482 0.9232838  0.0000