Vignette

Installation

if (!require(pacman)) {
    install.packages("pacman")
}
pacman::p_load(insight, RCurl, writexl, htmlTable, devtools, roxygen2, ggnewscale,
    visreg, formatR, reshape, glmmTMB, remotes, merTools, GGally, imputeTS, sf, geojsonio,
    leaflet, plyr, rnrfa)

# Conditionally install hetoolkit from github install.packages('remotes')
library(remotes)
# Conditionally install hetoolkit from github
if ("hetoolkit" %in% installed.packages() == FALSE) {
    remotes::install_github("APEM-LTD/hetoolkit")
}
library(hetoolkit)

Introduction

The hetoolkit package comprises a collection of 21 functions for assembling, processing, visualising and modelling hydro-ecological data. These are:

import_nrfa for importing flow data from the National River Flow Archive (NRFA);
import_hde for importing flow data from the Environment Agency (EA) Hydrology Data Explorer (HDE);
import_flowfiles for importing flow data from local files;
import_flow for importing flow data from a mix of the above sources;
impute_flow for infilling missing records in daily flow time series for one or more sites (gauging stations) using either an interpolation or an equipercentile method.
import_inv for importing macroinvertebrate sampling data from the EA Ecology and Fish Data Explorer;
import_env for importing environmental base data from the EA Ecology and Fish Data Explorer;
import_rhs for importing River Habitat Survey (RHS) data from the EA’s Open Data portal;
predict_indices for calculating expected scores for macroinvertebrate indices using the RICT model (FBA 2020);
calc_flowstats and calc_rfrstats for calculating summary statistics describing historical flow conditions;
join_he for joining the above datasets;
plot_heatmap for visualising and summarising gaps in time series data;
plot_hev and shiny_hev for producing time series plots of biology and flow data;
plot_sitepca for summarising environmental characteristics of biological sampling sites;
plot_rngflows for Visualising the range of flow conditions experienced historically at a site;
model_cv and model_logocv for performing cross-validation on linear mixed-effects models and hierarchical generalized additive models;
diag_lmer for generating a variety of diagnostic plots for a mixed-effects regression (lmer) model; and
plot_predictions for visualising the time series predictions from a hydro-ecological model.

This vignette illustrates a typical workflow using a selection of 20 macroinvertebrate sampling sites from the Environment Agency’s National Drought Monitoring Network (NDMN).

Although the package has been developed with macroinvertebrate data in mind, the functions can be used with any kind of biological sampling data.

Meta-data file

To link together disparate datasets requires a look-up table of site ids. In this example, we load a table with four columns:

biol_site_id = id of biology (in this case macroinvertebrate) sampling site.
flow_site_id = id of paired flow gauging station.
flow_input = vector specifying where to source the flow data for station (either National River Flow Archive “NRFA”, Hydrology Data Explorer “HDE”, or local files “FLOWFILES”).
rhs_survey_id = id of paired River Habitat Survey (RHS) (survey id, not not site id, in case multiple surveys have been undertaken at a site).

# load master file
data("master_file")

# make all columns character vectors
master_file$biol_site_id <- as.character(master_file$biol_site_id)
master_file$rhs_survey_id <- as.character(master_file$rhs_survey_id)

# filter master file for selected sites of interest
master_data <- master_file %>%
    filter(biol_site_id %in% c("34310", "34343", "34352", "55287", "55395", "55417",
        "55673", "55824", "55897", "56065", "56226", "54637", "54769", "54801", "80998",
        "54827", "77216"))

# view data
master_data

## # A tibble: 17 × 4
##    biol_site_id flow_site_id flow_input rhs_survey_id
##    <chr>        <chr>        <chr>      <chr>        
##  1 34310        2859TH       HDE        39880        
##  2 34343        4790TH       HDE        39881        
##  3 34352        5420TH       HDE        39797        
##  4 77216        F3105        HDE        39621        
##  5 55287        029003       HDE        38660        
##  6 55395        029001       HDE        38669        
##  7 55417        030017       HDE        38670        
##  8 55673        032806       HDE        39205        
##  9 55824        031031       HDE        39707        
## 10 55897        033044       HDE        39798        
## 11 56065        033051       HDE        39310        
## 12 56226        033022       HDE        39839        
## 13 54637        037005       HDE        39560        
## 14 54769        035004       HDE        38952        
## 15 54801        034003       HDE        39612        
## 16 80998        034206       HDE        39891        
## 17 54827        035002       HDE        39574

# get site lists, for use with functions
biolsites <- master_data$biol_site_id
flowsites <- master_data$flow_site_id
flowinputs <- master_data$flow_input
rhssurveys <- master_data$rhs_survey_id

Standardised Column Names

A number of standardised column names are used throughout the hetoolkit package, and throughout this vignette and its associated datasets. These include:

biol_site_id = macroinvertebrate sampling site ids.
flow_site_id = flow gauging station ids.
rhs_survey_id = River Habitat Survey (RHS) ids (survey id, not not site id, in case multiple surveys have been undertaken at a site).
flow = flow data, as downloaded using the import_flow function

Prepare biology, ENV and RHS data

Import biology data

The import_inv function imports macroinvertebrate sampling data from the Environment Agency’s Ecology and Fish Data Explorer. The data can either be downloaded from https://environment.data.gov.uk/ecology-fish/downloads/INV_OPEN_DATA.zip or read in from a local .csv or .rds file. The data can be optionally filtered by site ID and sample date.

Below, we use our list biolsites to filter the data from EDE.

# Import biology data from EDE
biol_data <- import_inv(source = "parquet", sites = biolsites, start_date = "2010-01-01",
    end_date = "2020-12-31")

# view biol_data
biol_data

## # A tibble: 454 × 58
##    biol_site_id SAMPLE_ID SAMPLE_VERSION REPLICATE_CODE SAMPLE_DATE SAMPLE_TYPE
##    <chr>        <chr>              <int> <chr>          <date>      <fct>      
##  1 56065        605567                 1 NA             2010-03-16  SP         
##  2 56065        611126                 1 NA             2010-07-28  SP         
##  3 56065        614704                 1 NA             2010-10-12  SP         
##  4 56065        621146                 1 NA             2011-03-29  SP         
##  5 56065        625474                 1 NA             2011-06-14  SP         
##  6 56065        627625                 1 NA             2011-08-06  SP         
##  7 56065        630894                 1 NA             2011-10-12  SP         
##  8 56065        633498                 1 NA             2011-12-09  SP         
##  9 56065        634677                 1 NA             2012-02-08  SP         
## 10 56065        635806                 1 NA             2012-04-12  SP         
## # ℹ 444 more rows
## # ℹ 52 more variables: SAMPLE_TYPE_DESCRIPTION <fct>, SAMPLE_METHOD <fct>,
## #   SAMPLE_METHOD_DESCRIPTION <fct>, SAMPLE_REASON <fct>, ANALYSIS_ID <int>,
## #   DATE_OF_ANALYSIS <date>, ANALYSIS_TYPE <fct>,
## #   ANALYSIS_TYPE_DESCRIPTION <fct>, ANALYSIS_METHOD <fct>,
## #   ANALYSIS_METHOD_DESCRIPTION <fct>, BMWP_N_TAXA <int>, BMWP_TOTAL <int>,
## #   BMWP_ASPT <dbl>, CCI_N_TAXA <int>, CCI_CS_TOTAL <int>, CCI_ASPT <dbl>, …

Optional: Join additional biology data

If the user has additional biology data in a separate Excel file, it is possible to append this to the EDE download. The additional data must have the same column names as the EDE download file.

# bind 2 biology data sets - one from EDE and one local file

# drop any unwanted variables/columns from the EDE download file
drops_bio <- c("SAMPLE_VERSION", "REPLICATE_CODE", "SAMPLE_TYPE", "SAMPLE_METHOD",
    "ANALYSIS_TYPE", "ANALYSIS_METHOD", "IS_THIRD_PARTY_DATA", "WATERBODY_TYPE")

# drop unwanted variables
biol_data2 <- biol_data[, !(names(biol_data) %in% drops_bio)]

# read in additional biology data in csv format
biol_data_excel <- read.csv("Data/biol_data_join.csv")

# format columns
biol_data_excel <- biol_data_excel %>%
    dplyr::mutate(biol_site_id = as.character(biol_site_id))

# convert to tibble format
biol_data_excel <- as_tibble(biol_data_excel)

# bind datasets
biol_data_final <- rbind(biol_data2, biol_data_excel)

Import environmental data

The import_env function allows the user to download environmental base data from the Environment Agency’s Ecology and Fish Data Explorer.

The function either:

downloads environmental data data from https://environment.data.gov.uk/ecology-fish/downloads/INV_OPEN_DATA.zip
or imports it from a local .csv or .rds file

Data can be optionally filtered by site ID.

When saving, the name of rds file is hard-wired to: INV_OPEN_DATA_SITES_ALL.rds.

If saving prior to filtering, the name of the filtered rds file is hard-wired to: INV_OPEN_DATA_SITE_F.rds.

Below, we use our list biolsites to filter the data from EDE.

# Import biology data from EDE
env_data <- import_env(sites = biolsites)

# view env_data
env_data

## # A tibble: 17 × 34
##    AGENCY_AREA    REPORTING_AREA CATCHMENT WATERBODY_TYPE WATERBODY_TYPE_DESCR…¹
##    <chr>          <chr>          <chr>     <chr>          <chr>                 
##  1 ANGLIAN - NOR… LINCOLNSHIRE … EAST LIN… WBRV           RIVER: Natural/semi-n…
##  2 ANGLIAN - CEN… EAST ANGLIA -… IVEL      WBRV           RIVER: Natural/semi-n…
##  3 ANGLIAN - EAS… EAST ANGLIA -… WAVENEY … WBRV           RIVER: Natural/semi-n…
##  4 ANGLIAN - CEN… EAST ANGLIA -… CAM (2)   WBRV           RIVER: Natural/semi-n…
##  5 ANGLIAN - EAS… EAST ANGLIA -… BURE      WBRV           RIVER: Natural/semi-n…
##  6 THAMES - NORT… HERTFORDSHIRE… RODING    WBRV           RIVER: Natural/semi-n…
##  7 ANGLIAN - CEN… EAST ANGLIA -… LT OUSE,… WBRV           RIVER: Natural/semi-n…
##  8 NORTH EAST - … YORKSHIRE      HULL      WBRV           RIVER: Natural/semi-n…
##  9 ANGLIAN - EAS… EAST ANGLIA -… DEBEN     WBRV           RIVER: Natural/semi-n…
## 10 ANGLIAN - NOR… LINCOLNSHIRE … UPPER NE… WBRV           RIVER: Natural/semi-n…
## 11 ANGLIAN - EAS… EAST ANGLIA -… COLNE (A… WBRV           RIVER: Natural/semi-n…
## 12 ANGLIAN - NOR… LINCOLNSHIRE … EAST LIN… WBRV           RIVER: Natural/semi-n…
## 13 ANGLIAN - NOR… LINCOLNSHIRE … WELLAND   WBRV           RIVER: Natural/semi-n…
## 14 THAMES - NORT… HERTFORDSHIRE… MIMRAM    WBRV           RIVER: Natural/semi-n…
## 15 THAMES - NORT… HERTFORDSHIRE… COLNE (T… WBRV           RIVER: Natural/semi-n…
## 16 ANGLIAN - NOR… LINCOLNSHIRE … CENTRAL … WBRV           RIVER: Natural/semi-n…
## 17 ANGLIAN - EAS… EAST ANGLIA -… ALDE AND… WBRV           RIVER: Natural/semi-n…
## # ℹ abbreviated name: ¹WATERBODY_TYPE_DESCRIPTION
## # ℹ 29 more variables: WATER_BODY <chr>, biol_site_id <chr>,
## #   SITE_VERSION <int>, NGR_PREFIX <chr>, EASTING <chr>, NORTHING <chr>,
## #   NGR_10_FIG <chr>, FULL_EASTING <int>, FULL_NORTHING <int>,
## #   WFD_WATERBODY_ID <chr>, ALTITUDE <dbl>, SLOPE <dbl>,
## #   DIST_FROM_SOURCE <dbl>, DISCHARGE <dbl>, WIDTH <dbl>, DEPTH <dbl>,
## #   BOULDERS_COBBLES <dbl>, PEBBLES_GRAVEL <dbl>, SAND <dbl>, …

Optional: Map biology sites

Get data

We use the environmental base data that we have downloaded from the Ecology and Fish Data Explorer using import_env, this gives us their NGRs. We translate the NGRs to full latitude / longitude (WGS84) and match this back to the env_data so we have information to include in the plot.

## Convert national grid ref (NGR) to full lat / long from env_data (from import_env function)
## WGS84 is lat/long. 
temp.eastnorths <- osg_parse(env_data$NGR_10_FIG, coord_system = "WGS84") %>% as_tibble()

## match to back to env data to give details on map
env_data_map <- cbind(env_data, temp.eastnorths) %>%
  dplyr::select(AGENCY_AREA, WATER_BODY, CATCHMENT, WATERBODY_TYPE, biol_site_id, lat, lon) %>%
  dplyr::mutate(label = paste0("<b>", as.character(biol_site_id), "</b><br/>", AGENCY_AREA, "<br/>", CATCHMENT))

Create the map

Finally we use leaflet to plot the and points indicating the sample sites. The points are labelled with the biology sample site ID, EA area code and the catchment.

## Create map
leaflet() %>%
  leaflet::addTiles() %>%
  leaflet::addMarkers(lng = env_data_map$lon, lat = env_data_map$lat, popup = env_data_map$label,
            options = popupOptions(closeButton = FALSE))

Optional: Join additional environmental data

Any additional environmental data in a separate Excel file can be appended to the EDE download. The additional data must have the same column names as the EDE download file.

# # bind 2 env data sets - one from EDE and one local file # drop any unwanted
# variables/columns from the download file drops_env <- c('AGENCY_AREA',
# 'REPORTING_AREA', 'CATCHMENT', 'WATERBODY_TYPE',
# 'WATERBODY_TYPE_DESCRIPTION', 'SITE_VERSION', 'NGR_10_FIG', 'FULL_EASTING',
# 'FULL_NORTHING', 'BASE_DATA_DATE', 'MIN_SAMPLE_DATE', 'MAX_SAMPLE_DATE',
# 'COUNT_OF_SAMPLES', 'ECN_SITE', 'INV', 'ECN_SITE_INV') # drop unwanted
# variables env_data <- env_data[ , !(names(env_data) %in% drops_env)] #
# read-in environmental data in excel format env_data_excel <-
# read_excel('data/Env_Additonal_Sites.xlsx') # format columns env_data_excel
# <- env_data_excel %>% dplyr::mutate( biol_site_id =
# as.character(biol_site_id), WATER_BODY = as.character(WATER_BODY), NGR_PREFIX
# = as.character(NGR_PREFIX), EASTING = as.character(EASTING), NORTHING =
# as.character(NORTHING), WFD_WATERBODY_ID = as.character(WFD_WATERBODY_ID)) #
# convert to tibble format env_data_excel <- as_tibble(env_data_excel) # join
# datasets env_data <- rbind(env_data, env_data_excel)

Import RHS data

The import_rhs function allows the user to download River Habitat Survey (RHS) data from Open Data.

The function either:

downloads RHS data from Open Data, https://environment.data.gov.uk/portalstg/sharing/rest/content/items/b82d3ef3750d49f6917fff02b9341d68/data
or imports it from a local xlsx or rds file.

Data can be optionally filtered by survey ID.

Downloaded raw data files (in .zip format) will be automatically removed from the working directory following completed execution of the function.

Below, we use our list rhssurveys to filter the downloaded data.

To enable the RHS data to be joined to the biology and flow data at a later stage, it is necessary to rename the Survey ID column to rhs_survey_id.

# import RHS data from Open Data
rhs_data <- import_rhs(surveys = rhssurveys)

# rename Survey.ID as rhs_survey_id
rhs_data <- rhs_data %>%
    dplyr::rename(rhs_survey_id = Survey.ID)

If required, unwanted variables can be dropped from the dataset.

# select Columns to keep
rhs_keep <- c("rhs_survey_id", "Hms.Poaching.Sub.Score", "Hms.Rsctned.Bnk.Bed.Sub.Score",
    "HMS.Score")

# select columns of interest
rhs_data <- rhs_data[, (names(rhs_data) %in% rhs_keep)]

# view rhs_data
rhs_data

## # A tibble: 17 × 4
##    rhs_survey_id HMS.Score Hms.Poaching.Sub.Score Hms.Rsctned.Bnk.Bed.Sub.Score
##    <chr>             <dbl>                  <dbl>                         <dbl>
##  1 38660              1305                      0                            80
##  2 38669              3175                      0                          2000
##  3 38670               525                     10                             0
##  4 38952              3140                      0                          2280
##  5 39205              3420                      0                          2520
##  6 39310              3810                      0                          2200
##  7 39560               240                      0                             0
##  8 39574               780                      0                           760
##  9 39612              2600                     50                          2000
## 10 39621               250                      0                             0
## 11 39707              3040                      0                          2720
## 12 39797              3120                      0                          2080
## 13 39798              3570                      0                          2520
## 14 39839              2855                      0                          2440
## 15 39880              1740                      0                           240
## 16 39881              1055                     60                             0
## 17 39891              3170                     10                          2760

Predict expected indices

The predict_indices function mirrors the functionality of the RICT model available on the MS Azure platform (https://gallery.azure.ai/Experiment/RICT-package-2). Specifically, it uses the environmental data downloaded from EDE to generate expected scores under minimally impacted reference conditions for 80 indices, plus probabilities for RIVPACS end-groups. No classification is undertaken.

Data validation, transformation (conversion) and predictions are done within predict_indices using functionality predefined in the AquaMetrics RICT package (https://github.com/aquaMetrics/rict); specifically, the ‘rict_predict’ function from the aquaMetrics RICT package is applied.

The predict_indices function can accept environmental data formatted in either of two formats, specified using the ‘file_format’ argument: * “EDE” - environmental data is formatted as downloaded from the EA’s Ecology Data Explorer * “RICT” - environmental data is in the RICT template format.

To run, the following columns containing substrate composition data must not contain NAs, and must total to 100 for every site:

BOULDERS_COBBLES
PEBBLES_GRAVEL
SAND
SILT_CLAY

The code below demonstrates how to summarise whether these columns contain NAs, and how to replace NAs as required. Note, this code will only run successfully if using data in the “EDE” file format.

# check substrate variables for presence of NAs
env_data %>%
    dplyr::select(BOULDERS_COBBLES, PEBBLES_GRAVEL, SAND, SILT_CLAY) %>%
    summarise_all(~sum(is.na(.)))

## # A tibble: 1 × 4
##   BOULDERS_COBBLES PEBBLES_GRAVEL  SAND SILT_CLAY
##              <int>          <int> <int>     <int>
## 1                0              0     0         0

# replace NAs, if required
env_data$BOULDERS_COBBLES[is.na(env_data$BOULDERS_COBBLES)] <- 0

Now that the substrate data is complete, the expected scores can be generated. Here, we opt to calculate all available indices - we can then filter these down at a later stage.

This is an updated version of the predict_indices function. The first iteration of the function remains available, using preict_indices_old.

# run predictions
predict_data <- predict_indices(env_data = env_data, file_format = "EDE", all_indices = TRUE)

##   ROW  SITE YEAR FAIL
## 1  17 54769 2025  ---
## 2   3 80998 2025  ---
##                                                                                WARNING
## 1  You provided LONGITUDE: 1.4717273941385, max value used to train model: 1.3560119. 
## 2 You provided LONGITUDE: 1.35613296461629, max value used to train model: 1.3560119. 
##   REPLACEMENT
## 1         ---
## 2         ---

# view predict_data
predict_data

## # A tibble: 51 × 156
##    biol_site_id LATITUDE LONGITUDE LOG.ALTITUDE LOG.DISTANCE.FROM.SO…¹ LOG.WIDTH
##    <chr>           <dbl>     <dbl>        <dbl>                  <dbl>     <dbl>
##  1 55395            53.5    -0.114        1.18                    1.30     0.477
##  2 56226            52.0    -0.267        1.54                    1.21     0.568
##  3 80998            52.4     1.36         1.04                    1.54     0.954
##  4 56065            52.1     0.191        1.54                    1.31     0.740
##  5 54801            52.8     1.25         1.18                    1.33     0.799
##  6 34352            51.7     0.258        1.61                    1.46     0.602
##  7 55897            52.4     0.877        1.18                    1.47     0.954
##  8 77216            54.0    -0.379        0.699                   1.17     1.08 
##  9 54827            52.2     1.33         1                       1.32     0.633
## 10 55673            52.2    -0.934        1.78                    1.50     1.10 
## # ℹ 41 more rows
## # ℹ abbreviated name: ¹LOG.DISTANCE.FROM.SOURCE
## # ℹ 150 more variables: LOG.DEPTH <dbl>, MEAN.SUBSTRATUM <dbl>,
## #   DISCHARGE.CATEGORY <dbl>, ALKALINITY <dbl>, LOG.ALKALINITY <dbl>,
## #   LOG.SLOPE <dbl>, MEAN.AIR.TEMP <dbl>, AIR.TEMP.RANGE <dbl>, p1 <dbl>,
## #   p2 <dbl>, p3 <dbl>, p4 <dbl>, p5 <dbl>, p6 <dbl>, p7 <dbl>, p8 <dbl>,
## #   p9 <dbl>, p10 <dbl>, p11 <dbl>, p12 <dbl>, p13 <dbl>, p14 <dbl>, …

# drop unwanted variables
keeps <- c("biol_site_id", "SEASON", "TL2_WHPT_ASPT_AbW_DistFam", "TL2_WHPT_NTAXA_AbW_DistFam",
    "TL3_LIFE_Fam_DistFam", "TL3_PSI_Fam")

predict_data <- predict_data[, (names(predict_data) %in% keeps)]

predict_data <- predict_data %>%
    dplyr::rename(Season = SEASON) %>%
    dplyr::mutate(Season = case_when(Season == 1 ~ "Spring", Season == 2 ~ "Summer",
        Season == 3 ~ "Autumn"))

Join Biology Data, Environmental Data, and Expected Indices

Prior to joining the biology data with other datasets, it is advisable to remove any replicate or duplicate samples collected from a site within the same year and season.

# remove replicate samples
biol_data <- biol_data %>%
    distinct(biol_site_id, Year, Season, .keep_all = TRUE)

The following code demonstrates how to join the expected biology metric scores from predict_indices to the observed biology metric scores, using the common fields biol_site_id and Season.

# join expected and observed biology metric scores, by biol_site_id and Season
biol_all <- dplyr::left_join(biol_data, predict_data, by = c("biol_site_id", "Season"))

Next the environmental base data are joined using the common field biol_site_id.

# join ENV data, by biol_site_id
biol_all <- dplyr::left_join(biol_all, env_data, by = "biol_site_id")

Calculate O:E ratios

Now that the observed and expected biology metric scores are joined into a single data frame, O:E (Observed/Expected) ratios can be calculated.

# O:E calcs
biol_all <- biol_all %>%
    mutate(WHPT_ASPT_O = WHPT_ASPT, WHPT_ASPT_E = TL2_WHPT_ASPT_AbW_DistFam, WHPT_ASPT_OE = WHPT_ASPT_O/WHPT_ASPT_E,
        WHPT_NTAXA_O = WHPT_N_TAXA, WHPT_NTAXA_E = TL2_WHPT_NTAXA_AbW_DistFam, WHPT_NTAXA_OE = WHPT_NTAXA_O/WHPT_NTAXA_E,
        LIFE_F_O = LIFE_FAMILY_INDEX, LIFE_F_E = TL3_LIFE_Fam_DistFam, LIFE_F_OE = LIFE_F_O/LIFE_F_E,
        PSI_O = PSI_FAMILY_SCORE, PSI_E = TL3_PSI_Fam, PSI_OE = PSI_O/PSI_E, date = SAMPLE_DATE)

Prepare flow data

View Flow Data Source

The data frame nrfa.hde provides a joined dataset which illustrates which stations we believe are available on the NRFA, and which stations are available on Hydrology Data Explorer. The dataset was created in early April 2021, and may become out of date over time.

# view nrfa.hde
nrfa.hde

## # A tibble: 1,366 × 20
##    wiski.id open  Flow_Site_Name    on.nrfa on.hde  id_N name_N catchment.area_N
##    <chr>    <lgl> <chr>             <lgl>   <lgl>  <dbl> <chr>             <dbl>
##  1 021031   FALSE Till at Etal      TRUE    FALSE  21031 Till …            648  
##  2 021032   FALSE Glen at Kirknewt… TRUE    FALSE  21032 Glen …            199. 
##  3 021038   TRUE  Till at Heaton M… TRUE    FALSE  21035 Till …            656. 
##  4 022001   TRUE  Coquet at Morwick TRUE    TRUE   22001 Coque…            570. 
##  5 022002   FALSE Coquet at Bygate  TRUE    FALSE  22002 Coque…             59.5
##  6 022003   TRUE  Usway Burn at Sh… TRUE    TRUE   22003 Usway…             21.4
##  7 022004   FALSE Aln at Hawkhill   TRUE    FALSE  22004 Aln a…            205  
##  8 022006   TRUE  Blyth at Hartfor… TRUE    TRUE   22006 Blyth…            269. 
##  9 022007   TRUE  Wansbeck at Mitf… TRUE    TRUE   22007 Wansb…            287. 
## 10 NA       FALSE Alwin at Clennell TRUE    FALSE  22008 Alwin…             27.7
## # ℹ 1,356 more rows
## # ℹ 12 more variables: opened_N <chr>, closed_N <chr>,
## #   measuring.authority.id_N <chr>, measuring.authority.station.id_N <chr>,
## #   wiskiID_H <chr>, label_H <chr>, stationReference_H <chr>,
## #   riverName_H <chr>, easting_H <dbl>, northing_H <dbl>, col_H <chr>,
## #   value_H <chr>

Import flow data

import_flow is a high-level function that calls import_nrfa, import_hde and import_flowfiles to import data for a user-defined list of sites.

Below, we use the lists flowsites and flowinputs to specify sites for import.

flow_data <- import_flow(sites = flowsites, inputs = flowinputs, start_date = "2010-01-01",
    end_date = "2019-12-31", dir = "data/wiski", skip_num = 21, col_order = c(1,
        2, 3))

# view flow_data
flow_data

## # A tibble: 62,084 × 5
##    input flow_site_id date        flow quality
##    <chr> <chr>        <date>     <dbl> <chr>  
##  1 HDE   029001       2010-01-01 0.426 Good   
##  2 HDE   029001       2010-01-02 0.418 Good   
##  3 HDE   029001       2010-01-03 0.416 Good   
##  4 HDE   029001       2010-01-04 0.419 Good   
##  5 HDE   029001       2010-01-05 0.426 Good   
##  6 HDE   029001       2010-01-06 0.422 Good   
##  7 HDE   029001       2010-01-07 0.416 Good   
##  8 HDE   029001       2010-01-08 0.406 Good   
##  9 HDE   029001       2010-01-09 0.401 Good   
## 10 HDE   029001       2010-01-10 0.54  Good   
## # ℹ 62,074 more rows

Explore data gaps

The plot_heatmap function is designed to visualise and summarise gaps in time series data.

It plots time series data for multiple sites as a tiled heatmap, and optionally produces tabular summaries of data completeness by time period and site. Although designed for application with flow time series data, it can be applied to any type of numerical data, with or without a time dimension.

In the following example, plot_heatmap is used to explore gaps in the time series of mean daily flows, and then to summarise how complete the data is in each month.

# generate a heatmap of mean daily flows
a <- plot_heatmap(data = flow_data, x = "date", y = "flow_site_id", fill = "flow",
    dual = FALSE)
# view heatmap
gridExtra::grid.arrange(a[[1]])

# view table of completeness statistics for each site
a[[3]]

## # A tibble: 17 × 7
##    y      missing total prop_missing number_of_gaps smallest_gap biggest_gap
##    <chr>    <int> <int>        <dbl>          <int>        <int>       <int>
##  1 029001       0  3652     0                    NA           NA          NA
##  2 029003       0  3652     0                    NA           NA          NA
##  3 030017       0  3652     0                    NA           NA          NA
##  4 031031       4  3652     0.00110               2            1           3
##  5 032806       1  3652     0.000274              1            1           1
##  6 033022       0  3652     0                    NA           NA          NA
##  7 033044       0  3652     0                    NA           NA          NA
##  8 033051       0  3652     0                    NA           NA          NA
##  9 034003       0  3652     0                    NA           NA          NA
## 10 034206       0  3652     0                    NA           NA          NA
## 11 035002       0  3652     0                    NA           NA          NA
## 12 035004       0  3652     0                    NA           NA          NA
## 13 037005       0  3652     0                    NA           NA          NA
## 14 2859TH       0  3652     0                    NA           NA          NA
## 15 4790TH       0  3652     0                    NA           NA          NA
## 16 5420TH      11  3652     0.00301               1           11          11
## 17 F3105       62  3652     0.0170                1           62          62

# uses flow_data, as produced using import_flow()
flow_data$month <- lubridate::month(flow_data$date)
flow_data$year <- lubridate::year(flow_data$date)
temp1 <- flow_data %>%
    dplyr::group_by(month, year, flow_site_id) %>%
    dplyr::summarise(across(.cols = "flow", list(mean = ~mean(flow, na.rm = TRUE),
        missing = ~length(which(is.na(flow))), total = ~length(flow), perc_missing = ~(length(which(is.na(flow)))/length(flow)) *
            100), .names = "{.fn}"))
temp1$yy_mm <- paste(temp1$year, temp1$month, sep = "_")

# heatmap of monthly mean flows.
a <- plot_heatmap(data = temp1, x = "yy_mm", y = "flow_site_id", fill = "mean", dual = FALSE)
# view heatmap
gridExtra::grid.arrange(a[[1]])

# heatmap of monthly percentage completeness of the daily flow data
a <- plot_heatmap(data = temp1, x = "yy_mm", y = "flow_site_id", fill = "perc_missing")
# view heatmap
gridExtra::grid.arrange(a[[1]])

Impute flow gaps

The impute_flow function is designed to infill missing records in daily flow time series for one or more sites (gauging stations) using either an interpolation or an equipercentile method. Imputation of missing flow data can improve the later estimation of flow statistics using the calc_flowstats() function and aid the visualisation of hydro-ecological relationships using the plot_hev() function.

Note, it is advisable to consult a hydrologist when doing any flow imputation, particularly if your data contains extensive/lengthy gaps.

The most straight-forward interpolation can be completed using a ‘linear’ or ‘exponential’ method.

The output dataset is in long format, similar to that produced by import_flow. The output dataset contains an ‘imputed’ colunm; this is a flag indicating whether each flow value is original (0) or imputed (1). Additionally, this function plots up observed and imputed flow values for each site; enabling the user to validate the imputation.

Flow interpolation using the ‘linear’ method is demonstrated below:

flow_data_imputed <- impute_flow(flow_data, site_col = "flow_site_id", date_col = "date",
    flow_col = "flow", method = "linear")

If applying the ‘equipercentile’ method, an optional ‘donor’ argument can be specified. The ‘donor’ argument must be specified in data frame format with at least two columns: the first a list of flow sites requiring imputation, and the second a list of paired donor sites. The function assumes that data for all flow sites specified within the ‘donor’ data table, including paired donor sites, are contained within the ‘flow_data’ supplied.

If flow data for paired donor stations are not already included within the ‘flow_data’ dataframe, then it will be necessary to import the donor flow data, and then append these to the existing flow data (which can be done using the bind_rows() function). For brevity here we use an existing site for donor data.

# create 'donor' data frame for site with relatively large data gap (F3105)
donor_mapping <- data.frame(station = c("F3105"), donor_station = c("029001"))

# get a list of donor sites
donorsites <- as.character(unique(donor_mapping$donor_station))

# impute flow data using the equipercentile method
flow_data_imputed <- impute_flow(flow_data, site_col = "flow_site_id", date_col = "date",
    flow_col = "flow", method = "equipercentile", donor = donor_mapping)

Calculate flow statistics

calc_flowstats takes a time series of measured or modelled flows and uses a user-defined moving window to calculate a suite of long-term and time-varying flow statistics flow statistics for one or more sites (stations).

The function uses the win_start, win_width and win_step arguments to define a moving window, which divides the flow time series into a sequence of time periods. These time periods may be contiguous, non-contiguous or overlapping.

The sequence of time periods continues up to and including the present date, even when this extends beyond the period covered by the input flow dataset, as this facilitates the subsequent joining of flow statistics and ecology data by the join_he function.

It is primarily designed to work with mean daily flows (e.g. as produced by import_flow), but can also be applied to time series data on a longer (e.g. monthly) time step.

Regardless, the data should be regularly spaced and the same time step should be used for all sites.

The function also requires site, date and flow data columns to be specified in order to run. The default values for these arguments, “flow_site_id”, “date” and “flow” respectively, match the column headers from the outputs of the import_flow and impute_flow functions. The defaults allow the data from these functions to be passed without editing into calc_flowstats, however the names can be changed as normal to match the data supplied.

This is an updated version of the calc_flowstats function. The first iteration of the function, which employs fixed 6-monthly spring and autumn flow periods, remains available using calc_flowstats_old.

# uses flow_data, as produced using import_flow() will also accept interpolated
# flow data (flow_data_imputed), as produced using impute_flow()

# set any negative flow readings to NA
flow_data$flow[flow_data$flow <= 0] <- NA

# calculate flow statistics
flow_stats <- calc_flowstats(data = flow_data, site_col = "flow_site_id", date_col = "date",
    flow_col = "flow", win_width = "6 months", win_step = "6 months")

calc_flowstats returns a list of two data frames. The first contains a suite of time-varying flow statistics for every 6 month winter/summer period at every site. See ?calc_flowstats for definitions of these statistics.

# view time-varying flow statistics
head(flow_stats[[1]])

##   flow_site_id win_no start_date   end_date n_data n_missing n_total
## 1       029001      1 1995-04-01 1995-09-30      0       183     183
## 2       029001      2 1995-10-01 1996-03-31      0       183     183
## 3       029001      3 1996-04-01 1996-09-30      0       183     183
## 4       029001      4 1996-10-01 1997-03-31      0       182     182
## 5       029001      5 1997-04-01 1997-09-30      0       183     183
## 6       029001      6 1997-10-01 1998-03-31      0       182     182
##   prop_missing mean sd Q5 Q10 Q20 Q25 Q30 Q50 Q70 Q75 Q80 Q90 Q95 Q99 Q5z Q10z
## 1            1   NA NA NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA   NA
## 2            1   NA NA NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA   NA
## 3            1   NA NA NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA   NA
## 4            1   NA NA NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA   NA
## 5            1   NA NA NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA   NA
## 6            1   NA NA NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA   NA
##   Q20z Q25z Q30z Q50z Q70z Q75z Q80z Q90z Q95z Q99z dry_n dry_e dry_start
## 1   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA    NA    NA        NA
## 2   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA    NA    NA        NA
## 3   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA    NA    NA        NA
## 4   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA    NA    NA        NA
## 5   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA    NA    NA        NA
## 6   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA    NA    NA        NA
##   dry_end dry_mid low_n low_e low_start low_end low_mid low_magnitude
## 1      NA      NA    NA    NA        NA      NA      NA            NA
## 2      NA      NA    NA    NA        NA      NA      NA            NA
## 3      NA      NA    NA    NA        NA      NA      NA            NA
## 4      NA      NA    NA    NA        NA      NA      NA            NA
## 5      NA      NA    NA    NA        NA      NA      NA            NA
## 6      NA      NA    NA    NA        NA      NA      NA            NA
##   low_severity high_n high_e high_start high_end high_mid e_above3xq50
## 1           NA     NA     NA         NA       NA       NA           NA
## 2           NA     NA     NA         NA       NA       NA           NA
## 3           NA     NA     NA         NA       NA       NA           NA
## 4           NA     NA     NA         NA       NA       NA           NA
## 5           NA     NA     NA         NA       NA       NA           NA
## 6           NA     NA     NA         NA       NA       NA           NA
##   e_above5xq50 e_above7xq50 volume vol_z min min_z min_doy min_7day min_7day_z
## 1           NA           NA     NA    NA  NA    NA      NA       NA         NA
## 2           NA           NA     NA    NA  NA    NA      NA       NA         NA
## 3           NA           NA     NA    NA  NA    NA      NA       NA         NA
## 4           NA           NA     NA    NA  NA    NA      NA       NA         NA
## 5           NA           NA     NA    NA  NA    NA      NA       NA         NA
## 6           NA           NA     NA    NA  NA    NA      NA       NA         NA
##   min_7day_doy min_30day min_30day_z min_30day_doy max max_z max_doy
## 1           NA        NA          NA            NA  NA    NA      NA
## 2           NA        NA          NA            NA  NA    NA      NA
## 3           NA        NA          NA            NA  NA    NA      NA
## 4           NA        NA          NA            NA  NA    NA      NA
## 5           NA        NA          NA            NA  NA    NA      NA
## 6           NA        NA          NA            NA  NA    NA      NA

The second data table contains long-term flow statistics. The data are arranged in long format, with the following columns:

flow_site_id (a unique site id);
season (S = summer, W = winter, A = all);
parameter (base flow index (bfi), flow duration curve percentiles (p1 to p99), and seasonal means and standard deviations for Q5, Q10, Q20, Q25, Q30, Q50, Q70, Q75, Q80, Q90, Q95 and Q99); and
value (calculated statistic).

# view long-term flow statistics
flow_stats[[2]]

## # A tibble: 2,261 × 5
##    flow_site_id start_date end_date   parameter value
##    <chr>        <date>     <date>     <chr>     <dbl>
##  1 029001       2010-01-01 2019-12-31 max        1.18
##  2 029003       2010-01-01 2019-12-31 max        1.58
##  3 030017       2010-01-01 2019-12-31 max        1.70
##  4 031031       2010-01-01 2019-12-31 max       15.9 
##  5 032806       2010-01-01 2019-12-31 max        7.83
##  6 033022       2010-01-01 2019-12-31 max       11.8 
##  7 033044       2010-01-01 2019-12-31 max        5.48
##  8 033051       2010-01-01 2019-12-31 max        3.46
##  9 034003       2010-01-01 2019-12-31 max        3.44
## 10 034206       2010-01-01 2019-12-31 max       12.5 
## # ℹ 2,251 more rows

Join biology and flow data

The join_he function links biology data with time-varying flow statistics for one or more antecedent (lagged) time periods (as calculated by the calc_flowstats function) to create a combined dataset for hydro-ecological modelling.

The function allows to user to select a method for linking biology samples to flow statistics for antecedent time periods: * using method = “A” (default), lag 0 is defined for each biology sample as the most recently finished flow time period; * using method = “B”, lag 0 is defined as the most recently started flow time period. To describe the antecedent flow conditions prior to each biology sample, the time periods are labelled relative to the date of the biology sample, with lag 0 representing either the most recently finished (method = “A”) or most recently started (method = “B”) flow time period. The time period immediately prior to the Lag 0 time period is the Lag 1 period, and the period immediately prior to that is the Lag 2 period, and so on. The function allows the user to select which antecedent (lagged) time periods the biology data is joined to, e.g. lag = c(0,1,2).

The function also includes an option to specify the ‘join_type’: * “add_flows” (default) produces a dataset of biological metrics (response variables) and flow statistics (predictor variables) for hydro-ecological modelling. * “add_biol” produces a time series of flow statistics with associated biological metrics which can be used to assess the coverage of historical flow conditions using the plot_rngflows function.

This is an updated version of the join_he function. The first iteration of the function remains available, using join_he_old.

We will join the data for the pairs of biology and flow sites specified in master_data.

# get output from calc_flowstats
flowstats_1 <- flow_stats[[1]]

# create two-column table mapping biology sites to flow sites
mapping <- master_data[, c("biol_site_id", "flow_site_id")]

# join flow statistics to biology data

join_data <- join_he(biol_data = biol_all, flow_stats = flowstats_1, mapping = mapping,
    lags = c(0, 1), method = "A", join_type = "add_flows")

# view join_data
join_data

## # A tibble: 414 × 242
##    biol_site_id SAMPLE_ID SAMPLE_VERSION REPLICATE_CODE SAMPLE_DATE SAMPLE_TYPE
##    <chr>        <chr>              <int> <chr>          <date>      <fct>      
##  1 56065        605567                 1 NA             2010-03-16  SP         
##  2 56065        611126                 1 NA             2010-07-28  SP         
##  3 56065        614704                 1 NA             2010-10-12  SP         
##  4 56065        621146                 1 NA             2011-03-29  SP         
##  5 56065        625474                 1 NA             2011-06-14  SP         
##  6 56065        630894                 1 NA             2011-10-12  SP         
##  7 56065        633498                 1 NA             2011-12-09  SP         
##  8 56065        634677                 1 NA             2012-02-08  SP         
##  9 56065        635806                 1 NA             2012-04-12  SP         
## 10 56065        638557                 1 NA             2012-06-27  SP         
## # ℹ 404 more rows
## # ℹ 236 more variables: SAMPLE_TYPE_DESCRIPTION <fct>, SAMPLE_METHOD <fct>,
## #   SAMPLE_METHOD_DESCRIPTION <fct>, SAMPLE_REASON <fct>, ANALYSIS_ID <int>,
## #   DATE_OF_ANALYSIS <date>, ANALYSIS_TYPE <fct>,
## #   ANALYSIS_TYPE_DESCRIPTION <fct>, ANALYSIS_METHOD <fct>,
## #   ANALYSIS_METHOD_DESCRIPTION <fct>, BMWP_N_TAXA <int>, BMWP_TOTAL <int>,
## #   BMWP_ASPT <dbl>, CCI_N_TAXA <int>, CCI_CS_TOTAL <int>, CCI_ASPT <dbl>, …

Join RHS data

If required, the RHS data can also be joined manually to the combined biology and flow dataset.

# create mapping rhs_survey_id <- as.character(rhssurveys) flow_site_id <-
# as.character(flowsites) mapping <- data.frame(flow_site_id, rhs_survey_id)

# create two-column table mapping biology sites to RHS surveys
mapping <- master_data[, c("biol_site_id", "rhs_survey_id")]

# Join RHS data to join_data
all_data <- dplyr::left_join(join_data, mapping, by = "biol_site_id")
all_data <- dplyr::left_join(all_data, rhs_data, by = "rhs_survey_id")

Exploratory data anlaysis

Pairwise correlations

Plotting pairwise correlations among alternative biological metrics can be useful when deciding which metric(s) to analysis. In this example, the ggpairs function from the GGally package is used to explore the distribution and correlations of four commonly used macroinvertebrate metrics.

GGally::ggpairs(all_data, columns=c("LIFE_F_OE", "WHPT_ASPT_OE", "WHPT_NTAXA_OE", "PSI_OE"), 
                         upper = list(continuous = GGally::wrap("cor")),
                         diag = list(continuous = "densityDiag"),
                         lower = list(continuous = GGally::wrap("points")))

Similarly, a ggpairs plot can also be used to visualise relationships between response and predictor variables, and to identify co-linearity among predictor variables.

GGally::ggpairs(all_data, columns=c("LIFE_F_OE", "Q95z_lag0", "Q10z_lag0"), 
                         upper = list(continuous = GGally::wrap("cor")),
                         diag = list(continuous = "densityDiag"),
                         lower = list(continuous = GGally::wrap("points")))

Assess site similarity

Prior to building a hydro-ecological model, it is important identify any sites that could potentially be outliers because they are physically, chemically or geographically dissimilar to the others sites in the dataset. The plot-sitepca function performs a Principal Components Analysis (PCA) which reduces a set of site-level environmental variables down to two uncorrelated ‘principal components’ and plots these components as a two-dimensional scatter plot. Sites that are closer together have more similar environmental characteristics. The proportion of the total environmental variation explained by each component is indicated on the axis labels. The environmental gradients represented by the two axes can be interpreted by reference to the arrows (eigenvectors), which show the direction and strength of correlation between the principal components and the individual environmental variables.

plot_sitepca(data = env_data, vars = c("ALTITUDE", "SLOPE", "WIDTH", "DEPTH", "BOULDERS_COBBLES", "PEBBLES_GRAVEL", "SILT_CLAY"), eigenvectors = TRUE, label_by = "biol_site_id")

In this example, the first component (x-axis) can be interpreted as a longitudinal gradient from wider, deeper rivers (left) to smaller, steeper streams (right), and the second component (y-axis) can be interpreted a gradient in substrate composition from finer pebbles/gravels (top) to coarser boulder/cobbles (bottom). On the basis of these results, we may wish to consider whether sites such as 77216 and 55824 are suitable for inclusion in the model, or whether they are sufficiently different to warrant being excluded.

Assess coverage of historical flow conditions

A hydro-ecological model has a greater chance of revealing relationships between biology metrics and flow variables if the biology samples span a wide range of flow conditions. Good coverage of historical flow conditions by the biology sample data also means that the calibrated model is likely to be better at predicting biological responses under high and low flow scenarios.

The plot_rngflows function generates a scatterplot for two flow variables and overlays two convex hulls: one showing the full range of flow conditions experienced historically, and a second convex hull showing the range of flow conditions with associated biology samples. This visualisation helps identify to what extent the available biology data span the full range of full range of flow conditions experienced historically.

In the following example, the Q95z and Q10z flow statistics have been selected as measures of low and high flows within each six month period. The first plot shows the data for all 20 sites; the second plot is faceted to show the data separately for each site individually. In this dataset, biology samples are present for almost every six month period, and so provide excellent coverage of historical flow conditions.

all.combinations <- expand.grid(biol_site_id = unique(biol_data$biol_site_id), Year = min(biol_data$Year):max(biol_data$Year), Season = c("Spring", "Autumn"), stringsAsFactors = FALSE)
dim(all.combinations)

## [1] 374   3

biol_data1 <- all.combinations %>%
  left_join(biol_all)


mapping <- master_data[, c("biol_site_id", "flow_site_id")]

join_data <- join_he(biol_data = biol_data1, flow_stats = flowstats_1, mapping = mapping, lags = c(0, 1), method = "A", join_type = "add_biol")

plot_rngflows(data = join_data, flow_stats = c("Q95z_lag0", "Q10z_lag0"), biol_metric = "LIFE_F_OE", wrap_by = NULL, label = "Year")

plot_rngflows(data = join_data, flow_stats = c("Q95z_lag0", "Q10z_lag0"), biol_metric = "LIFE_F_OE", wrap_by = "biol_site_id", label = "Year")

HEV plots

The plot_hev function generates, for one site of interest, a time series plot of biology sample data and flow summary statistics, often referred to by the EA as a hydro-ecological validation (HEV) plot. HEV plots provide a visual assessment of trends in the historical data, and an initial guide to possible relationships that could be explored and quantified within a hydro-ecological model.

Whilst the dataset outputted by the join_he function can be used to produce a HEV plot, a better solution is achieved by via a combination of manual processing and the add biol functionality of join_he.

# get output from calc_flowstats
flowstats_1 <- flow_stats[[1]]

# create two-column table mapping biology sites to flow sites
mapping <- data.frame(master_data[, c("biol_site_id", "flow_site_id")])

# filter biology and flow datasets to sites of interest
biol_data_hev <- dplyr::filter(biol_all, biol_site_id %in% unique(mapping$biol_site_id))
flow_data_hev <- dplyr::filter(flowstats_1, flow_site_id %in% unique(mapping$flow_site_id))

# create a complete daily time series for each biology site
hev_data <- expand.grid(
  biol_site_id = unique(biol_data_hev$biol_site_id), 
  date = seq.Date(as.Date("2011-01-01"), as.Date("2019-12-31"), by="day"), 
  stringsAsFactors = FALSE)

# create season and water_year columns on which to join the six-monthly flow statistics
hev_data$Month <- lubridate::month(hev_data$date)
hev_data$Year <- lubridate::year(hev_data$date)

# join biology data
hev_data <- hev_data %>%
  left_join(biol_data_hev, by = c("biol_site_id", "date", "Year"))

# join flow data
hev_data <- join_he(biol_data = hev_data, flow_stats = flow_data_hev, mapping = mapping, method = "B", join_type = "add_biol")

Using this new hev_data dataset, the following example uses multiplot = TRUE to produce HEV plots for four macroinvertebrate metrics at one site.

plot_hev(data = subset(hev_data, biol_site_id == "34352"), 
         date_col = "date", 
         flow_stat = c("Q95z_lag0", "Q10z_lag0"), 
         biol_metric = c("LIFE_F_OE", "WHPT_ASPT_OE", "WHPT_NTAXA_OE", "PSI_OE"), 
         multiplot = TRUE)

The save argument can be used to export the HEV plot as a .png file. The name of the png file is fixed, however, so to generate a separate png file for each site requires a loop.

bsi <- mapping$biol_site_id[1:5]
#for(i in mapping$biol_site_id){
for(i in bsi){
  
  hev_plot <- plot_hev(data = subset(hev_data, biol_site_id == i), 
         date_col = "date", 
         flow_stat = c("Q95z_lag0", "Q10z_lag0"), 
         biol_metric = c("LIFE_F_OE"), 
         clr_by = "biol_site_id")
  
  ggsave(hev_plot, device = "png", path = "Plots/", filename = paste0(i,"_hevplot.png"))
  
}

The example above shows the standard HEV plot, using non-overlapping 6-month (winter and summer) windows. By changing the inputs to the calc_flowstats function, HEV plots with different window widths can be also be plotted. The example below shows the same data but displayed with 12-month windows. The call to calc_flowstats used for this is:

# calculate flow statistics
flow_stats <- calc_flowstats(data = flow_data,
                           site_col = "flow_site_id",
                           date_col = "date",
                           flow_col = "flow",
                           win_width = "6 months",
                           win_step = "6 months")

To interactively examine time series relationships for different metrics and sites, use the shiny_hev function, which launches a HEV plot within a shiny app.

Model calibration

Model selection

Once the biology and flow data has been assembled and explored, spatial and temporal variation in a biological metric of interest can be modelled as a function of one or more flow summary statistics to reveal potential hydro-ecological relationships. For datasets comprising multiple sites, linear mixed-effects models have the ability to describe a ‘global’ relationship that applies across all sites, whilst also quantifying the degree of variability in this relationship from site to site.

The following example illustrates a mixed-effects model fitted to autumn macroinvertebrate data using the lmer function from the lme4 package. The model includes three variables representing high and low flows in different antecedent periods (Q95z, Q10z), and Q95z also has a random (site-specific) slope.

# filter data by season
autumn_data <- all_data %>% filter(Season == "Autumn")

#filter data to eliminate NAs (necessary for cross-validation later on)
autumn_data <- autumn_data %>% filter(is.na(LIFE_F_OE) == FALSE)
autumn_data <- autumn_data %>% filter(is.na(biol_site_id) == FALSE)
autumn_data <- autumn_data %>% filter(is.na(Q95z_lag0) == FALSE)
autumn_data <- autumn_data %>% filter(is.na(Q10z_lag0) == FALSE)
autumn_data <- autumn_data %>% filter(is.na(Q95z_lag1) == FALSE)

model_full <- lme4::lmer(LIFE_F_OE ~ Q95z_lag0 + Q10z_lag0 + Q95z_lag1 + (Q95z_lag0 | biol_site_id), data = autumn_data, REML = FALSE)

summary(model_full)

## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: LIFE_F_OE ~ Q95z_lag0 + Q10z_lag0 + Q95z_lag1 + (Q95z_lag0 |  
##     biol_site_id)
##    Data: autumn_data
## 
##      AIC      BIC   logLik deviance df.resid 
##   -538.9   -514.1    277.5   -554.9      156 
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -3.15766 -0.56189 -0.01551  0.54874  2.78413 
## 
## Random effects:
##  Groups       Name        Variance  Std.Dev. Corr 
##  biol_site_id (Intercept) 4.560e-03 0.067529      
##               Q95z_lag0   2.245e-05 0.004738 -1.00
##  Residual                 1.380e-03 0.037154      
## Number of obs: 164, groups:  biol_site_id, 17
## 
## Fixed effects:
##               Estimate Std. Error t value
## (Intercept)  1.025e+00  1.666e-02  61.542
## Q95z_lag0   -1.306e-03  5.761e-03  -0.227
## Q10z_lag0    4.982e-03  4.595e-03   1.084
## Q95z_lag1   -1.276e-06  2.701e-03   0.000
## 
## Correlation of Fixed Effects:
##           (Intr) Q95z_0 Q10z_0
## Q95z_lag0 -0.146              
## Q10z_lag0 -0.036 -0.701       
## Q95z_lag1 -0.022 -0.095  0.221
## optimizer (nloptwrap) convergence code: 0 (OK)
## boundary (singular) fit: see help('isSingular')

To aid model selection, the hetoolkit package includes two functions for performing cross-validation.

model_cv performs repeated, stratified k-fold cross-validation, which can be used to evaluate a model’s ability to predict the response at sites in the calibration dataset. Alternatively, model_logocv performs leave-one-group-out cross-validation, which can be used to evaluate a model’s ability to predict the response at sites not in the calibration dataset. Both functions can be applied to linear mixed-effects model (class: lmerMod) or hierarchical generalized additive model (class: gam) model with a single random grouping factor. See ?model_cv and ?model_logocv for details.

model_cv and model_logocv measure the performance of the model under different situations, and so will not necessarily agree which is the best model. If the priority is to be able to predict the response at sites in the calibration dataset, then use model_cv; if the priority is to be able to predict the response at sites not in the calibration dataset, then use model_logocv.

In the following example, the process of simplifying the full model is guided by model_cv, which measures predictive performance using the Root Mean Square Error (RMSE); lower values indicate better predictions. r is set at 20 to give stable RMSE estimates and control is used to help prevent model convergence issues. Backward elimination of random-effect terms is followed by backward elimination of fixed-effect terms, until the RMSE cannot be reduced any further. model_3 is the final, preferred model.

# get RMSE estimate
cv_full[1]

## $RMSE
## [1] 0.04097054

# drop random slope
model_1 <- model_full <- lme4::lmer(LIFE_F_OE ~ Q95z_lag0 + Q10z_lag0 + Q95z_lag1 + (1 | biol_site_id), data = autumn_data, REML = FALSE)

cv_1 <- model_cv(model = model_1, data = autumn_data, group = "biol_site_id", k = 5, r = 20, control = my_control)

# model_1 has a lower RMSE than model_full, so drop Q95z slope
cv_1[1]

## $RMSE
## [1] 0.04023993

# drop Q95zLS1 term
model_2 <- model_full <- lme4::lmer(LIFE_F_OE ~ Q95z_lag0 + Q10z_lag0 + (1 | biol_site_id), data = autumn_data, REML = FALSE)

cv_2 <- model_cv(model = model_2, data = autumn_data, group = "biol_site_id", k = 5, r = 20, control = my_control)

# model_2 has a lower RMSE than model_full, so drop Q95zLS1
cv_2[1]

## $RMSE
## [1] 0.03964198

# drop Q10z term
model_3 <- model_full <- lme4::lmer(LIFE_F_OE ~ Q95z_lag0 + (1 | biol_site_id), data = autumn_data, REML = FALSE)

cv_3 <- model_cv(model = model_3, data = autumn_data, group = "biol_site_id", k = 5, r = 20, control = my_control)

# model_3 has a lower RMSE than model_2, so drop Q10z
cv_3[1]

## $RMSE
## [1] 0.03951175

# no further model simplification possible; model_3 is the final model.

Model diagnostics

The diag_lmer function produces a variety of diagnostic plots for a mixed-effects regression (lmer) model.

# generate diagnostic plots
diag_plots <- diag_lmer(model = model_3, data = autumn_data, facet_by = NULL)

These include:

Fitted vs observed values.

Normal probability plot.

Residuals vs fitted.

Histogram of residuals.

Residuals vs model fixed predictors. This displays a linear model (red) for each fixed predictor vs the model’s residuals. The blue line is a loess-smoothed line. The main purpose of this plot is to check whether or not the relationship between residuals and a predictor is linear.

Alternatively, the plots can be faceted by site, to explore model fit and residuals site-by-site. In this example, the sites are arranged in descending order by the mean of the fitted values.

# generate diagnostic plots
diag_plots2 <- diag_lmer(model = model_3, 
                         data = autumn_data, 
                         facet_by = "biol_site_id", 
                         order_by = "mean", 
                         order = "descending", 
                         ncol = 4, 
                         scales = "fixed")

diag_plots2[[1]]

diag_plots2[[2]]

diag_plots2[[3]]

# boxplot instead of histogram of residuals:
diag_plots2[[4]]

# same as un-faceted option:
# diag_plots2[[5]]

Other useful diagnostic plots for lmer models include:

caterpillar plots, for visualising the random intercepts (and, where included, slopes) for each site:

sjPlot::plot_model(model = model_3, 
                   type = "re", 
                   facet.grid = FALSE, 
                   free.scale = FALSE, 
                   title = NULL, 
                   vline.color = "darkgrey") + 
  ylim(-0.2,0.2)

partial regression plot for a fixed effect, showing the ‘global’ relationship:

visreg::visreg(model_3, "Q95z_lag0", ylab = "LIFE_F_OE")

partial regression plot for a random effect, showing the site-specific relationship (here only random intercepts were included so the slope is consistent among sites):

visreg::visreg(model_3, "Q95z_lag0", by = "biol_site_id", ylab = "LIFE_F_OE")

Predictions

The plot_predictions function produces a time series plot of predictions from a hydro-ecological model, alongside observed biology and flow data. The plot is faceted by site, as specified by the site_col argument. The function generates a ggplot object, which can optionally be saved as a .png file named “Predictions_Plot.png”.

# Generate prediction intervals
pi <- merTools::predictInterval(merMod = model_3, 
                                level = 0.95, 
                                n.sims = 1000,
                                stat = "mean",
                                type="linear.prediction",
                                include.resid.var = TRUE)

# bind to model calibration dataset
autumn_data <- cbind(autumn_data, pi)

# produce predictions plot
plot_predictions(data = autumn_data,
                  biol_metric = "LIFE_F_OE",
                  time_col = "Year",
                  site_col = "biol_site_id",
                  flow_stat = c("Q95z_lag0", "Q10z_lag0"),
                  pred_col = c("fit", "lwr", "upr"),
                  ncol = 4)

Environment Agency / APEM Ltd