1 Setup

Install required libraries (only if the libraries have not been installed before):

installed <- rownames(installed.packages())
required <- c( "readr", "dplyr", "magrittr", "here", "rgbif", "digest")
if (!all(required %in% installed)) {
  install.packages(required[!required %in% installed])
}

Load libraries:

library(readr)          # To work with text files
library(dplyr)          # To manipulate tabular data
library(magrittr)       # To use %<>% pipes
library(here)           # To find files
library(rgbif)          # To use GBIF services
library(digest)         # To generate hashes

2 Read source data

Create a data frame input_data from the source data:

input_data <- readr::read_csv(file = here("data", "raw", "input_taxa.csv"))

## Rows: 60 Columns: 1
## ── Column specification ──────────────────
## Delimiter: ","
## chr (1): scientific_name
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Preview data:

input_data %>% head(n = 5)

3 Process source data

3.1 Scientific names

Use the GBIF nameparser to retrieve nomenclatural information for the scientific names in the checklist:

parsed_names <- input_data %>%
  distinct(scientific_name) %>%
  pull() %>%
  rgbif::parsenames() # An rgbif function

Show scientific names with nomenclatural issues, i.e. not of type = SCIENTIFIC or that could not be fully parsed. Note: these are not necessarily incorrect.

parsed_names %>%
  select(scientificname, type, parsed, parsedpartially, rankmarker) %>%
  filter(!(type == "SCIENTIFIC" & parsed == "TRUE" & parsedpartially == "FALSE"))

3.2 Taxon ranks

The nameparser function also provides information about the rank of the taxon (in rankmarker). Here we join this information with our checklist. Cleaning these ranks will done in the Taxon Core mapping:

input_data %<>% left_join(
  parsed_names %>%
  select(scientificname, rankmarker),
  by = c("scientific_name" = "scientificname"))

3.3 Taxon IDs

To link taxa with information in the extension(s), each taxon needs a unique and relatively stable taxonID. Here we create one in the form of dataset_shortname:taxon:hash, where hash is unique code based on scientific name and kingdom (that will remain the same as long as scientific name and kingdom remain the same):

vdigest <- Vectorize(digest) # Vectorize digest function to work with vectors
input_data %<>% mutate(taxon_id = paste(
  "dvw-target-list",
  "taxon",
  vdigest(paste(scientific_name, "Plantae"), algo = "md5"),
  sep = ":"
))

4 Taxon core

We start by creating a copy of the input data:

taxon <- input_data

4.1 Term mapping

Map the data to Darwin Core Taxon.

Start with record-level terms which contain metadata about the dataset (which is generally the same for all records).

4.1.1 language

taxon %<>% mutate(dwc_language = "en")

4.1.2 license

taxon %<>% mutate(dwc_license = "http://creativecommons.org/publicdomain/zero/1.0/")

4.1.3 rightsHolder

taxon %<>% mutate(dwc_rightsHolder = "INBO") # e.g. "INBO"

4.1.4 datasetID

taxon %<>% mutate(dwc_datasetID = "https://doi.org/10.15468/52b8h9") # added after first publication

4.1.5 institutionCode

taxon %<>% mutate(dwc_institutionCode = "INBO")

4.1.6 datasetName

taxon %<>% mutate(dwc_datasetName = "De Vlaamse Waterweg target species list")

The following terms contain information about the taxon:

4.1.7 taxonID

taxon %<>% mutate(dwc_taxonID = taxon_id)

4.1.8 scientificName

taxon %<>% mutate(dwc_scientificName = scientific_name)

4.1.9 kingdom

taxon %<>% mutate(dwc_kingdom = "Plantae")

4.1.10 taxonRank

Inspect values:

taxon %>%
  group_by(rankmarker) %>%
  count()

Map values by recoding to the GBIF rank vocabulary:

taxon %<>% mutate(dwc_taxonRank = dplyr::case_match(rankmarker,
  "sp." ~ "species"
))

Inspect mapped values:

taxon %>%
  group_by(rankmarker, dwc_taxonRank) %>%
  count()

4.2 Post-processing

Only keep the Darwin Core columns:

taxon %<>% select(starts_with("dwc_"))

Drop the dwc_ prefix:

colnames(taxon) <- gsub(
  x = colnames(taxon), 
  pattern = "dwc_", 
  replacement = ""
)