Install required libraries (only if the libraries have not been installed before):
installed <- rownames(installed.packages())
required <- c( "readr", "dplyr", "magrittr", "here", "rgbif", "digest")
if (!all(required %in% installed)) {
install.packages(required[!required %in% installed])
}
Load libraries:
library(readr) # To work with text files
library(dplyr) # To manipulate tabular data
library(magrittr) # To use %<>% pipes
library(here) # To find files
library(rgbif) # To use GBIF services
library(digest) # To generate hashes
Create a data frame input_data
from the source data:
input_data <- readr::read_csv(file = here("data", "raw", "input_taxa.csv"))
## Rows: 60 Columns: 1
## ── Column specification ──────────────────
## Delimiter: ","
## chr (1): scientific_name
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Preview data:
input_data %>% head(n = 5)
Use the GBIF nameparser to retrieve nomenclatural information for the scientific names in the checklist:
parsed_names <- input_data %>%
distinct(scientific_name) %>%
pull() %>%
rgbif::parsenames() # An rgbif function
Show scientific names with nomenclatural issues, i.e. not of
type = SCIENTIFIC
or that could not be fully parsed. Note:
these are not necessarily incorrect.
parsed_names %>%
select(scientificname, type, parsed, parsedpartially, rankmarker) %>%
filter(!(type == "SCIENTIFIC" & parsed == "TRUE" & parsedpartially == "FALSE"))
The nameparser function also provides information about the rank of
the taxon (in rankmarker
). Here we join this information
with our checklist. Cleaning these ranks will done in the Taxon Core
mapping:
input_data %<>% left_join(
parsed_names %>%
select(scientificname, rankmarker),
by = c("scientific_name" = "scientificname"))
To link taxa with information in the extension(s), each taxon needs a
unique and relatively stable taxonID
. Here we create one in
the form of dataset_shortname:taxon:hash
, where
hash
is unique code based on scientific name and kingdom
(that will remain the same as long as scientific name and kingdom remain
the same):
vdigest <- Vectorize(digest) # Vectorize digest function to work with vectors
input_data %<>% mutate(taxon_id = paste(
"dvw-target-list",
"taxon",
vdigest(paste(scientific_name, "Plantae"), algo = "md5"),
sep = ":"
))
We start by creating a copy of the input data:
taxon <- input_data
Map the data to Darwin Core Taxon.
Start with record-level terms which contain metadata about the dataset (which is generally the same for all records).
taxon %<>% mutate(dwc_language = "en")
taxon %<>% mutate(dwc_license = "http://creativecommons.org/publicdomain/zero/1.0/")
taxon %<>% mutate(dwc_rightsHolder = "INBO") # e.g. "INBO"
taxon %<>% mutate(dwc_datasetID = "https://doi.org/10.15468/52b8h9") # added after first publication
taxon %<>% mutate(dwc_institutionCode = "INBO")
taxon %<>% mutate(dwc_datasetName = "De Vlaamse Waterweg target species list")
The following terms contain information about the taxon:
taxon %<>% mutate(dwc_taxonID = taxon_id)
taxon %<>% mutate(dwc_scientificName = scientific_name)
taxon %<>% mutate(dwc_kingdom = "Plantae")
Inspect values:
taxon %>%
group_by(rankmarker) %>%
count()
Map values by recoding to the GBIF rank vocabulary:
taxon %<>% mutate(dwc_taxonRank = dplyr::case_match(rankmarker,
"sp." ~ "species"
))
Inspect mapped values:
taxon %>%
group_by(rankmarker, dwc_taxonRank) %>%
count()
Only keep the Darwin Core columns:
taxon %<>% select(starts_with("dwc_"))
Drop the dwc_
prefix:
colnames(taxon) <- gsub(
x = colnames(taxon),
pattern = "dwc_",
replacement = ""
)
Preview data:
taxon %>% head()
Save to CSV:
write_csv(taxon, here("data", "processed", "taxon.csv"), na = "")