This document describes how we map the checklist data to Darwin Core. The source file for this document can be found here.
Load libraries:
library(tidyverse) # To do data science
library(tidylog) # To provide feedback on dplyr functions
library(magrittr) # To use %<>% pipes
library(here) # To find files
library(janitor) # To clean input data
library(digest) # To generate hashes
The source data is maintained in this Google Spreadsheet.
Read the relevant worksheets (published as csv):
input_distribution <- read_csv("")
input_taxa <- read_csv("")
input_regions <- read_csv("")
input_references <- read_csv("")
Sort the source files (to maintain some consistency between updates of the dataset):
input_distribution %<>% arrange(scientific_name, region_code)
input_taxa %<>% arrange(scientific_name)
input_regions %<>% arrange(region_code)
input_references %<>% arrange(region_code, citation)
Copy the source data to the repository to keep track of changes:
write_csv(input_distribution, here("data", "raw", "distribution.csv"), na = "")
write_csv(input_taxa, here("data", "raw", "taxa.csv"), na = "")
write_csv(input_regions, here("data", "raw", "regions.csv"), na = "")
write_csv(input_references, here("data", "raw", "references.csv"), na = "")
The first 4 files will be used for the Darwin Core mapping, while rlc
is needed for the calculation in wrl_values.Rmd
The input_references
contain references of species/red lists per region. We can include those as source
for distributions, but unfortunately not as full references in a reference extension, as that would require them to be associated (and repeated) with taxa.
Group references
to create a concatenated list of unique references per region_code
, e.g. AD
= Đug (2013) | Koren and Kulijer (2016) | Lelo (2016)
. References that act as both red and species list reference, will only be listed once:
grouped_references <-
input_references %>%
arrange(region_code, citation_type, citation) %>%
group_by(region_code) %>%
summarize(reference = paste(unique(citation), collapse = " | "))
Start from distributions and join with taxon, region and grouped references information (which should result in the same number of rows as distributions):
input_data <-
input_distribution %>%
left_join(input_taxa, on = "scientific_name") %>%
left_join(input_regions, on = "region_code") %>%
left_join(grouped_references, on = "region_code")
nrow(input_data) == nrow(input_distribution)
## [1] TRUE
Clean data somewhat:
input_data %<>%
remove_empty("rows") %>% # Remove empty rows
clean_names() # Have sensible (lowercase) column names
Sort on scientific name and region code (to maintain some consistency between updates of the dataset):
input_data %<>% arrange(scientific_name, region_code)
Information regarding scientific names, authorship, classification and phylogenetic order in input_taxa
is derived from Wiemers et al. (2018), which is an “updated checklist of the European Butterflies” with 496 species.
All names on the regional checklists (scientific_name_regional
in input_distribution
) were matched by Dimitri Brosens to their accepted name (scientific_name
) using the GBIF Backbone Taxonomy and verified by Dirk Maes (e.g. Lycaeides argyrognomon
to official Plebejus argyrognomon
). A limited number of scientific_name_regional
could not be matched and/or are not accepted by Wiemers et al. (2018):
input_data %>%
filter( %>% # Is a field from input_taxa and thus assesses there was a match
select(scientific_name_regional, scientific_name, region_code, comments) %>%
We restrict the checklist to scientific names included in Wiemers et al. (2018) (i.e. those included in input_taxa
, see issue 16):
input_data %<>% filter(! # Remove distributions/names without match
To link taxa with information in the extension(s), each taxon needs a unique and relatively stable taxonID
. Here we create one in the form of dataset_shortname:taxon:hash
, where hash
is unique code based on scientific name (that will remain the same as long as scientific name remains the same):
vdigest <- Vectorize(digest) # Vectorize digest function to work with vectors
input_data %<>% mutate(taxon_id = paste(
vdigest(paste(scientific_name), algo = "md5"),
sep = ":"
Show the number of species and distributions:
input_data %>%
group_by(family) %>%
`# taxa` = n_distinct(taxon_id),
`# distributions` = n()
) %>%
Preview data:
input_data %>% head()
Create a dataframe with unique taxa only (retaining the first of multiple distribution rows):
taxon <- input_data %>% distinct(taxon_id, .keep_all = TRUE)
Map the data to Darwin Core Taxon:
taxon %<>% mutate(dwc_language = "en")
taxon %<>% mutate(dwc_license = "")
taxon %<>% mutate(dwc_rightsHolder = "INBO")
taxon %<>% mutate(dwc_accessRights = "")
taxon %<>% mutate(dwc_datasetID = "")
taxon %<>% mutate(dwc_institutionCode = "INBO")
taxon %<>% mutate(dwc_datasetName = "National checklists and red lists for European butterflies")
taxon %<>% mutate(dwc_taxonID = taxon_id)
taxon %<>% mutate(dwc_scientificName = scientific_name)
taxon %<>% mutate(dwc_kingdom = "Animalia")
taxon %<>% mutate(dwc_phylum = "Arthropoda")
taxon %<>% mutate(dwc_class = "Insecta")
taxon %<>% mutate(dwc_order = "Lepidoptera")
taxon %<>% mutate(dwc_family = family)
taxon %<>% mutate(dwc_genus = genus)
taxon %<>% mutate(dwc_specificEpithet = specific_epithet)
taxon %<>% mutate(dwc_taxonRank = "species")
taxon %<>% mutate(dwc_nomenclaturalCode = "ICZN")
Create a dataframe with all data:
distribution <- input_data
Map the data to Species Distribution:
distribution %<>% mutate(dwc_taxonID = taxon_id)
Map values:
distribution %<>% mutate(dwc_locationID = case_when(
# Europe: not WGSRPD:1 as that does not include Azores and Canary islands
region_code == "EUR" ~ "",
# European union: EU is reserved, but not an official ISO code
region_code == "EU27" ~ "",
# Azores, Canary Islands, Madeira: use marine regions codes
# The island groups have ISO_3166-2 subdivisions codes, but for consistency we use marine regions for all
region_code == "MA_AZ" ~ "", # ISO_3166:PT-20
region_code == "MA_AZ_Corvo" ~ "",
region_code == "MA_AZ_Faial" ~ "",
region_code == "MA_AZ_Flores" ~ "",
region_code == "MA_AZ_Graciosa" ~ "",
region_code == "MA_AZ_Pico" ~ "",
region_code == "MA_AZ_Santa Maria" ~ "",
region_code == "MA_AZ_Sao Jorge" ~ "",
region_code == "MA_AZ_Sao Miguel" ~ "",
region_code == "MA_AZ_Terceira" ~ "",
region_code == "MA_CA" ~ "", # ISO_3166:ES-CN
region_code == "MA_CA_El Hierro" ~ "",
region_code == "MA_CA_Fuerteventura" ~ "",
region_code == "MA_CA_Gran Canaria" ~ "",
region_code == "MA_CA_La Gomera" ~ "",
region_code == "MA_CA_La Palma" ~ "",
region_code == "MA_CA_Lanzarote" ~ "",
region_code == "MA_CA_Tenerife" ~ "",
region_code == "MA_MA" ~ "", # ISO_3166:PT-30
region_code == "MA_MA_Madeira" ~ "",
region_code == "MA_MA_Porto Santo" ~ "",
# All other countries: ISO_3166 code ~ paste0("ISO_3166:", region_code)
Map values:
distribution %<>% mutate(dwc_locality = case_when(
# Use country name if not a region ~ country_name,
# Use e.g. "Azores, Corvo" for island groups
str_detect(region_code, "MA_AZ_") ~ paste("Azores", region_name, sep = ", "),
str_detect(region_code, "MA_MA_") ~ paste("Madeira", region_name, sep = ", "),
str_detect(region_code, "MA_CA_") ~ paste("Canary Islands", region_name, sep = ", "),
# Use region name for rest
TRUE ~ region_name
Map values:
distribution %<>% mutate(dwc_countryCode = country_code)
Inspect mapped values:
distribution %>%
region_code, region_name, country_code, country_name,
dwc_locationID, dwc_locality, dwc_countryCode
) %>%
Inspect values:
distribution %>%
group_by(status) %>%
Map values (see Wiemers et al. 2018, Materials & Methods). For migrants we decided to map to own term migrant
as this information is too important to be lumped into irregular
(with alternative term vagrant
) and cannot be considered just present
or absent
(see issue #18).
distribution %<>% mutate(dwc_occurrenceStatus = recode(status,
"A" = "absent",
"Ex" = "absent", # Regionally extinct, is indicated in threatStatus "RE"
"Excluded" = "excluded",
"I" = "irregular", # Irregular vagrant
"M" = "migrant", # Regular migrant
"P" = "present",
"P?" = "doubtful", # Possibly present
"P(I)" = "irregular", # Probably only present as an immigrant
.default = ""
Inspect mapped values:
distribution %>%
group_by(status, dwc_occurrenceStatus) %>%
Inspect values:
distribution %>%
group_by(rlc) %>%
Map values:
distribution %<>% mutate(dwc_threatStatus = recode(rlc,
# Official IUCN terms:
"RE" = "RE", # Regionally Extinct
"CR" = "CR", # Critically Endangered
"EN" = "EN", # Endangered
"VU" = "VU", # Vulnerable
"NT" = "NT", # Near Threatened
"LC" = "LC", # Least Concern
"DD" = "DD", # Data Deficient
"NtA" = "NA", # Not Applicable
"NE" = "NE", # Not Evaluated
# Unofficial terms
"NRLA" = "", # No red list available
"R" = "Rare", # Used by Germany
"Unknown" = "unknown", # not an official IUCN term, but included to be explicit
"LC/NE" = "NE", # Used by Poland
.default = ""
Inspect mapped values:
distribution %>%
group_by(rlc, dwc_threatStatus) %>%
distribution %<>% mutate(dwc_source = reference)
distribution %<>% mutate(dwc_occurrenceRemarks = case_when(
scientific_name_regional != scientific_name ~ paste0("In source as '", scientific_name_regional, "'")
Create a dataframe with unique taxa only (retaining the first of multiple distribution rows):
vernacular_names <- input_data %>% distinct(taxon_id, .keep_all = TRUE)
Remove taxa where the vernacular name contains “suggestion”:
vernacular_names %<>% filter(!str_detect(english_name, "suggestion"))
Map the data to Vernacular Names:
vernacular_names %<>% mutate(dwc_taxonID = taxon_id)
vernacular_names %<>% mutate(dwc_vernacularName = english_name)
vernacular_names %<>% mutate(dwc_language = "en")
Only keep the Darwin Core columns:
taxon %<>% select(starts_with("dwc_"))
distribution %<>% select(starts_with("dwc_"))
vernacular_names %<>% select(starts_with("dwc_"))
Drop the dwc_
colnames(taxon) <- str_remove(colnames(taxon), "dwc_")
colnames(distribution) <- str_remove(colnames(distribution), "dwc_")
colnames(vernacular_names) <- str_remove(colnames(vernacular_names), "dwc_")
Preview taxon core:
taxon %>% head()
Preview distribution extension:
distribution %>% head()
Preview vernacular names extension:
vernacular_names %>% head()
Save to CSV:
write_csv(taxon, here("data", "processed", "taxon.csv"), na = "")
write_csv(distribution, here("data", "processed", "distribution.csv"), na = "")
write_csv(vernacular_names, here("data", "processed", "vernacularname.csv"), na = "")