This document describes how we map the checklist data to Darwin Core. The source file for this document can be found here.
Load libraries:
library(tidyverse) # To do data science
library(tidylog) # To provide feedback on dplyr functions
library(magrittr) # To use %<>% pipes
library(here) # To find files
library(janitor) # To clean input data
library(digest) # To generate hashes
The source data is maintained in this Google Spreadsheet.
Read the relevant worksheets (published as csv):
input_distribution <- read_csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vSrKN8_XbQ4Vo-VrtUiqAtS5im3QxkHBNSJHPSLmrkM5C1PIC7DOg-oboRcEJZtWp_qsi802YRlRp8C/pub?gid=979140000&single=true&output=csv")
input_taxa <- read_csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vSrKN8_XbQ4Vo-VrtUiqAtS5im3QxkHBNSJHPSLmrkM5C1PIC7DOg-oboRcEJZtWp_qsi802YRlRp8C/pub?gid=1559651428&single=true&output=csv")
input_regions <- read_csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vSrKN8_XbQ4Vo-VrtUiqAtS5im3QxkHBNSJHPSLmrkM5C1PIC7DOg-oboRcEJZtWp_qsi802YRlRp8C/pub?gid=2076261682&single=true&output=csv")
input_references <- read_csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vSrKN8_XbQ4Vo-VrtUiqAtS5im3QxkHBNSJHPSLmrkM5C1PIC7DOg-oboRcEJZtWp_qsi802YRlRp8C/pub?gid=1083424499&single=true&output=csv")
Sort the source files (to maintain some consistency between updates of the dataset):
input_distribution %<>% arrange(scientific_name, region_code)
input_taxa %<>% arrange(scientific_name)
input_regions %<>% arrange(region_code)
input_references %<>% arrange(region_code, citation)
Copy the source data to the repository to keep track of changes:
write_csv(input_distribution, here("data", "raw", "distribution.csv"), na = "")
write_csv(input_taxa, here("data", "raw", "taxa.csv"), na = "")
write_csv(input_regions, here("data", "raw", "regions.csv"), na = "")
write_csv(input_references, here("data", "raw", "references.csv"), na = "")
The first 4 files will be used for the Darwin Core mapping, while rlc is needed for the calculation in wrl_values.Rmd.
The input_references contain references of species/red lists per region. We can include those as source for distributions, but unfortunately not as full references in a reference extension, as that would require them to be associated (and repeated) with taxa.
Group references to create a concatenated list of unique references per region_code, e.g. AD = Đug (2013) | Koren and Kulijer (2016) | Lelo (2016). References that act as both red and species list reference, will only be listed once:
grouped_references <-
input_references %>%
arrange(region_code, citation_type, citation) %>%
group_by(region_code) %>%
summarize(reference = paste(unique(citation), collapse = " | "))
Start from distributions and join with taxon, region and grouped references information (which should result in the same number of rows as distributions):
input_data <-
input_distribution %>%
left_join(input_taxa, on = "scientific_name") %>%
left_join(input_regions, on = "region_code") %>%
left_join(grouped_references, on = "region_code")
nrow(input_data) == nrow(input_distribution)
## [1] TRUE
Clean data somewhat:
input_data %<>%
remove_empty("rows") %>% # Remove empty rows
clean_names() # Have sensible (lowercase) column names
Sort on scientific name and region code (to maintain some consistency between updates of the dataset):
input_data %<>% arrange(scientific_name, region_code)
Information regarding scientific names, authorship, classification and phylogenetic order in input_taxa is derived from Wiemers et al. (2018), which is an “updated checklist of the European Butterflies” with 496 species.
All names on the regional checklists (scientific_name_regional in input_distribution) were matched by Dimitri Brosens to their accepted name (scientific_name) using the GBIF Backbone Taxonomy and verified by Dirk Maes (e.g. Lycaeides argyrognomon to official Plebejus argyrognomon). A limited number of scientific_name_regional could not be matched and/or are not accepted by Wiemers et al. (2018):
input_data %>%
filter(is.na(family)) %>% # Is a field from input_taxa and thus assesses there was a match
select(scientific_name_regional, scientific_name, region_code, comments) %>%
arrange(scientific_name_regional)
We restrict the checklist to scientific names included in Wiemers et al. (2018) (i.e. those included in input_taxa, see issue 16):
input_data %<>% filter(!is.na(family)) # Remove distributions/names without match
To link taxa with information in the extension(s), each taxon needs a unique and relatively stable taxonID. Here we create one in the form of dataset_shortname:taxon:hash, where hash is unique code based on scientific name (that will remain the same as long as scientific name remains the same):
vdigest <- Vectorize(digest) # Vectorize digest function to work with vectors
input_data %<>% mutate(taxon_id = paste(
"eurobutt-checklist",
"taxon",
vdigest(paste(scientific_name), algo = "md5"),
sep = ":"
))
Show the number of species and distributions:
input_data %>%
group_by(family) %>%
summarize(
`# taxa` = n_distinct(taxon_id),
`# distributions` = n()
) %>%
adorn_totals("row")
Preview data:
input_data %>% head()
Create a dataframe with unique taxa only (retaining the first of multiple distribution rows):
taxon <- input_data %>% distinct(taxon_id, .keep_all = TRUE)
Map the data to Darwin Core Taxon:
taxon %<>% mutate(dwc_language = "en")
taxon %<>% mutate(dwc_license = "http://creativecommons.org/publicdomain/zero/1.0/")
taxon %<>% mutate(dwc_rightsHolder = "INBO")
taxon %<>% mutate(dwc_accessRights = "https://www.inbo.be/en/norms-data-use")
taxon %<>% mutate(dwc_datasetID = "https://doi.org/10.15468/ye7whj")
taxon %<>% mutate(dwc_institutionCode = "INBO")
taxon %<>% mutate(dwc_datasetName = "National checklists and red lists for European butterflies")
taxon %<>% mutate(dwc_taxonID = taxon_id)
taxon %<>% mutate(dwc_scientificName = scientific_name)
taxon %<>% mutate(dwc_kingdom = "Animalia")
taxon %<>% mutate(dwc_phylum = "Arthropoda")
taxon %<>% mutate(dwc_class = "Insecta")
taxon %<>% mutate(dwc_order = "Lepidoptera")
taxon %<>% mutate(dwc_family = family)
taxon %<>% mutate(dwc_genus = genus)
taxon %<>% mutate(dwc_specificEpithet = specific_epithet)
taxon %<>% mutate(dwc_taxonRank = "species")
taxon %<>% mutate(dwc_nomenclaturalCode = "ICZN")
Create a dataframe with all data:
distribution <- input_data
Map the data to Species Distribution:
distribution %<>% mutate(dwc_taxonID = taxon_id)
Map values:
distribution %<>% mutate(dwc_locationID = case_when(
# Europe: not WGSRPD:1 as that does not include Azores and Canary islands
# https://en.wikipedia.org/wiki/World_Geographical_Scheme_for_Recording_Plant_Distributions#1_Europe
region_code == "EUR" ~ "",
# European union: EU is reserved, but not an official ISO code
# https://en.wikipedia.org/wiki/ISO_3166-1#Criteria_for_inclusion
region_code == "EU27" ~ "",
# Azores, Canary Islands, Madeira: use marine regions codes
# The island groups have ISO_3166-2 subdivisions codes, but for consistency we use marine regions for all
region_code == "MA_AZ" ~ "http://marineregions.org/mrgid/2454", # ISO_3166:PT-20
region_code == "MA_AZ_Corvo" ~ "http://marineregions.org/mrgid/2462",
region_code == "MA_AZ_Faial" ~ "http://marineregions.org/mrgid/2458",
region_code == "MA_AZ_Flores" ~ "http://marineregions.org/mrgid/2461",
region_code == "MA_AZ_Graciosa" ~ "http://marineregions.org/mrgid/2463",
region_code == "MA_AZ_Pico" ~ "http://marineregions.org/mrgid/2460",
region_code == "MA_AZ_Santa Maria" ~ "http://marineregions.org/mrgid/2459",
region_code == "MA_AZ_Sao Jorge" ~ "http://marineregions.org/mrgid/2455",
region_code == "MA_AZ_Sao Miguel" ~ "http://marineregions.org/mrgid/2456",
region_code == "MA_AZ_Terceira" ~ "http://marineregions.org/mrgid/2457",
region_code == "MA_CA" ~ "http://marineregions.org/mrgid/3743", # ISO_3166:ES-CN
region_code == "MA_CA_El Hierro" ~ "http://marineregions.org/mrgid/3747",
region_code == "MA_CA_Fuerteventura" ~ "http://marineregions.org/mrgid/3757",
region_code == "MA_CA_Gran Canaria" ~ "http://marineregions.org/mrgid/3746",
region_code == "MA_CA_La Gomera" ~ "http://marineregions.org/mrgid/3744",
region_code == "MA_CA_La Palma" ~ "http://marineregions.org/mrgid/3748",
region_code == "MA_CA_Lanzarote" ~ "http://marineregions.org/mrgid/3755",
region_code == "MA_CA_Tenerife" ~ "http://marineregions.org/mrgid/3756",
region_code == "MA_MA" ~ "http://marineregions.org/mrgid/4955", # ISO_3166:PT-30
region_code == "MA_MA_Madeira" ~ "http://marineregions.org/mrgid/4956",
region_code == "MA_MA_Porto Santo" ~ "http://marineregions.org/mrgid/4958",
# All other countries: ISO_3166 code
is.na(region_name) ~ paste0("ISO_3166:", region_code)
))
Map values:
distribution %<>% mutate(dwc_locality = case_when(
# Use country name if not a region
is.na(region_name) ~ country_name,
# Use e.g. "Azores, Corvo" for island groups
str_detect(region_code, "MA_AZ_") ~ paste("Azores", region_name, sep = ", "),
str_detect(region_code, "MA_MA_") ~ paste("Madeira", region_name, sep = ", "),
str_detect(region_code, "MA_CA_") ~ paste("Canary Islands", region_name, sep = ", "),
# Use region name for rest
TRUE ~ region_name
))
Map values:
distribution %<>% mutate(dwc_countryCode = country_code)
Inspect mapped values:
distribution %>%
group_by(
region_code, region_name, country_code, country_name,
dwc_locationID, dwc_locality, dwc_countryCode
) %>%
count()
Inspect values:
distribution %>%
group_by(status) %>%
count()
Map values (see Wiemers et al. 2018, Materials & Methods). For migrants we decided to map to own term migrant as this information is too important to be lumped into irregular (with alternative term vagrant) and cannot be considered just present or absent (see issue #18).
distribution %<>% mutate(dwc_occurrenceStatus = recode(status,
"A" = "absent",
"Ex" = "absent", # Regionally extinct, is indicated in threatStatus "RE"
"Excluded" = "excluded",
"I" = "irregular", # Irregular vagrant
"M" = "migrant", # Regular migrant
"P" = "present",
"P?" = "doubtful", # Possibly present
"P(I)" = "irregular", # Probably only present as an immigrant
.default = ""
))
Inspect mapped values:
distribution %>%
group_by(status, dwc_occurrenceStatus) %>%
count()
Inspect values:
distribution %>%
group_by(rlc) %>%
count()
Map values:
distribution %<>% mutate(dwc_threatStatus = recode(rlc,
# Official IUCN terms: https://www.iucnredlist.org/about/regional
"RE" = "RE", # Regionally Extinct
"CR" = "CR", # Critically Endangered
"EN" = "EN", # Endangered
"VU" = "VU", # Vulnerable
"NT" = "NT", # Near Threatened
"LC" = "LC", # Least Concern
"DD" = "DD", # Data Deficient
"NtA" = "NA", # Not Applicable
"NE" = "NE", # Not Evaluated
# Unofficial terms
"NRLA" = "", # No red list available
"R" = "Rare", # Used by Germany
"Unknown" = "unknown", # not an official IUCN term, but included to be explicit
"LC/NE" = "NE", # Used by Poland
.default = ""
))
Inspect mapped values:
distribution %>%
group_by(rlc, dwc_threatStatus) %>%
count()
distribution %<>% mutate(dwc_source = reference)
distribution %<>% mutate(dwc_occurrenceRemarks = case_when(
scientific_name_regional != scientific_name ~ paste0("In source as '", scientific_name_regional, "'")
))
Create a dataframe with unique taxa only (retaining the first of multiple distribution rows):
vernacular_names <- input_data %>% distinct(taxon_id, .keep_all = TRUE)
Remove taxa where the vernacular name contains “suggestion”:
vernacular_names %<>% filter(!str_detect(english_name, "suggestion"))
Map the data to Vernacular Names:
vernacular_names %<>% mutate(dwc_taxonID = taxon_id)
vernacular_names %<>% mutate(dwc_vernacularName = english_name)
vernacular_names %<>% mutate(dwc_language = "en")
Only keep the Darwin Core columns:
taxon %<>% select(starts_with("dwc_"))
distribution %<>% select(starts_with("dwc_"))
vernacular_names %<>% select(starts_with("dwc_"))
Drop the dwc_ prefix:
colnames(taxon) <- str_remove(colnames(taxon), "dwc_")
colnames(distribution) <- str_remove(colnames(distribution), "dwc_")
colnames(vernacular_names) <- str_remove(colnames(vernacular_names), "dwc_")
Preview taxon core:
taxon %>% head()
Preview distribution extension:
distribution %>% head()
Preview vernacular names extension:
vernacular_names %>% head()
Save to CSV:
write_csv(taxon, here("data", "processed", "taxon.csv"), na = "")
write_csv(distribution, here("data", "processed", "distribution.csv"), na = "")
write_csv(vernacular_names, here("data", "processed", "vernacularname.csv"), na = "")