This document describes how we map the checklist data to Darwin Core. The source file for this document can be found here.
Load libraries:
library(tidyverse) # To do data science
library(tidylog) # To provide feedback on dplyr functions
library(magrittr) # To use %<>% pipes
library(here) # To find files
library(janitor) # To clean input data
library(digest) # To generate hashes
The source data is maintained in this Google Spreadsheet.
Read the relevant worksheets (published as csv):
input_distribution <- read_csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vSrKN8_XbQ4Vo-VrtUiqAtS5im3QxkHBNSJHPSLmrkM5C1PIC7DOg-oboRcEJZtWp_qsi802YRlRp8C/pub?gid=979140000&single=true&output=csv")
input_taxa <- read_csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vSrKN8_XbQ4Vo-VrtUiqAtS5im3QxkHBNSJHPSLmrkM5C1PIC7DOg-oboRcEJZtWp_qsi802YRlRp8C/pub?gid=1559651428&single=true&output=csv")
input_regions <- read_csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vSrKN8_XbQ4Vo-VrtUiqAtS5im3QxkHBNSJHPSLmrkM5C1PIC7DOg-oboRcEJZtWp_qsi802YRlRp8C/pub?gid=2076261682&single=true&output=csv")
input_references <- read_csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vSrKN8_XbQ4Vo-VrtUiqAtS5im3QxkHBNSJHPSLmrkM5C1PIC7DOg-oboRcEJZtWp_qsi802YRlRp8C/pub?gid=1083424499&single=true&output=csv")
Sort the source files (to maintain some consistency between updates of the dataset):
input_distribution %<>% arrange(scientific_name, region_code)
input_taxa %<>% arrange(scientific_name)
input_regions %<>% arrange(region_code)
input_references %<>% arrange(region_code, citation)
Copy the source data to the repository to keep track of changes:
write_csv(input_distribution, here("data", "raw", "distribution.csv"), na = "")
write_csv(input_taxa, here("data", "raw", "taxa.csv"), na = "")
write_csv(input_regions, here("data", "raw", "regions.csv"), na = "")
write_csv(input_references, here("data", "raw", "references.csv"), na = "")
The first 4 files will be used for the Darwin Core mapping, while rlc
is needed for the calculation in wrl_values.Rmd
.
The input_references
contain references of species/red lists per region. We can include those as source
for distributions, but unfortunately not as full references in a reference extension, as that would require them to be associated (and repeated) with taxa.
Group references
to create a concatenated list of unique references per region_code
, e.g. AD
= Đug (2013) | Koren and Kulijer (2016) | Lelo (2016)
. References that act as both red and species list reference, will only be listed once:
grouped_references <-
input_references %>%
arrange(region_code, citation_type, citation) %>%
group_by(region_code) %>%
summarize(reference = paste(unique(citation), collapse = " | "))
Start from distributions and join with taxon, region and grouped references information (which should result in the same number of rows as distributions):
input_data <-
input_distribution %>%
left_join(input_taxa, on = "scientific_name") %>%
left_join(input_regions, on = "region_code") %>%
left_join(grouped_references, on = "region_code")
nrow(input_data) == nrow(input_distribution)
## [1] TRUE
Clean data somewhat:
input_data %<>%
remove_empty("rows") %>% # Remove empty rows
clean_names() # Have sensible (lowercase) column names
Sort on scientific name and region code (to maintain some consistency between updates of the dataset):
input_data %<>% arrange(scientific_name, region_code)
Information regarding scientific names, authorship, classification and phylogenetic order in input_taxa
is derived from Wiemers et al. (2018), which is an “updated checklist of the European Butterflies” with 496 species.
All names on the regional checklists (scientific_name_regional
in input_distribution
) were matched by Dimitri Brosens to their accepted name (scientific_name
) using the GBIF Backbone Taxonomy and verified by Dirk Maes (e.g. Lycaeides argyrognomon
to official Plebejus argyrognomon
). A limited number of scientific_name_regional
could not be matched and/or are not accepted by Wiemers et al. (2018):
input_data %>%
filter(is.na(family)) %>% # Is a field from input_taxa and thus assesses there was a match
select(scientific_name_regional, scientific_name, region_code, comments) %>%
arrange(scientific_name_regional)
We restrict the checklist to scientific names included in Wiemers et al. (2018) (i.e. those included in input_taxa
, see issue 16):
input_data %<>% filter(!is.na(family)) # Remove distributions/names without match
To link taxa with information in the extension(s), each taxon needs a unique and relatively stable taxonID
. Here we create one in the form of dataset_shortname:taxon:hash
, where hash
is unique code based on scientific name (that will remain the same as long as scientific name remains the same):
vdigest <- Vectorize(digest) # Vectorize digest function to work with vectors
input_data %<>% mutate(taxon_id = paste(
"eurobutt-checklist",
"taxon",
vdigest(paste(scientific_name), algo = "md5"),
sep = ":"
))
Show the number of species and distributions:
input_data %>%
group_by(family) %>%
summarize(
`# taxa` = n_distinct(taxon_id),
`# distributions` = n()
) %>%
adorn_totals("row")
Preview data:
input_data %>% head()
Create a dataframe with unique taxa only (retaining the first of multiple distribution rows):
taxon <- input_data %>% distinct(taxon_id, .keep_all = TRUE)
Map the data to Darwin Core Taxon:
taxon %<>% mutate(dwc_language = "en")
taxon %<>% mutate(dwc_license = "http://creativecommons.org/publicdomain/zero/1.0/")
taxon %<>% mutate(dwc_rightsHolder = "INBO")
taxon %<>% mutate(dwc_accessRights = "https://www.inbo.be/en/norms-data-use")
taxon %<>% mutate(dwc_datasetID = "https://doi.org/10.15468/ye7whj")
taxon %<>% mutate(dwc_institutionCode = "INBO")
taxon %<>% mutate(dwc_datasetName = "National checklists and red lists for European butterflies")
taxon %<>% mutate(dwc_taxonID = taxon_id)
taxon %<>% mutate(dwc_scientificName = scientific_name)
taxon %<>% mutate(dwc_kingdom = "Animalia")
taxon %<>% mutate(dwc_phylum = "Arthropoda")
taxon %<>% mutate(dwc_class = "Insecta")
taxon %<>% mutate(dwc_order = "Lepidoptera")
taxon %<>% mutate(dwc_family = family)
taxon %<>% mutate(dwc_genus = genus)
taxon %<>% mutate(dwc_specificEpithet = specific_epithet)
taxon %<>% mutate(dwc_taxonRank = "species")
taxon %<>% mutate(dwc_nomenclaturalCode = "ICZN")
Create a dataframe with all data:
distribution <- input_data
Map the data to Species Distribution:
distribution %<>% mutate(dwc_taxonID = taxon_id)
Map values:
distribution %<>% mutate(dwc_locationID = case_when(
# Europe: not WGSRPD:1 as that does not include Azores and Canary islands
# https://en.wikipedia.org/wiki/World_Geographical_Scheme_for_Recording_Plant_Distributions#1_Europe
region_code == "EUR" ~ "",
# European union: EU is reserved, but not an official ISO code
# https://en.wikipedia.org/wiki/ISO_3166-1#Criteria_for_inclusion
region_code == "EU27" ~ "",
# Azores, Canary Islands, Madeira: use marine regions codes
# The island groups have ISO_3166-2 subdivisions codes, but for consistency we use marine regions for all
region_code == "MA_AZ" ~ "http://marineregions.org/mrgid/2454", # ISO_3166:PT-20
region_code == "MA_AZ_Corvo" ~ "http://marineregions.org/mrgid/2462",
region_code == "MA_AZ_Faial" ~ "http://marineregions.org/mrgid/2458",
region_code == "MA_AZ_Flores" ~ "http://marineregions.org/mrgid/2461",
region_code == "MA_AZ_Graciosa" ~ "http://marineregions.org/mrgid/2463",
region_code == "MA_AZ_Pico" ~ "http://marineregions.org/mrgid/2460",
region_code == "MA_AZ_Santa Maria" ~ "http://marineregions.org/mrgid/2459",
region_code == "MA_AZ_Sao Jorge" ~ "http://marineregions.org/mrgid/2455",
region_code == "MA_AZ_Sao Miguel" ~ "http://marineregions.org/mrgid/2456",
region_code == "MA_AZ_Terceira" ~ "http://marineregions.org/mrgid/2457",
region_code == "MA_CA" ~ "http://marineregions.org/mrgid/3743", # ISO_3166:ES-CN
region_code == "MA_CA_El Hierro" ~ "http://marineregions.org/mrgid/3747",
region_code == "MA_CA_Fuerteventura" ~ "http://marineregions.org/mrgid/3757",
region_code == "MA_CA_Gran Canaria" ~ "http://marineregions.org/mrgid/3746",
region_code == "MA_CA_La Gomera" ~ "http://marineregions.org/mrgid/3744",
region_code == "MA_CA_La Palma" ~ "http://marineregions.org/mrgid/3748",
region_code == "MA_CA_Lanzarote" ~ "http://marineregions.org/mrgid/3755",
region_code == "MA_CA_Tenerife" ~ "http://marineregions.org/mrgid/3756",
region_code == "MA_MA" ~ "http://marineregions.org/mrgid/4955", # ISO_3166:PT-30
region_code == "MA_MA_Madeira" ~ "http://marineregions.org/mrgid/4956",
region_code == "MA_MA_Porto Santo" ~ "http://marineregions.org/mrgid/4958",
# All other countries: ISO_3166 code
is.na(region_name) ~ paste0("ISO_3166:", region_code)
))
Map values:
distribution %<>% mutate(dwc_locality = case_when(
# Use country name if not a region
is.na(region_name) ~ country_name,
# Use e.g. "Azores, Corvo" for island groups
str_detect(region_code, "MA_AZ_") ~ paste("Azores", region_name, sep = ", "),
str_detect(region_code, "MA_MA_") ~ paste("Madeira", region_name, sep = ", "),
str_detect(region_code, "MA_CA_") ~ paste("Canary Islands", region_name, sep = ", "),
# Use region name for rest
TRUE ~ region_name
))
Map values:
distribution %<>% mutate(dwc_countryCode = country_code)
Inspect mapped values:
distribution %>%
group_by(
region_code, region_name, country_code, country_name,
dwc_locationID, dwc_locality, dwc_countryCode
) %>%
count()
Inspect values:
distribution %>%
group_by(status) %>%
count()
Map values (see Wiemers et al. 2018, Materials & Methods). For migrants we decided to map to own term migrant
as this information is too important to be lumped into irregular
(with alternative term vagrant
) and cannot be considered just present
or absent
(see issue #18).
distribution %<>% mutate(dwc_occurrenceStatus = recode(status,
"A" = "absent",
"Ex" = "absent", # Regionally extinct, is indicated in threatStatus "RE"
"Excluded" = "excluded",
"I" = "irregular", # Irregular vagrant
"M" = "migrant", # Regular migrant
"P" = "present",
"P?" = "doubtful", # Possibly present
"P(I)" = "irregular", # Probably only present as an immigrant
.default = ""
))
Inspect mapped values:
distribution %>%
group_by(status, dwc_occurrenceStatus) %>%
count()
Inspect values:
distribution %>%
group_by(rlc) %>%
count()
Map values:
distribution %<>% mutate(dwc_threatStatus = recode(rlc,
# Official IUCN terms: https://www.iucnredlist.org/about/regional
"RE" = "RE", # Regionally Extinct
"CR" = "CR", # Critically Endangered
"EN" = "EN", # Endangered
"VU" = "VU", # Vulnerable
"NT" = "NT", # Near Threatened
"LC" = "LC", # Least Concern
"DD" = "DD", # Data Deficient
"NtA" = "NA", # Not Applicable
"NE" = "NE", # Not Evaluated
# Unofficial terms
"NRLA" = "", # No red list available
"R" = "Rare", # Used by Germany
"Unknown" = "unknown", # not an official IUCN term, but included to be explicit
"LC/NE" = "NE", # Used by Poland
.default = ""
))
Inspect mapped values:
distribution %>%
group_by(rlc, dwc_threatStatus) %>%
count()
distribution %<>% mutate(dwc_source = reference)
distribution %<>% mutate(dwc_occurrenceRemarks = case_when(
scientific_name_regional != scientific_name ~ paste0("In source as '", scientific_name_regional, "'")
))
Create a dataframe with unique taxa only (retaining the first of multiple distribution rows):
vernacular_names <- input_data %>% distinct(taxon_id, .keep_all = TRUE)
Remove taxa where the vernacular name contains “suggestion”:
vernacular_names %<>% filter(!str_detect(english_name, "suggestion"))
Map the data to Vernacular Names:
vernacular_names %<>% mutate(dwc_taxonID = taxon_id)
vernacular_names %<>% mutate(dwc_vernacularName = english_name)
vernacular_names %<>% mutate(dwc_language = "en")
Only keep the Darwin Core columns:
taxon %<>% select(starts_with("dwc_"))
distribution %<>% select(starts_with("dwc_"))
vernacular_names %<>% select(starts_with("dwc_"))
Drop the dwc_
prefix:
colnames(taxon) <- str_remove(colnames(taxon), "dwc_")
colnames(distribution) <- str_remove(colnames(distribution), "dwc_")
colnames(vernacular_names) <- str_remove(colnames(vernacular_names), "dwc_")
Preview taxon core:
taxon %>% head()
Preview distribution extension:
distribution %>% head()
Preview vernacular names extension:
vernacular_names %>% head()
Save to CSV:
write_csv(taxon, here("data", "processed", "taxon.csv"), na = "")
write_csv(distribution, here("data", "processed", "distribution.csv"), na = "")
write_csv(vernacular_names, here("data", "processed", "vernacularname.csv"), na = "")