This document describes how we map the checklist data to Darwin Core. The source file for this document can be found here.

Load libraries:

library(tidyverse)      # To do data science
library(tidylog)        # To provide feedback on dplyr functions
library(magrittr)       # To use %<>% pipes
library(here)           # To find files
library(janitor)        # To clean input data
library(digest)         # To generate hashes

1 Read source data

The source data is maintained in this Google Spreadsheet.

Read the relevant worksheets (published as csv):

input_distribution <- read_csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vSrKN8_XbQ4Vo-VrtUiqAtS5im3QxkHBNSJHPSLmrkM5C1PIC7DOg-oboRcEJZtWp_qsi802YRlRp8C/pub?gid=979140000&single=true&output=csv")
input_taxa <- read_csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vSrKN8_XbQ4Vo-VrtUiqAtS5im3QxkHBNSJHPSLmrkM5C1PIC7DOg-oboRcEJZtWp_qsi802YRlRp8C/pub?gid=1559651428&single=true&output=csv")
input_regions <- read_csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vSrKN8_XbQ4Vo-VrtUiqAtS5im3QxkHBNSJHPSLmrkM5C1PIC7DOg-oboRcEJZtWp_qsi802YRlRp8C/pub?gid=2076261682&single=true&output=csv")
input_references <- read_csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vSrKN8_XbQ4Vo-VrtUiqAtS5im3QxkHBNSJHPSLmrkM5C1PIC7DOg-oboRcEJZtWp_qsi802YRlRp8C/pub?gid=1083424499&single=true&output=csv")

Sort the source files (to maintain some consistency between updates of the dataset):

input_distribution %<>% arrange(scientific_name, region_code)
input_taxa %<>% arrange(scientific_name)
input_regions %<>% arrange(region_code)
input_references %<>% arrange(region_code, citation)

Copy the source data to the repository to keep track of changes:

write_csv(input_distribution, here("data", "raw", "distribution.csv"), na = "")
write_csv(input_taxa, here("data", "raw", "taxa.csv"), na = "")
write_csv(input_regions, here("data", "raw", "regions.csv"), na = "")
write_csv(input_references, here("data", "raw", "references.csv"), na = "")

The first 4 files will be used for the Darwin Core mapping, while rlc is needed for the calculation in wrl_values.Rmd.

2 Preprocessing

2.1 Join spreadsheets

The input_references contain references of species/red lists per region. We can include those as source for distributions, but unfortunately not as full references in a reference extension, as that would require them to be associated (and repeated) with taxa.

Group references to create a concatenated list of unique references per region_code, e.g. AD = Đug (2013) | Koren and Kulijer (2016) | Lelo (2016). References that act as both red and species list reference, will only be listed once:

grouped_references <-
  input_references %>%
  arrange(region_code, citation_type, citation) %>%
  group_by(region_code) %>%
  summarize(reference = paste(unique(citation), collapse = " | "))

Start from distributions and join with taxon, region and grouped references information (which should result in the same number of rows as distributions):

input_data <-
  input_distribution %>%
  left_join(input_taxa, on = "scientific_name") %>%
  left_join(input_regions, on = "region_code") %>%
  left_join(grouped_references, on = "region_code")
nrow(input_data) == nrow(input_distribution)
## [1] TRUE

2.2 Tidy data

Clean data somewhat:

input_data %<>%
  remove_empty("rows") %>%    # Remove empty rows
  clean_names()               # Have sensible (lowercase) column names

Sort on scientific name and region code (to maintain some consistency between updates of the dataset):

input_data %<>% arrange(scientific_name, region_code)

2.3 Scientific names

Information regarding scientific names, authorship, classification and phylogenetic order in input_taxa is derived from Wiemers et al. (2018), which is an “updated checklist of the European Butterflies” with 496 species.

All names on the regional checklists (scientific_name_regional in input_distribution) were matched by Dimitri Brosens to their accepted name (scientific_name) using the GBIF Backbone Taxonomy and verified by Dirk Maes (e.g. Lycaeides argyrognomon to official Plebejus argyrognomon). A limited number of scientific_name_regional could not be matched and/or are not accepted by Wiemers et al. (2018):

input_data %>%
  filter(is.na(family)) %>% # Is a field from input_taxa and thus assesses there was a match
  select(scientific_name_regional, scientific_name, region_code, comments) %>%
  arrange(scientific_name_regional)

We restrict the checklist to scientific names included in Wiemers et al. (2018) (i.e. those included in input_taxa, see issue 16):

input_data %<>% filter(!is.na(family)) # Remove distributions/names without match

2.4 Taxon IDs

To link taxa with information in the extension(s), each taxon needs a unique and relatively stable taxonID. Here we create one in the form of dataset_shortname:taxon:hash, where hash is unique code based on scientific name (that will remain the same as long as scientific name remains the same):

vdigest <- Vectorize(digest) # Vectorize digest function to work with vectors
input_data %<>% mutate(taxon_id = paste(
  "eurobutt-checklist",
  "taxon",
  vdigest(paste(scientific_name), algo = "md5"),
  sep = ":"
))

2.5 Preview data

Show the number of species and distributions:

input_data %>%
  group_by(family) %>%
  summarize(
    `# taxa` = n_distinct(taxon_id),
    `# distributions` = n()
  ) %>%
  adorn_totals("row")

Preview data:

input_data %>% head()

3 Darwin Core mapping

3.1 Taxon core

Create a dataframe with unique taxa only (retaining the first of multiple distribution rows):

taxon <- input_data %>% distinct(taxon_id, .keep_all = TRUE)

Map the data to Darwin Core Taxon:

3.1.1 language

taxon %<>% mutate(dwc_language = "en")

3.1.2 license

taxon %<>% mutate(dwc_license = "http://creativecommons.org/publicdomain/zero/1.0/")

3.1.3 rightsHolder

taxon %<>% mutate(dwc_rightsHolder = "INBO")

3.1.4 accessRights

taxon %<>% mutate(dwc_accessRights = "https://www.inbo.be/en/norms-data-use")

3.1.5 datasetID

taxon  %<>% mutate(dwc_datasetID = "https://doi.org/10.15468/ye7whj")

3.1.6 institutionCode

taxon %<>% mutate(dwc_institutionCode = "INBO")

3.1.7 datasetName

taxon %<>% mutate(dwc_datasetName = "National checklists and red lists for European butterflies")

3.1.8 taxonID

taxon %<>% mutate(dwc_taxonID = taxon_id)

3.1.9 scientificName

taxon %<>% mutate(dwc_scientificName = scientific_name)

3.1.10 kingdom

taxon %<>% mutate(dwc_kingdom = "Animalia")

3.1.11 phylum

taxon %<>% mutate(dwc_phylum = "Arthropoda")

3.1.12 class

taxon %<>% mutate(dwc_class = "Insecta")

3.1.13 order

taxon %<>% mutate(dwc_order = "Lepidoptera")

3.1.14 family

taxon %<>% mutate(dwc_family = family)

3.1.15 genus

taxon %<>% mutate(dwc_genus = genus)

3.1.16 specificEpithet

taxon %<>% mutate(dwc_specificEpithet = specific_epithet)

3.1.17 taxonRank

taxon %<>% mutate(dwc_taxonRank = "species")

3.1.18 scientificNameAuthorship

taxon %<>% mutate(dwc_scientificNameAuthorship = authorship)

3.1.19 nomenclaturalCode

taxon %<>% mutate(dwc_nomenclaturalCode = "ICZN")

3.2 Distribution extension

Create a dataframe with all data:

distribution <- input_data

Map the data to Species Distribution:

3.2.1 taxonID

distribution %<>% mutate(dwc_taxonID = taxon_id) 

3.2.2 locationID

Map values:

distribution %<>% mutate(dwc_locationID = case_when(
  # Europe: not WGSRPD:1 as that does not include Azores and Canary islands
  # https://en.wikipedia.org/wiki/World_Geographical_Scheme_for_Recording_Plant_Distributions#1_Europe
  region_code == "EUR" ~ "",
  
  # European union: EU is reserved, but not an official ISO code
  # https://en.wikipedia.org/wiki/ISO_3166-1#Criteria_for_inclusion
  region_code == "EU27" ~ "",
  
  # Azores, Canary Islands, Madeira: use marine regions codes
  # The island groups have ISO_3166-2 subdivisions codes, but for consistency we use marine regions for all
  region_code == "MA_AZ" ~ "http://marineregions.org/mrgid/2454", # ISO_3166:PT-20
  region_code == "MA_AZ_Corvo" ~ "http://marineregions.org/mrgid/2462",
  region_code == "MA_AZ_Faial" ~ "http://marineregions.org/mrgid/2458",
  region_code == "MA_AZ_Flores" ~ "http://marineregions.org/mrgid/2461",
  region_code == "MA_AZ_Graciosa" ~ "http://marineregions.org/mrgid/2463",
  region_code == "MA_AZ_Pico" ~ "http://marineregions.org/mrgid/2460",
  region_code == "MA_AZ_Santa Maria" ~ "http://marineregions.org/mrgid/2459",
  region_code == "MA_AZ_Sao Jorge" ~ "http://marineregions.org/mrgid/2455",
  region_code == "MA_AZ_Sao Miguel" ~ "http://marineregions.org/mrgid/2456",
  region_code == "MA_AZ_Terceira" ~ "http://marineregions.org/mrgid/2457",
  
  region_code == "MA_CA" ~ "http://marineregions.org/mrgid/3743", # ISO_3166:ES-CN
  region_code == "MA_CA_El Hierro" ~ "http://marineregions.org/mrgid/3747",
  region_code == "MA_CA_Fuerteventura" ~ "http://marineregions.org/mrgid/3757",
  region_code == "MA_CA_Gran Canaria" ~ "http://marineregions.org/mrgid/3746",
  region_code == "MA_CA_La Gomera" ~ "http://marineregions.org/mrgid/3744",
  region_code == "MA_CA_La Palma" ~ "http://marineregions.org/mrgid/3748",
  region_code == "MA_CA_Lanzarote" ~ "http://marineregions.org/mrgid/3755",
  region_code == "MA_CA_Tenerife" ~ "http://marineregions.org/mrgid/3756",

  region_code == "MA_MA" ~ "http://marineregions.org/mrgid/4955", # ISO_3166:PT-30
  region_code == "MA_MA_Madeira" ~ "http://marineregions.org/mrgid/4956",
  region_code == "MA_MA_Porto Santo" ~ "http://marineregions.org/mrgid/4958",
  
  # All other countries: ISO_3166 code
  is.na(region_name) ~ paste0("ISO_3166:", region_code)
))

3.2.3 locality

Map values:

distribution %<>% mutate(dwc_locality = case_when(
  # Use country name if not a region  
  is.na(region_name) ~ country_name,
  
  # Use e.g. "Azores, Corvo" for island groups
  str_detect(region_code, "MA_AZ_") ~ paste("Azores", region_name, sep = ", "),
  str_detect(region_code, "MA_MA_") ~ paste("Madeira", region_name, sep = ", "),
  str_detect(region_code, "MA_CA_") ~ paste("Canary Islands", region_name, sep = ", "),
  
  # Use region name for rest
  TRUE ~ region_name
))

3.2.4 countryCode

Map values:

distribution %<>% mutate(dwc_countryCode = country_code)

Inspect mapped values:

distribution %>%
  group_by(
    region_code, region_name, country_code, country_name,
    dwc_locationID, dwc_locality, dwc_countryCode
  ) %>%
  count()

3.2.5 occurrenceStatus

Inspect values:

distribution %>%
  group_by(status) %>%
  count()

Map values (see Wiemers et al. 2018, Materials & Methods). For migrants we decided to map to own term migrant as this information is too important to be lumped into irregular (with alternative term vagrant) and cannot be considered just present or absent (see issue #18).

distribution %<>% mutate(dwc_occurrenceStatus = recode(status,
  "A" = "absent",
  "Ex" = "absent", # Regionally extinct, is indicated in threatStatus "RE"
  "Excluded" = "excluded",
  "I" = "irregular", # Irregular vagrant
  "M" = "migrant", # Regular migrant
  "P" = "present",
  "P?" = "doubtful", # Possibly present
  "P(I)" = "irregular", # Probably only present as an immigrant
  .default = ""
))

Inspect mapped values:

distribution %>%
  group_by(status, dwc_occurrenceStatus) %>%
  count()

3.2.6 threatStatus

Inspect values:

distribution %>%
  group_by(rlc) %>%
  count()

Map values:

distribution %<>% mutate(dwc_threatStatus = recode(rlc,
  # Official IUCN terms: https://www.iucnredlist.org/about/regional

  "RE" = "RE", # Regionally Extinct
  "CR" = "CR", # Critically Endangered
  "EN" = "EN", # Endangered
  "VU" = "VU", # Vulnerable
  "NT" = "NT", # Near Threatened
  "LC" = "LC", # Least Concern
  "DD" = "DD", # Data Deficient
  "NtA" = "NA", # Not Applicable
  "NE" = "NE", # Not Evaluated
  
  # Unofficial terms
  "NRLA" = "", # No red list available
  "R" = "Rare", # Used by Germany
  "Unknown" = "unknown", # not an official IUCN term, but included to be explicit
  "LC/NE" = "NE", # Used by Poland
  .default = ""
))

Inspect mapped values:

distribution %>%
  group_by(rlc, dwc_threatStatus) %>%
  count()

3.2.7 source

distribution %<>% mutate(dwc_source = reference)

3.2.8 occurrenceRemarks

distribution %<>% mutate(dwc_occurrenceRemarks = case_when(
  scientific_name_regional != scientific_name ~ paste0("In source as '", scientific_name_regional, "'")
))

4 Vernacular names extension

Create a dataframe with unique taxa only (retaining the first of multiple distribution rows):

vernacular_names <- input_data %>% distinct(taxon_id, .keep_all = TRUE)

Remove taxa where the vernacular name contains “suggestion”:

vernacular_names %<>% filter(!str_detect(english_name, "suggestion"))

Map the data to Vernacular Names:

4.0.1 taxonID

vernacular_names %<>% mutate(dwc_taxonID = taxon_id) 

4.0.2 vernacularName

vernacular_names %<>% mutate(dwc_vernacularName = english_name)

4.0.3 language

vernacular_names %<>% mutate(dwc_language = "en") 

5 Post-processing

Only keep the Darwin Core columns:

taxon %<>% select(starts_with("dwc_"))
distribution %<>% select(starts_with("dwc_"))
vernacular_names %<>% select(starts_with("dwc_"))

Drop the dwc_ prefix:

colnames(taxon) <- str_remove(colnames(taxon), "dwc_")
colnames(distribution) <- str_remove(colnames(distribution), "dwc_")
colnames(vernacular_names) <- str_remove(colnames(vernacular_names), "dwc_")

Preview taxon core:

taxon %>% head()

Preview distribution extension:

distribution %>% head()

Preview vernacular names extension:

vernacular_names %>% head()

Save to CSV:

write_csv(taxon, here("data", "processed", "taxon.csv"), na = "")
write_csv(distribution, here("data", "processed", "distribution.csv"), na = "")
write_csv(vernacular_names, here("data", "processed", "vernacularname.csv"), na = "")