Skip to contents

Make a Movebank dataset “frictionless”

Frictionless Data is an open-source framework designed to remove common barriers to reading and understanding data. By transforming a Movebank dataset into a “Frictionless Data Package” (Walsch and Pollock, 2017), we create a set of files that is better documented and easier to read programmatically, compared to individual files downloaded from Movebank. It is also a necessary step before transforming to Darwin Core with write_dwc(), because it standardizes file and field names.

Here we build a Frictionless Data Package by starting from a directory containing CSV data files in Movebank format (reference data and GPS data), and adding a datapackage.json file which provides persistent human- and machine-readable definitions of the contents of the CSV files. Let’s try that on an existing dataset, published in the Movebank Data Repository:

Griffin L (2014) Data from: Forecasting spring from afar? Timing of migration and predictability of phenology along different migration routes of an avian herbivore [Svalbard data]. Movebank Data Repository. https://doi.org/10.5441/001/1.5k6b1364

It consists of:

reference_data <- "https://datarepository.movebank.org/server/api/core/bitstreams/a6e123b0-7588-40da-8f06-73559bb3ff6b/content"
gps_data <- "https://datarepository.movebank.org/server/api/core/bitstreams/df28a80e-e0c4-49fb-aa87-76ceb2d2b76f/content"

And its DOI:

doi <- "https://doi.org/10.5441/001/1.5k6b1364" # Don't use a http://dx.doi URL and exclude "www."

Let’s bundle that into a Frictionless Data Package:

package <-
  frictionless::create_package() %>%
  append(c(id = doi), after = 0) %>%
  movepub::add_resource("reference-data", reference_data) %>%
  movepub::add_resource("gps", gps_data)

Here’s what we did:

Here’s an example of how a field is documented:

package$resources[[1]]$schema$fields[[2]]
#> $name
#> [1] "animal-id"
#> 
#> $title
#> [1] "animal ID"
#> 
#> $description
#> [1] "An individual identifier for the animal, provided by the data owner. Values are unique within the study. If the data owner does not provide an Animal ID, an internal Movebank animal identifier is sometimes shown. Example: 'TUSC_CV5'; Units: none; Entity described: individual"
#> 
#> $type
#> [1] "string"
#> 
#> $format
#> [1] "default"
#> 
#> $`skos:exactMatch`
#> [1] "http://vocab.nerc.ac.uk/collection/MVB/current/MVB000016/3/"

package can now be used to transform to Darwin Core (in the next step) or saved as a datapackage.json file for other uses:

frictionless::write_package(package, "data/my_dataset")

Transform a Movebank dataset to Darwin Core

A Movebank dataset can be converted to Darwin Core using write_dwc(). Let’s try it out with a small dataset.

O_ASSEN is a bird GPS tracking study and dataset, available on Movebank and deposited on Zenodo.

write_dwc() requires the dataset to be structured as a Frictionless Data Package (recognizable by the presence of a datapackage.json file). That is the case for O_ASSEN on Zenodo, meaning it can be read with the frictionless R package.

Let’s create two directories:

dir_source <- "data/o_assen/source" # Local directory for the source dataset
dir_dwc    <- "data/o_assen/dwc"    # Local directory for the Darwin Core dataset

And download the dataset from Zenodo to the local directory. Using a local package avoids having to download the data again when you encounter an issue:

frictionless::read_package("https://zenodo.org/records/10053903/files/datapackage.json") %>%
  # Remove the large acceleration resource we won't use (and thus won't download)
  frictionless::remove_resource("acceleration") %>%
  frictionless::write_package(dir_source)
#> Please make sure you have the right to access data from this Data Package for your intended use.
#> Follow applicable norms or requirements to credit the dataset and its authors.
#> For more information, see https://doi.org/10.5281/zenodo.10053903
#> Downloading file from https://zenodo.org/records/10053903/files/O_ASSEN-reference-data.csv
#> Downloading file from https://zenodo.org/records/10053903/files/O_ASSEN-gps-2018.csv.gz
#> Downloading file from https://zenodo.org/records/10053903/files/O_ASSEN-gps-2019.csv.gz

We then create a package variable pointing to the local dataset:

package <- frictionless::read_package(file.path(dir_source, "datapackage.json"))
#> Please make sure you have the right to access data from this Data Package for your intended use.
#> Follow applicable norms or requirements to credit the dataset and its authors.
#> For more information, see https://doi.org/10.5281/zenodo.10053903

That covers the data. The metadata are derived from DataCite using the dataset DOI. For O_ASSEN, the DOI is already stored in the package metadata:

package$id
#> [1] "https://doi.org/10.5281/zenodo.10053903"

DataCite metadata does not include a contact person and rights holder, so we need to set those:

contact <- person(
  given = "Peter",
  family = "Desmet",
  email = "peter.desmet@inbo.be",
  comment = c(ORCID = "0000-0002-8442-8025")
)
rights_holder <- "Vogelwerkgroep Assen"

We now have everything to convert the dataset to Darwin Core and EML:

movepub::write_dwc(
  package = package,
  doi = package$id,
  directory = dir_dwc,
  contact = contact,
  rights_holder = rights_holder
)
#> 
#> ── Reading data ──
#> 
#>  Taxa found in reference data and their WoRMS AphiaID:
#> Haematopus ostralegus: 147436
#> 
#> ── Transforming data to Darwin Core ──
#> 
#> ── Writing files ──
#> 
#> • data/o_assen/dwc/eml.xml
#> • data/o_assen/dwc/dwc_occurrence.csv

The resulting eml.xml file includes the metadata:

EML::read_eml(file.path(dir_dwc, "eml.xml"))
#> dataset:
#>   title: O_ASSEN - Eurasian oystercatchers (Haematopus ostralegus, Haematopodidae)
#>     breeding in Assen (the Netherlands) [subsampled representation]
#>   metadataProvider:
#>     individualName:
#>       givenName: Peter
#>       surName: Desmet
#>     electronicMailAddress: peter.desmet@inbo.be
#>     userId:
#>       directory: https://orcid.org/
#>       userId: 0000-0002-8442-8025
#>   pubDate: '2023-10-30'
#>   abstract:
#>     para:
#>     - <![CDATA[<span></span>This animal tracking dataset is derived from Dijkstra
#>       et al. (2023, <a href="https://doi.org/10.5281/zenodo.10053903">https://doi.org/10.5281/zenodo.10053903</a>),
#>       a deposit of Movebank study <a href="https://www.movebank.org/cms/webapp?gwt_fragment=page=studies,path=study1605797471">1605797471</a>.
#>       Data have been standardized to Darwin Core using the <a href="https://inbo.github.io/movepub/">movepub</a>
#>       R package and are downsampled to the first GPS position per hour. The original
#>       dataset description follows.]]>
#>     - |-
#>       O_ASSEN - Eurasian oystercatchers (Haematopus ostralegus, Haematopodidae) breeding in Assen (the Netherlands) is a bird tracking dataset published by the Vogelwerkgroep Assen, Netherlands Institute of Ecology (NIOO-KNAW), Sovon, Radboud University, the University of Amsterdam and the Research Institute for Nature and Forest (INBO). It contains animal tracking data collected for the study O_ASSEN using trackers developed by the University of Amsterdam Bird Tracking System (UvA-BiTS, http://www.uva-bits.nl). The study was operational from 2018 to 2019. In total 6 individuals of Eurasian oystercatchers (Haematopus ostralegus) have been tagged as a breeding bird in the city of Assen (the Netherlands), mainly to study space use of oystercatchers breeding in urban areas. Data are uploaded from the UvA-BiTS database to Movebank and from there archived on Zenodo (see https://github.com/inbo/bird-tracking). No new data are expected.
#> 
#>       See van der Kolk et al. (2022, https://doi.org/10.3897/zookeys.1123.90623) for a more detailed description of this dataset.
#> 
#>       Files
#> 
#>       Data in this package are exported from Movebank study 1605797471. Fields in the data follow the Movebank Attribute Dictionary and are described in datapackage.json. Files are structured as a Frictionless Data Package. You can access all data in R via https://zenodo.org/records/10053903/files/datapackage.json using frictionless.
#> 
#> 
#> 
#>       datapackage.json: technical description of the data files.
#> 
#>       O_ASSEN-reference-data.csv: reference data about the animals, tags and deployments.
#> 
#>       O_ASSEN-gps-yyyy.csv.gz: GPS data recorded by the tags, grouped by year.
#> 
#>       O_ASSEN-acceleration-yyyy.csv.gz: acceleration data recorded by the tags, grouped by year.
#> 
#> 
#>       Acknowledgements
#> 
#>       These data were collected by Bert Dijkstra and Rinus Dillerop from Vogelwerkgroep Assen, in collaboration with the Netherlands Institute of Ecology (NIOO-KNAW), Sovon, Radboud University and the University of Amsterdam (UvA). Funding was provided by the Prins Bernard Cultuurfonds Drenthe, municipality of Assen, IJsvogelfonds (from Birdlife Netherlands and Nationale Postcodeloterij) and the Waterleiding Maatschappij Drenthe. The dataset was published with funding from Stichting NLBIF - Netherlands Biodiversity Information Facility.
#>     - This version adds alt-project-id to the reference-data and references the latest
#>       Movebank Attribute Dictionary.
#>   keywordSet:
#>     keywordThesaurus: n/a
#>     keyword:
#>     - animal movement
#>     - animal tracking
#>     - gps tracking
#>     - accelerometer
#>     - altitude
#>     - temperature
#>     - biologging
#>     - birds
#>     - UvA-BiTS
#>     - Movebank
#>     - frictionlessdata
#>   intellectualRights:
#>     para: cc0-1.0
#>   distribution:
#>     scope: document
#>     online:
#>       url:
#>         function: information
#>         url: https://www.movebank.org/cms/webapp?gwt_fragment=page=studies,path=study1605797471
#>   contact:
#>     individualName:
#>       givenName: Peter
#>       surName: Desmet
#>     electronicMailAddress: peter.desmet@inbo.be
#>     userId:
#>       directory: https://orcid.org/
#>       userId: 0000-0002-8442-8025
#>   alternateIdentifier:
#>   - https://doi.org/10.5281/zenodo.10053903
#>   - https://www.movebank.org/cms/webapp?gwt_fragment=page=studies,path=study1605797471
#>   creator:
#>   - individualName:
#>       givenName: Bert
#>       surName: Dijkstra
#>   - individualName:
#>       givenName: Rinus
#>       surName: Dillerop
#>   - individualName:
#>       givenName: Kees
#>       surName: Oosterbeek
#>   - individualName:
#>       givenName: Willem
#>       surName: Bouten
#>     userId:
#>       directory: https://orcid.org/
#>       userId: 0000-0002-5250-8872
#>   - individualName:
#>       givenName: Peter
#>       surName: Desmet
#>     userId:
#>       directory: https://orcid.org/
#>       userId: 0000-0002-8442-8025
#>   - individualName:
#>       givenName: Henk-Jan
#>       surName: van der Kolk
#>     userId:
#>       directory: https://orcid.org/
#>       userId: 0000-0002-8023-379X
#>   - individualName:
#>       givenName: Bruno J.
#>       surName: Ens
#>     userId:
#>       directory: https://orcid.org/
#>       userId: 0000-0002-4659-4807
#> packageId: f415e8bd-e5ed-4453-87f7-e5405ae102a5
#> schemaLocation: https://eml.ecoinformatics.org/eml-2.2.0 https://eml.ecoinformatics.org/eml-2.2.0/eml.xsd
#> system: uuid

The resulting dwc_occurrence.csv contains the Darwin Core data, created by transforming the package data. Some of the record level-terms at the beginning are set based on DataCite metadata and the provided rights_holder:

readr::read_csv(file.path(dir_dwc, "dwc_occurrence.csv"), show_col_types = FALSE)
#> # A tibble: 5,827 × 32
#>    type  license           rightsHolder datasetID institutionCode collectionCode
#>    <chr> <chr>             <chr>        <chr>     <chr>           <chr>         
#>  1 Event https://creative… Vogelwerkgr… https://… MPIAB           Movebank      
#>  2 Event https://creative… Vogelwerkgr… https://… MPIAB           Movebank      
#>  3 Event https://creative… Vogelwerkgr… https://… MPIAB           Movebank      
#>  4 Event https://creative… Vogelwerkgr… https://… MPIAB           Movebank      
#>  5 Event https://creative… Vogelwerkgr… https://… MPIAB           Movebank      
#>  6 Event https://creative… Vogelwerkgr… https://… MPIAB           Movebank      
#>  7 Event https://creative… Vogelwerkgr… https://… MPIAB           Movebank      
#>  8 Event https://creative… Vogelwerkgr… https://… MPIAB           Movebank      
#>  9 Event https://creative… Vogelwerkgr… https://… MPIAB           Movebank      
#> 10 Event https://creative… Vogelwerkgr… https://… MPIAB           Movebank      
#> # ℹ 5,817 more rows
#> # ℹ 26 more variables: datasetName <chr>, basisOfRecord <chr>,
#> #   dataGeneralizations <chr>, occurrenceID <chr>, sex <chr>, lifeStage <chr>,
#> #   reproductiveCondition <lgl>, occurrenceStatus <chr>, organismID <dbl>,
#> #   organismName <lgl>, eventID <chr>, parentEventID <chr>, eventType <chr>,
#> #   eventDate <dttm>, samplingProtocol <chr>, eventRemarks <chr>,
#> #   minimumElevationInMeters <dbl>, maximumElevationInMeters <dbl>, …

Both files can be uploaded to a GBIF IPT for publication. The dataset will use the DOI of the source dataset. See the O_ASSEN example on an IPT and at GBIF.