Introduction

For a practical recipe to setup n2khab_data, go to Getting started.

See vignette("v022_example") if you’d like to be guided by a hands-on example.

Distribution of the data sources used by n2khab functions

Apart from several textual datasets, provided directly with this package, other N2KHAB data sources 1 are binary or large data. Those are made available through cloud-based infrastructure, preserved for the future at least via Zenodo (see below).2 An overview of data distribution pathways is given here.

More about Zenodo

Zenodo is a scientific repository funded by the European Commission and hosted at CERN.

  • We prefer Zenodo for its straightforward, easy approach of preserving data sources for the long term – needed for reproducibility – while providing a stable DOI link for each version and for each record as whole.
  • Managing the N2KHAB data sources at Zenodo allowed us to apply a uniform and pure representation of each data source: we made one data version correspond to one data set following one fileformat (not zipped); we add no other files (e.g. metadata files like pdf’s, alternative fileformats etc.). The filenames in these Zenodo records follow the codes of the data sources that are used in the n2khab package.
  • Zenodo storage nicely fits with an internationalized approach of reproducible N2KHAB workflows in R, as its website is in English.

Local data storage

Data sources evolve, and hence, data source versions succeed one another. To ease reproducibility of analytical workflows, this package assumes locally stored data sources.

The n2khab functions, aimed at reading these data and returning them in R in some kind of standardized way, always provide arguments to specify the file’s name and location – so you can in fact freely choose these. However, to ease collaboration in scripting, it is highly recommended to follow the below standard locations and filenames (see: Getting started). Moreover, the functions assume these conventions by default in order to make your life easier!

There is a major distinction between:

  • raw data (Zenodo-link), to be stored in a folder n2khab_data/10_raw;
  • processed data (Zenodo-link), to be stored in a folder n2khab_data/20_processed. These data sources have been derived from the raw data sources, but are distributed on their own because of the time-consuming or intricate calculations needed to reproduce them.

You can reproduce the processed data sources from a shell script on Github, but it will take hours.

As you see, when storing these binary or large data, we avoid using a folder named as data:

  • the n2khab_data name is better fit when the folder does not sit inside one project or repository (see further) but instead delivers to several projects / repositories.
  • within a project or repository, the specific name keeps it separate from a project-specific data folder with locally generated or extra needed input data, part or all of which is to be version-controlled, and which may use its own substructure. n2khab_data should always be ignored by version control systems.
  • it works better for the n2khab functions to automatically detect the right location when using a more special name.

Getting started for your (collaborative) workflow

Mind that, if you store the n2khab_data folder inside a version controlled repository (e.g. using git), it must be ignored by version control!

  1. Decide where you want to store the n2khab_data folder:

    • from the viewpoint of several projects / several git repositories, when these need the same data source versions, the location may be at a high level in your file system. A convenient approach is to use the folder which holds the different project folders / repositories.
    • from the viewpoint of one project / repository: the n2khab_data folder can be put inside the project / repository folder. This approach has the advantage that you can store versions of data sources different from those in another repository (where you also have an n2khab_data folder).

    For the functions to succeed in finding the n2khab_data folder in each collaborator’s file system, make sure that the folder is present either in the working directory of your R scripts or in a path 1 up to 10 levels above this working directory. By default, the functions search the folder in that order and use the first encountered n2khab_data folder. (Otherwise, you would need to actively set the path to the data folder with the path argument in each function call.)

  2. From your working directory, use fileman_folders() to specify the desired location (using the function’s arguments). It will check the existence of the folders n2khab_data, n2khab_data/10_raw and n2khab_data/20_processed and create them if they don’t exist.

fileman_folders(root = "rproj")
#> Created <clipped_path_prefix>/n2khab_data
#> Created subfolder 10_raw
#> Created subfolder 20_processed
#> [1] "<clipped_path_prefix>/n2khab_data"
  1. From the cloud storage (links: raw data | processed data), download the respective data files of a data source. You can also use the function download_zenodo() to do that, using the DOI of each data source version. For each data source, put its file(s) in an appropriate subfolder either below n2khab_data/10_raw or n2khab_data/20_processed (depending on the data source). Use the data source’s default name for the subfolder. You get a list of the data source names with XXX. These names are version-agnostic! The names of the n2khab ‘read’ function and their documentation make clear which data sources you will need.

    Below is an example of correctly organised N2KHAB data folders:

n2khab_data
    ├── 10_raw
    │     ├── habitatmap            -> contains habitatmat.shp, habitatmap.dbf etc.
    │     ├── soilmap
    │     └── GRTSmaster_habitats
    └── 20_processed
          ├── habitatmap_stdized
          └── GRTSmh_diffres

  1. N2KHAB data sources are a list of public, standard data sources, important to analytical workflows concerning Natura 2000 (n2k) habitats (hab) in Flanders. They are in a public repository in order to be easily findable and to be preserved in a durable way.↩︎

  2. This also means that several previously published (open) data sources have been publicly redistributed at Zenodo.↩︎