vignettes/v020_datastorage.Rmd
v020_datastorage.Rmd
For a practical recipe to setup n2khab_data
, go to Getting started.
See vignette("v022_example")
if you’d like to be guided
by a hands-on example.
n2khab
functionsApart from several textual datasets, provided directly with this package, other N2KHAB data sources 1 are binary or large data. Those are made available through cloud-based infrastructure, preserved for the future at least via Zenodo (see below).2 An overview of data distribution pathways is given here.
Zenodo is a scientific repository funded by the European Commission and hosted at CERN.
n2khab
package.Data sources evolve, and hence, data source versions succeed one another. To ease reproducibility of analytical workflows, this package assumes locally stored data sources.
The n2khab
functions, aimed at reading these data and
returning them in R in some kind of standardized way, always provide
arguments to specify the file’s name and location – so you can
in fact freely choose these. However, to ease collaboration in
scripting, it is highly recommended to follow the below
standard locations and filenames (see: Getting started). Moreover, the
functions assume these conventions by default in order to make
your life easier!
There is a major distinction between:
n2khab_data/10_raw
;n2khab_data/20_processed
. These
data sources have been derived from the raw data sources, but are
distributed on their own because of the time-consuming or intricate
calculations needed to reproduce them.You can reproduce the processed data sources from a shell script on Github, but it will take hours.
These binary or large data sources are to be stored in a dedicated
directory n2khab_data
on your system. Don’t use this
special directory for adding other data. It can reside inside one
project or repository but it can also deliver to several projects /
repositories; see further. n2khab_data
should always be
ignored by version control systems.
Mind that, if you store the n2khab_data
directory inside a version controlled repository (e.g. using git), it
must be ignored by version control!
Decide where you want to store the
n2khab_data
directory:
n2khab_data
directory can be put inside the project /
repository directory. This approach has the advantage that you can store
versions of data sources different from those in another repository
(where you also have an n2khab_data
directory).For the functions to succeed in finding the n2khab_data
directory in each collaborator’s file system, make sure that the
directory is present either in the working directory of your R
scripts or in a path at some level above this working directory. By
default, the functions search the directory in that order and use the
first encountered n2khab_data
directory.
Alternatively, you can set an environment variable
N2KHAB_DATA_PATH
or option n2khab_data_path
to
enforce a specific directory on your system that all n2khab
functions will use (do that outside the files you collaborate on and
share; see n2khab_options()
).
From your working directory, use fileman_folders()
to specify the desired location (using the function’s arguments). It
will check the existence of the directories n2khab_data
,
n2khab_data/10_raw
and
n2khab_data/20_processed
and create them if they don’t
exist.
fileman_folders(root = "rproj")
#> Created <clipped_path_prefix>/n2khab_data
#> Created subfolder 10_raw
#> Created subfolder 20_processed
#> [1] "<clipped_path_prefix>/n2khab_data"
From the cloud storage (links: raw data | processed
data), download the respective data files of a data
source. You can also use the function download_zenodo()
to
do that, using the DOI of each data source version. For each data
source, put its file(s) in an appropriate subdirectory either below
n2khab_data/10_raw
or n2khab_data/20_processed
(depending on the data source). Use the data source’s default name for
the subdirectory. You get a list of the data source names with
XXX. These names are version-agnostic! The names of the
n2khab
‘read’ function and their documentation make clear
which data sources you will need.
Below is an example of correctly organised N2KHAB data directories:
n2khab_data
├── 10_raw
│ ├── habitatmap -> contains habitatmat.shp, habitatmap.dbf etc.
│ ├── soilmap
│ └── GRTSmaster_habitats
└── 20_processed
├── habitatmap_stdized
└── GRTSmh_diffres
N2KHAB data sources are a list of public, standard data sources, important to analytical workflows concerning Natura 2000 (n2k) habitats (hab) in Flanders. They are in a public repository in order to be easily findable and to be preserved in a durable way.↩︎
This also means that several previously published (open) data sources have been publicly redistributed at Zenodo.↩︎