vignettes/v020_datastorage.Rmd
v020_datastorage.Rmd
For a practical recipe to setup n2khab_data
, go to Getting started.
See vignette("v022_example")
if you’d like to be guided
by a hands-on example.
n2khab
functionsApart from several textual datasets, provided directly with this package, other N2KHAB data sources 1 are binary or large data. Those are made available through cloud-based infrastructure, preserved for the future at least via Zenodo (see below).2 An overview of data distribution pathways is given here.
Zenodo is a scientific repository funded by the European Commission and hosted at CERN.
n2khab
package.Data sources evolve, and hence, data source versions succeed one another. To ease reproducibility of analytical workflows, this package assumes locally stored data sources.
The n2khab
functions, aimed at reading these data and
returning them in R in some kind of standardized way, always provide
arguments to specify the file’s name and location – so you can
in fact freely choose these. However, to ease collaboration in
scripting, it is highly recommended to follow the below
standard locations and filenames (see: Getting started). Moreover, the
functions assume these conventions by default in order to make
your life easier!
There is a major distinction between:
n2khab_data/10_raw
;n2khab_data/20_processed
. These
data sources have been derived from the raw data sources, but are
distributed on their own because of the time-consuming or intricate
calculations needed to reproduce them.You can reproduce the processed data sources from a shell script on Github, but it will take hours.
As you see, when storing these binary or large data, we avoid using a
folder named as data
:
n2khab_data
name is better fit when the folder does
not sit inside one project or repository (see further) but instead
delivers to several projects / repositories.data
folder with locally generated
or extra needed input data, part or all of which is to be
version-controlled, and which may use its own substructure.
n2khab_data
should always be ignored by version control
systems.n2khab
functions to
automatically detect the right location when using a more special
name.Mind that, if you store the n2khab_data
folder
inside a version controlled repository (e.g. using git), it must be
ignored by version control!
Decide where you want to store the
n2khab_data
folder:
n2khab_data
folder can be put inside the project /
repository folder. This approach has the advantage that you can store
versions of data sources different from those in another repository
(where you also have an n2khab_data
folder).For the functions to succeed in finding the n2khab_data
folder in each collaborator’s file system, make sure that the folder is
present either in the working directory of your R scripts or in a
path 1 up to 10 levels above this working directory. By default,
the functions search the folder in that order and use the first
encountered n2khab_data
folder. (Otherwise, you
would need to actively set the path to the data folder with the
path
argument in each function call.)
From your working directory, use fileman_folders()
to specify the desired location (using the function’s arguments). It
will check the existence of the folders n2khab_data
,
n2khab_data/10_raw
and
n2khab_data/20_processed
and create them if they don’t
exist.
fileman_folders(root = "rproj")
#> Created <clipped_path_prefix>/n2khab_data
#> Created subfolder 10_raw
#> Created subfolder 20_processed
#> [1] "<clipped_path_prefix>/n2khab_data"
From the cloud storage (links: raw data | processed
data), download the respective data files of a data
source. You can also use the function download_zenodo()
to
do that, using the DOI of each data source version. For each data
source, put its file(s) in an appropriate subfolder either below
n2khab_data/10_raw
or n2khab_data/20_processed
(depending on the data source). Use the data source’s default name for
the subfolder. You get a list of the data source names with
XXX. These names are version-agnostic! The names of the
n2khab
‘read’ function and their documentation make clear
which data sources you will need.
Below is an example of correctly organised N2KHAB data folders:
n2khab_data
├── 10_raw
│ ├── habitatmap -> contains habitatmat.shp, habitatmap.dbf etc.
│ ├── soilmap
│ └── GRTSmaster_habitats
└── 20_processed
├── habitatmap_stdized
└── GRTSmh_diffres
N2KHAB data sources are a list of public, standard data sources, important to analytical workflows concerning Natura 2000 (n2k) habitats (hab) in Flanders. They are in a public repository in order to be easily findable and to be preserved in a durable way.↩︎
This also means that several previously published (open) data sources have been publicly redistributed at Zenodo.↩︎