We developed n2kanalysis
to analyse data from species
monitoring schemes.
The n2kanalysis
package defines a framework for the
analysis. We recommend to define the actual analysis in a dedicated
package per monitoring scheme. Two examples are
watervogelanalysis::prepare_analysis()
and
abvanalysis::prepare_analysis()
.
The general workflow within n2kanalysis
is as
follows.
Often the source of the data is not under control of the data analyst. Hence the data can change in the source during or after the analysis. Therefore we strongly recommend to import the relevant data from the source and store it in an “analysis database”. Use only this database when analysing the data. This offers a stable starting point for the analysis. In case the source changes, you only need to update the import script.
The analysis database should be under some kind of version control.
An n2kModel
is a generic S4 class which contains all
required information to fit the model. The machine on which you create
the object needs access to the analysis database. The machine that fit
the model in the objects only needs access to the object itself.
Splitting multiple analyses over different machines is trivial.
The basic option is to fit the n2kModel
object in
memory. A better option is to fit it from the location where you stored
the object. fit_model()
will update the stored object,
making it available for archiving. In case you need to fit several
interdependent objects, you can specify the objects in an
n2kManifest
object. Such object lists all
n2kModel
objects to fit and the order in which you want to
fit them. The order matters when an n2kModel
object depends
on the output of a different n2kModel
object.
The fitted n2kModel
contains the model. Hence the user
can extract the required information. n2kanalysis
offers a
get_result()
function to return all relevant parameters as
a n2kResult
object. A n2kResult
object
contains both the parameters and the metadata. You can combine multiple
n2kResult
objects with combine_result()
into a
single n2kResult
object and still distinguish the output
from the different analyses. get_result()
also return
potential anomalies for n2k_inla()
models.
Every n2kModel
object has a so-called file fingerprint
and status fingerprint.
The file fingerprint is a SHA-1 hash based on
everything which will remain stable during the lifetime of the object.
Think of the data, model type, model formula, species group, … Creating
the object generates the hash based on the information at the time of
creation. We check the validity of the file fingerprint before fitting
the model or retrieving results. You can use validObject()
to do the validation manually. The probability that two
n2kModel
objects with different stable information get the
same file fingerprint is very small. Therefore we can reference the
n2kModel
object by its file fingerprint and use it as the
filename. You can retrieve this hash with
get_file_fingerprint()
.
The status fingerprint is based on the file fingerprint and
everything that can change during the lifetime of the object. The most
obvious element is the model. You can retrieve this hash with
get_status_fingerprint()
.
Reporting both the file fingerprint and status fingerprint along side
the results uniquely identifies a specific version of the
n2kModel
object.
The n2kModel
object contains all required information to
fit the model. The only exception is the case where the object needs
information from one or more parent objects. Then you must provide
access to those objects as well.
Use store_model()
to store the objects. The
base
argument is either the path of a directory on a file
system or an S3 bucket. The
project
argument defines a sub directory at the root of
base
. Moving the analyses to a different file system is as
simple as copying everything in the project
folder to the
other file system. For S3 buckets you only need to provide the machine
access to the S3 bucket, no need to copy the files manually.
fit_model()
will read the object from the S3 bucket, fit
the model and update the object in the S3 bucket.
We mentioned because that generating the object generates as file
fingerprint which only depends on the stable element. Use
store_model(overwrite = FALSE)
when you generate objects.
When nothing changed, the file fingerprint doesn’t change and we keep
the existing object which we might have fit earlier. Suppose you need to
change the definition of some child objects. That will alter the file
fingerprint and thus generate new objects. However the parent objects
remain as is.
Every object has a status
. By default
fit_model()
only handles object with status “new” or
“waiting”.
fit_model()
fits the object.fit_model()
updates the information of the
parent objects and updates the status accordingly.fit_model()
successfully fitted the
object.Refitting objects with status “error” or “converged” only makes sense
after upgrade n2kanalysis
.
Because we need to transfer the n2kModel
object between
machines, it is important to add the necessary metadata to the object.
The metadata contains both the file and status fingerprints.
The model type is a more generic description of the model, whereas the formula is the actual formula used for this specific analysis. The more generic model type allows to group results among species or location, while the formula still allow to use a different formula for some objects. Suppose that the generic model uses both year and month a covariates. When a species is only present during a single month, then it doesn’t make sense to include month as a covariate for this species.
seed
sets the seed to make the analysis
reproducible.
Other important information in the metadata are the scheme id, the
species group id and the location group id. The scheme id refers to the
monitoring scheme that delivered the data. The species group id refers
to the list of species handled in this analysis. When you analyse a
single species, then you need to create a species group containing only
this species. Similarly, the location group id refers to a list of
locations. The management of the species and location groups is outside
of the scope of n2kanalysis
. We recommend three tables:
speciesgroup
, species
and
speciesgroup_species
. speciesgroup
and
species
should contain at least an id
and a
description
field. speciesgroup_species
should
contain at least a speciesgroup_id
and a
species_id
field. Hence a species can be part of one or
more species groups. Use a similar structure for
locationgroup
, location
and
locationgroup_location
. result_datasource_id
refers to the database when you store the definitions of
scheme_id
, species_group_id
and
location_group_id
.
Document the available time period with
first_imported_year
and last_imported_year
.
duration
and last_analysed_year
refer to the
time span for this analysis. The default duration
is the
difference between the last and first imported year. The default
last_analysed_year
is the last_imported_year
.
Use different values in case the analyse covers only a subset of the
full dataset. Suppose the full dataset spans from 1992 to 2023. If you
want to analyse the period 2011 - 2020, you set
duration = 10
and
last_analysed_year = 2020
.
analysis_date
refers to the moment when you imported the
source data. You can use the time-stamp of the commit in case you placed
the data under version control with git.
git2rdata::recent_commit()
is a handy function to find the
most recent commit that changed a specific file.
The metadata contains one or more “analysis versions”. An “analysis version” is the list of packages including their version number loading when creating or fitting the model. Fitting the object on a different machine with different R packages than the one you generated the object will result in a second “analysis version”. The goal is to document the software used during the process.
Store the scripts to generate the objects in a dedicated package.
Load the package when running the script. such a workflow documents the
code to create the objects. Have a look at abvanalysis
and watervogelanalysis
for some real examples. Both packages contain the code to: 1. Import the
data from the source. 1. Create the analysis objects. 1. Get the results
from the analysis objects. 1. Display the results in a report or
website.
Optionally you can define one or more parent analyses for the object. Use this when you need to combine the analyses of different species into a multi-species index. Or when you want to analyse trends with multiple imputation. The imputation model is the parent object. The aggregated imputed data is the child object. The analysis of the aggregated imputed data is the grandchild object.
n2k_import()
A dummy model to document the import step. This object allows to use it a parent object for the subsequent models.
n2k_inla()
Fits models with integrated nested Laplacian approximation (INLA).
Setting an imputation_size
larger than 0 (default) performs
multiple imputation on the missing values using
multimput::impute()
.
In case of multiple imputation you can define a minimum
covariate. This covariate hold the minimal value to impute. Use this in
case of incomplete samples. For example when you count the number of
birds in a location and you sampled only a part of the location. The
observed count it thus a lower bound of the true value. Set the response
variable to missing and the minimum variable to the observed count.
Another option is the specify an extra
dataset of
observations. We intended this option for rare observations that might
distort the imputation model. For instance due to few observation events
at a location, only a few non-zero observations at a location or a few
extremely high observations at a location. Exclude the observations from
such location for the imputation model and add them to
extra
. You won’t get imputations for the missing
observation in extra
. But we use the observed numbers in
extra
in the subsequent aggregation.
n2k_inla_comparison()
Compares multiple n2k_inla()
objects based on the
Wantanabe Akaike Information Criterion (WAIC).
n2k_composite()
Combines several n2kModel
objects into a single
analysis. For example to combine the results of individual species into
a composite species index. The extractor
function should
extract the mean and variance of the parameters. The mean is the fitted
object is the mean of the means of every parameters over the models. The
standard error is square root of the average variance of every parameter
over the models. Hence we assume independence of the parameters between
the models.
n2k_hurdle_imputed()
Combines two n2k_inla()
objects with multiple
imputation. The presence
model is an
n2k_inla()
binomial model that describes the presence of
the species. The count
model is an n2k_inla()
zero-truncated Poisson or zero-truncated negative binomial model that
describes the non-zero counts of the species. INLA provides type 0 and
type 1 zero-inflated models. The type 0 model defines the likelihood as
a point mass at 0 and a zero-truncated distribution.
\[Prob(y) = p × 1_{[y=0]} + (1 − p) × \mathcal{P(y | y > 0)}\]
The type 1 model defines the likelihood as a combination of a point at zero and a distribution which can produce zero values too.
\[Prob(y) = p × 1_{[y=0]} + (1 − p) × \mathcal{P(y)}\]
The trick to get a zero-truncated distribution is to use a type 0
zero-inflated distribution and fix the probability of the point mass at
zero to a very small value. Below is the required INLA setting. Note
that the value is expressed on the log scale. -11
on the
log scale is equivalent to \(p =
0.0000167\).
control.family = list(
list(hyper = list(theta2 = list(initial = -11, fixed = TRUE)))
)
n2k_aggregate()
The aggregation step after multiple imputation.
fit_model()
will call
multimpute::aggregate_impute()
. Use the join
argument to select a subset of locations. For example when you ran an
imputation model at the country level and you want to aggregate over a
region within the country. Note that you can use the same impute model
for different aggregation, for example one imputation model at the
country level and aggregations for multiple regions.
n2k_modelimputed()
Fitting a model an (aggregated) imputed dataset.
fit_model()
will call
multimpute::model_impute()
. You can use filter
to subset the data after the imputation. This is a useful feature when
modelling an aggregation on a smaller subset which might result in
leading zero’s in the dataset. You can provide a custom filter function
to remove those observations from the dataset. The function handles
empty datasets after applying the filter.
The most basic way to fit the models is to loop over all files. This is suboptimal when some models depend on other models. Since the file name equals the file fingerprint, the order of the file names has no link with the optimal order to fit the models.
We solved this problem by introducing the n2k_manifest()
object. The manifest is simply a dataframe with the file fingerprint of
the model and the file fingerprint of the parent model. This information
is available after generating the models. Hence you can generate all
n2kModel
objects and store the link between the objects in
the manifest. When you need multiple parent objects, add one row of
every parent.
Applying fit_model()
on an n2k_manifest()
fit the models in an optimal order. The initial list of model consist of
the models without parents. Then it handles the children of the initial
models. Then the grandchildren and so on. This order guarantees that all
parent models are fitted before fitting a model.
Another relevant use of a manifest is that is bundles a set of models. For example all models for a given monitoring scheme in a specific year. The manifest has is own file fingerprint, useful to reference a specific manifest.
n2kanalysis
allows to maximise the reproducibility by
making it easier to run the analysis in a Docker container. Use
store_manifest_yaml()
to store the manifest and specify the
Docker image plus optional extra dependencies to install.
manifest_yaml_to_bash()
creates a bash script that fits the
models one by one in the specified Docker container.
Using Docker improves the portability of the analysis as well. When you stored the object in an S3 bucket, you only need to copy the script to the other machine and run it. Or copy both the script and the objects in case you store everything on a file system.
manifest_yaml_to_bash()
offers the option to shut down
the machine at the end of the script. This avoids the cost of running an
idle server. You can also have it to split the analysis over multiple
machines. No fancy parallel computing is involved. Suppose you specified
split = 2
. Then you simply get two scripts, one with the
“odd” analyses and one with the “even” analysis.