Concept

We developed n2kanalysis to analyse data from species monitoring schemes.

  1. A first requirement is to make the analysis reproducible, traceable and auditable. Should someone challenge the results of the analysis, we must be able to demonstrate which analysis was run with what data.
  2. A second requirement is to make the analysis portable. Portable in the sense that we can move analyses to other machines to run them. The feature is most relevant when you have a lot of analyses that take a lot of time.
  3. A third requirement is to make it as efficient as possible. Only run the new analyses. Don’t rerun existing analyses. Note that any change in data or metadata results in a new analysis.

The n2kanalysis package defines a framework for the analysis. We recommend to define the actual analysis in a dedicated package per monitoring scheme. Two examples are watervogelanalysis::prepare_analysis() and abvanalysis::prepare_analysis().

Workflow

The general workflow within n2kanalysis is as follows.

  1. Import the relevant data from the source.
  2. Generate the analysis objects from the imported data.
  3. Fit the analysis objects.
  4. Extract the results from the analysis objects.

Import the relevant data

Often the source of the data is not under control of the data analyst. Hence the data can change in the source during or after the analysis. Therefore we strongly recommend to import the relevant data from the source and store it in an “analysis database”. Use only this database when analysing the data. This offers a stable starting point for the analysis. In case the source changes, you only need to update the import script.

The analysis database should be under some kind of version control.

Generate the analysis objects

An n2kModel is a generic S4 class which contains all required information to fit the model. The machine on which you create the object needs access to the analysis database. The machine that fit the model in the objects only needs access to the object itself. Splitting multiple analyses over different machines is trivial.

Fit the models

The basic option is to fit the n2kModel object in memory. A better option is to fit it from the location where you stored the object. fit_model() will update the stored object, making it available for archiving. In case you need to fit several interdependent objects, you can specify the objects in an n2kManifest object. Such object lists all n2kModel objects to fit and the order in which you want to fit them. The order matters when an n2kModel object depends on the output of a different n2kModel object.

Extract the results

The fitted n2kModel contains the model. Hence the user can extract the required information. n2kanalysis offers a get_result() function to return all relevant parameters as a n2kResult object. A n2kResult object contains both the parameters and the metadata. You can combine multiple n2kResult objects with combine_result() into a single n2kResult object and still distinguish the output from the different analyses. get_result() also return potential anomalies for n2k_inla() models.

Traceable and auditable

Every n2kModel object has a so-called file fingerprint and status fingerprint.

The file fingerprint is a SHA-1 hash based on everything which will remain stable during the lifetime of the object. Think of the data, model type, model formula, species group, … Creating the object generates the hash based on the information at the time of creation. We check the validity of the file fingerprint before fitting the model or retrieving results. You can use validObject() to do the validation manually. The probability that two n2kModel objects with different stable information get the same file fingerprint is very small. Therefore we can reference the n2kModel object by its file fingerprint and use it as the filename. You can retrieve this hash with get_file_fingerprint().

The status fingerprint is based on the file fingerprint and everything that can change during the lifetime of the object. The most obvious element is the model. You can retrieve this hash with get_status_fingerprint().

Reporting both the file fingerprint and status fingerprint along side the results uniquely identifies a specific version of the n2kModel object.

Portable

The n2kModel object contains all required information to fit the model. The only exception is the case where the object needs information from one or more parent objects. Then you must provide access to those objects as well.

Use store_model() to store the objects. The base argument is either the path of a directory on a file system or an S3 bucket. The project argument defines a sub directory at the root of base. Moving the analyses to a different file system is as simple as copying everything in the project folder to the other file system. For S3 buckets you only need to provide the machine access to the S3 bucket, no need to copy the files manually. fit_model() will read the object from the S3 bucket, fit the model and update the object in the S3 bucket.

Efficient

We mentioned because that generating the object generates as file fingerprint which only depends on the stable element. Use store_model(overwrite = FALSE) when you generate objects. When nothing changed, the file fingerprint doesn’t change and we keep the existing object which we might have fit earlier. Suppose you need to change the definition of some child objects. That will alter the file fingerprint and thus generate new objects. However the parent objects remain as is.

Every object has a status. By default fit_model() only handles object with status “new” or “waiting”.

  1. “new”: the object is ready for fitting. When required, the object from the parent object is available in the child object. fit_model() fits the object.
  2. “waiting”: the object is missing information from at least one parent object. fit_model() updates the information of the parent objects and updates the status accordingly.
  3. “converged”: fit_model() successfully fitted the object.
  4. “error”: Something when wrong when fitting the model. Or one of the parents has the status “error”.

Refitting objects with status “error” or “converged” only makes sense after upgrade n2kanalysis.

Metadata

Because we need to transfer the n2kModel object between machines, it is important to add the necessary metadata to the object. The metadata contains both the file and status fingerprints.

Definition of the model

The model type is a more generic description of the model, whereas the formula is the actual formula used for this specific analysis. The more generic model type allows to group results among species or location, while the formula still allow to use a different formula for some objects. Suppose that the generic model uses both year and month a covariates. When a species is only present during a single month, then it doesn’t make sense to include month as a covariate for this species.

seed sets the seed to make the analysis reproducible.

Scheme, species and location

Other important information in the metadata are the scheme id, the species group id and the location group id. The scheme id refers to the monitoring scheme that delivered the data. The species group id refers to the list of species handled in this analysis. When you analyse a single species, then you need to create a species group containing only this species. Similarly, the location group id refers to a list of locations. The management of the species and location groups is outside of the scope of n2kanalysis. We recommend three tables: speciesgroup, species and speciesgroup_species. speciesgroup and species should contain at least an id and a description field. speciesgroup_species should contain at least a speciesgroup_id and a species_id field. Hence a species can be part of one or more species groups. Use a similar structure for locationgroup, location and locationgroup_location. result_datasource_id refers to the database when you store the definitions of scheme_id, species_group_id and location_group_id.

Time-stamp

Document the available time period with first_imported_year and last_imported_year. duration and last_analysed_year refer to the time span for this analysis. The default duration is the difference between the last and first imported year. The default last_analysed_year is the last_imported_year. Use different values in case the analyse covers only a subset of the full dataset. Suppose the full dataset spans from 1992 to 2023. If you want to analyse the period 2011 - 2020, you set duration = 10 and last_analysed_year = 2020.

analysis_date refers to the moment when you imported the source data. You can use the time-stamp of the commit in case you placed the data under version control with git. git2rdata::recent_commit() is a handy function to find the most recent commit that changed a specific file.

Used packages

The metadata contains one or more “analysis versions”. An “analysis version” is the list of packages including their version number loading when creating or fitting the model. Fitting the object on a different machine with different R packages than the one you generated the object will result in a second “analysis version”. The goal is to document the software used during the process.

Store the scripts to generate the objects in a dedicated package. Load the package when running the script. such a workflow documents the code to create the objects. Have a look at abvanalysis and watervogelanalysis for some real examples. Both packages contain the code to: 1. Import the data from the source. 1. Create the analysis objects. 1. Get the results from the analysis objects. 1. Display the results in a report or website.

Linked analyses

Optionally you can define one or more parent analyses for the object. Use this when you need to combine the analyses of different species into a multi-species index. Or when you want to analyse trends with multiple imputation. The imputation model is the parent object. The aggregated imputed data is the child object. The analysis of the aggregated imputed data is the grandchild object.

Available models

n2k_import()

A dummy model to document the import step. This object allows to use it a parent object for the subsequent models.

n2k_inla()

Fits models with integrated nested Laplacian approximation (INLA). Setting an imputation_size larger than 0 (default) performs multiple imputation on the missing values using multimput::impute().

In case of multiple imputation you can define a minimum covariate. This covariate hold the minimal value to impute. Use this in case of incomplete samples. For example when you count the number of birds in a location and you sampled only a part of the location. The observed count it thus a lower bound of the true value. Set the response variable to missing and the minimum variable to the observed count.

Another option is the specify an extra dataset of observations. We intended this option for rare observations that might distort the imputation model. For instance due to few observation events at a location, only a few non-zero observations at a location or a few extremely high observations at a location. Exclude the observations from such location for the imputation model and add them to extra. You won’t get imputations for the missing observation in extra. But we use the observed numbers in extra in the subsequent aggregation.

n2k_inla_comparison()

Compares multiple n2k_inla() objects based on the Wantanabe Akaike Information Criterion (WAIC).

n2k_composite()

Combines several n2kModel objects into a single analysis. For example to combine the results of individual species into a composite species index. The extractor function should extract the mean and variance of the parameters. The mean is the fitted object is the mean of the means of every parameters over the models. The standard error is square root of the average variance of every parameter over the models. Hence we assume independence of the parameters between the models.

n2k_hurdle_imputed()

Combines two n2k_inla() objects with multiple imputation. The presence model is an n2k_inla() binomial model that describes the presence of the species. The count model is an n2k_inla() zero-truncated Poisson or zero-truncated negative binomial model that describes the non-zero counts of the species. INLA provides type 0 and type 1 zero-inflated models. The type 0 model defines the likelihood as a point mass at 0 and a zero-truncated distribution.

\[Prob(y) = p × 1_{[y=0]} + (1 − p) × \mathcal{P(y | y > 0)}\]

The type 1 model defines the likelihood as a combination of a point at zero and a distribution which can produce zero values too.

\[Prob(y) = p × 1_{[y=0]} + (1 − p) × \mathcal{P(y)}\]

The trick to get a zero-truncated distribution is to use a type 0 zero-inflated distribution and fix the probability of the point mass at zero to a very small value. Below is the required INLA setting. Note that the value is expressed on the log scale. -11 on the log scale is equivalent to \(p = 0.0000167\).

control.family = list(
  list(hyper = list(theta2 = list(initial = -11, fixed = TRUE)))
)

n2k_aggregate()

The aggregation step after multiple imputation. fit_model() will call multimpute::aggregate_impute(). Use the join argument to select a subset of locations. For example when you ran an imputation model at the country level and you want to aggregate over a region within the country. Note that you can use the same impute model for different aggregation, for example one imputation model at the country level and aggregations for multiple regions.

n2k_modelimputed()

Fitting a model an (aggregated) imputed dataset. fit_model() will call multimpute::model_impute(). You can use filter to subset the data after the imputation. This is a useful feature when modelling an aggregation on a smaller subset which might result in leading zero’s in the dataset. You can provide a custom filter function to remove those observations from the dataset. The function handles empty datasets after applying the filter.

Manifest

The most basic way to fit the models is to loop over all files. This is suboptimal when some models depend on other models. Since the file name equals the file fingerprint, the order of the file names has no link with the optimal order to fit the models.

We solved this problem by introducing the n2k_manifest() object. The manifest is simply a dataframe with the file fingerprint of the model and the file fingerprint of the parent model. This information is available after generating the models. Hence you can generate all n2kModel objects and store the link between the objects in the manifest. When you need multiple parent objects, add one row of every parent.

Applying fit_model() on an n2k_manifest() fit the models in an optimal order. The initial list of model consist of the models without parents. Then it handles the children of the initial models. Then the grandchildren and so on. This order guarantees that all parent models are fitted before fitting a model.

Another relevant use of a manifest is that is bundles a set of models. For example all models for a given monitoring scheme in a specific year. The manifest has is own file fingerprint, useful to reference a specific manifest.

Docker

n2kanalysis allows to maximise the reproducibility by making it easier to run the analysis in a Docker container. Use store_manifest_yaml() to store the manifest and specify the Docker image plus optional extra dependencies to install. manifest_yaml_to_bash() creates a bash script that fits the models one by one in the specified Docker container.

Using Docker improves the portability of the analysis as well. When you stored the object in an S3 bucket, you only need to copy the script to the other machine and run it. Or copy both the script and the objects in case you store everything on a file system.

manifest_yaml_to_bash() offers the option to shut down the machine at the end of the script. This avoids the cost of running an idle server. You can also have it to split the analysis over multiple machines. No fancy parallel computing is involved. Suppose you specified split = 2. Then you simply get two scripts, one with the “odd” analyses and one with the “even” analysis.