# Tutorial ## Writing and loading whip specification files Whip specifications in text format are expressed in [YAML](https://en.wikipedia.org/wiki/YAML) language, It uses both Python-style indentation to indicate nesting, and a more compact format that uses `[]` for lists and `{}` for key-value maps, making YAML 1.2 a superset of JSON. The whip syntax is explained in more detail on the [whip](https://github.com/inbo/whip) repository, explaining the available specifications. Consider the following data set used in a project, with 3 columns: * a `eventDate` column, representing the date of occurrence * a `individualCount` column with the counts of individuals seen on the date * a `country` column, defining the country of the observation An subset of the data could look like this: eventDate | individualCount | country -----------|-----------------|--------- 2018-01-03 | 5 | BA 2018-04-02 | 20 | NL 2016-07-06 | 3300 | BE 2017-03-02 | 2 | BE 1018-01-08 | 1 | NL In the current project, we do know the following about the data: * The project was running from 2016 until 2018, so date values should be in this range * The project was happening in Belgium and The Netherlandsa and country need to be either `BE` or `NL` * Individual counts can not be higher than 100 and should be at least 1 * Empty values are not allowed (default according to whip specifications) We can express these rules as [whip specifications](https://github.com/inbo/whip): ```yaml country: allowed: [BE, NL] eventDate: dateformat: '%Y-%m-%d' mindate: 2016-01-01 maxdate: 2018-12-31 individualCount: numberformat: x # needs to be an integer value min: 1 max: 100 ``` *(Notice the possibility to include comments)* These specifications can be saved to a yaml-file (use the extention `.yaml`), e.g. `observations_example.yaml` and parsed into Python using the yaml-package: ```python import yaml with open("observations_example.yaml") as whip_specs_file: specifications = yaml.load(whip_specs_file) ``` Similar to loading the file, one can also write the specifications directly in a Python script: ```python import yaml whip_specs = """ country: allowed: [BE, NL] eventDate: dateformat: '%Y-%m-%d' mindate: 2016-01-01 maxdate: 2018-12-31 individualCount: numberformat: x # needs to be an integer value min: 1 max: 100 """ specifications = yaml.load(whip_specs) ``` or directly the Python-object itself: ```python import datetime specifications = {'individualCount': {'min': 1, 'max': 100, 'numberformat': 'x'}, 'eventDate': {'dateformat': '%Y-%m-%d', 'maxdate': datetime.date(2018, 12, 31), 'mindate': datetime.date(2016, 1, 1)}, 'country': {'allowed': ['BE', 'NL']}} ``` *(Notice that the dates are coerced to `datetime.date` objects)* Using one of these approaches, the `specifications` can be used by `pywhip` to control incoming data sets. ## Whip a data set Pywhip supports a number of data input formats that can be used to apply the whip specifications. ### CSV files Applying whip specifications to a CSV file is supported by the function `whip_csv`, which requires a data file, the whip specifications and the delimiter of the CSV file (`","` for CSV, `"\t"` for TSV,...) ```python import yaml from pywhip import whip_csv with open("observations_example.yaml") as whip_specs_file: specifications = yaml.load(whip_specs_file) observations_whip = whip_csv("observations_data.csv", specifications, delimiter=',') ``` When the data set is according to the provided whip specifications, a Hooray message prints to stdout: ``` Hooray, your data set is according to the guidelines! ``` If not, a message alerts you to check the errors: ``` Your dataset does not comply with the specifications, use get_report() for more detailed information.' ``` Default reports are provided as `json` or `html`. It is advised to use a context to store store the output as file (e.g. called `report_observations.html`): ```python with open("report_observations.html", "w") as index_page: index_page.write(observations_whip.get_report('html')) ``` By which an [HTML version](report_observations.html) of the error report is generated. Similar, a json version of the report can be provided (and returned or saved to file): ```python import json with open('report_observations.json', 'w') as json_report: json.dump(observations_whip.get_report(), json_report) ``` Which provides a file version of the [json report](report_observations.json). **Remark:** In some occasions it is useful to not directly validate the entire data set. In that case, use the `maxentries` parameter to define the number of lines to validate: ```python observations_whip = whip_csv("observations_data.csv", specifications, delimiter=',', maxentries=50) ``` ### Darwin Core Archive To validate the core file of a [Darwin Core Archive](https://en.wikipedia.org/wiki/Darwin_Core_Archive) zipped data set, the packages relies on the [Darwin Core Reader](https://python-dwca-reader.readthedocs.io/en/latest/) package. To directly apply the specifications on a Darwin Core Archive, use the `whip_dwca` function: ```python import yaml from pywhip import whip_dwca with open("observations_example.yaml") as whip_specs_file: specifications = yaml.load(whip_specs_file) observations_whip = whip_dwca("observations_data.csv", specifications) ``` Reporting functionalities are the same as the csv-version. ## Whip CSV files from the command line To apply pywhip for data set validation outside Python, use the command line interface providing direct application of pywhip on a CSV data set. By installing the package, the `whip_csv` command will be available from the command line. To read the documentation: ```bash whip_csv --help ``` As an example, to whip the data set `observations_data.csv` with a comma as delimiter using the whip specifications defined in the `observations_example.yaml` file and printing the output to an `index.html` as an HTML page: ```bash whip_csv observations_data.csv observations_example.yaml index.html --delimiter ',' ```