API reference

High level functions

pywhip.pywhip.whip_csv(csv_file, specifications, delimiter, maxentries=None)[source]

Whip a CSV-like file

Validate a CSV file, using the CSV reading and iterator capabilities of the Python standard library.

Parameters
csv_filestr

Filename of the CSV file to whip validate.

specificationsdict

Valid specifications whip dictionary schema.

delimiterstr

A one-character string used to separate fields, e.g. ','.

maxentriesint

Define the limit of records to validate from the Archive, useful to have a quick set on the frst subset of data.

Returns
whip_itpywhip.pywhi.Whip

Whip validator class instance, containing the errors and reporting capabilities.

pywhip.pywhip.whip_dwca(dwca_zip, specifications, maxentries=None)[source]

Whip a Darwin Core Archive

Validate the core file of a Darwin Core Archive zipped data set, using the DwCAReader reading and iterator capabilities.

Parameters
dwca_zipstr

Filename of the zipped Darwin Core Archive.

specificationsdict

Valid specifications whip dictionary schema.

maxentriesint

Define the limit of records to validate from the Archive, useful to have a quick set on the frst subset of data.

Returns
whip_itpywhip.pywhi.Whip

Whip validator clasc instance, containing the errors and reporting capabilities.

Document validation

class pywhip.pywhip.Whip(schema, sample_size=10)[source]

Whip document validation class

Validates (multiple row) documents against a whip specification schema using the high-level functions whip_... and creates a validation report with the get_report() method.

Attributes
sample_sizeint

Number of value-examples to use in reporting

schemadict

Whip specification schema, consisting of field : constraint combinations

validationpywhip.validators.DwcaValidator

A DwcaValidator class instance.

_reportdict

Base report container to collect document errors. Errors are collected in the [‘results’][‘specified_fields’] values, having a SpecificationErrorHandler for each field-specification combination.

Parameters
schemadict

Whip specification schema, consisting of field : constraint combinations.

sample_sizeint

For each of the field-rules combinations, the (top) number of data value samples/examples to include in the report.

create_html()[source]

Build html using template

Returns
str
get_report(format='json')[source]

Collect errors into reporting format (json/html)

Converts the logged errors into a json or html style report.

Parameters
formatjson | html

Define the output format the report is used.

Returns
str

Specification handling

The DwcaValidator is the underlying engine to handle the validation of incoming values against the whip specifications. It extends the existing Cerberus Validator class.

class pywhip.validators.DwcaValidator(*args, **kwargs)[source]

Validates any mapping against specifications defined in a validation-schema

In the context of pywhip, a mapping is generally a single line of data, with the keys the fields (data headers) and the values the data values for that particular line.

Notes

This class subclasses Validator and adds pywhip specific _validate_<specification> methods.

The whip specifications are a combination of cerberus native specifications and pywhip custom ones:

  • directly available by cerberus

    minlength, maxlength, regex

  • cerberus specifications overwritten by pywhip

    allowed, empty, min, max

  • pywhip specific specification functions

    numberformat, dateformat, mindate, maxdate, stringformat

  • pywhip specific specification environments:

    delimitedValues, if

Each _validate_<specification> assumes the following input arguments:

  • constraint:

    The constraint provided in the whip specification, i.e. the right hand side of the colon in the whip specifications. In the implementation, the input parameter can be names differently to clarify the role of the constraint in the validation function.

  • field:

    The name of the field, i.e. the left hand side of the colon in the whip specifications which corresponds to the field header name in the data.

  • value:

    A single data value for which the whip specification needs to be tested using the provided constraint.

To validate the schema input itself, cerberus validation rules can be added to the docstring TODO ADDLINK

Extends the handling of Cerberus Validator

The following alterations are done: * Allow_unkown is default set on True * Initaition requires a schema * By default, all fields without empty specification get an empty: False specification. As such, empy strings are not allowed by default, according to whip specifications.

Parameters
allow_unknownboolean

If False, only terms with specifications are allowed as input. As unknown fields are reported by pywhip after validation, the default value is False.

class pywhip.validators.WhipErrorHandler(tree=None)[source]

Class to store custom error message handling

The WhipErrorHandler updates the BasicErrorHandler with custom messages for pywhip specific specifications. Each of the messages updates the message of a specification error, using the unique code attributed in the ErrorDefinition setup.

The message is a descriptive message about the error and can optionally use the following variables:

  • value

    This refers to the individual data value of the document, use {value}

  • constraint

    This refers to the constraint provided by the whip specification right hand side of the colon, use {constraint}

Optionally initialize a new instance.

Reporter Objects

class pywhip.reporters.SpecificationErrorHandler(constraint)[source]

Class handler for field-rule entity reporting

Notes

The SpecificationErrorHandler class is basically an enriched dictionary (using mapping), directly building on top of a defaultdict with the (wrong) values as keys and a set as values to add (unique) rows for which that value occurs.

Attributes
constraintstr

The constraint linked to the specification (field-rule combination), expressed as string

_samplesdefaultdict(set)

Dictionary with wrong data values as keys and the corresponding row identifiers as values.

build_error_report(total_rows_count, top_n)[source]

Convert defaultdict to regular dict for json reporting

Parameters
total_rows_countint

Total rows of the current document working with, used to calculate passed rows as well

top_nint

Number of samples (ordered on the number of rows) to retain for reporting purposes

Notes

build_error_report() combines the information contained by the _samples attribute, for example:

{ ("07241981", "string format ...") : [2, 3, 5, 6],
  ("value", "message as provided by error") : [1, 2, 6,]
}

together with the other attributes into a json-style report:

{"constraint": "%Y-%m-%d, %Y-%m, %Y",
 "failed_rows": 23,
 "passed_rows": 3,
 "samples": {
    "07241981": {
        "failed_rows": 4,
        "first_row": 2,
        "message": "string format ..."
    },
    "value": {
        "failed_rows": n_rows,
        "first_row": minimum of row identifiers,
        "message": "message as provided by error"
        }
    }
}
exception pywhip.reporters.WhipReportException[source]

Raised when the reporting of the errors contains errors