2 Import and preprocess INBO color ring data

2.1 Extract data from INBO kleurring database

2.1.1 Connection to INBO database

Get connection settings from config.yml (not included to the reposiory) and connect to database:

meeuwen <- config::get("meeuwen")
conn <- dbConnect(odbc::odbc(),
  driver = meeuwen$driver,
  server = meeuwen$server,
  database = meeuwen$database,
  port = meeuwen$port,
  trusted_connection = meeuwen$trusted_connection
)

2.1.2 Extract user data

First, we extract data about users from INBO kleurring database:

users <- dbGetQuery(conn, "SELECT * FROM dbo.tblWaarnemer") %>%
  as_tibble()

2.1.3 Extract color ring data

birds <- dbGetQuery(conn, "SELECT * FROM dbo.tblKleurring") %>%
  as_tibble()

We extract the color table in order to know the meaning of the color abbreviations in fields RingKleurCode and InscriptieKleurCode in birds.

color_table <- dbGetQuery(conn, "SELECT * FROM dbo.cdeKleur") %>%
  as_tibble()
color_table

We also extract the ring position (left or right leg) and the inscription orientation:

ring_position_table <- dbGetQuery(conn, "SELECT * FROM dbo.cdeRingPlaats") %>%
  as_tibble()
ring_position_table

2.1.4 Extract observation data

INBO’s observation data contain a text type field: Opmerking. Text type is deprecated and an error will be returned if we perform the standard SQL query "SELECT * FROM dbo.tblWaarneming". So, we need an ad-hoc query:

obs <- dbGetQuery(
  conn,
  "SELECT Nummer,
          Datum,
          EuringCode,
          LeeftijdCode,
          KleurringNummer,
          KleurringPlaats,
          MetaalringNummer,
          MetaalringPlaats,
          PlaatsGemeente,
          PlaatsToponym,
          PlaatsToponymDetail,
          Convert(nvarchar(4000),Opmerking) as Opmerking,
          WaarnemerNummer,
          PlaatsLengtegraadGraden,
          PlaatsLengtegraadMinuten,
          PlaatsLengtegraadSeconden,
          PlaatsBreedtegraadGraden,
          PlaatsBreedtegraadMinuten,
          PlaatsBreedtegraadSeconden,
          PlaatsLengtegraadRichtingCode,
          PlaatsBreedtegraadRichtingCode,
          PlaatsLandCode,
          MetaalringLandCode,
          BevestigingDatum,
          PlaatsProvincie,
          AanmaakDatum,
          WijzigDatum
  FROM dbo.tblWaarneming"
) %>% as_tibble()

Table tblWaarnemingAktie is also important because contains informations about the actions taken at each observation:

obs_actions <- dbGetQuery(conn, "SELECT * FROM dbo.tblWaarnemingAktie")

The action codes are described in table dbo.cdeAktie:

actions_meaning <- dbGetQuery(conn, "SELECT * FROM dbo.cdeAktie")
actions_meaning

Close connection to server:

dbDisconnect(conn)

2.1.5 Import control data

From UVABIT repository, we copied a text file containing the mapping of GPS ids:

map_gps_path <- here::here("data", "input", "map_gps_id_to_color_ring.tsv")
map_gps <-
  read_tsv(map_gps_path) %>%
  mutate(sovon_bird_notes = "uvabits_gps_tag.")
## Parsed with column specification:
## cols(
##   gps_id = col_character(),
##   color_ring = col_character()
## )
## mutate: new variable 'sovon_bird_notes' (character) with one unique value and 0% NA
map_gps %>% head()

This file will be used later in this same document.

3 Pre-processing

3.1 Preview raw data

For privacy reasons users data cannot be shown. Users data refer to the following fields:

colnames(users)
##  [1] "Nummer"         "Familienaam"    "Voornaam"       "Adres"         
##  [5] "Postcode"       "Gemeente"       "Email"          "Telefoon"      
##  [9] "Wachtwoord"     "Gebruikersnaam" "LandCode"       "TelefoonMobiel"
## [13] "TelefoonWerk"

Preview birds data:

birds %>% head()

Preview observations data:

obs %>% head()

Preview observation actions data:

obs_actions %>% head()

3.2 Check primary key consistency

The primary key of each table shuold never be left empty and should contain unique values.

3.2.1 Users

Primary key: column Nummer. Users with empty values:

users %>%
  filter(is.na(Nummer))
## filter: removed all rows (100%)

Users with not unique values of Nummer:

users %>%
  group_by(Nummer) %>%
  count() %>%
  filter(n > 1)
## group_by: one grouping variable (Nummer)
## count: now 2,039 rows and 2 columns, one group variable remaining (Nummer)
## filter (grouped): removed all rows (100%)

3.2.2 Birds

Primary key: column Nummer. Birds with empty values:

birds %>%
  filter(is.na(Nummer))
## filter: removed all rows (100%)

Birds with not unique values of Nummer:

birds %>%
  group_by(Nummer) %>%
  count() %>%
  filter(n > 1)
## group_by: one grouping variable (Nummer)
## count: now 11,309 rows and 2 columns, one group variable remaining (Nummer)
## filter (grouped): removed all rows (100%)

3.2.3 Observations

Primary key: column Nummer. Observations with empty values:

obs %>%
  filter(is.na(Nummer))
## filter: removed all rows (100%)

Observations with not unique values of Nummer:

obs %>%
  group_by(Nummer) %>%
  count() %>%
  filter(n > 1)
## group_by: one grouping variable (Nummer)
## count: now 151,349 rows and 2 columns, one group variable remaining (Nummer)
## filter (grouped): removed all rows (100%)

3.2.4 Actions

Primary key: column Nummer. Actions with empty values:

obs_actions %>%
  filter(is.na(Nummer))
## filter: removed all rows (100%)

Observations with not unique values of Nummer:

obs_actions %>%
  group_by(Nummer) %>%
  count() %>%
  filter(n > 1)
## group_by: one grouping variable (Nummer)
## count: now 162,010 rows and 2 columns, one group variable remaining (Nummer)
## filter (grouped): removed all rows (100%)

3.3 Remove hard returns

3.3.1 Remove hard returns in users

Remove hard returns \r\n in users:

users <-
  users %>%
  mutate_if(is.character, ~ str_replace_all(., "[\\r\\n]", ""))
## mutate_if: changed one value (<1%) of 'Adres' (0 new NA)
##            changed one value (<1%) of 'Email' (0 new NA)

3.3.2 Remove hard returns in birds

Remove hard returns \r\n in birds:

birds <-
  birds %>%
  mutate_if(is.character, ~ str_replace_all(., "[\\r\\n]", ""))
## mutate_if: no changes

3.3.3 Remove hard returns in observations

Remove hard returns \r\n in observations:

obs <-
  obs %>%
  mutate_if(is.character, ~ str_replace_all(., "[\\r\\n]", ""))
## mutate_if: changed one value (<1%) of 'PlaatsToponymDetail' (0 new NA)
##            changed 1,030 values (1%) of 'Opmerking' (0 new NA)

3.4 Check spaces in birds

No spaces should be present in any column of birds:

map_dfr(birds, ~ mutate(birds, space_detect = str_detect(., pattern = " "))) %>%
  filter(space_detect == TRUE)
## mutate: new variable 'space_detect' (logical) with one unique value and 0% NA
## mutate: new variable 'space_detect' (logical) with one unique value and 0% NA
## mutate: new variable 'space_detect' (logical) with one unique value and 0% NA
## mutate: new variable 'space_detect' (logical) with 2 unique values and 9% NA
## mutate: new variable 'space_detect' (logical) with 2 unique values and 1% NA
## mutate: new variable 'space_detect' (logical) with 2 unique values and 1% NA
## mutate: new variable 'space_detect' (logical) with 2 unique values and <1% NA
## mutate: new variable 'space_detect' (logical) with 2 unique values and <1% NA
## mutate: new variable 'space_detect' (logical) with 2 unique values and <1% NA
## mutate: new variable 'space_detect' (logical) with 2 unique values and 9% NA
## mutate: new variable 'space_detect' (logical) with 2 unique values and 6% NA
## filter: removed all rows (100%)

Remove all spaces:

birds <- birds %>%
  mutate_all(list(~ str_replace_all(., " ", "")))
## mutate_all: no changes

3.5 Check duplicates WaarnemingNummer-AktieCode

The action acronym is contained in column AktieCode. Observations and their actions are linked via columns WaarnemingNummer (in actions) and Nummer (in obs).

No duplicates WaarnemingNummer-AktieCode should exist:

obs_actions %>%
  group_by(WaarnemingNummer, AktieCode) %>%
  add_tally() %>%
  ungroup() %>%
  filter(n > 1)
## group_by: 2 grouping variables (WaarnemingNummer, AktieCode)
## add_tally (grouped): new variable 'n' (integer) with one unique value and 0% NA
## ungroup: no grouping variables
## filter: removed all rows (100%)

Otherwise they should be removed:

obs_actions <-
  obs_actions %>%
  distinct(WaarnemingNummer, AktieCode, .keep_all = TRUE)
## distinct: no rows removed

3.6 Merge observations and actions

In order to ease the mapping of birds and observations, we first remove columns will be never used:

obs_actions <-
  obs_actions %>%
  select(-c(
    Nummer,
    AanmaakDatum,
    WijzigDatum
  ))
## select: dropped 3 variables (Nummer, AanmaakDatum, WijzigDatum)

Then we spread action codes to columns:

obs_actions <-
  obs_actions %>%
  pivot_wider(names_from = AktieCode, values_from = AktieCode)
## pivot_wider: reorganized (AktieCode) into (rngme, rngkl, veld, dood, vang, …) [was 162010x2, now 151349x13]

The following combinations occur:

combinations_actions <- 
  obs_actions %>%
  select(-WaarnemingNummer) %>%
  distinct()
## select: dropped one variable (WaarnemingNummer)
## distinct: removed 151,319 rows (>99%), 30 rows remaining
combinations_actions

Actions present in database:

acts <- names(combinations_actions)
acts
##  [1] "rngme" "rngkl" "veld"  "dood"  "vang"  "klgev" "br"    "vangl" "klweg"
## [10] "me"    "meweg" "ziek"

Actions described in actions_meaning never used:

actions_meaning %>%
  filter(!Code %in% acts)
## filter: removed 12 rows (92%), one row remaining

We add action code columns to observations in order to have a unique dataframe for observations:

obs_and_acts <-
  obs %>%
  left_join(obs_actions, by = c("Nummer" = "WaarnemingNummer"))
## left_join: added 12 columns (rngme, rngkl, veld, dood, vang, …)
##            > rows only in x         0
##            > rows only in y  (      0)
##            > matched rows     151,349
##            >                 =========
##            > rows total       151,349

Preview:

head(obs_and_acts, n = 100)

3.7 Check spaces in obs_and_acts

No spaces should be present in character columns of obs_and_acts except for columns related to place description (Plaats*), datums (*Datum) and notes (Opmerking):

map_dfr(obs_and_acts %>%
  select_if(is.character) %>%
  select(-c(
    starts_with("Plaats"),
    Opmerking,
    ends_with("Datum")
  )), ~
mutate(obs_and_acts, space_detect = str_detect(., pattern = " "))) %>%
  filter(space_detect == TRUE) %>%
  arrange(Nummer)
## select_if: dropped 12 variables (Nummer, Datum, WaarnemerNummer, PlaatsLengtegraadGraden, PlaatsLengtegraadMinuten, …)
## select: dropped 8 variables (PlaatsGemeente, PlaatsToponym, PlaatsToponymDetail, Opmerking, PlaatsLengtegraadRichtingCode, …)
## mutate: new variable 'space_detect' (logical) with 2 unique values and 28% NA
## mutate: new variable 'space_detect' (logical) with 2 unique values and 40% NA
## mutate: new variable 'space_detect' (logical) with one unique value and 0% NA
## mutate: new variable 'space_detect' (logical) with 2 unique values and 89% NA
## mutate: new variable 'space_detect' (logical) with 2 unique values and 99% NA
## mutate: new variable 'space_detect' (logical) with 2 unique values and 99% NA
## mutate: new variable 'space_detect' (logical) with 2 unique values and 99% NA
## mutate: new variable 'space_detect' (logical) with 2 unique values and 93% NA
## mutate: new variable 'space_detect' (logical) with 2 unique values and 93% NA
## mutate: new variable 'space_detect' (logical) with 2 unique values and 13% NA
## mutate: new variable 'space_detect' (logical) with 2 unique values and >99% NA
## mutate: new variable 'space_detect' (logical) with 2 unique values and >99% NA
## mutate: new variable 'space_detect' (logical) with 2 unique values and >99% NA
## mutate: new variable 'space_detect' (logical) with 2 unique values and 96% NA
## mutate: new variable 'space_detect' (logical) with 2 unique values and >99% NA
## mutate: new variable 'space_detect' (logical) with 2 unique values and >99% NA
## mutate: new variable 'space_detect' (logical) with 2 unique values and >99% NA
## mutate: new variable 'space_detect' (logical) with 2 unique values and >99% NA
## mutate: new variable 'space_detect' (logical) with 2 unique values and >99% NA
## filter: removed all rows (100%)

Remove spaces if present:

obs_and_acts <-
  obs_and_acts %>%
  mutate(across(
    !starts_with("Plaats") &
      !ends_with("Datum") & 
      !one_of("Opmerking"),
    ~ str_replace_all(., " ", "")
  ))
## mutate: converted 'Nummer' from integer to character (0 new NA)
##         converted 'WaarnemerNummer' from integer to character (0 new NA)

3.8 Check presence action code

All observations should also have an action code:

no_actions <-
  obs_and_acts %>%
  filter_at(vars_select(names(obs_and_acts), one_of(acts)), all_vars(is.na(.)))
## filter_at: removed all rows (100%)
no_actions

Observation without it will be removed:

obs_and_acts <-
  obs_and_acts %>%
  filter(!Nummer %in% no_actions$Nummer)
## filter: no rows removed

3.9 Check combinations of actions with klgev or megev

klgev should never occur with other active actions like vang, vangl or action dood. Is this true?

acts_with_klgev <-
  obs_and_acts %>%
  filter(klgev == "klgev") %>%
  select(all_of(acts)) %>%
  distinct() %>%
  select_if(~ sum(!is.na(.)) > 0) %>%
  names()
## filter: removed 151,311 rows (>99%), 38 rows remaining
## select: dropped 27 variables (Nummer, Datum, EuringCode, LeeftijdCode, KleurringNummer, …)
## distinct: removed 37 rows (97%), one row remaining
## select_if: dropped 11 variables (rngme, rngkl, veld, dood, vang, …)
acts_with_klgev <- acts_with_klgev[acts_with_klgev != "klgev"]
length(acts_with_klgev) == 0
## [1] TRUE

Same for megev:

if ("megev" %in% acts) {
  acts_with_megev <-
    obs_and_acts %>%
    filter(megev == "megev") %>%
    select(all_of(acts)) %>%
    distinct() %>%
    select_if(~ sum(!is.na(.)) > 0) %>%
    names()
  acts_with_megev <- acts_with_megev[acts_with_megev != "megev"]
  length(acts_with_megev) == 0
} else {
  message("No 'megev' actions found.")
}
## No 'megev' actions found.

3.10 Remove data labelled as FOUT

There are observations judged as error. They are marked byKleurringNummmer equal to FOUT:

obs_and_acts %>%
  filter(KleurringNummer == "FOUT")
## filter: removed 150,958 rows (>99%), 391 rows remaining

The value FOUT is also present in birds:

birds %>%
  filter(Nummer == "FOUT")
## filter: removed 11,308 rows (>99%), one row remaining

These data are related to observations judged not correct (FOUT: error) by the INBO experts and administrator of the database.

Thus, at the moment, we remove these data (discussion about this issue here) from observations:

obs_and_acts <-
  obs_and_acts %>%
  filter(KleurringNummer != "FOUT")
## filter: removed 391 rows (<1%), 150,958 rows remaining

and from birds:

birds <-
  birds %>% filter(Nummer != "FOUT")
## filter: removed one row (<1%), 11,308 rows remaining

3.11 Remove EAYT

The bird with color ring EAYT has been added after the final data export for SOVON, Oct 8th, and will be removed from birds:

birds <-
  birds %>%
  filter(Nummer != "EAYT")
## filter: removed one row (<1%), 11,307 rows remaining

and observations:

obs_and_acts <-
  obs_and_acts %>%
  filter(KleurringNummer != "EAYT")
## filter: removed 2 rows (<1%), 150,956 rows remaining

3.12 Check color rings: uppercase

Bird color rings codes are uppercase. Exceptions:

birds %>%
  filter(Nummer != toupper(Nummer) |
    NummerNieuw != toupper(NummerNieuw) |
    NummerDesc != toupper(NummerDesc))
## filter: removed all rows (100%)

are converted to uppercase:

birds <- birds %>%
  mutate(
    Nummer = toupper(Nummer),
    NummerNieuw = toupper(NummerNieuw),
    NummerDesc = toupper(NummerDesc)
  )
## mutate: no changes

The same holds true for KleurringNummer in obs_and_acts. Exceptions:

obs_and_acts %>%
  filter(KleurringNummer != toupper(KleurringNummer)) %>%
  distinct(KleurringNummer)
## filter: removed all rows (100%)
## distinct: no rows removed

are transformed to uppercase:

obs_and_acts <- obs_and_acts %>%
  mutate(KleurringNummer = toupper(KleurringNummer))
## mutate: no changes

3.13 Check keys KleurringNummer (obs) - Nummer (birds)

All values in KleurringNummer should be present in birds$Nummer. Exceptions:

obs_and_acts %>%
  filter(!KleurringNummer %in% birds$Nummer) %>%
  distinct(KleurringNummer)
## filter: removed all rows (100%)
## distinct: no rows removed

We remove them:

KleurringNummer_remove <-
  obs_and_acts %>%
  filter(!KleurringNummer %in% birds$Nummer) %>%
  distinct(KleurringNummer) %>%
  pull()
## filter: removed all rows (100%)
## distinct: no rows removed
obs_and_acts <-
  obs_and_acts %>%
  filter(!KleurringNummer %in% KleurringNummer_remove)
## filter: no rows removed

Finally, we search for birds (Nummer in birds) not linked to any observation (KleurringNummer in obs_and_acts):

birds %>% filter(!Nummer %in% obs_and_acts$KleurringNummer)
## filter: removed all rows (100%)

We remove them:

birds <-
  birds %>%
  filter(Nummer %in% obs_and_acts$KleurringNummer)
## filter: no rows removed

3.14 Check duplicates in date of applying rings

Detect duplicates in date of applying (the very first) color ring (action code rngkl):

duplicates_ringing_rngkl <-
  obs_and_acts %>%
  filter(!is.na(rngkl)) %>%
  group_by(KleurringNummer, Datum, rngkl) %>%
  add_tally() %>%
  ungroup() %>%
  filter(n > 1)
## filter: removed 139,650 rows (93%), 11,306 rows remaining
## group_by: 3 grouping variables (KleurringNummer, Datum, rngkl)
## add_tally (grouped): new variable 'n' (integer) with one unique value and 0% NA
## ungroup: no grouping variables
## filter: removed all rows (100%)
duplicates_ringing_rngkl

and metal ring (action code rngme):

duplicates_ringing_rngme <-
  obs_and_acts %>%
  filter(!is.na(rngme)) %>%
  group_by(KleurringNummer, Datum, rngme) %>%
  add_tally() %>%
  ungroup() %>%
  filter(n > 1)
## filter: removed 139,793 rows (93%), 11,163 rows remaining
## group_by: 3 grouping variables (KleurringNummer, Datum, rngme)
## add_tally (grouped): new variable 'n' (integer) with one unique value and 0% NA
## ungroup: no grouping variables
## filter: removed all rows (100%)
duplicates_ringing_rngme

3.15 Add rngme if not present

Some metal rings are not linked to an observation with action rngme, so we have no date for them:

no_rngme <-
  birds %>%
  filter(!is.na(MetaalringNummer)) %>%
  left_join(obs_and_acts %>%
    filter(!is.na(rngme)) %>%
    select(KleurringNummer, Datum),
  by = c("Nummer" = "KleurringNummer")
  ) %>%
  filter(is.na(Datum)) %>%
  select(Nummer, MetaalringNummer) %>%
  rename(KleurringNummer = Nummer) %>%
  arrange(KleurringNummer)
## filter: removed 31 rows (<1%), 11,276 rows remaining
## filter: removed 139,793 rows (93%), 11,163 rows remaining
## select: dropped 37 variables (Nummer, EuringCode, LeeftijdCode, KleurringPlaats, MetaalringNummer, …)
## left_join: added one column (Datum)
##            > rows only in x      144
##            > rows only in y  (    31)
##            > matched rows     11,132
##            >                 ========
##            > rows total       11,276
## filter: removed 11,132 rows (99%), 144 rows remaining
## select: dropped 10 variables (NummerNieuw, NummerDesc, Plaats, RingKleurCode, InscriptieKleurCode, …)
## rename: renamed one variable (KleurringNummer)
no_rngme

Experts are sure that these metal rings have been applied while applying the color ring (see comment on issue #33). So, we can add action rngme to observations coupled to action rngkl for these rings:

obs_and_acts <-
  obs_and_acts %>%
  left_join(no_rngme %>%
    select(KleurringNummer) %>%
    mutate(add_rngme = TRUE),
  by = "KleurringNummer"
  ) %>%
  mutate(rngme = case_when(
    rngkl == "rngkl" & add_rngme == TRUE ~ "rngme",
    TRUE ~ rngme
  )) %>%
  select(-add_rngme)
## select: dropped one variable (MetaalringNummer)
## mutate: new variable 'add_rngme' (logical) with one unique value and 0% NA
## left_join: added one column (add_rngme)
##            > rows only in x   150,312
##            > rows only in y  (      0)
##            > matched rows         644
##            >                 =========
##            > rows total       150,956
## mutate: changed 144 values (<1%) of 'rngme' (144 fewer NA)
## select: dropped one variable (add_rngme)

Be sure this solution provides a date for all metal rings:

obs_and_acts %>%
  filter(is.na(rngme) | rngme == "rngme") %>%
  nrow() == nrow(obs_and_acts)
## filter: no rows removed
## [1] TRUE

3.16 Check age while applying rings

For each bird, observations with same date should refer to same bird age. The only exceptions allowed are those from veld observations as the age in this case is estimated by observer and can be different from the real one.
Exceptions, afer removing field observations:

exceptions_age <-
  obs_and_acts %>%
  # age is present
  filter(!is.na(LeeftijdCode)) %>%
  # exclude field observations
  filter(is.na(veld)) %>%
  distinct(KleurringNummer, Datum, LeeftijdCode, WaarnemerNummer) %>%
  group_by(KleurringNummer, Datum, WaarnemerNummer) %>%
  count() %>%
  filter(n > 1) %>%
  left_join(obs_and_acts %>%
    filter(!is.na(LeeftijdCode)) %>%
    select(
      Nummer, KleurringNummer, Datum, WaarnemerNummer,
      LeeftijdCode, one_of(actions_meaning$Code)
    )) %>%
  select(-n) %>%
  select(
    Nummer, KleurringNummer, Datum,
    LeeftijdCode, WaarnemerNummer, one_of(actions_meaning$Code)
  )
## filter: removed 60,680 rows (40%), 90,276 rows remaining
## filter: removed 72,908 rows (81%), 17,368 rows remaining
## distinct: removed 2 rows (<1%), 17,366 rows remaining
## group_by: 3 grouping variables (KleurringNummer, Datum, WaarnemerNummer)
## count: now 17,366 rows and 4 columns, 3 group variables remaining (KleurringNummer, Datum, WaarnemerNummer)
## filter (grouped): removed all rows (100%)
## filter: removed 60,680 rows (40%), 90,276 rows remaining
## Warning: Unknown columns: `megev`
## select: dropped 22 variables (EuringCode, KleurringPlaats, MetaalringNummer, MetaalringPlaats, PlaatsGemeente, …)
## Joining, by = c("KleurringNummer", "Datum", "WaarnemerNummer")
## left_join: added 14 columns (Nummer, LeeftijdCode, rngkl, rngme, klgev, …)
##            > rows only in x        0
##            > rows only in y  (90,276)
##            > matched rows          0
##            >                 ========
##            > rows total            0
## select: dropped one variable (n)
## Warning: Unknown columns: `megev`
## select: columns reordered (Nummer, KleurringNummer, Datum, LeeftijdCode, WaarnemerNummer, …)
exceptions_age

3.17 Check bird sex inconsistencies

All birds should have a sex, which could be one of M (mannetje), V (vrouwtje), O (onbekend):

distinct(birds, GeslachtCode)
## distinct: removed 11,304 rows (>99%), 3 rows remaining

Exceptions:

birds %>%
  filter(is.na(GeslachtCode)) %>%
  distinct(Nummer, NummerNieuw, NummerDesc, GeslachtCode)
## filter: removed all rows (100%)
## distinct: no rows removed

For all birds a one-to-one relation bird - sex should hold true:

birds %>%
  distinct(Nummer, GeslachtCode) %>%
  nrow() == nrow(birds)
## distinct: no rows removed
## [1] TRUE

3.18 Handle information about metal ring number

There are columns containing metal ring related information in birds:

names(birds)[which(str_detect(names(birds), "Metaalring"))]
## [1] "MetaalringNummer"   "MetaalringPlaats"   "MetaalringLandCode"

and obs_and_acts:

names(obs_and_acts)[which(str_detect(names(obs_and_acts), "Metaalring"))]
## [1] "MetaalringNummer"   "MetaalringPlaats"   "MetaalringLandCode"

This is not only redundant: inconsistencies are detected. Some examples:

birds %>%
  filter(Nummer %in% c("AAAA", "AAAR", "AAAT", "AAAW")) %>%
  rename("metal_ring_from_birds" = "MetaalringNummer") %>%
  left_join(obs_and_acts %>%
    rename("metal_ring_from_obs" = "MetaalringNummer"),
  by = c("Nummer" = "KleurringNummer")
  ) %>%
  distinct(Nummer, metal_ring_from_birds, metal_ring_from_obs)
## filter: removed 11,303 rows (>99%), 4 rows remaining
## rename: renamed one variable (metal_ring_from_birds)
## rename: renamed one variable (metal_ring_from_obs)
## left_join: added 41 columns (EuringCode.x, MetaalringPlaats.x, MetaalringLandCode.x, Nummer.y, Datum, …)
##            > rows only in x         0
##            > rows only in y  (150,596)
##            > matched rows         360    (includes duplicates)
##            >                 =========
##            > rows total           360
## distinct: removed 353 rows (98%), 7 rows remaining

Based on experts’ knowledge only the metal ring information in birds is correct. We rename the columns related to metal ring information from obs_and_acts by adding suffix _obs:

obs_and_acts <-
  obs_and_acts %>%
  rename_at(
    vars(starts_with("Metaalring")),
    ~ paste0(., "_obs")
  )
## rename_at: renamed 3 variables (MetaalringNummer_obs, MetaalringPlaats_obs, MetaalringLandCode_obs)

3.19 Solve GPS tracker IDs

Some birds have a suspect Nummer which is formed by letters GPS or GP followed by some numbers. These are a kind of GPS IDs which should not be in columns related to color rings:

birds %>%
  filter(str_detect(Nummer, pattern = "(^(GP)\\d+)|(^(GPS)\\d+)")) %>%
  select(Nummer)
## filter: removed 11,167 rows (99%), 140 rows remaining
## select: dropped 10 variables (NummerNieuw, NummerDesc, Plaats, RingKleurCode, InscriptieKleurCode, …)

The right mapping is saved in map_gps:

map_gps %>% head()

We import the color rings in new column first_Nummer. For all other rings will be first_nummer = Nummer as they are the very first rings:

birds <-
  birds %>%
  # create column first_Nummer with right color rings for the GPS or GP** rings
  left_join(map_gps, by = c("Nummer" = "gps_id")) %>%
  # set first_Numer equal to Nummer in all other cases
  mutate(first_Nummer = ifelse(is.na(color_ring),
    Nummer,
    color_ring
  )) %>%
  select(Nummer, first_Nummer, everything())
## left_join: added 2 columns (color_ring, sovon_bird_notes)
##            > rows only in x   11,167
##            > rows only in y  (     1)
##            > matched rows        140
##            >                 ========
##            > rows total       11,307
## mutate: new variable 'first_Nummer' (character) with 11,307 unique values and 0% NA
## select: columns reordered (Nummer, first_Nummer, NummerNieuw, NummerDesc, Plaats, …)

Check whether not corrected rings (GP*** or GPS***) are still present:

birds %>%
  filter(str_detect(first_Nummer, pattern = "(^(GP)\\d+)|(^(GPS)\\d+)")) %>%
  select(first_Nummer, NummerNieuw)
## filter: removed all rows (100%)
## select: dropped 12 variables (Nummer, NummerDesc, Plaats, RingKleurCode, InscriptieKleurCode, …)

Mapping example: the bird with Nummer equal to GP104 is mapped as follows:

birds %>%
  filter(Nummer == "GP104") %>%
  select(Nummer, first_Nummer, NummerNieuw, sovon_bird_notes)
## filter: removed 11,306 rows (>99%), one row remaining
## select: dropped 10 variables (NummerDesc, Plaats, RingKleurCode, InscriptieKleurCode, EuringCode, …)

Column first_Nummer should contain unique elements. Check for exceptions:

birds %>%
  group_by(first_Nummer) %>%
  add_tally() %>%
  ungroup() %>%
  filter(n > 1) %>%
  select(-n)
## group_by: one grouping variable (first_Nummer)
## add_tally (grouped): new variable 'n' (integer) with one unique value and 0% NA
## ungroup: no grouping variables
## filter: removed all rows (100%)
## select: dropped one variable (n)

Exceptions, if exist, are due to double entry, one with the original color ring, a second one with GPS tracker ID. Such exceptions, if present, should be handled separately while mapping bird_shorthand in document 3_birds.Rmd. Another reason could be mapping in map_gps as result of reringing of birds with GPS. Notice that map_gps should map to the very first ring.

3.20 Check inconsistencies between users and observers

We have to be sure that all observations have a valid observator reference, i.e. an ID contained in column Nummer of users.

Observations without an observator reference:

obs_and_acts %>%
  filter(is.na(WaarnemerNummer))
## filter: removed all rows (100%)

All observator references in obs_and_acts should be present in users:

observers_ids <- unique(
  obs_and_acts %>%
    filter(!is.na(WaarnemerNummer)) %>%
    distinct(WaarnemerNummer) %>%
    pull()
)
## filter: no rows removed
## distinct: removed 148,946 rows (99%), 2,010 rows remaining
all(observers_ids %in% users$Nummer)
## [1] TRUE

3.21 Remove data of common shelduck

Data related to common shelduck, Tadorna tadorna, are not updated: INOB experts suggest to remove them from the mapping. More details here. Euring scheme: 01730. We retrieve the values of field Nummer from birds related to this species:

common_shellduck_nummer <-
  birds %>%
  filter(EuringCode == "01730") %>%
  pull(Nummer)
## filter: removed 11,227 rows (99%), 80 rows remaining

Number of birds to remove:

length(common_shellduck_nummer)
## [1] 80

We remove data from birds:

birds <- birds %>%
  filter(EuringCode != "01730" | is.na(EuringCode))
## filter: removed 80 rows (1%), 11,227 rows remaining

and from obs_and_acts:

obs_and_acts <-
  obs_and_acts %>%
  filter(!KleurringNummer %in% common_shellduck_nummer | is.na(KleurringNummer))
## filter: removed 110 rows (<1%), 150,846 rows remaining

3.22 Remove rings BGAH, BRAB and EAU

The rings BGAH, BRAB and EAUshould be removed, as explained in # 89-issuesomment-481195862 and # 102-issuecomment-604939290. No information about EURING code can be found:

birds %>%
  filter(Nummer %in% c("BGAH", "BRAB", "EAU"))
## filter: removed 11,224 rows (>99%), 3 rows remaining

We also remove the ring EAU as it is not valid and the INBO experts agree on removing it (see issue #102).

We remove these rings from birds:

birds <-
  birds %>%
  filter(!Nummer %in% c("BGAH", "BRAB", "EAU"))
## filter: removed 3 rows (<1%), 11,224 rows remaining

Observations related to these three rings:

obs_and_acts %>%
  filter(KleurringNummer %in% c("BGAH", "BRAB", "EAU"))
## filter: removed 150,843 rows (>99%), 3 rows remaining

will be also removed:

obs_and_acts <-
  obs_and_acts %>%
  filter(!KleurringNummer %in% c("BGAH", "BRAB", "EAU"))
## filter: removed 3 rows (<1%), 150,843 rows remaining

3.23 Check date of dood

Observations with action dood should be the very last for each observation sequence containing this action. Exceptions:

obs_after_death <-
  obs_and_acts %>%
  filter(dood == "dood") %>%
  select(KleurringNummer, Datum) %>%
  rename(dood_datum = Datum) %>%
  left_join(obs_and_acts, by = "KleurringNummer") %>%
  filter(dood_datum < Datum) %>%
  select(Nummer, KleurringNummer, Datum, dood_datum, acts, everything()) %>%
  arrange(KleurringNummer, Datum)
## filter: removed 150,113 rows (>99%), 730 rows remaining
## select: dropped 37 variables (Nummer, EuringCode, LeeftijdCode, KleurringPlaats, MetaalringNummer_obs, …)
## rename: renamed one variable (dood_datum)
## left_join: added 38 columns (Nummer, Datum, EuringCode, LeeftijdCode, KleurringPlaats, …)
##            > rows only in x         0
##            > rows only in y  (143,800)
##            > matched rows       7,043    (includes duplicates)
##            >                 =========
##            > rows total         7,043
## filter: removed all rows (100%)
## Note: Using an external vector in selections is ambiguous.
## i Use `all_of(acts)` instead of `acts` to silence this message.
## i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
## select: columns reordered (Nummer, KleurringNummer, Datum, dood_datum, rngme, …)
obs_after_death

If present, observations after dood are removed:

obs_and_acts <-
  obs_and_acts %>%
  filter(!Nummer %in% obs_after_death$Nummer)
## filter: no rows removed

Finally, we perform a last check: a ring should not be linked to two observations, one with action dood and one with action klgev:

obs_and_acts %>%
  filter(dood == "dood" | klgev == "klgev") %>%
  group_by(KleurringNummer) %>%
  add_tally() %>%
  ungroup() %>%
  filter(n > 1)
## filter: removed 150,075 rows (99%), 768 rows remaining
## group_by: one grouping variable (KleurringNummer)
## add_tally (grouped): new variable 'n' (integer) with one unique value and 0% NA
## ungroup: no grouping variables
## filter: removed all rows (100%)

3.24 Save preprocessed data

After data cleaning we save the data as TSVs in .data/interim:

write_tsv(users, path = here::here("data", "interim", "users.tsv"), na = "")
write_tsv(birds, path = here::here("data", "interim", "birds.tsv"), na = "")
write_tsv(
  obs_and_acts,
  path = here::here("data", "interim", "obs_and_actions.tsv"),
  na = ""
)

We save the tables containing action codes, colors and ring places in ./data/input as it could be useful in following mapping steps:

# Save action codes
write_tsv(
  actions_meaning,
  path = here::here("data", "input", "actions_meaning.tsv"),
  na = ""
)

# Save colors
write_tsv(
  color_table,
  path = here::here("data", "input", "color_table.tsv"),
  na = ""
)

# Save ring position
write_tsv(
  ring_position_table,
  path = here::here("data", "input", "ring_position_table.tsv"),
  na = ""
)