2 Import and preprocess INBO color ring data
2.1 Extract data from INBO kleurring database
2.1.1 Connection to INBO database
Get connection settings from config.yml
(not included to the reposiory) and connect to database:
2.1.2 Extract user data
First, we extract data about users from INBO kleurring database:
2.1.3 Extract color ring data
We extract the color table in order to know the meaning of the color abbreviations in fields RingKleurCode
and InscriptieKleurCode
in birds
.
We also extract the ring position (left or right leg) and the inscription orientation:
ring_position_table <- dbGetQuery(conn, "SELECT * FROM dbo.cdeRingPlaats") %>%
as_tibble()
ring_position_table
2.1.4 Extract observation data
INBO’s observation data contain a text type field: Opmerking
. Text type is deprecated and an error will be returned if we perform the standard SQL query "SELECT * FROM dbo.tblWaarneming"
. So, we need an ad-hoc query:
obs <- dbGetQuery(
conn,
"SELECT Nummer,
Datum,
EuringCode,
LeeftijdCode,
KleurringNummer,
KleurringPlaats,
MetaalringNummer,
MetaalringPlaats,
PlaatsGemeente,
PlaatsToponym,
PlaatsToponymDetail,
Convert(nvarchar(4000),Opmerking) as Opmerking,
WaarnemerNummer,
PlaatsLengtegraadGraden,
PlaatsLengtegraadMinuten,
PlaatsLengtegraadSeconden,
PlaatsBreedtegraadGraden,
PlaatsBreedtegraadMinuten,
PlaatsBreedtegraadSeconden,
PlaatsLengtegraadRichtingCode,
PlaatsBreedtegraadRichtingCode,
PlaatsLandCode,
MetaalringLandCode,
BevestigingDatum,
PlaatsProvincie,
AanmaakDatum,
WijzigDatum
FROM dbo.tblWaarneming"
) %>% as_tibble()
Table tblWaarnemingAktie
is also important because contains informations about the actions taken at each observation:
The action codes are described in table dbo.cdeAktie
:
Close connection to server:
2.1.5 Import control data
From UVABIT repository, we copied a text file containing the mapping of GPS ids:
map_gps_path <- here::here("data", "input", "map_gps_id_to_color_ring.tsv")
map_gps <-
read_tsv(map_gps_path) %>%
mutate(sovon_bird_notes = "uvabits_gps_tag.")
## Parsed with column specification:
## cols(
## gps_id = col_character(),
## color_ring = col_character()
## )
## mutate: new variable 'sovon_bird_notes' (character) with one unique value and 0% NA
This file will be used later in this same document.
3 Pre-processing
3.1 Preview raw data
For privacy reasons users data cannot be shown. Users data refer to the following fields:
## [1] "Nummer" "Familienaam" "Voornaam" "Adres"
## [5] "Postcode" "Gemeente" "Email" "Telefoon"
## [9] "Wachtwoord" "Gebruikersnaam" "LandCode" "TelefoonMobiel"
## [13] "TelefoonWerk"
Preview birds data:
Preview observations data:
Preview observation actions data:
3.2 Check primary key consistency
The primary key of each table shuold never be left empty and should contain unique values.
3.2.1 Users
Primary key: column Nummer
. Users with empty values:
## filter: removed all rows (100%)
Users with not unique values of Nummer
:
## group_by: one grouping variable (Nummer)
## count: now 2,039 rows and 2 columns, one group variable remaining (Nummer)
## filter (grouped): removed all rows (100%)
3.2.2 Birds
Primary key: column Nummer
. Birds with empty values:
## filter: removed all rows (100%)
Birds with not unique values of Nummer
:
## group_by: one grouping variable (Nummer)
## count: now 11,309 rows and 2 columns, one group variable remaining (Nummer)
## filter (grouped): removed all rows (100%)
3.2.3 Observations
Primary key: column Nummer
. Observations with empty values:
## filter: removed all rows (100%)
Observations with not unique values of Nummer
:
## group_by: one grouping variable (Nummer)
## count: now 151,349 rows and 2 columns, one group variable remaining (Nummer)
## filter (grouped): removed all rows (100%)
3.2.4 Actions
Primary key: column Nummer
. Actions with empty values:
## filter: removed all rows (100%)
Observations with not unique values of Nummer
:
## group_by: one grouping variable (Nummer)
## count: now 162,010 rows and 2 columns, one group variable remaining (Nummer)
## filter (grouped): removed all rows (100%)
3.3 Remove hard returns
3.3.1 Remove hard returns in users
Remove hard returns \r\n
in users
:
## mutate_if: changed one value (<1%) of 'Adres' (0 new NA)
## changed one value (<1%) of 'Email' (0 new NA)
3.3.2 Remove hard returns in birds
Remove hard returns \r\n
in birds
:
## mutate_if: no changes
3.4 Check spaces in birds
No spaces should be present in any column of birds
:
map_dfr(birds, ~ mutate(birds, space_detect = str_detect(., pattern = " "))) %>%
filter(space_detect == TRUE)
## mutate: new variable 'space_detect' (logical) with one unique value and 0% NA
## mutate: new variable 'space_detect' (logical) with one unique value and 0% NA
## mutate: new variable 'space_detect' (logical) with one unique value and 0% NA
## mutate: new variable 'space_detect' (logical) with 2 unique values and 9% NA
## mutate: new variable 'space_detect' (logical) with 2 unique values and 1% NA
## mutate: new variable 'space_detect' (logical) with 2 unique values and 1% NA
## mutate: new variable 'space_detect' (logical) with 2 unique values and <1% NA
## mutate: new variable 'space_detect' (logical) with 2 unique values and <1% NA
## mutate: new variable 'space_detect' (logical) with 2 unique values and <1% NA
## mutate: new variable 'space_detect' (logical) with 2 unique values and 9% NA
## mutate: new variable 'space_detect' (logical) with 2 unique values and 6% NA
## filter: removed all rows (100%)
Remove all spaces:
## mutate_all: no changes
3.5 Check duplicates WaarnemingNummer
-AktieCode
The action acronym is contained in column AktieCode
. Observations and their actions are linked via columns WaarnemingNummer
(in actions
) and Nummer
(in obs
).
No duplicates WaarnemingNummer
-AktieCode
should exist:
obs_actions %>%
group_by(WaarnemingNummer, AktieCode) %>%
add_tally() %>%
ungroup() %>%
filter(n > 1)
## group_by: 2 grouping variables (WaarnemingNummer, AktieCode)
## add_tally (grouped): new variable 'n' (integer) with one unique value and 0% NA
## ungroup: no grouping variables
## filter: removed all rows (100%)
Otherwise they should be removed:
## distinct: no rows removed
3.6 Merge observations and actions
In order to ease the mapping of birds and observations, we first remove columns will be never used:
## select: dropped 3 variables (Nummer, AanmaakDatum, WijzigDatum)
Then we spread action codes to columns:
## pivot_wider: reorganized (AktieCode) into (rngme, rngkl, veld, dood, vang, …) [was 162010x2, now 151349x13]
The following combinations occur:
## select: dropped one variable (WaarnemingNummer)
## distinct: removed 151,319 rows (>99%), 30 rows remaining
Actions present in database:
## [1] "rngme" "rngkl" "veld" "dood" "vang" "klgev" "br" "vangl" "klweg"
## [10] "me" "meweg" "ziek"
Actions described in actions_meaning
never used:
## filter: removed 12 rows (92%), one row remaining
We add action code columns to observations in order to have a unique dataframe for observations:
## left_join: added 12 columns (rngme, rngkl, veld, dood, vang, …)
## > rows only in x 0
## > rows only in y ( 0)
## > matched rows 151,349
## > =========
## > rows total 151,349
Preview:
3.7 Check spaces in obs_and_acts
No spaces should be present in character columns of obs_and_acts
except for columns related to place description (Plaats*
), datums (*Datum
) and notes (Opmerking
):
map_dfr(obs_and_acts %>%
select_if(is.character) %>%
select(-c(
starts_with("Plaats"),
Opmerking,
ends_with("Datum")
)), ~
mutate(obs_and_acts, space_detect = str_detect(., pattern = " "))) %>%
filter(space_detect == TRUE) %>%
arrange(Nummer)
## select_if: dropped 12 variables (Nummer, Datum, WaarnemerNummer, PlaatsLengtegraadGraden, PlaatsLengtegraadMinuten, …)
## select: dropped 8 variables (PlaatsGemeente, PlaatsToponym, PlaatsToponymDetail, Opmerking, PlaatsLengtegraadRichtingCode, …)
## mutate: new variable 'space_detect' (logical) with 2 unique values and 28% NA
## mutate: new variable 'space_detect' (logical) with 2 unique values and 40% NA
## mutate: new variable 'space_detect' (logical) with one unique value and 0% NA
## mutate: new variable 'space_detect' (logical) with 2 unique values and 89% NA
## mutate: new variable 'space_detect' (logical) with 2 unique values and 99% NA
## mutate: new variable 'space_detect' (logical) with 2 unique values and 99% NA
## mutate: new variable 'space_detect' (logical) with 2 unique values and 99% NA
## mutate: new variable 'space_detect' (logical) with 2 unique values and 93% NA
## mutate: new variable 'space_detect' (logical) with 2 unique values and 93% NA
## mutate: new variable 'space_detect' (logical) with 2 unique values and 13% NA
## mutate: new variable 'space_detect' (logical) with 2 unique values and >99% NA
## mutate: new variable 'space_detect' (logical) with 2 unique values and >99% NA
## mutate: new variable 'space_detect' (logical) with 2 unique values and >99% NA
## mutate: new variable 'space_detect' (logical) with 2 unique values and 96% NA
## mutate: new variable 'space_detect' (logical) with 2 unique values and >99% NA
## mutate: new variable 'space_detect' (logical) with 2 unique values and >99% NA
## mutate: new variable 'space_detect' (logical) with 2 unique values and >99% NA
## mutate: new variable 'space_detect' (logical) with 2 unique values and >99% NA
## mutate: new variable 'space_detect' (logical) with 2 unique values and >99% NA
## filter: removed all rows (100%)
Remove spaces if present:
obs_and_acts <-
obs_and_acts %>%
mutate(across(
!starts_with("Plaats") &
!ends_with("Datum") &
!one_of("Opmerking"),
~ str_replace_all(., " ", "")
))
## mutate: converted 'Nummer' from integer to character (0 new NA)
## converted 'WaarnemerNummer' from integer to character (0 new NA)
3.8 Check presence action code
All observations should also have an action code:
no_actions <-
obs_and_acts %>%
filter_at(vars_select(names(obs_and_acts), one_of(acts)), all_vars(is.na(.)))
## filter_at: removed all rows (100%)
Observation without it will be removed:
## filter: no rows removed
3.9 Check combinations of actions with klgev
or megev
klgev
should never occur with other active actions like vang
, vangl
or action dood
. Is this true?
acts_with_klgev <-
obs_and_acts %>%
filter(klgev == "klgev") %>%
select(all_of(acts)) %>%
distinct() %>%
select_if(~ sum(!is.na(.)) > 0) %>%
names()
## filter: removed 151,311 rows (>99%), 38 rows remaining
## select: dropped 27 variables (Nummer, Datum, EuringCode, LeeftijdCode, KleurringNummer, …)
## distinct: removed 37 rows (97%), one row remaining
## select_if: dropped 11 variables (rngme, rngkl, veld, dood, vang, …)
## [1] TRUE
Same for megev
:
if ("megev" %in% acts) {
acts_with_megev <-
obs_and_acts %>%
filter(megev == "megev") %>%
select(all_of(acts)) %>%
distinct() %>%
select_if(~ sum(!is.na(.)) > 0) %>%
names()
acts_with_megev <- acts_with_megev[acts_with_megev != "megev"]
length(acts_with_megev) == 0
} else {
message("No 'megev' actions found.")
}
## No 'megev' actions found.
3.10 Remove data labelled as FOUT
There are observations judged as error. They are marked byKleurringNummmer
equal to FOUT
:
## filter: removed 150,958 rows (>99%), 391 rows remaining
The value FOUT
is also present in birds
:
## filter: removed 11,308 rows (>99%), one row remaining
These data are related to observations judged not correct (FOUT
: error) by the INBO experts and administrator of the database.
Thus, at the moment, we remove these data (discussion about this issue here) from observations:
## filter: removed 391 rows (<1%), 150,958 rows remaining
and from birds:
## filter: removed one row (<1%), 11,308 rows remaining
3.11 Remove EAYT
The bird with color ring EAYT
has been added after the final data export for SOVON, Oct 8th, and will be removed from birds:
## filter: removed one row (<1%), 11,307 rows remaining
and observations:
## filter: removed 2 rows (<1%), 150,956 rows remaining
3.12 Check color rings: uppercase
Bird color rings codes are uppercase. Exceptions:
birds %>%
filter(Nummer != toupper(Nummer) |
NummerNieuw != toupper(NummerNieuw) |
NummerDesc != toupper(NummerDesc))
## filter: removed all rows (100%)
are converted to uppercase:
birds <- birds %>%
mutate(
Nummer = toupper(Nummer),
NummerNieuw = toupper(NummerNieuw),
NummerDesc = toupper(NummerDesc)
)
## mutate: no changes
The same holds true for KleurringNummer
in obs_and_acts
. Exceptions:
## filter: removed all rows (100%)
## distinct: no rows removed
are transformed to uppercase:
## mutate: no changes
3.13 Check keys KleurringNummer
(obs) - Nummer
(birds)
All values in KleurringNummer
should be present in birds$Nummer
. Exceptions:
## filter: removed all rows (100%)
## distinct: no rows removed
We remove them:
KleurringNummer_remove <-
obs_and_acts %>%
filter(!KleurringNummer %in% birds$Nummer) %>%
distinct(KleurringNummer) %>%
pull()
## filter: removed all rows (100%)
## distinct: no rows removed
## filter: no rows removed
Finally, we search for birds (Nummer
in birds
) not linked to any observation (KleurringNummer
in obs_and_acts
):
## filter: removed all rows (100%)
We remove them:
## filter: no rows removed
3.14 Check duplicates in date of applying rings
Detect duplicates in date of applying (the very first) color ring (action code rngkl
):
duplicates_ringing_rngkl <-
obs_and_acts %>%
filter(!is.na(rngkl)) %>%
group_by(KleurringNummer, Datum, rngkl) %>%
add_tally() %>%
ungroup() %>%
filter(n > 1)
## filter: removed 139,650 rows (93%), 11,306 rows remaining
## group_by: 3 grouping variables (KleurringNummer, Datum, rngkl)
## add_tally (grouped): new variable 'n' (integer) with one unique value and 0% NA
## ungroup: no grouping variables
## filter: removed all rows (100%)
and metal ring (action code rngme
):
duplicates_ringing_rngme <-
obs_and_acts %>%
filter(!is.na(rngme)) %>%
group_by(KleurringNummer, Datum, rngme) %>%
add_tally() %>%
ungroup() %>%
filter(n > 1)
## filter: removed 139,793 rows (93%), 11,163 rows remaining
## group_by: 3 grouping variables (KleurringNummer, Datum, rngme)
## add_tally (grouped): new variable 'n' (integer) with one unique value and 0% NA
## ungroup: no grouping variables
## filter: removed all rows (100%)
3.15 Add rngme
if not present
Some metal rings are not linked to an observation with action rngme
, so we have no date for them:
no_rngme <-
birds %>%
filter(!is.na(MetaalringNummer)) %>%
left_join(obs_and_acts %>%
filter(!is.na(rngme)) %>%
select(KleurringNummer, Datum),
by = c("Nummer" = "KleurringNummer")
) %>%
filter(is.na(Datum)) %>%
select(Nummer, MetaalringNummer) %>%
rename(KleurringNummer = Nummer) %>%
arrange(KleurringNummer)
## filter: removed 31 rows (<1%), 11,276 rows remaining
## filter: removed 139,793 rows (93%), 11,163 rows remaining
## select: dropped 37 variables (Nummer, EuringCode, LeeftijdCode, KleurringPlaats, MetaalringNummer, …)
## left_join: added one column (Datum)
## > rows only in x 144
## > rows only in y ( 31)
## > matched rows 11,132
## > ========
## > rows total 11,276
## filter: removed 11,132 rows (99%), 144 rows remaining
## select: dropped 10 variables (NummerNieuw, NummerDesc, Plaats, RingKleurCode, InscriptieKleurCode, …)
## rename: renamed one variable (KleurringNummer)
Experts are sure that these metal rings have been applied while applying the color ring (see comment on issue #33). So, we can add action rngme
to observations coupled to action rngkl
for these rings:
obs_and_acts <-
obs_and_acts %>%
left_join(no_rngme %>%
select(KleurringNummer) %>%
mutate(add_rngme = TRUE),
by = "KleurringNummer"
) %>%
mutate(rngme = case_when(
rngkl == "rngkl" & add_rngme == TRUE ~ "rngme",
TRUE ~ rngme
)) %>%
select(-add_rngme)
## select: dropped one variable (MetaalringNummer)
## mutate: new variable 'add_rngme' (logical) with one unique value and 0% NA
## left_join: added one column (add_rngme)
## > rows only in x 150,312
## > rows only in y ( 0)
## > matched rows 644
## > =========
## > rows total 150,956
## mutate: changed 144 values (<1%) of 'rngme' (144 fewer NA)
## select: dropped one variable (add_rngme)
Be sure this solution provides a date for all metal rings:
## filter: no rows removed
## [1] TRUE
3.16 Check age while applying rings
For each bird, observations with same date should refer to same bird age. The only exceptions allowed are those from veld
observations as the age in this case is estimated by observer and can be different from the real one.
Exceptions, afer removing field observations:
exceptions_age <-
obs_and_acts %>%
# age is present
filter(!is.na(LeeftijdCode)) %>%
# exclude field observations
filter(is.na(veld)) %>%
distinct(KleurringNummer, Datum, LeeftijdCode, WaarnemerNummer) %>%
group_by(KleurringNummer, Datum, WaarnemerNummer) %>%
count() %>%
filter(n > 1) %>%
left_join(obs_and_acts %>%
filter(!is.na(LeeftijdCode)) %>%
select(
Nummer, KleurringNummer, Datum, WaarnemerNummer,
LeeftijdCode, one_of(actions_meaning$Code)
)) %>%
select(-n) %>%
select(
Nummer, KleurringNummer, Datum,
LeeftijdCode, WaarnemerNummer, one_of(actions_meaning$Code)
)
## filter: removed 60,680 rows (40%), 90,276 rows remaining
## filter: removed 72,908 rows (81%), 17,368 rows remaining
## distinct: removed 2 rows (<1%), 17,366 rows remaining
## group_by: 3 grouping variables (KleurringNummer, Datum, WaarnemerNummer)
## count: now 17,366 rows and 4 columns, 3 group variables remaining (KleurringNummer, Datum, WaarnemerNummer)
## filter (grouped): removed all rows (100%)
## filter: removed 60,680 rows (40%), 90,276 rows remaining
## Warning: Unknown columns: `megev`
## select: dropped 22 variables (EuringCode, KleurringPlaats, MetaalringNummer, MetaalringPlaats, PlaatsGemeente, …)
## Joining, by = c("KleurringNummer", "Datum", "WaarnemerNummer")
## left_join: added 14 columns (Nummer, LeeftijdCode, rngkl, rngme, klgev, …)
## > rows only in x 0
## > rows only in y (90,276)
## > matched rows 0
## > ========
## > rows total 0
## select: dropped one variable (n)
## Warning: Unknown columns: `megev`
## select: columns reordered (Nummer, KleurringNummer, Datum, LeeftijdCode, WaarnemerNummer, …)
3.17 Check bird sex inconsistencies
All birds should have a sex, which could be one of M
(mannetje), V
(vrouwtje), O
(onbekend):
## distinct: removed 11,304 rows (>99%), 3 rows remaining
Exceptions:
## filter: removed all rows (100%)
## distinct: no rows removed
For all birds a one-to-one relation bird - sex should hold true:
## distinct: no rows removed
## [1] TRUE
3.18 Handle information about metal ring number
There are columns containing metal ring related information in birds
:
## [1] "MetaalringNummer" "MetaalringPlaats" "MetaalringLandCode"
and obs_and_acts
:
## [1] "MetaalringNummer" "MetaalringPlaats" "MetaalringLandCode"
This is not only redundant: inconsistencies are detected. Some examples:
birds %>%
filter(Nummer %in% c("AAAA", "AAAR", "AAAT", "AAAW")) %>%
rename("metal_ring_from_birds" = "MetaalringNummer") %>%
left_join(obs_and_acts %>%
rename("metal_ring_from_obs" = "MetaalringNummer"),
by = c("Nummer" = "KleurringNummer")
) %>%
distinct(Nummer, metal_ring_from_birds, metal_ring_from_obs)
## filter: removed 11,303 rows (>99%), 4 rows remaining
## rename: renamed one variable (metal_ring_from_birds)
## rename: renamed one variable (metal_ring_from_obs)
## left_join: added 41 columns (EuringCode.x, MetaalringPlaats.x, MetaalringLandCode.x, Nummer.y, Datum, …)
## > rows only in x 0
## > rows only in y (150,596)
## > matched rows 360 (includes duplicates)
## > =========
## > rows total 360
## distinct: removed 353 rows (98%), 7 rows remaining
Based on experts’ knowledge only the metal ring information in birds
is correct. We rename the columns related to metal ring information from obs_and_acts
by adding suffix _obs
:
## rename_at: renamed 3 variables (MetaalringNummer_obs, MetaalringPlaats_obs, MetaalringLandCode_obs)
3.19 Solve GPS tracker IDs
Some birds have a suspect Nummer
which is formed by letters GPS
or GP
followed by some numbers. These are a kind of GPS IDs which should not be in columns related to color rings:
## filter: removed 11,167 rows (99%), 140 rows remaining
## select: dropped 10 variables (NummerNieuw, NummerDesc, Plaats, RingKleurCode, InscriptieKleurCode, …)
The right mapping is saved in map_gps
:
We import the color rings in new column first_Nummer
. For all other rings will be first_nummer = Nummer
as they are the very first rings:
birds <-
birds %>%
# create column first_Nummer with right color rings for the GPS or GP** rings
left_join(map_gps, by = c("Nummer" = "gps_id")) %>%
# set first_Numer equal to Nummer in all other cases
mutate(first_Nummer = ifelse(is.na(color_ring),
Nummer,
color_ring
)) %>%
select(Nummer, first_Nummer, everything())
## left_join: added 2 columns (color_ring, sovon_bird_notes)
## > rows only in x 11,167
## > rows only in y ( 1)
## > matched rows 140
## > ========
## > rows total 11,307
## mutate: new variable 'first_Nummer' (character) with 11,307 unique values and 0% NA
## select: columns reordered (Nummer, first_Nummer, NummerNieuw, NummerDesc, Plaats, …)
Check whether not corrected rings (GP***
or GPS***
) are still present:
birds %>%
filter(str_detect(first_Nummer, pattern = "(^(GP)\\d+)|(^(GPS)\\d+)")) %>%
select(first_Nummer, NummerNieuw)
## filter: removed all rows (100%)
## select: dropped 12 variables (Nummer, NummerDesc, Plaats, RingKleurCode, InscriptieKleurCode, …)
Mapping example: the bird with Nummer
equal to GP104
is mapped as follows:
## filter: removed 11,306 rows (>99%), one row remaining
## select: dropped 10 variables (NummerDesc, Plaats, RingKleurCode, InscriptieKleurCode, EuringCode, …)
Column first_Nummer
should contain unique elements. Check for exceptions:
## group_by: one grouping variable (first_Nummer)
## add_tally (grouped): new variable 'n' (integer) with one unique value and 0% NA
## ungroup: no grouping variables
## filter: removed all rows (100%)
## select: dropped one variable (n)
Exceptions, if exist, are due to double entry, one with the original color ring, a second one with GPS tracker ID. Such exceptions, if present, should be handled separately while mapping bird_shorthand
in document 3_birds.Rmd
. Another reason could be mapping in map_gps
as result of reringing of birds with GPS. Notice that map_gps
should map to the very first ring.
3.20 Check inconsistencies between users and observers
We have to be sure that all observations have a valid observator reference, i.e. an ID contained in column Nummer
of users
.
Observations without an observator reference:
## filter: removed all rows (100%)
All observator references in obs_and_acts
should be present in users
:
observers_ids <- unique(
obs_and_acts %>%
filter(!is.na(WaarnemerNummer)) %>%
distinct(WaarnemerNummer) %>%
pull()
)
## filter: no rows removed
## distinct: removed 148,946 rows (99%), 2,010 rows remaining
## [1] TRUE
3.21 Remove data of common shelduck
Data related to common shelduck, Tadorna tadorna, are not updated: INOB experts suggest to remove them from the mapping. More details here. Euring scheme: 01730. We retrieve the values of field Nummer
from birds
related to this species:
## filter: removed 11,227 rows (99%), 80 rows remaining
Number of birds to remove:
## [1] 80
We remove data from birds
:
## filter: removed 80 rows (1%), 11,227 rows remaining
and from obs_and_acts
:
obs_and_acts <-
obs_and_acts %>%
filter(!KleurringNummer %in% common_shellduck_nummer | is.na(KleurringNummer))
## filter: removed 110 rows (<1%), 150,846 rows remaining
3.22 Remove rings BGAH
, BRAB
and EAU
The rings BGAH
, BRAB
and EAU
should be removed, as explained in # 89-issuesomment-481195862 and # 102-issuecomment-604939290. No information about EURING code can be found:
## filter: removed 11,224 rows (>99%), 3 rows remaining
We also remove the ring EAU
as it is not valid and the INBO experts agree on removing it (see issue #102).
We remove these rings from birds
:
## filter: removed 3 rows (<1%), 11,224 rows remaining
Observations related to these three rings:
## filter: removed 150,843 rows (>99%), 3 rows remaining
will be also removed:
## filter: removed 3 rows (<1%), 150,843 rows remaining
3.23 Check date of dood
Observations with action dood
should be the very last for each observation sequence containing this action. Exceptions:
obs_after_death <-
obs_and_acts %>%
filter(dood == "dood") %>%
select(KleurringNummer, Datum) %>%
rename(dood_datum = Datum) %>%
left_join(obs_and_acts, by = "KleurringNummer") %>%
filter(dood_datum < Datum) %>%
select(Nummer, KleurringNummer, Datum, dood_datum, acts, everything()) %>%
arrange(KleurringNummer, Datum)
## filter: removed 150,113 rows (>99%), 730 rows remaining
## select: dropped 37 variables (Nummer, EuringCode, LeeftijdCode, KleurringPlaats, MetaalringNummer_obs, …)
## rename: renamed one variable (dood_datum)
## left_join: added 38 columns (Nummer, Datum, EuringCode, LeeftijdCode, KleurringPlaats, …)
## > rows only in x 0
## > rows only in y (143,800)
## > matched rows 7,043 (includes duplicates)
## > =========
## > rows total 7,043
## filter: removed all rows (100%)
## Note: Using an external vector in selections is ambiguous.
## i Use `all_of(acts)` instead of `acts` to silence this message.
## i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
## select: columns reordered (Nummer, KleurringNummer, Datum, dood_datum, rngme, …)
If present, observations after dood
are removed:
## filter: no rows removed
Finally, we perform a last check: a ring should not be linked to two observations, one with action dood
and one with action klgev
:
obs_and_acts %>%
filter(dood == "dood" | klgev == "klgev") %>%
group_by(KleurringNummer) %>%
add_tally() %>%
ungroup() %>%
filter(n > 1)
## filter: removed 150,075 rows (99%), 768 rows remaining
## group_by: one grouping variable (KleurringNummer)
## add_tally (grouped): new variable 'n' (integer) with one unique value and 0% NA
## ungroup: no grouping variables
## filter: removed all rows (100%)
3.24 Save preprocessed data
After data cleaning we save the data as TSVs in .data/interim
:
write_tsv(users, path = here::here("data", "interim", "users.tsv"), na = "")
write_tsv(birds, path = here::here("data", "interim", "birds.tsv"), na = "")
write_tsv(
obs_and_acts,
path = here::here("data", "interim", "obs_and_actions.tsv"),
na = ""
)
We save the tables containing action codes, colors and ring places in ./data/input
as it could be useful in following mapping steps:
# Save action codes
write_tsv(
actions_meaning,
path = here::here("data", "input", "actions_meaning.tsv"),
na = ""
)
# Save colors
write_tsv(
color_table,
path = here::here("data", "input", "color_table.tsv"),
na = ""
)
# Save ring position
write_tsv(
ring_position_table,
path = here::here("data", "input", "ring_position_table.tsv"),
na = ""
)