R/selectlocs.R
selectlocs_xg3.Rd
Select locations that comply with user-specified conditions,
from a dataset as returned by get_xg3
,
or from a list with the outputs of
eval_xg3_avail
and eval_xg3_series
.
Conditions can be specified for each of the summary statistics returned
by eval_xg3_avail
and eval_xg3_series
.
selectlocs_xg3( data, xg3_type = NULL, max_gap = NULL, min_dur = NULL, conditions, verbose = TRUE, list = FALSE )
data | Either an object returned by |
---|---|
xg3_type | Only relevant
when data is an object formatted as returned by
|
max_gap | A positive integer (can be zero). It is part of what the user defines as 'an XG3 series': the maximum allowed time gap between two consecutive XG3 values in a series, expressed as the number of years without XG3 value. |
min_dur | A strictly positive integer. It is part of what the user defines as 'an XG3 series': the minimum required duration of an XG3 series, i.e. the time (expressed as years) from the first to the last year of the XG3 series. |
conditions | A dataframe. See the devoted section below. |
verbose | Logical.
If |
list | Logical.
If |
If list = FALSE
: a tibble with one column loc_code
that
provides the locations selected by the conditions.
If list = TRUE
: a list of tibbles that extends the previous end-result
with intermediate results.
The below elements nrs. 2 and 3 are only given when at least one XG3
availability
condition was given, nrs. 4, 5 and 6 only when at least one XG3 series
condition was given,
and nr. 7 is only returned when both types of condition were given.
All list elements are named:
combined_result_filtered
:
the end-result, same as given by list = FALSE
.
result_avail
:
the test result of
each computed and tested availability statistic for each location and
XG3 variable: 'condition met' (cond_met
) is TRUE or FALSE.
combined_result_avail
:
aggregation of result_avail
per location.
Specific columns:
cond_met_avail
is TRUE
if all availability conditions
for that location were TRUE
, and is FALSE
in all other cases.
pct_cond_met_avail
is the percentage of 'met' availability conditions
per location.
result_series
:
the test result of
each computed and tested series statistic for each location and
XG3 series: 'condition met' (cond_met
) is TRUE or FALSE.
combined_result_series_xg3var
:
aggregation of result_series
per location and XG3 variable.
Two consecutive aggregation steps are involved here:
per XG3 series: are all series conditions met?
per XG3 variable: is there at least one series where all series conditions are met?
Specific columns:
all_ser_cond_met_xg3var
is the answer to question 2 (TRUE/FALSE).
avg_pct_cond_met_nonpassed_series
is the average percentage
(for a location and XG3 variable) of 'met'
series conditions in the series where not all conditions were met.
(Note that the same percentage is 100 for series where all conditions are
met, leading to all_ser_cond_met_xg3var = TRUE
at the level of
location and XG3 variable.)
combined_result_series
:
aggregation of combined_result_series_xg3var
per location.
Specific columns:
cond_met_series
is TRUE
if all XG3 variables
(that have series and on which series conditions were imposed)
were TRUE
in the previous aggregation
(all_ser_cond_met_xg3var = TRUE
), and is FALSE
in all other
cases.
pct_xg3vars_passed_ser
is the percentage of a location's XG3 variables
(that have series and on which series conditions were imposed) that were
TRUE
in the previous aggregation
(all_ser_cond_met_xg3var = TRUE
).
avg_pct_cond_met_in_nonpassed_series
is the average of
avg_pct_cond_met_nonpassed_series
from the previous aggregation step,
over the involved XG3 variables at the location.
combined_result
:
the inner join (on loc_code
) between combined_result_avail
and combined_result_series
.
Locations that were dropped in either evaluation because of missing
information, are dropped here too,
because the function is to return locations for which all conditions hold
and hence could be verified.
The last column, all_cond_met
, requires both
cond_met_avail = TRUE
and cond_met_series = TRUE
to result in
TRUE
for a location.
Notes:
locations with all_cond_met = FALSE
are not discarded in this
object;
filtering locations with all_cond_met = TRUE
will result in
exactly
the locations given by combined_result_filtered
, see list element 1;
the object is only returned when both availability and series
condition(s) were given (at least one of each family).
In the other cases, you can directly look at combined_result_avail
or combined_result_series
, from which combined_result_filtered
is derived.
selectlocs_xg3()
separately runs eval_xg3_avail()
and
eval_xg3_series()
on the input (data
) if the latter
conforms to the output of get_xg3
.
See the documentation of
eval_xg3_avail
and eval_xg3_series
to learn more about how an 'XG3 variable'
and an 'XG3 series' are defined, and about the available summary statistics.
Each condition for evaluation + selection of locations
is specific to an XG3 variable, which can also be
the level 'combined'.
Hence, the result will depend both on the XG3 types (HG3, LG3 and/or
VG3) for which statistics have been computed (specified by xg3_type
),
and on the conditions, specified by conditions
.
See the devoted section on the conditions
dataframe.
Only locations are returned:
which have all XG3 variables, implied by xg3_type
and
present in conditions
, available in data
.
(In other words, all conditions must be testable.)
for which all conditions are met;
As the conditions imposed by the conditions
dataframe are always
evaluated as a
required combination of conditions ('and'), the user must make different
calls to selectlocs_xg3()
if different sets of conditions are to be allowed ('or').
Regarding conditions that evaluate XG3 series, it is taken into account that one location can have multiple series for the same XG3 variable. When the user provides one or more conditions for the series of a specific XG3 variable, the condition(s) are regarded as fulfilled ('condition met') when at least one series is present of that XG3 variable for which all those conditions are met.
selectlocs_xg3()
joins the long-formatted results of
eval_xg3_avail
and eval_xg3_series
with the conditions
dataframe in order to evaluate the conditions.
Often, this join in itself already leads to dropping specific
combinations of loc_code
and xg3_variable
.
At least the locations that are completely dropped in this step are reported
when verbose = TRUE
.
For larger datasets eval_xg3_series()
can take quite some time,
whereas the user may want to repeatedly try different sets of conditions
until a satisfying selection of locations is returned.
However the output of both eval_xg3_avail()
and
eval_xg3_series()
will not change as long as the data and the chosen
values of max_gap
and min_dur
are not altered.
For that reason, the user can also prepare a list object with the
respective results of eval_xg3_avail()
and eval_xg3_series()
,
which must be named as "avail"
and "ser"
, respectively.
This list can instead be used as data-input, and in that case
xg3_type
, max_gap
and min_dur
are not needed
(they will be ignored).
Conditions can be specified for each of the summary statistics returned
by eval_xg3_avail
and eval_xg3_series
.
Consequently, XG3 availability conditions and
XG3 series conditions can be distinguished.
The conditions
parameter takes a dataframe that must have the
following columns:
xg3_variable
One of: "combined","lg3_lcl","lg3_ost",
"vg3_lcl",
"vg3_ost","hg3_lcl","hg3_ost"
.
statistic
Name of the statistic to be evaluated.
criterion
Numeric. Defines the value of the statistic on which the condition will be based.
direction
One of: "min","max","equal"
.
Together with criterion
, this completes the condition which will
be evaluated with respect to the specific xg3_variable
:
for direction = "min"
, the statistic must be the criterion
value or larger; for direction = "max"
, the statistic must be
the criterion value or lower; for direction = "equal"
,
the statistic must be equal to the criterion value.
Each condition is one row of the dataframe.
The dataframe should have at least one, and may have many.
Each combination of xg3_variable
and statistic
must be
unique.
Conditions on XG3 variables, absent from data
or not implied by
xg3_type
, will be dropped without warning.
Hence, it is up to the user to do sensible things.
The possible statistics for XG3 availability conditions are: nryears, firstyear, lastyear.
The possible statistics for XG3 series conditions are: ser_length, ser_nryears, ser_rel_nryears, ser_firstyear, ser_lastyear, ser_pval_uniform, ser_mean, ser_sd, ser_se_6y, ser_rel_sd_lcl, ser_rel_se_6y_lcl. The last six are not defined for the XG3 variable 'combined', and the last two are only defined for variables with a local vertical CRS.
eval_xg3_avail
, eval_xg3_series
Other functions to select locations:
selectlocs_chem()
if (FALSE) { watina <- connect_watina() library(dplyr) mylocs <- get_locs(watina, area_codes = "TOR", loc_type = c("P", "S")) mydata <- mylocs %>% get_xg3(watina, 2000) mydata %>% arrange(loc_code, hydroyear) # Number of locations in mydata: mydata %>% distinct(loc_code) %>% count # Number of hydrological years per location and XG3 variable: mydata %>% group_by(loc_code) %>% collect %>% summarise(lg3_lcl = sum(!is.na(lg3_lcl)), hg3_lcl = sum(!is.na(hg3_lcl)), vg3_lcl = sum(!is.na(vg3_lcl))) conditions_df <- tribble( ~xg3_variable, ~statistic, ~criterion, ~direction, "lg3_lcl", "ser_lastyear", 2015, "min", "hg3_lcl", "ser_lastyear", 2015, "min" ) conditions_df result <- mydata %>% selectlocs_xg3(xg3_type = c("L", "H"), max_gap = 1, min_dur = 5, conditions = conditions_df, list = TRUE) # or: # mystats <- list(avail = eval_xg3_avail(mydata, # xg3_type = c("L", "H")), # ser = eval_xg3_series(mydata, # xg3_type = c("L", "H"), # max_gap = 1, # min_dur = 5)) # result <- # mystats %>% # selectlocs_xg3(conditions = conditions_df, # list = TRUE) result$combined_result_filtered result[2:4] # Disconnect: dbDisconnect(watina) }