Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vignette for dynamic patterns in validation functions #525

Open
yjunechoe opened this issue Mar 4, 2024 · 0 comments
Open

Vignette for dynamic patterns in validation functions #525

yjunechoe opened this issue Mar 4, 2024 · 0 comments

Comments

@yjunechoe
Copy link
Collaborator

yjunechoe commented Mar 4, 2024

I like to think that the big unifying theme of the new 0.12.0 release is making validation functions more dynamic. Broadly, this came in the form of improved {tidyselect} support and enabling {glue} syntax, but in the process we also unlocked some really cool but non-obvious patterns that I think may be worth documenting.

There's still some questions in my mind about the scope and structure of such vignette (perhaps this could be part of a more general and technical vignette for intermediate-to-advanced users on "programming with pointblank" or "extending pointblank"), but I wanted to put the idea out there. I'm happy to draft something if this is of interest!


Just to sketch out a couple examples of new patterns:

1) glue() syntax in label allows the injecting of human-readable labels

The pattern would be to first define a "dictionary" (named vector) of column names to human-readable names and use it to substitute {.col} via indexing:

col_labels <- c(
  "lat" = "Latitude",
  "long" = "Longitude"
)
agent1a <- create_agent(storms) %>% 
  col_vals_not_null(
    columns = c(lat, long),
    label = "{col_labels[.col]} information should be present"
  ) %>% 
  interrogate(show_step_label = TRUE, progress = TRUE)
#> 
#> ── Interrogation Started - there are 2 steps ──────────────────────────────
#> ✔ Step 1: OK. - Latitude information should be present
#> ✔ Step 2: OK. - Longitude information should be present
#> 
#> ── Interrogation Completed ────────────────────────────────────────────────

Going further, we can also use all_of() from new tidyselect integration to express both columns and labels dynamically:

agent1b <- create_agent(storms) %>% 
  col_vals_not_null(
    columns = all_of(names(col_labels)),
    label = "{col_labels[.col]} information should be present"
  ) %>% 
  interrogate(show_step_label = TRUE, progress = TRUE)
#> 
#> ── Interrogation Started - there are 2 steps ──────────────────────────────
#> ✔ Step 1: OK. - Latitude information should be present
#> ✔ Step 2: OK. - Longitude information should be present
#> 
#> ── Interrogation Completed ────────────────────────────────────────────────

2. any_of() in columns allows smart subsetting of columns that are present from a larger pool of thematically grouped columns

For example, we might define a function that wraps col_vals_gt(), applying a rule (must be positive number) for a pool of columns that we know to contain information about a measure of time:

validate_time_cols <- function(agent) {
  time_measures <- c("year", "month", "day", "hour", "minute", "second")
  agent %>% 
    col_vals_gte(
      columns = any_of(time_measures),
      value = 0,
      label = "Measure of time `{.col}` should be positive"
    )
}

dplyr::storms records year, month, and day, hour but not minute and second in its current state. Just those columns that exist are picked up and validated and with this single function.

intersect(colnames(storms), time_measures)
#> [1] "year"  "month" "day"   "hour"
agent2a <- create_agent(storms) %>% 
  validate_time_cols() %>% 
  interrogate(show_step_label = TRUE, progress = TRUE)
#> 
#> ── Interrogation Started - there are 4 steps ──────────────────────────────
#> ✔ Step 1: OK. - Measure of time `year` should be positive
#> ✔ Step 2: OK. - Measure of time `month` should be positive
#> ✔ Step 3: OK. - Measure of time `day` should be positive
#> ✔ Step 4: OK. - Measure of time `hour` should be positive
#> 
#> ── Interrogation Completed ────────────────────────────────────────────────

If the storms API is known to sometimes (but not always) include information about minute, the same function can gracefully handle that variation, thanks to any_of():

storms2 <- storms %>% 
  mutate(minute = sample(0:60, n(), replace = TRUE))
agent1b <- create_agent(storms2) %>% 
  validate_time_cols() %>% 
  interrogate(show_step_label = TRUE, progress = TRUE)
#> 
#> ── Interrogation Started - there are 5 steps ──────────────────────────────
#> ✔ Step 1: OK. - Measure of time `year` should be positive
#> ✔ Step 2: OK. - Measure of time `month` should be positive
#> ✔ Step 3: OK. - Measure of time `day` should be positive
#> ✔ Step 4: OK. - Measure of time `hour` should be positive
#> ✔ Step 5: OK. - Measure of time `minute` should be positive
#> 
#> ── Interrogation Completed ────────────────────────────────────────────────

3. Shared tidyselect expression in columns = ... and has_columns(...) to conditionally skip a step

pointblank has always had this pattern, but it's become even simpler to reason about now that columns and has_columns() share the exact same tidyselect implementation.

For example, this function says "check for missing values in factor columns, but only if any exists in the data":

validate_factor_completeness <- function(agent) {
  agent %>% 
    col_vals_not_null(
      columns = where(is.factor),
      label = "Factor `{.col}` should not have missing data",
      active = ~ . %>% has_columns(where(is.factor))
    )
}

Now we have a very general and modular function that can be applied to any dataset.

It picks out the one factor column in dplyr::storms and validates it:

agent3a <- create_agent(storms) %>% 
  validate_factor_completeness() %>% 
  interrogate(show_step_label = TRUE, progress = TRUE)
#> 
#> ── Interrogation Started - there is a single validation step ──────────────
#> ✔ Step 1: OK. - Factor `status` should not have missing data
#> 
#> ── Interrogation Completed ────────────────────────────────────────────────

It skips for dplyr::starwars because there are no factor columns

agent3b <- create_agent(starwars) %>% 
  validate_factor_completeness() %>% 
  interrogate(show_step_label = TRUE, progress = TRUE)
#> 
#> ── Interrogation Started - there is a single validation step ──────────────
#> ℹ Step 1 is not set as active. Skipping.
#> 
#> ── Interrogation Completed ────────────────────────────────────────────────
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants