Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: assertions on attributes of data frame #98

Open
JordanGutterman opened this issue May 5, 2020 · 2 comments
Open

Feature request: assertions on attributes of data frame #98

JordanGutterman opened this issue May 5, 2020 · 2 comments

Comments

@JordanGutterman
Copy link

I previously used this package to check that joins in a pipeline do not introduce new rows via duplicates using the following pattern:

df3 <- df %>%
   left_join(df2, by="var") %>%
   assert(nrow(.) == nrow(df)

Or similarly, other attributes of the data frame at that point in the pipeline by passing the current state of the frame in the pipeline using .

This commit added a check that columns are passed to assert(), which makes sense per the current documentation but causes my use case to break. So this a request is to allow passing logical checks to predicates that do not operate on columns, or another way to check attributes of the data frame being built at that point in the pipeline.

@maia-sh
Copy link

maia-sh commented Jul 29, 2020

Hi @JordanGutterman,

I stumbled across a similar issue and came to following solutions using verify. Perhaps, they can help you.

library(dplyr)
library(assertr)

# Make toy dataframes
my_cars <- 
  mtcars %>% 
  mutate(id = row_number())
  
cars_info <- 
  my_cars %>% 
  select(id) %>% 
  mutate(color = "purple", year = 1974)

# Option 1: check then join
my_cars %>% 
  verify(nrow(anti_join(., cars_info, by = "id")) == 0) %>% 
  left_join(cars_info, by = "id") %>% 
  head()
#>    mpg cyl disp  hp drat    wt  qsec vs am gear carb id  color year
#> 1 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4  1 purple 1974
#> 2 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4  2 purple 1974
#> 3 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1  3 purple 1974
#> 4 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1  4 purple 1974
#> 5 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2  5 purple 1974
#> 6 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1  6 purple 1974

# Option 2: join then check
my_cars %>% 
  left_join(cars_info, by = "id") %>% 
  verify(nrow(.) == nrow(my_cars)) %>% 
  head()
#>    mpg cyl disp  hp drat    wt  qsec vs am gear carb id  color year
#> 1 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4  1 purple 1974
#> 2 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4  2 purple 1974
#> 3 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1  3 purple 1974
#> 4 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1  4 purple 1974
#> 5 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2  5 purple 1974
#> 6 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1  6 purple 1974

Created on 2020-07-29 by the reprex package (v0.3.0)

@gregleleu
Copy link

Hi,
Depending on your cases, not duplicating lines means var in df2 has no duplicates, so you could do:

df3 <- df %>%
   left_join(df2 %>% assert(is_uniq, var), by="var")

Which will fail if df2 has duplicates which would add more line, and tell you where are the duplicates as a bonus.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants