Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using arrow results in error "not really a table object" #473

Open
6 tasks done
DavZim opened this issue Apr 17, 2023 · 2 comments
Open
6 tasks done

Using arrow results in error "not really a table object" #473

DavZim opened this issue Apr 17, 2023 · 2 comments

Comments

@DavZim
Copy link
Contributor

DavZim commented Apr 17, 2023

Prework

  • Read and agree to the code of conduct and contributing guidelines.
  • If there is already a relevant issue, whether open or closed, comment on the existing thread instead of posting a new issue.
  • Post a minimal reproducible example so the maintainer can troubleshoot the problems you identify. A reproducible example is:
    • Runnable: post enough R code and data so any onlooker can create the error on their own computer.
    • Minimal: reduce runtime wherever possible and remove complicated details that are irrelevant to the issue at hand.
    • Readable: format your code according to the tidyverse style guide.

Description

When interrogating an agent that is an arrow object, I get the following error: The 'table' in this validation step is not really a table object.

image

When I convert the arrow dataset to a data.frame first, pointblank works as expected

create_agent(as.data.frame(df)) |> # NOTE the as.data.frame here
  col_is_numeric(vars(x)) |> 
  interrogate()

#> ── Interrogation Started - there is a single validation step ──────────────────────────────────────────────── 
#> ✔ Step 1: OK.
#> ── Interrogation Completed ──────────────────────────────────────────────────────────────────────────────────

Reproducible example

library(pointblank)
library(arrow)
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#> 
#>     timestamp


df <- arrow_table(x = 1:3, y = c("a", "b", "c"))

agent <- create_agent(df) |> 
  col_is_numeric(vars(x))
agent |> get_agent_report(display_table = FALSE)
#> # A tibble: 1 × 14
#>       i type  columns values precon active eval  units n_pass f_pass W     S    
#>   <int> <chr> <chr>   <chr>  <chr>  <lgl>  <chr> <int>  <int>  <dbl> <lgl> <lgl>
#> 1     1 col_… x       <NA>   <NA>   NA     <NA>     NA     NA     NA NA    NA   
#> # … with 2 more variables: N <lgl>, extract <lgl>

agent |> interrogate() |> get_agent_report(display_table = FALSE)
#> # A tibble: 1 × 14
#>       i type  columns values precon active eval  units n_pass f_pass W     S    
#>   <int> <chr> <chr>   <chr>  <chr>  <lgl>  <chr> <int>  <int>  <dbl> <lgl> <lgl>
#> 1     1 col_… x       <NA>   <NA>   TRUE   ERROR    NA     NA     NA NA    NA   
#> # … with 2 more variables: N <lgl>, extract <int>

# repeat with a database connection --------------------
write_dataset(df, "arrow-dataset")
ds <- open_dataset("arrow-dataset")

agent <- create_agent(ds) |> 
  col_is_numeric(vars(x))
agent |> get_agent_report(display_table = FALSE)
#> # A tibble: 1 × 14
#>       i type  columns values precon active eval  units n_pass f_pass W     S    
#>   <int> <chr> <chr>   <chr>  <chr>  <lgl>  <chr> <int>  <int>  <dbl> <lgl> <lgl>
#> 1     1 col_… x       <NA>   <NA>   NA     <NA>     NA     NA     NA NA    NA   
#> # … with 2 more variables: N <lgl>, extract <lgl>

agent |> interrogate() |> get_agent_report(display_table = FALSE)
#> # A tibble: 1 × 14
#>       i type  columns values precon active eval  units n_pass f_pass W     S    
#>   <int> <chr> <chr>   <chr>  <chr>  <lgl>  <chr> <int>  <int>  <dbl> <lgl> <lgl>
#> 1     1 col_… x       <NA>   <NA>   TRUE   ERROR    NA     NA     NA NA    NA   
#> # … with 2 more variables: N <lgl>, extract <int>

Created on 2023-04-17 with reprex v2.0.2

Session info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.2.1 (2022-06-23)
#>  os       Ubuntu 18.04.6 LTS
#>  system   x86_64, linux-gnu
#>  ui       X11
#>  language (EN)
#>  collate  en_US.UTF-8
#>  ctype    en_US.UTF-8
#>  tz       Etc/UTC
#>  date     2023-04-17
#>  pandoc   2.18 @ /usr/lib/rstudio-server/bin/quarto/bin/tools/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version  date (UTC) lib source
#>  arrow       * 11.0.0.3 2023-03-08 [1] RSPM (R 4.2.1)
#>  assertthat    0.2.1    2019-03-21 [1] RSPM (R 4.2.1)
#>  bit           4.0.4    2020-08-04 [1] RSPM (R 4.2.1)
#>  bit64         4.0.5    2020-08-30 [1] RSPM (R 4.2.1)
#>  blastula      0.3.3    2023-01-07 [1] RSPM (R 4.2.1)
#>  cli           3.6.0    2023-01-09 [1] RSPM (R 4.2.1)
#>  digest        0.6.31   2022-12-11 [1] RSPM (R 4.2.1)
#>  dplyr         1.1.1    2023-03-22 [1] RSPM (R 4.2.1)
#>  evaluate      0.16     2022-08-09 [1] RSPM (R 4.2.1)
#>  fansi         1.0.3    2022-03-24 [1] RSPM (R 4.2.1)
#>  fastmap       1.1.1    2023-02-24 [1] RSPM (R 4.2.1)
#>  fs            1.6.1    2023-02-06 [1] RSPM (R 4.2.1)
#>  generics      0.1.3    2022-07-05 [1] RSPM (R 4.2.1)
#>  glue          1.6.2    2022-02-24 [1] RSPM (R 4.2.1)
#>  htmltools     0.5.4    2022-12-07 [1] RSPM (R 4.2.1)
#>  knitr         1.42     2023-01-25 [1] RSPM (R 4.2.1)
#>  lifecycle     1.0.3    2022-10-07 [1] RSPM (R 4.2.1)
#>  magrittr      2.0.3    2022-03-30 [1] RSPM (R 4.2.1)
#>  pillar        1.8.1    2022-08-19 [1] RSPM (R 4.2.1)
#>  pkgconfig     2.0.3    2019-09-22 [1] RSPM (R 4.2.1)
#>  pointblank  * 0.11.3   2023-02-09 [1] RSPM (R 4.2.1)
#>  purrr         1.0.1    2023-01-10 [1] RSPM (R 4.2.1)
#>  R6            2.5.1    2021-08-19 [1] RSPM (R 4.2.1)
#>  reprex        2.0.2    2022-08-17 [2] RSPM (R 4.2.1)
#>  rlang         1.1.0    2023-03-14 [1] RSPM (R 4.2.1)
#>  rmarkdown     2.16     2022-08-24 [1] RSPM (R 4.2.1)
#>  rstudioapi    0.14     2022-08-22 [2] RSPM (R 4.2.1)
#>  sessioninfo   1.2.2    2021-12-06 [1] RSPM (R 4.2.1)
#>  tibble        3.2.1    2023-03-20 [1] RSPM (R 4.2.1)
#>  tidyselect    1.2.0    2022-10-10 [1] RSPM (R 4.2.1)
#>  utf8          1.2.2    2021-07-24 [1] RSPM (R 4.2.1)
#>  vctrs         0.6.1    2023-03-22 [1] RSPM (R 4.2.1)
#>  withr         2.5.0    2022-03-03 [2] RSPM (R 4.2.1)
#>  xfun          0.38     2023-03-24 [1] RSPM (R 4.2.1)
#>  yaml          2.3.7    2023-01-23 [1] RSPM (R 4.2.1)
#> 
#>  [1] /home/NAME/R/x86_64-pc-linux-gnu-library/4.2
#>  [2] /usr/r-library/admin-library/4.2
#>  [3] /opt/R/4.2.1/lib/R/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────
@rich-iannone
Copy link
Member

Thanks for reporting this and providing a lot of details! This is definitely not right and requires a fix.

@dragosmg
Copy link

FWIW I see this more of a feature request than a bug. Think of arrow as another backend. I think the error message is informative (while not perfect). An arrow dataset is neither a data.frame, nor a database table. So I would expect the current approach not to work. I couldn't find in the {pointblank} documentation a claim that the tbl argument of create_agent() can be an arrow::Table.

As a first suggestion it would be great to have the documentation of the supported backends in a more prominent location (e.g. a paragraph in the {pagedown} site).

A second suggestion: maybe, in a first instance, error with a clear message that arrow tables (or datasets, etc.) are not (yet) supported and have a follow-up issue to implement such support arrow inputs? (I have done some work on {arrow} in the past and I think this might not be a trivial endeavour).

(by the way, thanks a lot for the great package and for the R in Pharma workshop)

@rich-iannone rich-iannone modified the milestones: v0.12.0, v0.13.0 Feb 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants