Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CKANConnection uses an old dbplyr interface #187

Open
fjuniorr opened this issue Oct 21, 2022 · 12 comments · May be fixed by #211
Open

CKANConnection uses an old dbplyr interface #187

fjuniorr opened this issue Oct 21, 2022 · 12 comments · May be fixed by #211
Milestone

Comments

@fjuniorr
Copy link
Contributor

fjuniorr commented Oct 21, 2022

Here is the message when we connect to a CKAN site with DataStore enabled:

library("ckanr"); library("dplyr")
ckan <- src_ckan("https://dados.mg.gov.br/")
#> url: https://dados.mg.gov.br/

resource <- tbl(src = ckan, name = "72d031e9-2753-469a-acfa-2d67417a2f49")
#> Warning: <CKANConnection> uses an old dbplyr interface
#> ℹ Please install a newer version of the package or contact the maintainer
#> This warning is displayed once every 8 hours.

According to Hadley this should be a simple fix and the details are available at dbplyr 2.0.0 backend API • dbplyr.

Created on 2022-10-21 with reprex v2.0.2

Session Info and Traceback
sessionInfo()
#> R version 4.2.1 (2022-06-23)
#> Platform: x86_64-apple-darwin17.0 (64-bit)
#> Running under: macOS Big Sur ... 10.16
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] dplyr_1.0.10 ckanr_0.6.0  DBI_1.1.3   
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.9       compiler_4.2.1   pillar_1.8.1     dbplyr_2.2.1    
#>  [5] highr_0.9        tools_4.2.1      digest_0.6.29    jsonlite_1.8.1  
#>  [9] evaluate_0.16    lifecycle_1.0.2  tibble_3.1.8     pkgconfig_2.0.3 
#> [13] rlang_1.0.6      reprex_2.0.2     cli_3.4.1        rstudioapi_0.14 
#> [17] crul_1.3         curl_4.3.2       yaml_2.3.5       xfun_0.33       
#> [21] fastmap_1.1.0    withr_2.5.0      stringr_1.4.1    knitr_1.40      
#> [25] generics_0.1.3   fs_1.5.2         vctrs_0.4.2      triebeard_0.3.0 
#> [29] tidyselect_1.1.2 glue_1.6.2       httpcode_0.3.0   R6_2.5.1        
#> [33] fansi_1.0.3      rmarkdown_2.16   purrr_0.3.4      magrittr_2.0.3  
#> [37] urltools_1.7.3   htmltools_0.5.3  assertthat_0.2.1 utf8_1.2.2      
#> [41] stringi_1.7.8

Created on 2022-10-21 with reprex v2.0.2

@fjuniorr fjuniorr modified the milestones: v0.7, v0.8 Mar 12, 2023
@rdenham
Copy link

rdenham commented May 26, 2024

With the release of dbplyr 2.5.0 this is now pretty urgent. Any plans to address this?

@florianm florianm linked a pull request May 26, 2024 that will close this issue
11 tasks
@florianm
Copy link
Contributor

@rdenham thanks for the ping! I've taken a quick stab at upgrading the dbplyr generics. Feel free to try out my branch (see PR) if you have time, any feedback appreciated.

@rdenham
Copy link

rdenham commented May 26, 2024

Thanks for your prompt reply @florianm !

The problem I have is with the dplyr interface, just like in the example at the start of this issue:

library("ckanr"); library("dplyr")
ckan <- src_ckan("https://dados.mg.gov.br/")

This will throw a slightly misleading error (see this issue) but basically, the dbplyr function src_sql is now fully deprecated, so I can't currently get the dplyr interface working.

My example url, in case it's useful, is:

ckan <- src_ckan("https://www.data.qld.gov.au/")

Please let me know if I've missed something.

@rdenham
Copy link

rdenham commented May 26, 2024

Just having a look around, it seems like https://github.com/r-dbi/bigrquery might provide me with some idea on how to upgrade this part to work with DBI; there are some similar ideas in there.

@florianm
Copy link
Contributor

florianm commented May 26, 2024

I have missed that one, I initially went by the dbplyr upgrade guide.
Sure enough, https://dbplyr.tidyverse.org/reference/src_sql.html says "Deprecated: please use directly use a DBIConnection object instead." and it's used in ckanr::src_ckan.

So we need to upgrade src_ckan to use https://dbi.r-dbi.org/reference/DBIConnection-class.html in R/dplyr.R:

src_ckan <- function(url) {
  if (!requireNamespace("dplyr", quietly = TRUE)) {
    stop("Please install dplyr", call. = FALSE)
  }
  drv <- new("CKANDriver")
  con <- dbConnect(drv, url = url)
  info <- dbGetInfo(con)
  src_sql("ckan", con, info = info) ## This line needs to be upgraded
}

Any ideas welcome! I'll set up a local CKAN test instance if I get a chance later. Greetings from Western Australia! (was formerly involved in data.wa.gov.au)

@rdenham
Copy link

rdenham commented May 26, 2024

Thanks, I'll spend a little bit of time on it later tonight. Shouldn't be hard, just I need to work out how it all fits together first :-).

Greetings from Western Australia!

And the same to you from Queensland!

@rdenham
Copy link

rdenham commented May 26, 2024

If I get this right, we should drop the approach of:

my_ckan <- src_ckan("http://demo.ckan.org")

and instead use something like:

con <- dbConnect(ckanr::ckan(), url = "https://www.data.qld.gov.au/")
tbl(con, "587f65ae-6675-4b8e-bac5-606ce7f4446a")

where ckanr::ckan() is just something like:

ckan <- function(url) {
  if (!requireNamespace("dplyr", quietly = TRUE)) {
    stop("Please install dplyr", call. = FALSE)
  }
  drv <- new("CKANDriver")
}

Then the nomal dbi stuff should work, and the dplyr interface will now just follow.

Here's my test:

## DBI approach
sql = 'select "Density class", "Full" from "587f65ae-6675-4b8e-bac5-606ce7f4446a" limit 5'
dbGetQuery(con, sql)

## dplyr approach
tbl(con, "587f65ae-6675-4b8e-bac5-606ce7f4446a") %>% 
  select(`Density class`, Full) 

I do get a Warning: Translator is missing window variants of the following aggregate functions: paste. Not sure how serious this is, and I think can be addressed using dbplyr::sql_paste or something.

@florianm
Copy link
Contributor

Thanks for that! I can get the connection to work but am running into various errors with the examples. Pending a closer look, this seems like we'd need to implement/change some of the generics and update examples and vignettes.

Examples of errors I'm seeing:

# Minimal working updated src_ckan
src_ckan <- function(url) {
  if (!requireNamespace("dplyr", quietly = TRUE)) {stop("Please install dplyr", call. = FALSE)}
  drv <- new("CKANDriver")
  dbConnect(drv, url = url)
}

my_ckan <- src_ckan("https://www.data.qld.gov.au/")

tbl2 <- tbl(src = my_ckan, name = "587f65ae-6675-4b8e-bac5-606ce7f4446a")
Error in tbl.DBIConnection(src = my_ckan, name = "587f65ae-6675-4b8e-bac5-606ce7f4446a") : 
  argument "from" is missing, with no default

> tbl1 <- tbl(src = my_ckan, from = "test", name = tbl_list[1])
Error in ds_search_sql(as.character(statement), url = conn@url, as = "table") : 
  {"help": "https://www.data.qld.gov.au/api/3/action/help_show?name=datastore_search_sql", "error": {"query": ["(psycopg2.errors.UndefinedTable) relation \"test\" does not exist\nLINE 2: FROM \"test\"\n             ^\n\n[SQL: SELECT * FROM (SELECT *\nFROM \"test\"\nWHERE (0 = 1)) AS blah LIMIT 32001 ;]\n(Background on this error at: https://sqlalche.me/e/14/f405)"], "info": {"statement": ["SELECT * FROM (SELECT *\nFROM \"test\"\nWHERE (0 = 1)) AS blah LIMIT 32001 ;"], "params": [{}], "orig": ["relation \"test\" does not exist\nLINE 2: FROM \"test\"\n             ^\n"]}, "__type": "Validation Error"}, "success": false}

I'll keep looking into this as bandwidth allows.

@rdenham
Copy link

rdenham commented May 27, 2024

my thoughts were to drop src_ckan, and just have something like ckan as the way to connect to the 'database', just like you would in other drivers for other DBI::dbConnect() examples. So instead of src_ckan, I suggest:

ckan <- function() {
  drv <- new("CKANDriver")
}

Then do the connection like any other DBI connection, ie

con <- dbConnect(ckanr::ckan(), url = "https://www.data.qld.gov.au/")

The src_ckan function is no longer needed as you can get the tbl directly from a DBI connection now. Like the help for dbplyr::src_sql says, please use directly use a DBIConnection object instead.

This evening I'll try to go through the dplyr.R code in ckanr to try to clarify, and drop the deprecated components. I'll share that and you can see if that's the approach you'd like to take.

@florianm
Copy link
Contributor

florianm commented May 27, 2024

I'm not a maintainer and only speaking as a contributor here, so this is just my opinion.
I like the approach of having just one function to create a connection. I assume we will have to adjust all examples, as the errors I'm seeing indicate that the DBI connection works differently to the R4 object returned by ckanr v0.7.0 (dbplyr v1). I may be wrong about this.
So the function signature could be

function_name_here <- function(url) {
  if (!requireNamespace("dplyr", quietly = TRUE)) {stop("Please install dplyr", call. = FALSE)}
  drv <- new("CKANDriver")
  DBI::dbConnect(drv, url = url)
}

If we retain the function name src_ckan:

  • We amend one existing and add no new function.
  • Existing scripts can keep at least this step - however, other steps may need to change. We may need to overload other generics like tbl.
  • No need to add a new function for the driver if we can define it inside src_ckan. This is assuming that this will work and we don't need to define the driver in a separate function.
  • If we factor out the driver to a function ckanr::ckan, it might be more expressive to name the driver returning function something indicating this functionality, e.g. ckanr::ckan_driver() or similar.

Pro your approach:

  • I like the function name "ckan" more than "src_ckan".
  • I would like the function "ckan" to return a DBI connection and take the URL as sole argument. This would make the usage more concise.
# Perfectly valid and idiomatic usage of DBI
con <- DBI::dbConnect(ckanr::ckan(), url = "https://www.data.qld.gov.au/")

# Less typing, easier to memorize
con <- ckanr::ckan("https://www.data.qld.gov.au/")

Pedantry aside, I can create a valid DCI connection to the QLD CKAN, but DBI::tbl throws the error that it requires a "from" parameter. What could I be doing wrong?

src_ckan <- function(url) {
  if (!requireNamespace("dplyr", quietly = TRUE)) {
    stop("Please install dplyr", call. = FALSE)
  }
  # drv <- new("CKANDriver")
  dbConnect(ckanr::ckan(), url = url)
}

#' @export
ckan <- function() {
  drv <- new("CKANDriver")
  drv
}
> my_ckan <- src_ckan("https://www.data.qld.gov.au/")
> tbl_list <- DBI::dbListTables(my_ckan, limit=5)

# The parameter "from" seems to be required but no example in ckanr shows it being used. Odd.
> tbl1 <- tbl(src = my_ckan, name = tbl_list[1])
Error in tbl.DBIConnection(src = my_ckan, name = tbl_list[1]) : 
  argument "from" is missing, with no default

# Let's try a "from".
> tbl1 <- tbl(src = my_ckan, from = "test", name = tbl_list[1])
Error in ds_search_sql(as.character(statement), url = conn@url, as = "table") : 
  {"help": "https://www.data.qld.gov.au/api/3/action/help_show?name=datastore_search_sql", "error": {"query": ["(psycopg2.errors.UndefinedTable) relation \"test\" does not exist\nLINE 2: FROM \"test\"\n             ^\n\n[SQL: SELECT * FROM (SELECT *\nFROM \"test\"\nWHERE (0 = 1)) AS blah LIMIT 32001 ;]\n(Background on this error at: https://sqlalche.me/e/14/f405)"], "info": {"statement": ["SELECT * FROM (SELECT *\nFROM \"test\"\nWHERE (0 = 1)) AS blah LIMIT 32001 ;"], "params": [{}], "orig": ["relation \"test\" does not exist\nLINE 2: FROM \"test\"\n             ^\n"]}, "__type": "Validation Error"}, "success": false}

# Using a valid table name as value for "from"
> tbl1 <- tbl(src = my_ckan, from = tbl_list[1], name = tbl_list[1])
Error in `db_query_fields.DBIConnection()`:
! Can't query fields.
Using SQL: _id
Using SQL: _full_text
Using SQL: Motor Accident Insurance Commission
Caused by error in `curl::curl_fetch_memory()`:
! URL rejected: Malformed input to a URL function

@rdenham
Copy link

rdenham commented May 27, 2024

Obviously, I'd defer to you and the package authors, but my reasoning for the

con <- DBI::dbConnect(ckanr::ckan(), url = "https://www.data.qld.gov.au/")

style is that it follows the standard approach of connecting to a backend, like

con <- dbConnect(RSQLite::SQLite(), dbname = ":memory:")

and similarly for SQLite, duckdb etc. So consistent and therefore intuitive for users. I think this would also solve your difficulty in using DBI::tbl, but I can't tell exactly why it doesn't work in your example. Maybe you'd like to check my fork? For me, that should be able to be used like:

library(dbplyr)
library(dplyr)
library(ckanr)

drv = ckan()
dbGetInfo(drv)
summary(drv)
dbUnloadDriver(drv) # not sure what this should do

con <- dbConnect(drv, url = "https://www.data.qld.gov.au/")

dbDisconnect(con) # not sure what this should do 


dbGetInfo(con) # shows url

sql = 'select "Density class", "Full" from "587f65ae-6675-4b8e-bac5-606ce7f4446a" limit 5'
rs <- dbSendQuery(con, sql)
dbFetch(rs)
dbGetQuery(con, sql)

dbListTables(con, limit =5)

res = dbReadTable(con, '587f65ae-6675-4b8e-bac5-606ce7f4446a')
class(res)

# doesn't work at the moment
#dbExistsTable(con, '587f65ae-6675-4b8e-bac5-606ce7f4446a')

sql = 'select "Density class", "Full" from "587f65ae-6675-4b8e-bac5-606ce7f4446a" limit 5'
rs <- dbSendQuery(con, sql)

dbFetch(rs)

dbListFields(rs)

## dplyr interface

tab1 <- tbl(con, "587f65ae-6675-4b8e-bac5-606ce7f4446a")

# check translations
# explain won't wok
try(
tab1 %>% 
  summarise(sdf = sd(Full)) %>% 
  explain())

# but show query does
tab1 %>% 
  summarise(sdf = sd(Full)) %>% 
  show_query()

# paste 
tab1 %>% 
  mutate(x = paste(`_id`, `Monitoring period`)) %>% 
  show_query()

# count
tab1 %>% 
  summarise(nobs=n()) %>% 
  show_query()

tab1 %>% 
  summarise(xc=cor(Full)) %>% 
  show_query()

# not sure that's what we want
tab1 %>% 
  summarise(xp=paste(`Density class`)) %>% 
  show_query()


tab1 %>% 
  mutate(x = paste(`_id`, `Monitoring period`)) %>% 
  select(x)


tab1 %>% 
  select(-`_full_text`)

I think that most of the code in dplyr.R is no longer needed, so I've simplified that a lot, but you might like to check that a bit more thoroughly.

I notice that the vignette doesn't have much in it on the dplyr methods, so it shouldn't be hard to update, and perhaps expand on this. I'm happy to do this if you like.

@florianm
Copy link
Contributor

This looks great, I'll try it out!

IMHO it would be beneficial to replace ckanr wrappers with idiomatic DBI / dplyr / dbplyr code and examples (vignette?). The only purpose of ckanr wrappers would be to handle CKAN datastore API behaviour or to simplify code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants