Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dplyr interface of ds_search_sql #48

Open
wush978 opened this issue Aug 26, 2015 · 32 comments
Open

dplyr interface of ds_search_sql #48

wush978 opened this issue Aug 26, 2015 · 32 comments

Comments

@wush978
Copy link
Contributor

wush978 commented Aug 26, 2015

Dear @sckott ,

I think it is possible to implement a subset of dplyr interface for ckanr. I guess we need to use the engine in dplyr to generate the SQL statement and submit the statement to ckan server via ds_search_sql. Does this feature fit this package?

@sckott
Copy link
Contributor

sckott commented Aug 26, 2015

Possibly. Since we don't use dplyr already, I think it would best in Suggests, so that users that don't need it don't have to install it since it is a big dependency.

I don't yet see how it would work. Can you give an example? with pseudocode

@wush978
Copy link
Contributor Author

wush978 commented Aug 26, 2015

Sure.
Moreover, I'll directly implement a prototype. Any progress will be updated here.

@sckott
Copy link
Contributor

sckott commented Aug 26, 2015

thanks

wush978 added a commit to wush978/ckanr that referenced this issue Aug 28, 2015
wush978 added a commit to wush978/ckanr that referenced this issue Aug 28, 2015
@wush978
Copy link
Contributor Author

wush978 commented Aug 28, 2015

Oops, I guess I need to push after finishing the prototype. Sorry!

@wush978
Copy link
Contributor Author

wush978 commented Aug 30, 2015

Dear @sckott ,

master...wush978:dev/dplyr shows my idea of implementing interface of dplyr. There are two files: dbi.R implements the DBI interface and dplyr.R implements the dplyr's sql interface according to the DBI interface.

This is not a complete implementation, so only the following script works.

library(ckanr)
url <- 'http://demo.ckan.org/'
sql <- 'SELECT * from "f4129802-22aa-4437-b9f9-8a8f3b7b2a53" LIMIT 2'
src <- src_ckan(url)
tb <- tbl(src, sql(sql))
# Then we could do something like this:
# dplyr::filter(tb, ...) %>% compute

If you think this approach is OK, then I will keep implementing the complete interface of dplyr.

@sckott
Copy link
Contributor

sckott commented Sep 8, 2015

@wush978 Sorry, just got back from vacation. Looks good.

If you think this approach is OK, then I will keep implementing the complete interface of dplyr.

What is the rest of the interface?

@wush978
Copy link
Contributor Author

wush978 commented Sep 9, 2015

@sckott

What is the rest of the interface?

I only implement a minimal example to demonstrate my idea of using DBI to link ds_search_sql to dplyr. Currently, all the following basic verbs in dplyr do not work.

  • select
  • filter
  • mutate
  • arrange
  • group_by
  • summarise

And some verbs of joining...

@sckott
Copy link
Contributor

sckott commented Sep 9, 2015

@wush978 what I'm wondering about is if you have to write versions of select, filter, etc. verbs in this pkg, or if we can just use dplyr itself?

@wush978
Copy link
Contributor Author

wush978 commented Sep 9, 2015

We use these verbs directly, but we need to guide dplyr how to use the ds_search_sql to submit the SQL statement. That is to say, we do not need to implement these verbs. We override some internal behavior inside dplyr according to https://cran.r-project.org/web/packages/dplyr/vignettes/new-sql-backend.html.

@sckott
Copy link
Contributor

sckott commented Sep 9, 2015

Okay, sounds great!

@sckott
Copy link
Contributor

sckott commented Feb 6, 2016

Any progress on this @wush978

@wush978
Copy link
Contributor Author

wush978 commented Feb 6, 2016

Thanks for reminding me. I did forget this issue.

I'll try to do some progress during Chinese New Year.

@sckott
Copy link
Contributor

sckott commented Feb 7, 2016

thanks!

@wush978
Copy link
Contributor Author

wush978 commented Feb 13, 2016

Hi @sckott ,

Today I added some unit tests (12cb680). The basic verbs: filter, arrange, select are workable now. The distinct is failed and I need to check if it is workable with PostgreSQL which is the backend of the CKAN server.

By the way, the non-ascii column name will make the server return http 500, so I only test these features with ascii encoded tables.

And I temporarily make ckanr depend on DBI for simplicity. Otherwise, I need to import all S4 classes and methods from DBI. How do you think about making the package depend on DBI?

@sckott
Copy link
Contributor

sckott commented Feb 13, 2016

hmm, that commit links gives a 404, where should that point to?

on DBI - perhaps. Definitely worth trying, haven't thought about yet if it's not worth it, but I'm guessing it will be fine

@wush978
Copy link
Contributor Author

wush978 commented Feb 14, 2016

Oops, the link should be: wush978@12cb680

@sckott
Copy link
Contributor

sckott commented Feb 18, 2016

@wush978 I had a look, looks very nice. what is the CKAN instance you used for testing?

@wush978
Copy link
Contributor Author

wush978 commented Feb 20, 2016

I am using http://demo.ckan.org to test

@sckott
Copy link
Contributor

sckott commented Feb 20, 2016

thanks

@sckott
Copy link
Contributor

sckott commented Feb 24, 2016

@wush978 what versions of dplyr/DBI do you have? I'm getting errors trying to run your tests with your branch

@sckott
Copy link
Contributor

sckott commented Feb 24, 2016

e.g,.

tb <- tbl(src, name = name_list[3])
#>  Error in mget(plabels[hasSubclass], env) : invalid first argument

@wush978
Copy link
Contributor Author

wush978 commented Feb 27, 2016

@sckott

Here it is:

> packageVersion("dplyr")
[1] ‘0.4.3’
> packageVersion("DBI")
[1] ‘0.3.1’
> sessionInfo()
R version 3.2.3 (2015-12-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.3 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=zh_TW.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=zh_TW.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=zh_TW.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=zh_TW.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_0.4.3      ckanr_0.1.0.9190 DBI_0.3.1        testthat_0.10.0 

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.2      digest_0.6.8     crayon_1.3.1     assertthat_0.1   R6_2.1.1         jsonlite_0.9.17  magrittr_1.5    
 [8] httr_1.0.0       stringi_1.0-1    curl_0.9.3       rstudioapi_0.3.1 tools_3.2.3      stringr_1.0.0    parallel_3.2.3  
[15] memoise_0.2.1   

@wush978
Copy link
Contributor Author

wush978 commented Feb 27, 2016

@sckott ,

The unit tests for basic verbs are updated and I also tried left_join.

Hope they work in your computer.

@sckott
Copy link
Contributor

sckott commented Feb 29, 2016

Okay, looks good. I had a dev version of DBI ahead of what you had, works good. now
I do get an error on https://github.com/wush978/ckanr/blob/dev/dplyr/tests/testthat/test-dplyr.R#L94 but if I changed to the 2nd element name_list[2] it worked, but aren't data always changing on demo.ckan.org?

Do you think the implementation is nearly complete now?

@wush978
Copy link
Contributor Author

wush978 commented Mar 1, 2016

Yes, it is nearly complete.

We just need a stable testing environment. Could you give me any suggestion?

And maybe we need a short example to demonstrate this feature. Do you have any idea of where to put the examples?

@sckott
Copy link
Contributor

sckott commented Mar 1, 2016

We just need a stable testing environment. Could you give me any suggestion?

There's many CKAN instances. Not sure which is best, but I imagine there's many that have good up time, and don't have transient datasets in them. See servers(). Perhaps try https://data.noaa.gov/dataset

And maybe we need a short example to demonstrate this feature. Do you have any idea of where to put the examples?

You could create a man file with an appropriate name and include examples in that. e..g., like

#' foo bar foo bar
#' ....... more text .....
#' @name dplyr-interface
NULL

@wush978
Copy link
Contributor Author

wush978 commented Mar 1, 2016

Thanks.

I'll implement them on the weekend and send a PR.

@wush978
Copy link
Contributor Author

wush978 commented Mar 7, 2016

@sckott ,

Sorry that I cannot find time to write some code on the last weekend. Do you have any schedule of this issue? Please let me know if we need to be hurry or I'll try to find sometimes on the next weekend to do this.

@sckott
Copy link
Contributor

sckott commented Mar 7, 2016

No rush on this. And thanks for all your work on it

@wush978 wush978 mentioned this issue Mar 13, 2016
@wush978
Copy link
Contributor Author

wush978 commented Mar 13, 2016

You're welcome. I'm glad to have a chance to contribute.

@florianm
Copy link
Contributor

@wush978 and @sckott - your work has saved my Department a ton of headache!

We have moved behind a Microsoft single-sign-on authentication wall, and although we can make the CKAN API accessible to our intranet (SSO exempt), the resource URLs can't be easily made SSO exempt. This means that we cannot access our data on CKAN via read.table("ckan resource url")!
However, since dplyr can access the datastore directly:

ckan <- ckanr::src_ckan("https://data.dpaw.wa.gov.au/")
dplyr::tbl(src = ckan$con, from = res_id) %>% as_tibble(.)

All we need is for our data to be a clean CSV that uploads into the datastore.
Thought you should know that your work has made our lives heaps easier. Much appreciated!

@sckott
Copy link
Contributor

sckott commented Jun 27, 2017

nice!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants