Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make data repository registration a public API #351

Open
dokempf opened this issue Mar 10, 2023 · 2 comments
Open

Make data repository registration a public API #351

dokempf opened this issue Mar 10, 2023 · 2 comments
Labels
enhancement Idea or request for a new feature

Comments

@dokempf
Copy link
Contributor

dokempf commented Mar 10, 2023

Description of the desired feature:

Currently, the list of available data repository integrations is hard-coded into the code base. It would be a rather small change to provide this as a public API. It would allow power users (most likely: library developers) to add their own data repository implementations into pooch. I see this as an interesting option in the light of supporting domain-specific data repositories, which one may not want to support directly from the domain-agnostic pooch library.

Are you willing to help implement and maintain this feature?

Yes, with no specific timeline for the implementation.

@dokempf dokempf added the enhancement Idea or request for a new feature label Mar 10, 2023
@leouieda
Copy link
Member

@dokempf how do you envision this looking as a public API? Would it be in the form of an argument to the DOIDownloader, for example?

@dokempf
Copy link
Contributor Author

dokempf commented Jun 22, 2023

Hey @leouieda, one of my main priorities with this would be that it is an interface that is accessible for the end user (opposed to a library maintainer). Otherwise, library maintainers are limiting the set of data repositories their users can use. From my understanding, this rules out the approach to customize the downloader class.

I thought more of a registering mechanism. This can be done either explicit, like

pooch.register_data_repository(MyCustomRepositoryClass)

or

@pooch.register_data_repository
class MyCustomRepositoryClass(DataRepository)

or more implicit like:

# within the pooch core implementation:
chain_of_responsibility = DataRepository.__subclasses__()

or

# PRO: Allows control of where to insert in chain

# within the pooch core implementation:
class DataRepository:
   _chain = []
   def __init_subclass__(cls, prepend=False):
       if prepend:
           DataRepository._chain.append(cls)
       else:
           DataRepository._chain = [cls] + DataRepository._chain

I think my personal preference is the first and very simple register function.

To give you a bit of background on my motivation to do this: I want to advertise pooch in the future for reproducing computations from DOIs. In order to do that, it should support as much data repositories as possible (both generic and domain specific ones). I can understand that the pooch core can only support a limited number of data repositories. Therefore, I think users should be able to contribute data repository implementations to a separate project that has smaller stability guarantees. I have implemented such a thing as a proof of concept: https://github.com/dokempf/pooch_repositories It adds a "meta-repository" that accesses https://re3data.org to dispatch faster to e.g. DataVerse repositories and adds partial support for https://pangaea.de/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Idea or request for a new feature
Projects
None yet
Development

No branches or pull requests

2 participants