Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow running pooch on air-gapped systems #352

Open
dokempf opened this issue Mar 10, 2023 · 5 comments
Open

Allow running pooch on air-gapped systems #352

dokempf opened this issue Mar 10, 2023 · 5 comments
Labels
enhancement Idea or request for a new feature

Comments

@dokempf
Copy link
Contributor

dokempf commented Mar 10, 2023

Description of the desired feature:

When building Python libraries (as opposed to Python applications) that use Pooch for data downloading, we currently have to accept the fact that our software will not be usable in scenarios without network access (either due to temporary inavailability or due to the running environment being air-gapped). I think pooch itself could mitigate this risk by doing the following things:

  • Make it easy to fetch all files from a Poochs registry (e.g. by allowing fetch() without arguments)
  • Explicitly document how
    • library developers can use the above to create a "data prefetch CLI entrypoint"
    • users can relocate the cache directory onto an air-gapped system

A challenge that would require some discussion is to also make other API requests (e.g. in DOI resolution) obsolete by caching the responses in a relocatable location.

Are you willing to help implement and maintain this feature?

Yes, with no specific timeline for the implementation.

@dokempf dokempf added the enhancement Idea or request for a new feature label Mar 10, 2023
@leouieda
Copy link
Member

@dokempf allowing users to configure the cache location is already implemented through environment variables that package developers can enable. Having a "fetch all" function could also be easily made by package developers with 2 lines of code:

for fname in POOCH_INSTANCE.registry:
    POOCH_INSTANCE.fetch(fname)

Having such a function in Pooch itself would probably not help much since package developers would have to implement it on their end as well since users are not even meant to know about Pooch.

So maybe this is already resolved?

@dokempf
Copy link
Contributor Author

dokempf commented Jun 22, 2023

You are right that the "fetch all" thing is minor and can easily be implemented by a loop. What does not work air-gapped though is DOI resolution. This requires requests caching through e.g. https://github.com/requests-cache/requests-cache. I do have a partial implementation of this that I could finish now that this upstream issue is fixed.

@leouieda
Copy link
Member

But if the data have already been fetched, we wouldn't need to resolve the DOI. Caching the response opens up a load of potential issues with cache invalidation, which I'd rather avoid. Pooch is meant to be relatively simple and straight forward. So maybe not worth the maintenance burden for a somewhat niche application?

@dokempf
Copy link
Contributor Author

dokempf commented Jun 22, 2023

Currently, that only works when the registry is given explicitly. If it is populated from the data repository, it does not work offline, because it needs to do DOI resolution to learn what files there are. Could there be an easy fix, where the registry file is stored in the cache for this scenario?

@leouieda
Copy link
Member

Could there be an easy fix, where the registry file is stored in the cache for this scenario?

Again, this opens up a lot of issues around validating the cached registry. These things have a tendency to hide non-obvious bugs and we've been bitten by them before.

In this case, users could implement the caching of the registry with a few lines of code:

import json

POOCH = pooch.Pooch(...)
try:
    POOCH.load_registry_from_doi()
    with open("registry.json", "w") as output:
        json.dump(POOCH.registry, output)
except:
    if os.path.exists("registry.json"):
        with open("registry.json") as input:     
            POOCH.registry = json.load(input)
    else:
        raise ...

It's an easy fix from the user side but implementing this into Pooch would mean designing an API around all of this that wouldn't break any existing code out there. With Pooch being pulled by scipy and scikit-image, if we push a release that breaks compatibility in any way, it tends to be messy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Idea or request for a new feature
Projects
None yet
Development

No branches or pull requests

2 participants