Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve download stability #19

Open
aazuspan opened this issue Sep 10, 2021 · 4 comments
Open

Improve download stability #19

aazuspan opened this issue Sep 10, 2021 · 4 comments
Assignees
Labels
enhancement New feature or request

Comments

@aazuspan
Copy link
Owner

The current download system is pretty solid with automated retrying, but the cdsapi package has a more extensive system that should improve download stability. See their implementation for reference.

@aazuspan
Copy link
Owner Author

aazuspan commented Mar 8, 2023

There are a few recent additions to the EE API that may make this easier.

  1. ee.Image.getDownloadURL now accepts a format parameter so we no longer have to deal with zipped GeoTiffs.
  2. ee.data.computePixels allows image downloads without the intermediate URL generation step. I'm not sure what the performance implications are, but it should at least simplify if not speed up downloads. This method seems to accept the same parameters and be subject to the same size restrictions as getDownloadURL. I believe this was previously REST API only, but is now available through the Python client API.

@aazuspan
Copy link
Owner Author

Just ran some benchmarks on download speed and fsspec seems to have a big advantage over the current requests system. It can also handle concurrent downloads out of the box. That may be useful, but unfortunately I don't think it will be enough to let us drop joblib as a dependency since we'll still need that for grabbing URLs.

I don't love the idea of adding a new dependency, but if it can reduce download times substantially and simplify the download system, I think it's worth adding fsspec.

@aazuspan
Copy link
Owner Author

With ee.data.computePixels now available in the Python API (as of 2023-02-15), that will probably be the most straightforward way to grab image data.

It has the same size limitation as other methods, but allows data to be retrieved directly rather than through an intermediate URL, which should be a win for performance, simplicity, and reliability. Also, this would allow us to avoid adding fsspec and probably remove requests as dependencies.

I need to do some benchmarking to make sure there are no downsides, but at the moment this looks like the way to go. Note that as with all direct GEO_TIFF format downloads, it does not currently export band names, which means we unfortunately have to grab them manually with getInfo.

@aazuspan
Copy link
Owner Author

A quick-and-dirty benchmark test says computePixels is noticeably faster than downloading with fsspec and the current requests implementation, even for a single image where you have to grab bandNames. With more images, that improvement should scale since bandNames will only need to be retrieved once.

Time to download a single-band GridMET image at native resolution:

Method Time
getDownloadURL + requests 3.2s
getDownloadURL + fsspec 2.2s
computePixels 1.9s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant