Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor data handling #32

Open
rabernat opened this issue Aug 23, 2021 · 7 comments
Open

Refactor data handling #32

rabernat opened this issue Aug 23, 2021 · 7 comments

Comments

@rabernat
Copy link
Collaborator

We might want to use a utility like Pooch or a catalog like Intake to simplify how we are handling data. We could make things a lot easier on the students. On the other hand, perhaps understanding how to deal with real data (urls, broken links, etc.) is a valuable experience?

@tjcrone
Copy link
Contributor

tjcrone commented Aug 23, 2021

The easiest way to handle data files is to include them, wherever possible, in the repo. I agree there is great value in having students download data from primary sources. But the URLs and sources change often, and whether or not we use Pooch or Intake, the notebooks are likely to break every year, and the professor will need to fix the link/source before the Tuesday lecture. Students will deal with broken links less often unless a fix is not made before class.

Maybe a mixed approach is best, where we include a few of the most-often used and smaller datasets in the repo, and use links to primary sources for other datasets, especially ones where it would be nice to have the latest data. I am not opposed to using Pooch or Intake, but the real problem is that primary source URLs/APIs/formats change, and I don't think these tools can fix that. An effort to find data sources and URLs/permalinks that have been consistent over the years would also be worthwhile.

@rabernat
Copy link
Collaborator Author

The easiest way to handle data files is to include them, wherever possible, in the repo.

I like to avoid data files in the repo if possible because it bloats the repo size.

Perhaps there is an intermediate solution: we use a zenodo repo to store all the data for the course. That way we can be sure it will be immutable.

@tjcrone
Copy link
Contributor

tjcrone commented Aug 23, 2021

I agree that data in the repo is not ideal. The notebooks will never break, but there are downsides including bloat as you note. In some places (e.g. https://earth-env-data-science.github.io/assignments/numpy_matplotlib.html) we have data files stored in the ~rpa directory of the LDEO web server. Also not ideal, but these notebooks never broke, and I really appreciated that.

I like the idea of a repo for data that we control. It could be a separate GitHub repo with data for the class, which might be somewhat large, but students do not clone it and it stays more static than the textbook repo, or a Zenodo repo which I know less about but looks great. It would be good to also maintain a few instances when students get data from primary sources, for the pedagogical upsides. We could make a good effort to find permalinks/sources that are on the more stable side.

@tjcrone
Copy link
Contributor

tjcrone commented Aug 23, 2021

Eventually, our Open Storage Network pod might be a good place. Poking around, I found Dryad (https://datadryad.org/stash/our_membership) which is also interesting. Lots of others on this list from Nature: https://www.nature.com/sdata/policies/repositories.

@PedroVelez
Copy link

Hi,

first, thanks a lot for making An Introduction to Earth and Environmental Data Science such a nice and useful tool. I learnt python following it, and inspired me to create https://euroargodev.github.io/argoonlineschool .

We are still working on the Argo School and I am trying to overcome the problem of having large files in the repository. So far I have them hosted in a web page, so the users of the Argo School can download them. I prefer to have the originals stored rather than having the users download the last version from Argo repositories since they change and the jupyter notebooks may not work.

I have followed this issue and I wonder if you have came out with a solution, or even if you have tried Git Large File Storage.
thanks!
Pedro

@tjcrone
Copy link
Contributor

tjcrone commented Sep 17, 2021

Thanks for reaching out @PedroVelez. Your book looks awesome! I'm glad this book was a help to you. I think we are leaning toward Zenodo and definitely away from large files in the repo, but I don't think we settled on a solution. Git LFS looks very cool and might be a reasonable way forward. We like Zenodo because there are DOIs, and we like to have students get data directly from primary sources so they can learn how to deal with the difficulties that are sometimes involved which are instructive. But we haven't settled on anything yet and I think Git LFS should be in the discussion. It would be great to hear about how you solve this problem as you move forward.

@PedroVelez
Copy link

Hi, I explored GitLFS and the user has to install it. I think its purpose is for more expert developers.

To keep it simple and adjusted to the target audience of the Argo online School, I think the best option, is to use Google drive to store the content of the ./Data folder. In that way user can see the structure of the folder and decide if download just one file or all of them. This is the example we are using
any comment is appreciated.
Pedro

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants