Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: Organize data files by dataset, not type of data #320

Open
ethanjli opened this issue Dec 30, 2023 · 0 comments
Open

proposal: Organize data files by dataset, not type of data #320

ethanjli opened this issue Dec 30, 2023 · 0 comments
Labels
proposal A suggestion for a significant change

Comments

@ethanjli
Copy link
Member

ethanjli commented Dec 30, 2023

Motivation

Currently, datasets stored on the PlanktoScope's SD card are structured by img, objects, clean, and export folders at the top level, with each folder having with their own duplicate copy of the dataset folder structure as subfolders. This makes it annoying to delete all the data just for one dataset. It also interacts badly with our filebrowser web interface for trying to download a zip archive of the raw images for a single dataset (e.g. to segment later, and/or for archival): when we try to download the img folder, it will include raw images for any previous datasets we had forgotten to delete; and it's inadvisable to instead to download a specific subfolder of the img folder, the zip archive loses the name of that folder. The annoying workflow for downloading and deleting a dataset caused me lots of frustration during high-intensity PlanktoScope operations on the ToTS Sikuliaq research cruise in the summer of 2023, where the high workload often resulted in forgetting to delete the dataset from the previous acquisition.

Proposal

We should reorganize the folder structure of datasets so that img, objects, clean, and export are the subfolders of each dataset folder. This way, we can download all files of all types associated with a dataset just by downloading a single folder as a zip archive; and we can just delete a dataset folder to delete all data associated with it.

In order to keep it easy to access, download, and delete all the EcoTaxa export ZIP files from the PlanktoScope, we would need to provide an alternate interface which aggregates all the EcoTaxa export ZIP files from all datasets. This could be a simplified "file manager" interface which provides individual & bulk dataset management actions, while our file browser interface would be for more advanced usage. Perhaps this interface could be made accessible at http://pkscope.local/ps/data/index . In the future, that interface could also be a frontend to rclone for uploading/transferring datasets to cloud storage.

Unresolved questions to address as part of this proposal:

  • Would it be necessary/helpful/simpler to just have a single folder (e.g. project-id_sample-id_acq-id) for each dataset, instead of an entire tree of nested folders (which is what we have right now)?
@ethanjli ethanjli added the proposal A suggestion for a significant change label Dec 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
proposal A suggestion for a significant change
Projects
Status: Draft
Development

No branches or pull requests

1 participant