proposal: Organize data files by dataset, not type of data #320

ethanjli · 2023-12-30T04:43:15Z

Motivation

Currently, datasets stored on the PlanktoScope's SD card are structured by img, objects, clean, and export folders at the top level, with each folder having with their own duplicate copy of the dataset folder structure as subfolders. This makes it annoying to delete all the data just for one dataset. It also interacts badly with our filebrowser web interface for trying to download a zip archive of the raw images for a single dataset (e.g. to segment later, and/or for archival): when we try to download the img folder, it will include raw images for any previous datasets we had forgotten to delete; and it's inadvisable to instead to download a specific subfolder of the img folder, the zip archive loses the name of that folder. The annoying workflow for downloading and deleting a dataset caused me lots of frustration during high-intensity PlanktoScope operations on the ToTS Sikuliaq research cruise in the summer of 2023, where the high workload often resulted in forgetting to delete the dataset from the previous acquisition.

Proposal

We should reorganize the folder structure of datasets so that img, objects, clean, and export are the subfolders of each dataset folder. This way, we can download all files of all types associated with a dataset just by downloading a single folder as a zip archive; and we can just delete a dataset folder to delete all data associated with it.

In order to keep it easy to access, download, and delete all the EcoTaxa export ZIP files from the PlanktoScope, we would need to provide an alternate interface which aggregates all the EcoTaxa export ZIP files from all datasets. This could be a simplified "file manager" interface which provides individual & bulk dataset management actions, while our file browser interface would be for more advanced usage. Perhaps this interface could be made accessible at http://pkscope.local/ps/data/index . In the future, that interface could also be a frontend to rclone for uploading/transferring datasets to cloud storage.

Unresolved questions to address as part of this proposal:

Would it be necessary/helpful/simpler to just have a single folder (e.g. project-id_sample-id_acq-id) for each dataset, instead of an entire tree of nested folders (which is what we have right now)?

The text was updated successfully, but these errors were encountered:

ethanjli added the proposal A suggestion for a significant change label Dec 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

proposal: Organize data files by dataset, not type of data #320

proposal: Organize data files by dataset, not type of data #320

ethanjli commented Dec 30, 2023 •

edited

proposal: Organize data files by dataset, not type of data #320

proposal: Organize data files by dataset, not type of data #320

Comments

ethanjli commented Dec 30, 2023 • edited

Motivation

Proposal

ethanjli commented Dec 30, 2023 •

edited