-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
yaml files to define dataset download links/checksums. #117
yaml files to define dataset download links/checksums. #117
Conversation
…ums, filenames, etc. This will be useful as it will make it easier to define a set of different datasets (e.g., limiting other datasets to only the elements in ani2x). For now I've not changed the constructors (e.g., they accept "for_unit_testing"). I think this can be changed to a variable "mode" that takes a string (default being "full_dataset", an option for "unit_testing" and in cases we need to limit elements it could be like "SPICE_HNCO" or something of that nature. )
…ld/should be stored along with the hdf5 file, such that it is clearer the source of the original dataset (especially relevant for things like spice that are likely to see updates to the hdf5 files distributed).
…ide of the curation module. A simple version number has been added to the curation to make it easier to keep track of and increment hdf5 files if they change.
The curation scripts have been updated for the various spice datasets to allow us to filter by element, to create datasets that we will be able to train with ani. With the tests refactored in a recently merged PR, I will update this PR get rid of the "for_unit_testing" and instead have this be a string that will allow us to toggle different datasets ("full", "test"...and for spice datasets "full_7element", "test_7element" or some names of that nature). |
…e and how versioning is captured.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like these changes! This makes handling the datasets more modular and moving the parameters to the yaml files simplifies the control flow within the classes!
I'm going to merge this now. I resolved a qcarchive issue so I was able to fold in the PhAlkEthOH dataset as well. |
As the title suggests, I modified the datasets such that, instead of hardcoding in the url, checksum, filename, etc. these are now stored in yaml files.
The basic structure of the yaml file (e.g., ani1x):
This will be useful as it will make it easier to define a set of different datasets (e.g., limiting other datasets to only the elements in ani2x).
This is basically the same info I had in the datasets before, but presumably we can define any number of files, rather than just the full dataset or the unit testing dataset.
For now I've not changed the constructors (e.g., they accept "for_unit_testing"). I was going to hold off on changing this (I think wait to the other PRs are merged as those are changing other aspects of the datasets and tests). I think this can be changed to a variable "dataset_type" that takes a string. For example, this could be "full_dataset" (default), "unit_testing" and like "elements_HCNO" (i.e., a restricted dataset.
update: I've decided to extend the same approach to the dataset curation. These yaml files will define the download link and checksum. These files could/should be stored along with the resulting curated files to better define the source (especially useful in cases where the original dataset source is updated).
Status