Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

yaml files to define dataset download links/checksums. #117

Merged
merged 20 commits into from
May 29, 2024

Conversation

chrisiacovella
Copy link
Member

@chrisiacovella chrisiacovella commented May 10, 2024

As the title suggests, I modified the datasets such that, instead of hardcoding in the url, checksum, filename, etc. these are now stored in yaml files.

The basic structure of the yaml file (e.g., ani1x):

dataset: ani1x
full_dataset:
  gz_data_file:
    length: 4510287721
    md5: 408cdcf9768ac96a8ae8ade9f078c51b
    name: ani1x_dataset.hdf5.gz
  hdf5_data_file:
    md5: 361b7c4b9a4dfeece70f0fe6a893e76a
    name: ani1x_dataset.hdf5
  processed_data_file:
    md5: null
    name: ani1x_dataset_processed.npz
  url: https://www.dropbox.com/scl/fi/d98h9kt4pl40qeapqzu00/ani1x_dataset.hdf5.gz?rlkey=7q1o8hh9qzbxehsobjurcksit&dl=1
unit_testing_nc_1000:
  gz_data_file:
    length: 1761417
    md5: f47a92bf4791607d9fc92a4cf16cd096
    name: ani1x_dataset_nc_1000.hdf5.gz
  hdf5_data_file:
    md5: 776d38c18f3aa37b00360556cf8d78cc
    name: ani1x_dataset_nc_1000.hdf5
  processed_data_file:
    md5: null
    name: ani1x_dataset_nc_1000_processed.npz
  url: https://www.dropbox.com/scl/fi/26expl20116cqacdk9l1t/ani1x_dataset_ntc_1000.hdf5.gz?rlkey=swciz9dfr7suia6nrsznwbk6i&st=ryqysch3&dl=1

This will be useful as it will make it easier to define a set of different datasets (e.g., limiting other datasets to only the elements in ani2x).

This is basically the same info I had in the datasets before, but presumably we can define any number of files, rather than just the full dataset or the unit testing dataset.

For now I've not changed the constructors (e.g., they accept "for_unit_testing"). I was going to hold off on changing this (I think wait to the other PRs are merged as those are changing other aspects of the datasets and tests). I think this can be changed to a variable "dataset_type" that takes a string. For example, this could be "full_dataset" (default), "unit_testing" and like "elements_HCNO" (i.e., a restricted dataset.

update: I've decided to extend the same approach to the dataset curation. These yaml files will define the download link and checksum. These files could/should be stored along with the resulting curated files to better define the source (especially useful in cases where the original dataset source is updated).

Status

  • Ready to go

…ums, filenames, etc. This will be useful as it will make it easier to define a set of different datasets (e.g., limiting other datasets to only the elements in ani2x). For now I've not changed the constructors (e.g., they accept "for_unit_testing"). I think this can be changed to a variable "mode" that takes a string (default being "full_dataset", an option for "unit_testing" and in cases we need to limit elements it could be like "SPICE_HNCO" or something of that nature. )
…ld/should be stored along with the hdf5 file, such that it is clearer the source of the original dataset (especially relevant for things like spice that are likely to see updates to the hdf5 files distributed).
@chrisiacovella
Copy link
Member Author

The curation scripts have been updated for the various spice datasets to allow us to filter by element, to create datasets that we will be able to train with ani.

With the tests refactored in a recently merged PR, I will update this PR get rid of the "for_unit_testing" and instead have this be a string that will allow us to toggle different datasets ("full", "test"...and for spice datasets "full_7element", "test_7element" or some names of that nature).

Copy link
Member

@wiederm wiederm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like these changes! This makes handling the datasets more modular and moving the parameters to the yaml files simplifies the control flow within the classes!

@chrisiacovella
Copy link
Member Author

I'm going to merge this now. I resolved a qcarchive issue so I was able to fold in the PhAlkEthOH dataset as well.

@chrisiacovella chrisiacovella merged commit 7fde764 into choderalab:main May 29, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants