yaml files to define dataset download links/checksums. #117

chrisiacovella · 2024-05-10T05:39:37Z

As the title suggests, I modified the datasets such that, instead of hardcoding in the url, checksum, filename, etc. these are now stored in yaml files.

The basic structure of the yaml file (e.g., ani1x):

dataset: ani1x
full_dataset:
  gz_data_file:
    length: 4510287721
    md5: 408cdcf9768ac96a8ae8ade9f078c51b
    name: ani1x_dataset.hdf5.gz
  hdf5_data_file:
    md5: 361b7c4b9a4dfeece70f0fe6a893e76a
    name: ani1x_dataset.hdf5
  processed_data_file:
    md5: null
    name: ani1x_dataset_processed.npz
  url: https://www.dropbox.com/scl/fi/d98h9kt4pl40qeapqzu00/ani1x_dataset.hdf5.gz?rlkey=7q1o8hh9qzbxehsobjurcksit&dl=1
unit_testing_nc_1000:
  gz_data_file:
    length: 1761417
    md5: f47a92bf4791607d9fc92a4cf16cd096
    name: ani1x_dataset_nc_1000.hdf5.gz
  hdf5_data_file:
    md5: 776d38c18f3aa37b00360556cf8d78cc
    name: ani1x_dataset_nc_1000.hdf5
  processed_data_file:
    md5: null
    name: ani1x_dataset_nc_1000_processed.npz
  url: https://www.dropbox.com/scl/fi/26expl20116cqacdk9l1t/ani1x_dataset_ntc_1000.hdf5.gz?rlkey=swciz9dfr7suia6nrsznwbk6i&st=ryqysch3&dl=1

This will be useful as it will make it easier to define a set of different datasets (e.g., limiting other datasets to only the elements in ani2x).

This is basically the same info I had in the datasets before, but presumably we can define any number of files, rather than just the full dataset or the unit testing dataset.

For now I've not changed the constructors (e.g., they accept "for_unit_testing"). I was going to hold off on changing this (I think wait to the other PRs are merged as those are changing other aspects of the datasets and tests). I think this can be changed to a variable "dataset_type" that takes a string. For example, this could be "full_dataset" (default), "unit_testing" and like "elements_HCNO" (i.e., a restricted dataset.

update: I've decided to extend the same approach to the dataset curation. These yaml files will define the download link and checksum. These files could/should be stored along with the resulting curated files to better define the source (especially useful in cases where the original dataset source is updated).

Status

Ready to go

…ums, filenames, etc. This will be useful as it will make it easier to define a set of different datasets (e.g., limiting other datasets to only the elements in ani2x). For now I've not changed the constructors (e.g., they accept "for_unit_testing"). I think this can be changed to a variable "mode" that takes a string (default being "full_dataset", an option for "unit_testing" and in cases we need to limit elements it could be like "SPICE_HNCO" or something of that nature. )

…ld/should be stored along with the hdf5 file, such that it is clearer the source of the original dataset (especially relevant for things like spice that are likely to see updates to the hdf5 files distributed).

…ide of the curation module. A simple version number has been added to the curation to make it easier to keep track of and increment hdf5 files if they change.

chrisiacovella · 2024-05-23T06:48:04Z

The curation scripts have been updated for the various spice datasets to allow us to filter by element, to create datasets that we will be able to train with ani.

With the tests refactored in a recently merged PR, I will update this PR get rid of the "for_unit_testing" and instead have this be a string that will allow us to toggle different datasets ("full", "test"...and for spice datasets "full_7element", "test_7element" or some names of that nature).

…ation.

…e and how versioning is captured.

wiederm

I like these changes! This makes handling the datasets more modular and moving the parameters to the yaml files simplifies the control flow within the classes!

chrisiacovella · 2024-05-29T16:32:06Z

I'm going to merge this now. I resolved a qcarchive issue so I was able to fold in the PhAlkEthOH dataset as well.

chrisiacovella added 3 commits May 9, 2024 22:32

Merge branch 'main' into yaml_for_datasets

3d534f7

Modified curation to also use yaml files. The curation yaml files cou…

30a9346

…ld/should be stored along with the hdf5 file, such that it is clearer the source of the original dataset (especially relevant for things like spice that are likely to see updates to the hdf5 files distributed).

chrisiacovella requested a review from wiederm May 14, 2024 19:56

chrisiacovella added 4 commits May 17, 2024 09:48

Merge branch 'main' into yaml_for_datasets

cb4274c

Merge branch 'choderalab:main' into yaml_for_datasets

95fb63b

Merge branch 'choderalab:main' into yaml_for_datasets

b714fd8

Restructured curation scripts and moved to subdirectory (scripts) ins…

c2c00cf

…ide of the curation module. A simple version number has been added to the curation to make it easier to keep track of and increment hdf5 files if they change.

chrisiacovella and others added 8 commits May 23, 2024 22:10

updating yaml files to have more info and datasets

7c6e87c

Updated datasets to take "version_select" rather than "for_unit_testing"

51019fe

Updated unit tests to deal with version_select

735fbb0

Merge branch 'main' into yaml_for_datasets

5944923

fixing training test that seemed to have deleted the trainer instanti…

25534e6

…ation.

merge of main includes removed for_unit_testing in certain tests.

03fb54b

made curation more consistent with datasets in terms of yaml structur…

c5c7153

…e and how versioning is captured.

removed redundant release version in openff spice

c961855

wiederm approved these changes May 25, 2024

View reviewed changes

chrisiacovella added 5 commits May 28, 2024 11:20

adding in additional tests

6fb39c6

Added PhAlkEthOH dataset.

bf8ea68

Added PhAlkEthOH dataset.

718747f

accidentally left a debug statement in

196eb3f

refactoring names

04e19cf

chrisiacovella merged commit 7fde764 into choderalab:main May 29, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

yaml files to define dataset download links/checksums. #117

yaml files to define dataset download links/checksums. #117

chrisiacovella commented May 10, 2024 •

edited

chrisiacovella commented May 23, 2024

wiederm left a comment

chrisiacovella commented May 29, 2024

yaml files to define dataset download links/checksums. #117

yaml files to define dataset download links/checksums. #117

Conversation

chrisiacovella commented May 10, 2024 • edited

Status

chrisiacovella commented May 23, 2024

wiederm left a comment

Choose a reason for hiding this comment

chrisiacovella commented May 29, 2024

chrisiacovella commented May 10, 2024 •

edited