Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE]: Cache in CLI #792

Open
KnathanM opened this issue Apr 13, 2024 · 7 comments
Open

[FEATURE]: Cache in CLI #792

KnathanM opened this issue Apr 13, 2024 · 7 comments
Assignees
Labels
enhancement a new feature request
Milestone

Comments

@KnathanM
Copy link
Contributor

#697 added caching to v2. We haven't made it available through the CLI yet

@KnathanM KnathanM added the enhancement a new feature request label Apr 13, 2024
@davidegraff
Copy link
Contributor

IMO it should be dynamically determined based on dataset size, i.e., any dataset fewer than (just a random number) 50k molecules should be cached unless a user tells us not to (--no-cache).

notes:

  • we should run some numbers to get a sense of the memory size for a dataset of N molecules. My original napkin math was that a user could reasonably store 50k in only a couple gigs, but the recent precision changes from V2: Convert from float64 to float32 #761 could increase that number. Alongside that, we should decide what a reasonable expectation of memory is for our users. 4 GB/cpu is pretty typical, so maybe 8GB is reasonable? It’s up to the devs
  • it’d be a nice feature (in like 2.2 or 2.3), if we can extend data loading parallelism to the caching itself. Currently the caching is serial whereas data loading can be done in parallel.

@UnixJunkie
Copy link

UnixJunkie commented May 7, 2024

I completely second @davidegraff: maybe use 50k or 100k as the default limit.
If the dataset is bigger than that, automatically use --no_cache_mol.
You might also want to make this the default when the model is being trained on a GPU.

@UnixJunkie
Copy link

Can this be made available through the CLI?
It seems required for very large datasets.

@KnathanM
Copy link
Contributor Author

KnathanM commented May 8, 2024

Not caching is the default (and only option currently) in the CLI, which works for all sizes of datasets. Soon we plan to add an option to cache for small datasets.

I understand that your dataset is large. The CLI should work for your dataset as no caching is performed.

@UnixJunkie
Copy link

Ok, do you want me to share a 10M public dataset w/ you so that you can reproduce the problem?

@UnixJunkie
Copy link

~10M molecules; classification setting

@KnathanM
Copy link
Contributor Author

KnathanM commented May 8, 2024

Yes, I can try running the CLI on it tomorrow and see if I can reproduce your error. Please send details of the dataset in issue #858. Thank you

@KnathanM KnathanM self-assigned this May 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement a new feature request
Projects
None yet
Development

No branches or pull requests

4 participants