[FEATURE]: Cache in CLI #792

KnathanM · 2024-04-13T19:50:40Z

#697 added caching to v2. We haven't made it available through the CLI yet

davidegraff · 2024-04-13T21:56:02Z

IMO it should be dynamically determined based on dataset size, i.e., any dataset fewer than (just a random number) 50k molecules should be cached unless a user tells us not to (--no-cache).

notes:

we should run some numbers to get a sense of the memory size for a dataset of N molecules. My original napkin math was that a user could reasonably store 50k in only a couple gigs, but the recent precision changes from V2: Convert from float64 to float32 #761 could increase that number. Alongside that, we should decide what a reasonable expectation of memory is for our users. 4 GB/cpu is pretty typical, so maybe 8GB is reasonable? It’s up to the devs
it’d be a nice feature (in like 2.2 or 2.3), if we can extend data loading parallelism to the caching itself. Currently the caching is serial whereas data loading can be done in parallel.

UnixJunkie · 2024-05-07T04:21:59Z

I completely second @davidegraff: maybe use 50k or 100k as the default limit.
If the dataset is bigger than that, automatically use --no_cache_mol.
You might also want to make this the default when the model is being trained on a GPU.

UnixJunkie · 2024-05-08T01:12:24Z

Can this be made available through the CLI?
It seems required for very large datasets.

KnathanM · 2024-05-08T01:29:16Z

Not caching is the default (and only option currently) in the CLI, which works for all sizes of datasets. Soon we plan to add an option to cache for small datasets.

I understand that your dataset is large. The CLI should work for your dataset as no caching is performed.

UnixJunkie · 2024-05-08T01:31:34Z

Ok, do you want me to share a 10M public dataset w/ you so that you can reproduce the problem?

UnixJunkie · 2024-05-08T01:31:54Z

~10M molecules; classification setting

KnathanM · 2024-05-08T01:37:14Z

Yes, I can try running the CLI on it tomorrow and see if I can reproduce your error. Please send details of the dataset in issue #858. Thank you

KnathanM added the enhancement a new feature request label Apr 13, 2024

kevingreenman added this to the v2.1.0 milestone Apr 16, 2024

UnixJunkie mentioned this issue May 7, 2024

[v2 Feature Request]: On-the-Fly Graph Generation #858

Closed

KnathanM self-assigned this May 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE]: Cache in CLI #792

[FEATURE]: Cache in CLI #792

KnathanM commented Apr 13, 2024

davidegraff commented Apr 13, 2024

UnixJunkie commented May 7, 2024 •

edited

UnixJunkie commented May 8, 2024

KnathanM commented May 8, 2024

UnixJunkie commented May 8, 2024

UnixJunkie commented May 8, 2024

KnathanM commented May 8, 2024

[FEATURE]: Cache in CLI #792

[FEATURE]: Cache in CLI #792

Comments

KnathanM commented Apr 13, 2024

davidegraff commented Apr 13, 2024

UnixJunkie commented May 7, 2024 • edited

UnixJunkie commented May 8, 2024

KnathanM commented May 8, 2024

UnixJunkie commented May 8, 2024

UnixJunkie commented May 8, 2024

KnathanM commented May 8, 2024

UnixJunkie commented May 7, 2024 •

edited