Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TODO/BUG?]: add a warning to docs about all predictions being 0 with inconsistent specification of GPU devices #830

Open
kevingreenman opened this issue Apr 23, 2024 · 1 comment
Labels
bug Something isn't working todo add an item to the to-do list
Milestone

Comments

@kevingreenman
Copy link
Member

Notes
Akshat found that one can end up with predictions of 0 for all inputs when loading a model from a best.pt file.

@hwpang found that this is somehow related to using --devices "1," vs --devices "1". If the former is used during training and the latter is used during predicting, the mean and scale matrices in the unscale transform got put onto two different GPUs.

model loaded onto GPU 1
tensor([[-92.4739]], device='cuda:1')
tensor([[77.8905]], device='cuda:1')

GPU 0 is used by predicting
tensor([[0.]], device='cuda:0')
tensor([[0.]], device='cuda:0')

From @JacksonBurns:
This has to do with how lightning will load models from checkpoint and the default behavior of the map_location. This is technically intended behavior on their side, we just aren't providing the map location.

If there's no fool proof thing we can do to prevent this from happening for the users, we should add a warning to our documentation for users to make sure they specify their device numbers consistently between training and prediction.

@kevingreenman kevingreenman added bug Something isn't working todo add an item to the to-do list labels Apr 23, 2024
@kevingreenman kevingreenman added this to the v2.0.1 milestone Apr 23, 2024
@davidegraff
Copy link
Contributor

my vote is for warning. The two inputs have different meanings, and it's not up to us to try and guess what a user actually meant to type. This can lead to inconsistencies down the road, e.g., should we also assume the same behavior with chemprop train? IMO the answer is clearly no, and we want the argument to be treated the same across scripts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working todo add an item to the to-do list
Projects
None yet
Development

No branches or pull requests

2 participants