Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-GPU support exists❓ [QUESTION] #210

Open
JonathanSchmidt1 opened this issue May 11, 2022 · 21 comments
Open

Multi-GPU support exists❓ [QUESTION] #210

JonathanSchmidt1 opened this issue May 11, 2022 · 21 comments
Labels
enhancement New feature or request

Comments

@JonathanSchmidt1
Copy link

We are interested in training nequip potentials on large datasets of several million structures.
Consequently we wanted to know whether multi-gpu support exists or if someone knows whether the networks can be integrated into pytorch lightning.
best regards and thank you very much,
Jonathan
Ps: this might be related to #126

@JonathanSchmidt1 JonathanSchmidt1 added the question Further information is requested label May 11, 2022
@Linux-cpp-lisp
Copy link
Collaborator

Hi @JonathanSchmidt1 ,

Thanks for your interest in our code/method for your project! Sounds like an interesting application; please feel free to get in touch by email and let us know how it's going (we're always interested to hear about what people are working on using our methods).

Re multi-GPU training: I have a draft branch horovod using the Horovod distributed training framework. This is an in-progress draft, and has only been successfully tested so for for a few epochs on multiple CPUs. The branch is also a little out-of-sync with the latest version, but I will try to merge that back in in the coming days. If you are interested, you are more than welcome to use this branch, just understanding that it would as a sort of an "alpha tester." If you do use the branch, please carefully check any results you get for sanity and against those with Horovod disabled, and also please report any issues/suspicions here or by email. (One disclaimer is that the horovod branch is not a development priority for us this summer and I will likely be slow to respond.) PRs are also welcome, though I appreciate people reaching out to discuss first if the PR involves major development or restructuring.

PyTorch Lightning is a lot more difficult to integrate with. Getting a simple training loop going would be easy, but it would use a different configuration file, and integrating it with the full set of important nequip features, such as correctly calculated and averaged metrics, careful data normalization, EMA, correct global numerical precision and JIT settings, etc., etc. would be difficult and involve a lot of subtle stumbling blocks we have already dealt with in the nequip code. For this reason I would really recommend against this path unless you want to deal carefully with all of this. (If you do, of course, it would be great if you could share that work!)

Thanks!

@Linux-cpp-lisp
Copy link
Collaborator

OK, I've merged the latest develop -> horovod, see #211.

@Linux-cpp-lisp
Copy link
Collaborator

If you try this, please run the Horovod unit tests tests/integration/test_train_horovod.py and confirm that they (1) are not skipped (i.e. horovod is installed) and (2) pass.

@JonathanSchmidt1
Copy link
Author

thank you very much. I will see how it goes.

@JonathanSchmidt1
Copy link
Author

As usual, other things got in the way but I could finally test it.
Running tests/integration/test_train_horovod.py worked.
I also confirmed that the normal training on gpu worked (nequip-train configs/minimal.yaml).

Now if I run with --horovod the training of the first epoch seems fine but there is a problem with the metrics.
I checked the torch_runstats lib and could not find any get_state, are you maybe using a modified version?

Epoch batch loss loss_f f_mae f_rmse
0 1 1.06 1.06 24.3 32.5
Traceback (most recent call last):
File "/home/test_user/.conda/envs/nequip2/bin/nequip-train", line 33, in
sys.exit(load_entry_point('nequip', 'console_scripts', 'nequip-train')())
File "/raid/scratch/testuser/nequip/nequip/scripts/train.py", line 87, in main
trainer.train()
File "/raid/scratch/testuser/nequip/nequip/train/trainer.py", line 827, in train
self.epoch_step()
File "/raid/scratch/testuser/nequip/nequip/train/trainer.py", line 991, in epoch_step
self.metrics.gather()
File "/raid/scratch/testuser/nequip/nequip/train/metrics.py", line 274, in gather
{
File "/raid/scratch/testuser/nequip/nequip/train/metrics.py", line 275, in
k1: {k2: rs.get_state() for k2, rs in v1.items()}
File "/raid/scratch/testuser/nequip/nequip/train/metrics.py", line 275, in
k1: {k2: rs.get_state() for k2, rs in v1.items()}
AttributeError: 'RunningStats' object has no attribute 'get_state'

@Linux-cpp-lisp
Copy link
Collaborator

Hi @JonathanSchmidt1 ,

Surprised that the tests run if the training won't... that sounds like a sign that the tests are broken 😄

Whoops yes I forgot to mention, I haven't merged the code I was writing to enable multi-GPU training in torch_runstats yet; you can find it on the branch https://github.com/mir-group/pytorch_runstats/tree/state-reduce.

@JonathanSchmidt1
Copy link
Author

Thank you that fixed it for one gpu.
horovodrun -np 1 nequip-train configs/example.yaml --horovod
works now.
If I use two gpus we get an error message as some tensors during the metric evaluation are on the wrong devices.
File "/raid/scratch/testuser/nequip/nequip/train/trainer.py", line 993, in epoch_step
[1,0]: self.metrics.gather()
[1,0]: File "/raid/scratch/testuser/nequip/nequip/train/metrics.py", line 288, in gather
[1,0]: self.running_stats[k1][k2].accumulate_state(rs_state)
[1,0]: File "/home/test_user/.conda/envs/nequip2/lib/python3.9/site-packages/torch_runstats/_runstats.py", line 331, in accumulate_state
[1,0]: self._state += n * (state - self._state) / (self._n + n)
[1,0]:RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0!

I checked and "n" and "state" are on cuda:1 and "self._state", "self._n" are on cuda:0 .
Not sure how it's supposed to be. Are they all expected to be on cuda:0 for this step or all on their own gpu?

@Linux-cpp-lisp
Copy link
Collaborator

Aha... here's that "this is very untested" 😁 I think PyTorch / Horovod may be too smart for its own good and reloading transmitted tensors onto different CUDA devices when they are all available to the same host... I will look into this when I get a chance.

@JonathanSchmidt1
Copy link
Author

That would be great, I will also try to find the time to look into it but I think I will need some time to understand the whole codebase.

@Linux-cpp-lisp Linux-cpp-lisp added enhancement New feature or request and removed question Further information is requested labels Feb 20, 2023
@JonathanSchmidt1
Copy link
Author

JonathanSchmidt1 commented Mar 22, 2023

I thought reviving the issues might be more convenient than continuing by email.
So some quick notes about some issues I noticed when testing the ddp branch.

  • Every process seems to get its own wandb log. It's not possible to restart because wandb finds an existing run in each process and then crashes.

  • Sometimes random crash after a few 100 epochs have no idea yet why. Was also not reproducible.
    -WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 215968 closing signal SIGTERM
    WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 215970 closing signal SIGTERM
    WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 215971 closing signal SIGTERM
    ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -15) local_rank: 1 (pid: 215969) of binary: /home/test_user/.conda/envs/nequip2/bin/python
    Traceback (most recent call last):
    File "/home/test_user/.conda/envs/nequip2/bin/torchrun", line 8, in
    sys.exit(main())
    File "/home/test_user/.conda/envs/nequip2/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper
    return f(*args, **kwargs)
    File "/home/test_user/.conda/envs/nequip2/lib/python3.9/site-packages/torch/distributed/run.py", line 719, in main
    run(args)
    File "/home/test_user/.conda/envs/nequip2/lib/python3.9/site-packages/torch/distributed/run.py", line 710, in run
    elastic_launch(
    File "/home/test_user/.conda/envs/nequip2/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in call
    return launch_agent(self._config, self._entrypoint, list(args))
    File "/home/test_user/.conda/envs/nequip2/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
    raise ChildFailedError(
    torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
    /home/test_user/.conda/envs/nequip2/bin/nequip-train FAILED
    Failures:
    <NO_OTHER_FAILURES>
    Root Cause (first observed failure):
    [0]:
    time : 2023-03-21_21:38:56
    host : dgx2
    rank : 1 (local_rank: 1)
    exitcode : -15 (pid: 215969)
    error_file: <N/A>
    traceback : Signal 15 (SIGTERM) received by PID 215969
    /home/test_user/.conda/envs/nequip2/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 6 leaked semaphore objects to clean up at shutdown
    warnings.warn('resource_tracker: There appear to be %d '
    /home/test_user/.conda/envs/nequip2/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 6 leaked semaphore objects to clean up at shutdown
    warnings.warn('resource_tracker: There appear to be %d '
    /home/test_user/.conda/envs/nequip2/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 6 leaked semaphore objects to clean up at shutdown
    warnings.warn('resource_tracker: There appear to be %d '
    /home/test_user/.conda/envs/nequip2/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 6 leaked semaphore objects to clean up at shutdown
    warnings.warn('resource_tracker: There appear to be %d '

  • At the moment each process seems to load the network on each gpu e.g. running with 8 gpus I get this output from nvidia-smi:

    0 N/A N/A 804401 C ...a/envs/nequip2/bin/python 18145MiB |
    | 0 N/A N/A 804402 C ...a/envs/nequip2/bin/python 1499MiB |
    | 0 N/A N/A 804403 C ...a/envs/nequip2/bin/python 1499MiB |
    | 0 N/A N/A 804404 C ...a/envs/nequip2/bin/python 1499MiB |
    | 0 N/A N/A 804405 C ...a/envs/nequip2/bin/python 1499MiB |
    | 0 N/A N/A 804406 C ...a/envs/nequip2/bin/python 1499MiB |
    | 0 N/A N/A 804407 C ...a/envs/nequip2/bin/python 1499MiB |
    | 0 N/A N/A 804408 C ...a/envs/nequip2/bin/python 1499MiB |
    | 1 N/A N/A 804401 C ...a/envs/nequip2/bin/python 1499MiB |
    | 1 N/A N/A 804402 C ...a/envs/nequip2/bin/python 19101MiB |
    | 1 N/A N/A 804403 C ...a/envs/nequip2/bin/python 1499MiB |
    | 1 N/A N/A 804404 C ...a/envs/nequip2/bin/python 1499MiB |
    | 1 N/A N/A 804405 C ...a/envs/nequip2/bin/python 1499MiB |
    | 1 N/A N/A 804406 C ...a/envs/nequip2/bin/python 1499MiB |
    | 1 N/A N/A 804407 C ...a/envs/nequip2/bin/python 1499MiB |
    | 1 N/A N/A 804408 C ...a/envs/nequip2/bin/python 1499MiB |
    | 2 N/A N/A 804401 C ...a/envs/nequip2/bin/python 1499MiB |
    | 2 N/A N/A 804402 C ...a/envs/nequip2/bin/python 1499MiB |
    | 2 N/A N/A 804403 C ...a/envs/nequip2/bin/python 17937MiB |
    | 2 N/A N/A 804404 C ...a/envs/nequip2/bin/python 1499MiB |
    | 2 N/A N/A 804405 C ...a/envs/nequip2/bin/python 1499MiB |
    | 2 N/A N/A 804406 C ...a/envs/nequip2/bin/python 1499MiB |
    | 2 N/A N/A 804407 C ...a/envs/nequip2/bin/python 1499MiB |
    | 2 N/A N/A 804408 C ...a/envs/nequip2/bin/python 1499MiB |
    | 3 N/A N/A 804401 C ...a/envs/nequip2/bin/python 1499MiB |
    | 3 N/A N/A 804402 C ...a/envs/nequip2/bin/python 1499MiB |
    | 3 N/A N/A 804403 C ...a/envs/nequip2/bin/python 1499MiB |
    ......

@Linux-cpp-lisp
Copy link
Collaborator

Hi @JonathanSchmidt1 ,

Thanks!

Every process seems to get its own wandb log. It's not possible to restart because wandb finds an existing run in each process and then crashes.

Hm yes... this one will be a little nontrivial, since need to not only prevent wandb init on other ranks but probably also sync the wandb updated config to the nonzero ranks.

Sometimes random crash after a few 100 epochs have no idea yet why. Was also not reproducible.

Weird... usually when we see something like this it means out-of-memory, or that the cluster's scheduler went crazy.

At the moment each process seems to load the network on each gpu e.g. running with 8 gpus I get this output from nvidia-smi:

Not sure exactly what I'm looking at here, but yes, every GPU will get its own copy of the model as hinted by the name "Distributed Data Parallel"

@JonathanSchmidt1
Copy link
Author

Out of memory errors could make sense and might be connected to the last issue as with the same batch size per GPU I did not produce OOM errors when running on a single gpu.

The output basically says that each worker process uses up memory (most likely a copy of the model) on each gpu, however with DDP each worker is supposed to have a copy only on its own gpu. Then gradient updates are sent all-to-all. Basically I would expect the output to look like this from previous experience with ddp:
0 N/A N/A 804401 C ...a/envs/nequip2/bin/python 18145MiB |
1 N/A N/A 804402 C ...a/envs/nequip2/bin/python 19101MiB |
2 N/A N/A 804403 C ...a/envs/nequip2/bin/python 17937MiB |

@peastman
Copy link
Contributor

I'd also be very interested in this feature. I have access to a system with four A100s on each node. Being able to use all four would make training go a lot faster.

@JonathanSchmidt1
Copy link
Author

JonathanSchmidt1 commented Apr 5, 2023

I spend some time debugging the issue and it seems that the metrics.gather and loss.gather calls cause the extra processes to spawn. If I remove these calls it's only one process per gpu and I can scale to 16 gpus (before it would run oom because of the extra processes). However continuing the training after stopping still somehow causes extra processes to spawn but just on the zeroth gpu.

@rschireman
Copy link

Hi all,

Any updates on this feature? I also have some rather large datasets.

@JonathanSchmidt1
Copy link
Author

Just a small update. As I had access to a different cluster with HOROVOD I tested the horovod branch again and with the fixed runstats version and a few small changes it ran without the issues of the ddp version. I also got descent speedups, despite using single gpu nodes.
N_nodes (1 P100 per node) [1, 2, 4, 8, 16, 32]
[1.0, 1.6286277105250644, 3.3867286549788127, 6.642094103901569, 9.572247883815873, 17.38443770824977]
ps: I did not confirm whether the loss is the same for different node numbers yet for HOROVOD

@rschireman
Copy link

Hi @JonathanSchmidt1,

Did you also receive a message like this when using the horovod branch on 2 gpus:

[1,0]<stderr>:Processing dataset...
[1,1]<stderr>:Processing dataset...

@JonathanSchmidt1
Copy link
Author

JonathanSchmidt1 commented Oct 27, 2023

The dataset processing only seems to happen in process for me, so I only get the message once. Anyway if that is causing problems for you it might work to process the dataset before and then start the training.
Ps: I have tested some of the models now and the loss reported during training seems correct.

@sklenard
Copy link

sklenard commented Feb 9, 2024

Hi,

I am also quite interested in the multi-GPU training capbility. I did some tests with the ddp branch using PyTorch 2.1.1 up to 16 GPUs (4 V100 per node) on a dataset with ~5k configurations. In all my tests I achieved the same results compared to a single GPU reference. I was wondering whether this feature is still under active development and if there is any plan to merge it with the develop branch ?

@beidouamg
Copy link

beidouamg commented Apr 25, 2024

Hi @sklenard,

I am trying to utilizing the multi-GPU feature, but I have some trouble with it.
I install the ddp branch with pytorch 2.1.1 by changing
"torch>=1.8,<=1.12,!=1.9.0", # torch.fx added in 1.8 to "torch>=1.8,<=2.1.1,!=1.9.0", # torch.fx added in 1.8
in ''setup.py'' nequip folder.

in this way, ddp branch can be installed without any error.
However, when I try to run nequip-train, i get this error:

[W init.cpp:842] Warning: Use _jit_set_fusion_strategy, bailout depth is deprecated. Setting to (STATIC, 2) (function operator())
Traceback (most recent call last):
  File "/global/homes/.local/Miniconda3/envs/nequip-ddp/bin/nequip-train", line 8, in <module>
    sys.exit(main())
  File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/nequip/scripts/train.py", line 76, in main
    trainer = fresh_start(config)
  File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/nequip/scripts/train.py", line 189, in fresh_start
    config = init_n_update(config)
  File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/nequip/utils/wandb.py", line 17, in init_n_update
    wandb.init(
  File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/wandb/sdk/wandb_init.py", line 1200, in init
    raise e
  File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/wandb/sdk/wandb_init.py", line 1177, in init
    wi.setup(kwargs)
  File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/wandb/sdk/wandb_init.py", line 190, in setup
    self._wl = wandb_setup.setup(settings=setup_settings)
  File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/wandb/sdk/wandb_setup.py", line 327, in setup
    ret = _setup(settings=settings)
  File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/wandb/sdk/wandb_setup.py", line 320, in _setup
    wl = _WandbSetup(settings=settings)
  File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/wandb/sdk/wandb_setup.py", line 303, in __init__
    _WandbSetup._instance = _WandbSetup__WandbSetup(settings=settings, pid=pid)
  File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/wandb/sdk/wandb_setup.py", line 114, in __init__
    self._setup()
  File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/wandb/sdk/wandb_setup.py", line 250, in _setup
    self._setup_manager()
  File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/wandb/sdk/wandb_setup.py", line 277, in _setup_manager
    self._manager = wandb_manager._Manager(settings=self._settings)
  File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/wandb/sdk/wandb_manager.py", line 139, in __init__
    self._service.start()
  File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/wandb/sdk/service/service.py", line 250, in start
    self._launch_server()
  File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/wandb/sdk/service/service.py", line 244, in _launch_server
    _sentry.reraise(e)
  File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/wandb/analytics/sentry.py", line 154, in reraise
    raise exc.with_traceback(sys.exc_info()[2])
  File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/wandb/sdk/service/service.py", line 242, in _launch_server
    self._wait_for_ports(fname, proc=internal_proc)
  File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/wandb/sdk/service/service.py", line 132, in _wait_for_ports
    raise ServiceStartTimeoutError(
wandb.sdk.service.service.ServiceStartTimeoutError: Timed out waiting for wandb service to start after 30.0 seconds. Try increasing the timeout with the `_service_wait` setting.

it seems that there is something wrong with wandb.
I wonder how you install this branch, Maybe there is some difference between the version you installed and I installed since more than 2 months had passed. It would be great if you could recall and tell how you installed or share the version you installed.
Thank you very much!

@Linux-cpp-lisp
Copy link
Collaborator

@beidouamg this looks like a network error unrelated to the ddp branch, but maybe there is a race condition. Have you tried to run without wandb enabled?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants