Multi-GPU support exists❓ [QUESTION] #210

JonathanSchmidt1 · 2022-05-11T14:13:50Z

We are interested in training nequip potentials on large datasets of several million structures.
Consequently we wanted to know whether multi-gpu support exists or if someone knows whether the networks can be integrated into pytorch lightning.
best regards and thank you very much,
Jonathan
Ps: this might be related to #126

Linux-cpp-lisp · 2022-05-12T20:35:29Z

Hi @JonathanSchmidt1 ,

Thanks for your interest in our code/method for your project! Sounds like an interesting application; please feel free to get in touch by email and let us know how it's going (we're always interested to hear about what people are working on using our methods).

Re multi-GPU training: I have a draft branch horovod using the Horovod distributed training framework. This is an in-progress draft, and has only been successfully tested so for for a few epochs on multiple CPUs. The branch is also a little out-of-sync with the latest version, but I will try to merge that back in in the coming days. If you are interested, you are more than welcome to use this branch, just understanding that it would as a sort of an "alpha tester." If you do use the branch, please carefully check any results you get for sanity and against those with Horovod disabled, and also please report any issues/suspicions here or by email. (One disclaimer is that the horovod branch is not a development priority for us this summer and I will likely be slow to respond.) PRs are also welcome, though I appreciate people reaching out to discuss first if the PR involves major development or restructuring.

PyTorch Lightning is a lot more difficult to integrate with. Getting a simple training loop going would be easy, but it would use a different configuration file, and integrating it with the full set of important nequip features, such as correctly calculated and averaged metrics, careful data normalization, EMA, correct global numerical precision and JIT settings, etc., etc. would be difficult and involve a lot of subtle stumbling blocks we have already dealt with in the nequip code. For this reason I would really recommend against this path unless you want to deal carefully with all of this. (If you do, of course, it would be great if you could share that work!)

Thanks!

Linux-cpp-lisp · 2022-05-12T21:56:37Z

OK, I've merged the latest develop -> horovod, see #211.

Linux-cpp-lisp · 2022-05-13T01:11:49Z

If you try this, please run the Horovod unit tests tests/integration/test_train_horovod.py and confirm that they (1) are not skipped (i.e. horovod is installed) and (2) pass.

JonathanSchmidt1 · 2022-05-16T13:22:07Z

thank you very much. I will see how it goes.

JonathanSchmidt1 · 2022-07-16T15:03:10Z

As usual, other things got in the way but I could finally test it.
Running tests/integration/test_train_horovod.py worked.
I also confirmed that the normal training on gpu worked (nequip-train configs/minimal.yaml).

Now if I run with --horovod the training of the first epoch seems fine but there is a problem with the metrics.
I checked the torch_runstats lib and could not find any get_state, are you maybe using a modified version?

Epoch batch loss loss_f f_mae f_rmse
0 1 1.06 1.06 24.3 32.5
Traceback (most recent call last):
File "/home/test_user/.conda/envs/nequip2/bin/nequip-train", line 33, in
sys.exit(load_entry_point('nequip', 'console_scripts', 'nequip-train')())
File "/raid/scratch/testuser/nequip/nequip/scripts/train.py", line 87, in main
trainer.train()
File "/raid/scratch/testuser/nequip/nequip/train/trainer.py", line 827, in train
self.epoch_step()
File "/raid/scratch/testuser/nequip/nequip/train/trainer.py", line 991, in epoch_step
self.metrics.gather()
File "/raid/scratch/testuser/nequip/nequip/train/metrics.py", line 274, in gather
{
File "/raid/scratch/testuser/nequip/nequip/train/metrics.py", line 275, in
k1: {k2: rs.get_state() for k2, rs in v1.items()}
File "/raid/scratch/testuser/nequip/nequip/train/metrics.py", line 275, in
k1: {k2: rs.get_state() for k2, rs in v1.items()}
AttributeError: 'RunningStats' object has no attribute 'get_state'

Linux-cpp-lisp · 2022-07-16T16:19:17Z

Hi @JonathanSchmidt1 ,

Surprised that the tests run if the training won't... that sounds like a sign that the tests are broken 😄

Whoops yes I forgot to mention, I haven't merged the code I was writing to enable multi-GPU training in torch_runstats yet; you can find it on the branch https://github.com/mir-group/pytorch_runstats/tree/state-reduce.

JonathanSchmidt1 · 2022-07-16T22:03:55Z

Thank you that fixed it for one gpu.
horovodrun -np 1 nequip-train configs/example.yaml --horovod
works now.
If I use two gpus we get an error message as some tensors during the metric evaluation are on the wrong devices.
File "/raid/scratch/testuser/nequip/nequip/train/trainer.py", line 993, in epoch_step
[1,0]: self.metrics.gather()
[1,0]: File "/raid/scratch/testuser/nequip/nequip/train/metrics.py", line 288, in gather
[1,0]: self.running_stats[k1][k2].accumulate_state(rs_state)
[1,0]: File "/home/test_user/.conda/envs/nequip2/lib/python3.9/site-packages/torch_runstats/_runstats.py", line 331, in accumulate_state
[1,0]: self._state += n * (state - self._state) / (self._n + n)
[1,0]:RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0!

I checked and "n" and "state" are on cuda:1 and "self._state", "self._n" are on cuda:0 .
Not sure how it's supposed to be. Are they all expected to be on cuda:0 for this step or all on their own gpu?

Linux-cpp-lisp · 2022-07-17T08:43:08Z

Aha... here's that "this is very untested" 😁 I think PyTorch / Horovod may be too smart for its own good and reloading transmitted tensors onto different CUDA devices when they are all available to the same host... I will look into this when I get a chance.

JonathanSchmidt1 · 2022-07-17T19:00:01Z

That would be great, I will also try to find the time to look into it but I think I will need some time to understand the whole codebase.

JonathanSchmidt1 · 2023-03-22T11:10:22Z

I thought reviving the issues might be more convenient than continuing by email.
So some quick notes about some issues I noticed when testing the ddp branch.

Every process seems to get its own wandb log. It's not possible to restart because wandb finds an existing run in each process and then crashes.
Sometimes random crash after a few 100 epochs have no idea yet why. Was also not reproducible.
-WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 215968 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 215970 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 215971 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -15) local_rank: 1 (pid: 215969) of binary: /home/test_user/.conda/envs/nequip2/bin/python
Traceback (most recent call last):
File "/home/test_user/.conda/envs/nequip2/bin/torchrun", line 8, in
sys.exit(main())
File "/home/test_user/.conda/envs/nequip2/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper
return f(*args, **kwargs)
File "/home/test_user/.conda/envs/nequip2/lib/python3.9/site-packages/torch/distributed/run.py", line 719, in main
run(args)
File "/home/test_user/.conda/envs/nequip2/lib/python3.9/site-packages/torch/distributed/run.py", line 710, in run
elastic_launch(
File "/home/test_user/.conda/envs/nequip2/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/test_user/.conda/envs/nequip2/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
/home/test_user/.conda/envs/nequip2/bin/nequip-train FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2023-03-21_21:38:56
host : dgx2
rank : 1 (local_rank: 1)
exitcode : -15 (pid: 215969)
error_file: <N/A>
traceback : Signal 15 (SIGTERM) received by PID 215969
/home/test_user/.conda/envs/nequip2/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 6 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
/home/test_user/.conda/envs/nequip2/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 6 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
/home/test_user/.conda/envs/nequip2/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 6 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
/home/test_user/.conda/envs/nequip2/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 6 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
At the moment each process seems to load the network on each gpu e.g. running with 8 gpus I get this output from nvidia-smi:

0 N/A N/A 804401 | 0 N/A N/A | 0 N/A N/A | 0 N/A N/A | 0 N/A N/A | 0 N/A N/A | 0 N/A N/A | 0 N/A N/A | 1 N/A N/A | 1 N/A N/A | 1 N/A N/A | 1 N/A N/A | 1 N/A N/A | 1 N/A N/A | 1 N/A N/A | 1 N/A N/A | 2 N/A N/A | 2 N/A N/A | 2 N/A N/A | 2 N/A N/A | 2 N/A N/A | 2 N/A N/A | 2 N/A N/A | 2 N/A N/A | 3 N/A N/A | 3 N/A N/A | 3 N/A N/A ...... C ...a/envs/nequip2/bin/python 18145MiB |
804402 C ...a/envs/nequip2/bin/python 1499MiB |
804403 C ...a/envs/nequip2/bin/python 1499MiB |
804404 C ...a/envs/nequip2/bin/python 1499MiB |
804405 C ...a/envs/nequip2/bin/python 1499MiB |
804406 C ...a/envs/nequip2/bin/python 1499MiB |
804407 C ...a/envs/nequip2/bin/python 1499MiB |
804408 C ...a/envs/nequip2/bin/python 1499MiB |
804401 C ...a/envs/nequip2/bin/python 1499MiB |
804402 C ...a/envs/nequip2/bin/python 19101MiB |
804403 C ...a/envs/nequip2/bin/python 1499MiB |
804404 C ...a/envs/nequip2/bin/python 1499MiB |
804405 C ...a/envs/nequip2/bin/python 1499MiB |
804406 C ...a/envs/nequip2/bin/python 1499MiB |
804407 C ...a/envs/nequip2/bin/python 1499MiB |
804408 C ...a/envs/nequip2/bin/python 1499MiB |
804401 C ...a/envs/nequip2/bin/python 1499MiB |
804402 C ...a/envs/nequip2/bin/python 1499MiB |
804403 C ...a/envs/nequip2/bin/python 17937MiB |
804404 C ...a/envs/nequip2/bin/python 1499MiB |
804405 C ...a/envs/nequip2/bin/python 1499MiB |
804406 C ...a/envs/nequip2/bin/python 1499MiB |
804407 C ...a/envs/nequip2/bin/python 1499MiB |
804408 C ...a/envs/nequip2/bin/python 1499MiB |
804401 C ...a/envs/nequip2/bin/python 1499MiB |
804402 C ...a/envs/nequip2/bin/python 1499MiB |
804403 C ...a/envs/nequip2/bin/python 1499MiB |

Linux-cpp-lisp · 2023-03-28T22:59:15Z

Hi @JonathanSchmidt1 ,

Thanks!

Every process seems to get its own wandb log. It's not possible to restart because wandb finds an existing run in each process and then crashes.

Hm yes... this one will be a little nontrivial, since need to not only prevent wandb init on other ranks but probably also sync the wandb updated config to the nonzero ranks.

Sometimes random crash after a few 100 epochs have no idea yet why. Was also not reproducible.

Weird... usually when we see something like this it means out-of-memory, or that the cluster's scheduler went crazy.

At the moment each process seems to load the network on each gpu e.g. running with 8 gpus I get this output from nvidia-smi:

Not sure exactly what I'm looking at here, but yes, every GPU will get its own copy of the model as hinted by the name "Distributed Data Parallel"

JonathanSchmidt1 · 2023-03-29T07:55:03Z

Out of memory errors could make sense and might be connected to the last issue as with the same batch size per GPU I did not produce OOM errors when running on a single gpu.

The output basically says that each worker process uses up memory (most likely a copy of the model) on each gpu, however with DDP each worker is supposed to have a copy only on its own gpu. Then gradient updates are sent all-to-all. Basically I would expect the output to look like this from previous experience with ddp:
0 N/A N/A 804401 C ...a/envs/nequip2/bin/python 18145MiB |
1 N/A N/A 804402 C ...a/envs/nequip2/bin/python 19101MiB |
2 N/A N/A 804403 C ...a/envs/nequip2/bin/python 17937MiB |

peastman · 2023-03-29T19:50:31Z

I'd also be very interested in this feature. I have access to a system with four A100s on each node. Being able to use all four would make training go a lot faster.

JonathanSchmidt1 · 2023-04-05T19:25:33Z

I spend some time debugging the issue and it seems that the metrics.gather and loss.gather calls cause the extra processes to spawn. If I remove these calls it's only one process per gpu and I can scale to 16 gpus (before it would run oom because of the extra processes). However continuing the training after stopping still somehow causes extra processes to spawn but just on the zeroth gpu.

rschireman · 2023-07-26T22:28:24Z

Hi all,

Any updates on this feature? I also have some rather large datasets.

JonathanSchmidt1 · 2023-09-08T15:14:47Z

Just a small update. As I had access to a different cluster with HOROVOD I tested the horovod branch again and with the fixed runstats version and a few small changes it ran without the issues of the ddp version. I also got descent speedups, despite using single gpu nodes.
N_nodes (1 P100 per node) [1, 2, 4, 8, 16, 32]
[1.0, 1.6286277105250644, 3.3867286549788127, 6.642094103901569, 9.572247883815873, 17.38443770824977]
ps: I did not confirm whether the loss is the same for different node numbers yet for HOROVOD

rschireman · 2023-09-29T17:57:04Z

Hi @JonathanSchmidt1,

Did you also receive a message like this when using the horovod branch on 2 gpus:

[1,0]<stderr>:Processing dataset...
[1,1]<stderr>:Processing dataset...

JonathanSchmidt1 · 2023-10-27T10:45:15Z

The dataset processing only seems to happen in process for me, so I only get the message once. Anyway if that is causing problems for you it might work to process the dataset before and then start the training.
Ps: I have tested some of the models now and the loss reported during training seems correct.

sklenard · 2024-02-09T18:27:59Z

Hi,

I am also quite interested in the multi-GPU training capbility. I did some tests with the ddp branch using PyTorch 2.1.1 up to 16 GPUs (4 V100 per node) on a dataset with ~5k configurations. In all my tests I achieved the same results compared to a single GPU reference. I was wondering whether this feature is still under active development and if there is any plan to merge it with the develop branch ?

beidouamg · 2024-04-25T03:27:11Z

Hi @sklenard,

I am trying to utilizing the multi-GPU feature, but I have some trouble with it.
I install the ddp branch with pytorch 2.1.1 by changing
"torch>=1.8,<=1.12,!=1.9.0", # torch.fx added in 1.8 to "torch>=1.8,<=2.1.1,!=1.9.0", # torch.fx added in 1.8
in ''setup.py'' nequip folder.

in this way, ddp branch can be installed without any error.
However, when I try to run nequip-train, i get this error:

[W init.cpp:842] Warning: Use _jit_set_fusion_strategy, bailout depth is deprecated. Setting to (STATIC, 2) (function operator())
Traceback (most recent call last):
  File "/global/homes/.local/Miniconda3/envs/nequip-ddp/bin/nequip-train", line 8, in <module>
    sys.exit(main())
  File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/nequip/scripts/train.py", line 76, in main
    trainer = fresh_start(config)
  File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/nequip/scripts/train.py", line 189, in fresh_start
    config = init_n_update(config)
  File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/nequip/utils/wandb.py", line 17, in init_n_update
    wandb.init(
  File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/wandb/sdk/wandb_init.py", line 1200, in init
    raise e
  File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/wandb/sdk/wandb_init.py", line 1177, in init
    wi.setup(kwargs)
  File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/wandb/sdk/wandb_init.py", line 190, in setup
    self._wl = wandb_setup.setup(settings=setup_settings)
  File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/wandb/sdk/wandb_setup.py", line 327, in setup
    ret = _setup(settings=settings)
  File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/wandb/sdk/wandb_setup.py", line 320, in _setup
    wl = _WandbSetup(settings=settings)
  File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/wandb/sdk/wandb_setup.py", line 303, in __init__
    _WandbSetup._instance = _WandbSetup__WandbSetup(settings=settings, pid=pid)
  File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/wandb/sdk/wandb_setup.py", line 114, in __init__
    self._setup()
  File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/wandb/sdk/wandb_setup.py", line 250, in _setup
    self._setup_manager()
  File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/wandb/sdk/wandb_setup.py", line 277, in _setup_manager
    self._manager = wandb_manager._Manager(settings=self._settings)
  File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/wandb/sdk/wandb_manager.py", line 139, in __init__
    self._service.start()
  File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/wandb/sdk/service/service.py", line 250, in start
    self._launch_server()
  File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/wandb/sdk/service/service.py", line 244, in _launch_server
    _sentry.reraise(e)
  File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/wandb/analytics/sentry.py", line 154, in reraise
    raise exc.with_traceback(sys.exc_info()[2])
  File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/wandb/sdk/service/service.py", line 242, in _launch_server
    self._wait_for_ports(fname, proc=internal_proc)
  File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/wandb/sdk/service/service.py", line 132, in _wait_for_ports
    raise ServiceStartTimeoutError(
wandb.sdk.service.service.ServiceStartTimeoutError: Timed out waiting for wandb service to start after 30.0 seconds. Try increasing the timeout with the `_service_wait` setting.

it seems that there is something wrong with wandb.
I wonder how you install this branch, Maybe there is some difference between the version you installed and I installed since more than 2 months had passed. It would be great if you could recall and tell how you installed or share the version you installed.
Thank you very much!

Linux-cpp-lisp · 2024-05-01T22:56:25Z

@beidouamg this looks like a network error unrelated to the ddp branch, but maybe there is a race condition. Have you tried to run without wandb enabled?

JonathanSchmidt1 added the question Further information is requested label May 11, 2022

Linux-cpp-lisp mentioned this issue May 12, 2022

make EnergyModel pickable #126

Closed

10 tasks

Linux-cpp-lisp added enhancement New feature or request and removed question Further information is requested labels Feb 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-GPU support exists❓ [QUESTION] #210

Multi-GPU support exists❓ [QUESTION] #210

JonathanSchmidt1 commented May 11, 2022

Linux-cpp-lisp commented May 12, 2022

Linux-cpp-lisp commented May 12, 2022

Linux-cpp-lisp commented May 13, 2022

JonathanSchmidt1 commented May 16, 2022

JonathanSchmidt1 commented Jul 16, 2022

Linux-cpp-lisp commented Jul 16, 2022

JonathanSchmidt1 commented Jul 16, 2022

Linux-cpp-lisp commented Jul 17, 2022

JonathanSchmidt1 commented Jul 17, 2022

JonathanSchmidt1 commented Mar 22, 2023 •

edited

Linux-cpp-lisp commented Mar 28, 2023

JonathanSchmidt1 commented Mar 29, 2023

peastman commented Mar 29, 2023

JonathanSchmidt1 commented Apr 5, 2023 •

edited

rschireman commented Jul 26, 2023

JonathanSchmidt1 commented Sep 8, 2023

rschireman commented Sep 29, 2023

JonathanSchmidt1 commented Oct 27, 2023 •

edited

sklenard commented Feb 9, 2024

beidouamg commented Apr 25, 2024 •

edited by Linux-cpp-lisp

Linux-cpp-lisp commented May 1, 2024

Multi-GPU support exists❓ [QUESTION] #210

Multi-GPU support exists❓ [QUESTION] #210

Comments

JonathanSchmidt1 commented May 11, 2022

Linux-cpp-lisp commented May 12, 2022

Linux-cpp-lisp commented May 12, 2022

Linux-cpp-lisp commented May 13, 2022

JonathanSchmidt1 commented May 16, 2022

JonathanSchmidt1 commented Jul 16, 2022

Linux-cpp-lisp commented Jul 16, 2022

JonathanSchmidt1 commented Jul 16, 2022

Linux-cpp-lisp commented Jul 17, 2022

JonathanSchmidt1 commented Jul 17, 2022

JonathanSchmidt1 commented Mar 22, 2023 • edited

Linux-cpp-lisp commented Mar 28, 2023

JonathanSchmidt1 commented Mar 29, 2023

peastman commented Mar 29, 2023

JonathanSchmidt1 commented Apr 5, 2023 • edited

rschireman commented Jul 26, 2023

JonathanSchmidt1 commented Sep 8, 2023

rschireman commented Sep 29, 2023

JonathanSchmidt1 commented Oct 27, 2023 • edited

sklenard commented Feb 9, 2024

beidouamg commented Apr 25, 2024 • edited by Linux-cpp-lisp

Linux-cpp-lisp commented May 1, 2024

JonathanSchmidt1 commented Mar 22, 2023 •

edited

JonathanSchmidt1 commented Apr 5, 2023 •

edited

JonathanSchmidt1 commented Oct 27, 2023 •

edited

beidouamg commented Apr 25, 2024 •

edited by Linux-cpp-lisp