Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

torch.distributed sys.excepthook crashes if distributed backend was deinitialized #126379

Closed
szmigacz opened this issue May 16, 2024 · 2 comments
Closed
Assignees
Labels
high priority module: c10d Issues/PRs related to collective communications and process groups oncall: distributed Add this issue/PR to distributed oncall triage queue triage review triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@szmigacz
Copy link
Contributor

szmigacz commented May 16, 2024

馃悰 Describe the bug

sys.excepthook defined in

def _distributed_excepthook(*args):
calls get_rank() which is not available if distributed group was already destroyed.

import torch

torch.distributed.init_process_group('gloo')
torch.distributed.destroy_process_group()

raise ZeroDivisionError

Raises:

Error in sys.excepthook:
Traceback (most recent call last):
  File "/opt/homebrew/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1339, in _distributed_excepthook
    prefix = f"[rank{get_rank()}]"
                     ^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1746, in get_rank
    default_pg = _get_default_group()
                 ^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1008, in _get_default_group
    raise ValueError(
ValueError: Default process group has not been initialized, please make sure to call init_process_group.

Original exception was:
Traceback (most recent call last):
  File "/private/tmp/hook.py", line 8, in <module>
    raise ZeroDivisionError
ZeroDivisionError

Versions

Collecting environment information...
PyTorch version: 2.3.0
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: macOS 14.4.1 (arm64)
GCC version: Could not collect
Clang version: 15.0.0 (clang-1500.3.9.4)
CMake version: version 3.26.4
Libc version: N/A

Python version: 3.10.14 (main, Mar 19 2024, 21:46:16) [Clang 15.0.0 (clang-1500.3.9.4)] (64-bit runtime)
Python platform: macOS-14.4.1-arm64-arm-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Apple M1 Max

Versions of relevant libraries:
[pip3] torch==2.3.0
[conda] Could not collect

cc @ezyang @gchanan @zou3519 @kadeng @msaroufim @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k

@mikaylagawarecki mikaylagawarecki added the oncall: distributed Add this issue/PR to distributed oncall triage queue label May 20, 2024
@yf225 yf225 added the module: c10d Issues/PRs related to collective communications and process groups label May 20, 2024
@yf225
Copy link
Contributor

yf225 commented May 20, 2024

Thanks! We would be open to taking PR for this.

@wconstab wconstab self-assigned this May 20, 2024
@wconstab wconstab added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label May 20, 2024
@ezyang
Copy link
Contributor

ezyang commented May 21, 2024

Thanks @wconstab, I appreciate it

wconstab added a commit that referenced this issue May 21, 2024
fixes #126379

This is the easy fix.  An additional fix that I did not do is to
deregister the excepthook (or rather, restore the orignal one) when
calling dist.destroy_process_group.  This might be a bit complicated in
practice, so landing as is for now.

Also, couldn't figure out a clean way to test this.  assertRaisesRegex
wasn't getting a string value, probably becuase of the stderr
redirection done via the excepthook in the first place.

ghstack-source-id: 2ffe93e104eec809881411120f08a37b9285fe16
Pull Request resolved: #126739
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
high priority module: c10d Issues/PRs related to collective communications and process groups oncall: distributed Add this issue/PR to distributed oncall triage queue triage review triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

5 participants