torch.distributed sys.excepthook crashes if distributed backend was deinitialized #126379

szmigacz · 2024-05-16T04:46:43Z

🐛 Describe the bug

sys.excepthook defined in

pytorch/torch/distributed/distributed_c10d.py

Line 1361 in 0214711

def _distributed_excepthook(*args):

calls get_rank() which is not available if distributed group was already destroyed.

import torch

torch.distributed.init_process_group('gloo')
torch.distributed.destroy_process_group()

raise ZeroDivisionError

Raises:

Error in sys.excepthook:
Traceback (most recent call last):
  File "/opt/homebrew/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1339, in _distributed_excepthook
    prefix = f"[rank{get_rank()}]"
                     ^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1746, in get_rank
    default_pg = _get_default_group()
                 ^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1008, in _get_default_group
    raise ValueError(
ValueError: Default process group has not been initialized, please make sure to call init_process_group.

Original exception was:
Traceback (most recent call last):
  File "/private/tmp/hook.py", line 8, in <module>
    raise ZeroDivisionError
ZeroDivisionError

Versions

Collecting environment information...
PyTorch version: 2.3.0
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: macOS 14.4.1 (arm64)
GCC version: Could not collect
Clang version: 15.0.0 (clang-1500.3.9.4)
CMake version: version 3.26.4
Libc version: N/A

Python version: 3.10.14 (main, Mar 19 2024, 21:46:16) [Clang 15.0.0 (clang-1500.3.9.4)] (64-bit runtime)
Python platform: macOS-14.4.1-arm64-arm-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Apple M1 Max

Versions of relevant libraries:
[pip3] torch==2.3.0
[conda] Could not collect

cc @ezyang @gchanan @zou3519 @kadeng @msaroufim @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k

The text was updated successfully, but these errors were encountered:

yf225 · 2024-05-20T17:45:19Z

Thanks! We would be open to taking PR for this.

ezyang · 2024-05-21T00:21:13Z

Thanks @wconstab, I appreciate it

fixes #126379 This is the easy fix. An additional fix that I did not do is to deregister the excepthook (or rather, restore the orignal one) when calling dist.destroy_process_group. This might be a bit complicated in practice, so landing as is for now. Also, couldn't figure out a clean way to test this. assertRaisesRegex wasn't getting a string value, probably becuase of the stderr redirection done via the excepthook in the first place. ghstack-source-id: 2ffe93e104eec809881411120f08a37b9285fe16 Pull Request resolved: #126739

mikaylagawarecki added the oncall: distributed Add this issue/PR to distributed oncall triage queue label May 20, 2024

yf225 added the module: c10d Issues/PRs related to collective communications and process groups label May 20, 2024

wconstab added the high priority label May 20, 2024

pytorch-bot bot added the triage review label May 20, 2024

wconstab self-assigned this May 20, 2024

wconstab added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label May 20, 2024

wconstab mentioned this issue May 21, 2024

[c10d] fix excepthook crash on exc after destroy_process_group #126739

Closed

pytorchmergebot closed this as completed in 8c9d332 May 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

torch.distributed sys.excepthook crashes if distributed backend was deinitialized #126379

torch.distributed sys.excepthook crashes if distributed backend was deinitialized #126379

szmigacz commented May 16, 2024 •

edited by pytorch-bot bot

yf225 commented May 20, 2024

ezyang commented May 21, 2024

torch.distributed sys.excepthook crashes if distributed backend was deinitialized #126379

torch.distributed sys.excepthook crashes if distributed backend was deinitialized #126379

Comments

szmigacz commented May 16, 2024 • edited by pytorch-bot bot

🐛 Describe the bug

Versions

yf225 commented May 20, 2024

ezyang commented May 21, 2024

szmigacz commented May 16, 2024 •

edited by pytorch-bot bot