Skip to content

Trouble adding ignite.distributed to my baseline training process #2032

Answered by ydcjeff
aksg87 asked this question in Q&A
Discussion options

You must be logged in to vote

Hi @aksg87 , looks like you are using spawn method to run distributed training and the error is coming from PyTorch DataLoader unable to pickle the SwigPyObject. Make sure DataLoader can pickle SwigPyObject.

Also please try to use launch method to run distributed training, it is faster than spawn method.

# spawn method
python train.py --args (training args)
# launch method
python -m torch.distributed.launch --nproc_per_node 8 --use_env -m train.py --backend nccl --args (training args)

If you use launch, calling the training loop will become:

if __name__ == "__main__":

    with idist.Parallel(backend=backend) as parallel:  # no need for `nproc_per_node` as it is handled by `torch.distribu…

Replies: 2 comments 8 replies

Comment options

You must be logged in to vote
1 reply
@vfdev-5
Comment options

Answer selected by aksg87
Comment options

You must be logged in to vote
7 replies
@vfdev-5
Comment options

@aksg87
Comment options

@vfdev-5
Comment options

@aksg87
Comment options

@vfdev-5
Comment options

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
3 participants