Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

load pretrained file error in ddp mode training #2321

Open
noname111234 opened this issue Jan 3, 2024 · 16 comments
Open

load pretrained file error in ddp mode training #2321

noname111234 opened this issue Jan 3, 2024 · 16 comments
Labels
bug Something isn't working

Comments

@noname111234
Copy link

noname111234 commented Jan 3, 2024

Describe the bug

folder : speechbrain/recipes/LibriSpeech/ASR/seq2seq
python -m torch.distributed.launch --nproc_per_node=4 train.py hparams/train_BPE_5000.yaml --device='cuda'
When I performed above script, training error happened.

But this issue happened only in ddp mode training.
Thanks.

The. bug message is here.
AssertionError: HPU device module is not loaded
[2024-01-03 15:50:44,493] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 5495) of binary: /folderbin/python
Traceback (most recent call last):
File "/folderlib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/folderlib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/folderlib/python3.10/site-packages/torch/distributed/launch.py", line 196, in
main()
File "/folderlib/python3.10/site-packages/torch/distributed/launch.py", line 192, in main
launch(args)
File "/folderlib/python3.10/site-packages/torch/distributed/launch.py", line 177, in launch
run(args)
File "/folderlib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/folderlib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/folderlib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

Expected behaviour

To Reproduce

No response

Environment Details

No response

Relevant Log Output

No response

Additional Context

No response

@noname111234 noname111234 added the bug Something isn't working label Jan 3, 2024
@noname111234
Copy link
Author

[pip3] flake8==3.7.9
[pip3] numpy==1.26.3
[pip3] torch==2.1.2
[pip3] torchaudio==2.1.2
[pip3] torchvision==0.16.2
[pip3] triton==2.1.0
[conda] numpy 1.26.3 pypi_0 pypi
[conda] torch 2.1.2 pypi_0 pypi
[conda] torchaudio 2.1.2 pypi_0 pypi
[conda] torchvision 0.16.2 pypi_0 pypi
[conda] triton 2.1.0 pypi_0 pypi

this is my software version.

@TParcollet
Copy link
Collaborator

Hi the traceback of the error is not full, could you please post the full error?

@avishaiElmakies
Copy link

avishaiElmakies commented Jan 16, 2024

I am having a similar issue with DDP.
I think that the DDP isn't able to save all of the recoverables in time before it reaches a place he needs to load them (e.g. if you have 1 epoch and then evaluate the brain after that).

#this is my yaml:


# Seed needs to be set at top of yaml, before objects with parameters are made
seed: 12
__set_seed: !apply:torch.manual_seed [12]
output_folder: finetune_e_wavlm_large
# eder_file: finetune_emo/eder.txt
save_folder: &id008 !ref <output_folder>/save
train_log: &id009 !ref <output_folder>/train_log.txt
local-rank: 0
distributed_launch: false

## some important costants 
sample_rate: 16000
download_base_path: "../../models"
dataset_json: ../../data_jsons/emotion_finetune_dataset.json
split_ratio: [0.8, 0.1, 0.1]

window_length: 1 # win_len = 0.02 * 1 = 0.02s
stride: 1 # stride = 0.02 * 1 = 0.02s

encoder_dim: 1024
# Outputs
out_n_neurons: 4  # BPE size, index(blank/eos/bos) = 0

# Dataloader options
# With data_parallel batch_size is split into N jobs
# With DDP batch_size is multiplied by N jobs
dataloader_options:
  batch_size: 2
  shuffle: true
  num_workers: 2    # 2 on linux but 0 works on windows
  drop_last: false
  pin_memory: true
  collate_fn: !name:speechbrain.dataio.batch.PaddedBatch

test_dataloader_opts:
  batch_size: 2
  collate_fn: !name:speechbrain.dataio.batch.PaddedBatch

epoch_counter: &id007 !new:speechbrain.utils.epoch_loop.EpochCounter

  limit: 1


add_noise_aug: &id020 !new:speechbrain.processing.speech_augmentation.AddNoise
  snr_low: 10
  snr_high: 20
  mix_prob: 0.5 


drop_freq_aug: &id021 !new:speechbrain.processing.speech_augmentation.DropFreq
  drop_freq_high: 0.5
  drop_count_low: 2
  drop_count_high: 4
  drop_prob : 0.5



drop_chunk_aug: &id022 !new:speechbrain.processing.speech_augmentation.DropChunk
  drop_length_low : 1600
  drop_length_high : 16000
  drop_count_low: 1
  drop_count_high: 2


augmentation: !new:speechbrain.processing.augmentation.Augmenter
  parallel_augment: false
  concat_original: false
  min_augmentations: 0
  max_augmentations: 3
  repeat_augment: 1
  noise_aug: *id020
  freq_aug: *id021
  chunk_aug: *id022

input_norm: &id001 !new:speechbrain.processing.features.InputNormalization
  norm_type: sentence
  std_norm: false


wav2vec2: &id002 !new:speechbrain.lobes.models.huggingface_wav2vec.HuggingFaceWav2Vec2
  source: microsoft/wavlm-large
  output_norm: true
  freeze: false
  freeze_feature_extractor: true
  save_path: !ref <download_base_path>/wavlm-large

avg_pool: !new:speechbrain.nnet.pooling.Pooling1d
  pool_type: avg
  kernel_size: 1
  stride: 1
  ceil_mode: true

output_mlp: &id003 !new:speechbrain.nnet.linear.Linear
  input_size: !ref <encoder_dim>
  n_neurons: !ref <out_n_neurons>
  bias: false

log_softmax: !new:speechbrain.nnet.activations.Softmax
  apply_log: true

compute_cost: !name:emotion.finetune.utils.weighted_nll_loss

# can be used with compute loss, probably better for ddp to work like this
# should also know the labels in advance. can work with diffrent labels and label encoeders
weights : 
  "s": 8.0
  "h": 20.0
  "n": 1.0
  "a": 20.0

modules:
  input_norm: *id001
  wav2vec2: *id002
  output_mlp: *id003

opt_class: !name:torch.optim.Adam
  lr: 0.0001

wav2vec2_opt_class: !name:torch.optim.Adam
  lr: 0.00001

lr_annealing: &id005 !new:speechbrain.nnet.schedulers.NewBobScheduler
  initial_value: 0.0001
  improvement_threshold: 0.0025
  annealing_factor: 0.8
  patient: 0

lr_annealing_wav2vec2: &id006 !new:speechbrain.nnet.schedulers.NewBobScheduler
  initial_value: 0.00001
  improvement_threshold: 0.0025
  annealing_factor: 0.9
  patient: 0


checkpointer: !new:speechbrain.utils.checkpoints.Checkpointer
  checkpoints_dir: *id008
  recoverables:
    scheduler_model: *id005
    scheduler_wav2vec: *id006
    counter: *id007
train_logger: !new:speechbrain.utils.train_logger.FileTrainLogger
  save_file: *id009

error_stats: !name:speechbrain.utils.metric_stats.ClassificationStats

this is what i add as a recoverable in my code(during init_optimizers)

if self.checkpointer is not None:
            self.checkpointer.add_recoverable("wav2vec2_opt", self.wav2vec2_optimizer)
            self.checkpointer.add_recoverable("optimizer", self.optimizer)
            self.checkpointer.add_recoverable("input_norm", self.modules.input_norm)
            self.checkpointer.add_recoverable("output_mlp", self.modules.output_mlp)
            self.checkpointer.add_recoverable("wav2vec2", self.modules.wav2vec2)

Traceback:


Traceback (most recent call last):
  File "/cs/labs/oabend/avishai.elma/src/emotion_src/emotion_finetune.py", line 63, in <module>
    main()
  File "/cs/labs/oabend/avishai.elma/src/emotion_src/emotion_finetune.py", line 52, in main
    brain.evaluate(
  File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/speechbrain/core.py", line 1531, in evaluate
    self.on_evaluate_start(max_key=max_key, min_key=min_key)
  File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/speechbrain/core.py", line 994, in on_evaluate_start
    self.checkpointer.recover_if_possible(
  File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/speechbrain/utils/checkpoints.py", line 891, in recover_if_possible
    self.load_checkpoint(chosen_ckpt, device)
  File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/speechbrain/utils/checkpoints.py", line 904, in load_checkpoint
    self._call_load_hooks(checkpoint, device)
  File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/speechbrain/utils/checkpoints.py", line 1039, in _call_load_hooks
    default_hook(obj, loadpath, end_of_epoch, device)
  File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/speechbrain/utils/checkpoints.py", line 96, in torch_recovery
    obj.load_state_dict(torch.load(path, map_location=device), strict=True)
  File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/torch/serialization.py", line 1028, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/torch/serialization.py", line 1246, in _legacy_load
    magic_number = pickle_module.load(f, **pickle_load_args)
EOFError: Ran out of input
[2024-01-16 18:02:11,451] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2696972 closing signal SIGTERM
[2024-01-16 18:02:29,572] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 1 (pid: 2696973) of binary: /cs/labs/oabend/avishai.elma/lab_env_v2/bin/python
Traceback (most recent call last):
  File "/usr/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/torch/distributed/launch.py", line 196, in <module>
    main()
  File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/torch/distributed/launch.py", line 192, in main
    launch(args)
  File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/torch/distributed/launch.py", line 177, in launch
    run(args)
  File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
emotion_finetune.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-01-16_18:02:11
  host      : drape-02.cs.huji.ac.il
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 2696973)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

image

as we can see it isn't able to save all of the files.
sometimes the error is "EOFError: Ran out of input"
sometimes the error is a keyerror.

when not using DDP or even when using DDP with just one process it works fine.

env:

SpeechBrain system description
==============================
Python version:
3.9.2 (default, Feb 28 2021, 17:03:44) 
[GCC 10.2.1 20210110]
==============================
Installed Python packages:
aiohttp==3.9.1
aiosignal==1.3.1
alembic==1.13.0
AMFM-decompy==1.0.11
annotated-types==0.6.0
antlr4-python3-runtime==4.8
asteroid-filterbanks==0.4.0
asttokens==2.4.1
async-timeout==4.0.3
attrs==23.1.0
audioread==3.0.1
bitarray==2.9.0
blis==0.7.11
catalogue==2.0.10
certifi==2022.12.7
cffi==1.16.0
charset-normalizer==2.1.1
click==8.1.7
cloudpathlib==0.16.0
colorama==0.4.6
coloredlogs==15.0.1
colorlog==6.8.0
comm==0.2.0
confection==0.1.4
contourpy==1.2.0
cycler==0.12.1
cymem==2.0.8
Cython==3.0.6
datasets==2.15.0
debugpy==1.8.0
decorator==5.1.1
dill==0.3.7
docopt==0.6.2
einops==0.7.0
en-core-web-sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl#sha256=86cc141f63942d4b2c5fcee06630fd6f904788d2f0ab005cce45aadb8fb73889
exceptiongroup==1.2.0
executing==2.0.1
fairseq @ git+https://github.com/pytorch/fairseq@da8fb630880d529ab47e53381c30ddc8ad235216
filelock==3.9.0
flatbuffers==23.5.26
fonttools==4.46.0
frozenlist==1.4.0
fsspec==2023.10.0
greenlet==3.0.2
huggingface-hub==0.19.4
humanfriendly==10.0
hydra-core==1.0.7
HyperPyYAML==1.2.2
idna==3.4
importlib-metadata==7.0.0
importlib-resources==6.1.1
inflect==7.0.0
iniconfig==2.0.0
ipykernel==6.27.1
ipython==8.18.1
jedi==0.19.1
Jinja2==3.1.2
joblib==1.3.2
julius==0.2.7
jupyter_client==8.6.0
jupyter_core==5.5.1
kiwisolver==1.4.5
langcodes==3.3.0
lazy_loader==0.3
librosa==0.10.0
lightning==2.1.2
lightning-utilities==0.10.0
llvmlite==0.36.0
lxml==4.9.3
Mako==1.3.0
markdown-it-py==3.0.0
MarkupSafe==2.1.3
matplotlib==3.8.2
matplotlib-inline==0.1.6
mdurl==0.1.2
mpmath==1.3.0
msgpack==1.0.7
multidict==6.0.4
multiprocess==0.70.15
murmurhash==1.0.10
nest-asyncio==1.5.8
networkx==3.0
npy-append-array==0.9.16
numba==0.53.0
numpy==1.22.0
omegaconf==2.0.6
onnxruntime==1.16.3
optuna==3.5.0
packaging==23.2
pandas==2.1.4
parso==0.8.3
pexpect==4.9.0
Pillow==9.3.0
platformdirs==4.0.0
pluggy==1.3.0
pooch==1.8.0
portalocker==2.8.2
preshed==3.0.9
primePy==1.3
prompt-toolkit==3.0.43
protobuf==4.25.1
psutil==5.9.7
ptyprocess==0.7.0
pure-eval==0.2.2
pyannote.audio==3.1.1
pyannote.core==5.0.0
pyannote.database==5.0.1
pyannote.metrics==3.2.1
pyannote.pipeline==3.0.1
pyarrow==14.0.1
pyarrow-hotfix==0.6
pycparser==2.21
pydantic==2.5.2
pydantic_core==2.14.5
Pygments==2.17.2
pyparsing==3.1.1
pytest==7.4.3
python-dateutil==2.8.2
pytorch-lightning==2.1.2
pytorch-metric-learning==2.3.0
pytz==2023.3.post1
PyYAML==6.0.1
pyzmq==25.1.2
regex==2023.10.3
requests==2.28.1
resampy==0.4.2
rich==13.7.0
ruamel.yaml==0.18.5
ruamel.yaml.clib==0.2.8
sacrebleu==2.4.0
safetensors==0.4.1
scikit-learn==1.3.2
scipy==1.11.4
semver==3.0.2
sentencepiece==0.1.99
shellingham==1.5.4
simplejson==3.19.2
six==1.16.0
smart-open==6.4.0
sortedcontainers==2.4.0
soundfile==0.12.1
soxr==0.3.7
spacy==3.7.2
spacy-legacy==3.0.12
spacy-loggers==1.0.5
speechbrain==0.5.16
SQLAlchemy==2.0.23
srsly==2.4.8
stack-data==0.6.3
sympy==1.12
tabulate==0.9.0
tensorboardX==2.6.2.2
-e git+https://github.com/facebookresearch/textlesslib.git@ba33d669d8284b4f7bfe81e7384e83ab799fe384#egg=textless
tgt==1.5
thinc==8.2.1
threadpoolctl==3.0.0
tokenizers==0.15.0
tomli==2.0.1
torch==2.1.1+cu118
torch-audiomentations==0.11.0
torch-pitch-shift==1.2.4
torchaudio==2.1.1+cu118
torchcrepe==0.0.22
torchmetrics==1.2.1
torchvision==0.16.1+cu118
tornado==6.4
tqdm==4.66.1
traitlets==5.14.0
transformers==4.36.1
triton==2.1.0
typer==0.9.0
typing_extensions==4.9.0
tzdata==2023.3
Unidecode==1.3.7
urllib3==1.26.13
wasabi==1.1.2
wcwidth==0.2.12
weasel==0.3.4
wget==3.2
xxhash==3.4.1
yarl==1.9.4
zipp==3.17.0
==============================
Could not get git revision==============================
CUDA version:
11.8

@Adel-Moumen
Copy link
Collaborator

CC @TParcollet and @pplantinga

@TParcollet
Copy link
Collaborator

Hi @avishaiElmakies, this way of adding so many recoverables and especially in that spot of the code (the init opt function) is a bit unconventional. It's better to use the checkpointer in Yaml for that. Can we see the train.py associated with this error? Many thanks.

@avishaiElmakies
Copy link

avishaiElmakies commented Jan 16, 2024

@TParcollet Thanks for the fast reply!
I will admit I used someone's code who added the optimizers there so I thought it was fine. Is there a better location to add it?
My code doesn't always have the modules in the yaml (I want my code to be able to take a huggingface.co model and use it, specifically the model used for emotion-Diarization, so I would like to have the flexibility)
I will try to add the files shortly

@avishaiElmakies
Copy link

Code.zip

Those are the files I use. Emotion_finetune.py is the main file. Emotion is a package I created that has most of the logic for the finetuning

@TParcollet
Copy link
Collaborator

SIde note while waiting for the .py: the new version of SpeechBrain (v1) on the develop Branch can handle any model originating from HF nicely from the lobes.

@avishaiElmakies
Copy link

avishaiElmakies commented Jan 17, 2024

@TParcollet I added the code. In code.zip file(In my previous message)

I will look at the 1.0, I used pip to download the library, but if v1 is more appropriate for. Me I will try and use it.

@TParcollet
Copy link
Collaborator

Alright, I won't be able to have a detailed review on that code right now because it is highly customised. I will however give an advice - there is no reason to use such a compositional approach, you could build a simple standard speechbrain recipe doing what this emotion_finetune.py script is doing. It would be much, much, much cleaner and easier to debug. But it is most likely not a DDP problem.

@avishaiElmakies
Copy link

@TParcollet
Do I need to use speechbrain v1 for that?
Do you have an example?

@TParcollet
Copy link
Collaborator

You don't need the v1 for that, but updating for the latest version always is the best solution, especially since this one is a big change. I am not sure to understand fully what you are trying to do here, but if it's fine-tuning of a model throughout one of our interface, it's not the best possible way (you can do it, as you are trying, but not doing something right might lead to errors). Interfaces are not built for fine-tuning, they are built for inference (in theory). If you want to fine-tune, it's better to get the checkpoint and build an actual training recipe with a .yaml and a .py using the Pretrainer class:

Now for your use case, you could try force the checkpointer to save the checkpoint somewhere at some point in your training recipe. Do not forget to call it within the run_on_main environment or somewhere that is only executed in the main process if you want to use DDP. Unfortunately, your code is a bit too complex here in its structure for me to help. Maybe others may want to have a try if they have time @Adel-Moumen

@avishaiElmakies
Copy link

avishaiElmakies commented Jan 17, 2024

@TParcollet
I will try to explain what I want to do.
I need to finetune a wavlm model on an emotion data. My main problem is that I want the ability to finetune two models. I want to finetune wavlm from scratch on the data, and I want to finetune a finetuned model. And see who is better. I hoped my code will let me do it by only changing the yaml.

The finetuned model is this one:
https://huggingface.co/speechbrain/emotion-diarization-wavlm-large

And used the code there to use as a inspiration for my code(on how to finetune a model)
https://www.dropbox.com/sh/woudm1v31a7vyp5/AADAMxpQOXaxf8E_1hX202GJa?dl=0

I also looked at the instructions on how to fine tune a model.

I will look at what you said. If you have some advice I will appreciate it very much.

EDIT: I should probably add that when trying to finetune the finetuned model. I would like to change the output mlp.

@Adel-Moumen
Copy link
Collaborator

Maybe @BenoitWang can provide some tips here ? :-)

@avishaiElmakies
Copy link

avishaiElmakies commented Jan 17, 2024

I tried making the save on run on main thread but it still not working.

if stage == sb.Stage.VALID:

            old_lr, new_lr = self.hparams.lr_annealing(1 - stats["accuracy"])
            sb.nnet.schedulers.update_learning_rate(self.optimizer, new_lr)

            
            old_lr_wav2vec2,new_lr_wav2vec2 = self.hparams.lr_annealing_wav2vec2(1 - stats["accuracy"])
            sb.nnet.schedulers.update_learning_rate(self.wav2vec2_optimizer, new_lr_wav2vec2)

            meta["lr"] = old_lr
            meta["lr_wav2vec2"] = old_lr_wav2vec2
            self.hparams.train_logger.log_stats(stats_meta=meta, valid_stats=stats)
            if epoch % 1 == 0: #TODO change this to 3
                checkpointer_meta = stats.copy()
                checkpointer_meta.pop('confusion_matrix',None)
                sb.utils.distributed.run_on_main(
                    self.checkpointer.save_and_keep_only,
                    kwargs={"meta":checkpointer_meta,
                    "max_keys":["accuracy"],
                    "num_to_keep":1}
                )
                print("Saved checkpoint")

I am getting an error:

Traceback (most recent call last):
  File "/cs/labs/oabend/avishai.elma/src/emotion_src/emotion_finetune.py", line 63, in <module>
    main()
  File "/cs/labs/oabend/avishai.elma/src/emotion_src/emotion_finetune.py", line 46, in main
    brain = finetune(modules, hparams, datasets,class_weights=class_weights, run_opts=run_opts,
  File "/cs/labs/oabend/avishai.elma/src/emotion_src/emotion/finetune/utils.py", line 136, in finetune
    emo_brain.fit(
  File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/speechbrain/core.py", line 1367, in fit
    self._fit_valid(valid_set=valid_set, epoch=epoch, enable=enable)
  File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/speechbrain/core.py", line 1281, in _fit_valid
    self.on_stage_end(Stage.VALID, avg_valid_loss, epoch)
  File "/cs/labs/oabend/avishai.elma/src/emotion_src/emotion/finetune/finetune_brain.py", line 119, in on_stage_end
    sb.utils.distributed.run_on_main(
  File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/speechbrain/utils/distributed.py", line 65, in run_on_main
    ddp_barrier()
  File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/speechbrain/utils/distributed.py", line 118, in ddp_barrier
    torch.distributed.barrier()
  File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
    work = default_pg.barrier(opts=opts)
RuntimeError: Detected mismatch between collectives on ranks. Rank 1 is running collective: CollectiveFingerPrint(SequenceNumber=6858OpType=BARRIER), but Rank 0 is running collective: CollectiveFingerPrint(SequenceNumber=0OpType=GATHER).Collectives differ in the following aspects:          Sequence number: 6858vs 0  Op type: BARRIERvs GATHER
[E ProcessGroupGloo.cpp:2810] [Rank 0]: Rank 1 failed to pass monitoredBarrier in 7200000 ms
[E ProcessGroupGloo.cpp:138] [Rank 0]: Ranks 1 failed to pass monitoredBarrier in 7200000 ms
speechbrain.core - Exception:
Traceback (most recent call last):
  File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/speechbrain/utils/distributed.py", line 60, in run_on_main
    func(*args, **kwargs)
  File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/speechbrain/utils/checkpoints.py", line 679, in save_and_keep_only
    self.save_checkpoint(
  File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/speechbrain/utils/checkpoints.py", line 586, in save_checkpoint
    torch.distributed.broadcast_object_list(communication_list, src=0)
  File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 2603, in broadcast_object_list
    broadcast(object_sizes_tensor, src=src, group=group)
  File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1906, in broadcast
    work = default_pg.broadcast([tensor], opts)
RuntimeError: Detected mismatch between collectives on ranks. Rank 0 is running collective: CollectiveFingerPrint(SequenceNumber=6858, OpType=BROADCAST, TensorShape=[1], TensorDtypes=Long, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt))), but Rank 1 is running collective: CollectiveFingerPrint(SequenceNumber=0OpType=REDUCE).Collectives differ in the following aspects:     Sequence number: 6858vs 0  Op type: BROADCASTvs REDUCE  Tensor Tensor shapes: 1vs   Tensor Tensor dtypes: Longvs   Tensor Tensor devices: TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt))vs 

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/cs/labs/oabend/avishai.elma/src/emotion_src/emotion_finetune.py", line 63, in <module>
    main()
  File "/cs/labs/oabend/avishai.elma/src/emotion_src/emotion_finetune.py", line 46, in main
    brain = finetune(modules, hparams, datasets,class_weights=class_weights, run_opts=run_opts,
  File "/cs/labs/oabend/avishai.elma/src/emotion_src/emotion/finetune/utils.py", line 136, in finetune
    emo_brain.fit(
  File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/speechbrain/core.py", line 1367, in fit
    self._fit_valid(valid_set=valid_set, epoch=epoch, enable=enable)
  File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/speechbrain/core.py", line 1281, in _fit_valid
    self.on_stage_end(Stage.VALID, avg_valid_loss, epoch)
  File "/cs/labs/oabend/avishai.elma/src/emotion_src/emotion/finetune/finetune_brain.py", line 119, in on_stage_end
    sb.utils.distributed.run_on_main(
  File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/speechbrain/utils/distributed.py", line 62, in run_on_main
    ddp_barrier()
  File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/speechbrain/utils/distributed.py", line 118, in ddp_barrier
    torch.distributed.barrier()
  File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
    work = default_pg.barrier(opts=opts)
RuntimeError: [Rank 0]: Ranks 1 failed to pass monitoredBarrier in 7200000 ms
[2024-01-17 21:02:13,945] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 2012414) of binary: /cs/labs/oabend/avishai.elma/lab_env_v2/bin/python
Traceback (most recent call last):
  File "/usr/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/torch/distributed/launch.py", line 196, in <module>
    main()
  File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/torch/distributed/launch.py", line 192, in main
    launch(args)
  File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/torch/distributed/launch.py", line 177, in launch
    run(args)
  File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/cs/labs/oabend/avishai.elma/lab_env_v2/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
emotion_finetune.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-01-17_21:02:13
  host      : firth-02.cs.huji.ac.il
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 2012415)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-01-17_21:02:13
  host      : firth-02.cs.huji.ac.il
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 2012414)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

I will probably try version 1.0 from develop tommrow. But from what i saw in the code I am not sure this will fix it.
I am also having a problem with some unused parameters. which i will open a different bug for.

@avishaiElmakies
Copy link

avishaiElmakies commented Jan 18, 2024

Hi
I updated my code to v1 and after the refactoring this seems to work fine again!
I would love if someone can look at issue #2340.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants