-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
load pretrained file error in ddp mode training #2321
Comments
[pip3] flake8==3.7.9 this is my software version. |
Hi the traceback of the error is not full, could you please post the full error? |
I am having a similar issue with DDP. #this is my yaml:
this is what i add as a recoverable in my code(during init_optimizers)
Traceback:
as we can see it isn't able to save all of the files. when not using DDP or even when using DDP with just one process it works fine. env:
|
CC @TParcollet and @pplantinga |
Hi @avishaiElmakies, this way of adding so many recoverables and especially in that spot of the code (the init opt function) is a bit unconventional. It's better to use the checkpointer in Yaml for that. Can we see the train.py associated with this error? Many thanks. |
@TParcollet Thanks for the fast reply! |
Those are the files I use. Emotion_finetune.py is the main file. Emotion is a package I created that has most of the logic for the finetuning |
SIde note while waiting for the .py: the new version of SpeechBrain (v1) on the develop Branch can handle any model originating from HF nicely from the lobes. |
@TParcollet I added the code. In code.zip file(In my previous message) I will look at the 1.0, I used pip to download the library, but if v1 is more appropriate for. Me I will try and use it. |
Alright, I won't be able to have a detailed review on that code right now because it is highly customised. I will however give an advice - there is no reason to use such a compositional approach, you could build a simple standard speechbrain recipe doing what this emotion_finetune.py script is doing. It would be much, much, much cleaner and easier to debug. But it is most likely not a DDP problem. |
@TParcollet |
You don't need the v1 for that, but updating for the latest version always is the best solution, especially since this one is a big change. I am not sure to understand fully what you are trying to do here, but if it's fine-tuning of a model throughout one of our interface, it's not the best possible way (you can do it, as you are trying, but not doing something right might lead to errors). Interfaces are not built for fine-tuning, they are built for inference (in theory). If you want to fine-tune, it's better to get the checkpoint and build an actual training recipe with a .yaml and a .py using the Pretrainer class:
|
@TParcollet The finetuned model is this one: And used the code there to use as a inspiration for my code(on how to finetune a model) I also looked at the instructions on how to fine tune a model. I will look at what you said. If you have some advice I will appreciate it very much. EDIT: I should probably add that when trying to finetune the finetuned model. I would like to change the output mlp. |
Maybe @BenoitWang can provide some tips here ? :-) |
I tried making the save on run on main thread but it still not working.
I am getting an error:
I will probably try version 1.0 from develop tommrow. But from what i saw in the code I am not sure this will fix it. |
Hi |
Describe the bug
folder : speechbrain/recipes/LibriSpeech/ASR/seq2seq
python -m torch.distributed.launch --nproc_per_node=4 train.py hparams/train_BPE_5000.yaml --device='cuda'
When I performed above script, training error happened.
But this issue happened only in ddp mode training.
Thanks.
The. bug message is here.
AssertionError: HPU device module is not loaded
[2024-01-03 15:50:44,493] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 5495) of binary: /folderbin/python
Traceback (most recent call last):
File "/folderlib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/folderlib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/folderlib/python3.10/site-packages/torch/distributed/launch.py", line 196, in
main()
File "/folderlib/python3.10/site-packages/torch/distributed/launch.py", line 192, in main
launch(args)
File "/folderlib/python3.10/site-packages/torch/distributed/launch.py", line 177, in launch
run(args)
File "/folderlib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/folderlib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/folderlib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
Expected behaviour
To Reproduce
No response
Environment Details
No response
Relevant Log Output
No response
Additional Context
No response
The text was updated successfully, but these errors were encountered: