Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Error with nn.transformers layer size with Zero stage 3 #5543

Open
q737645224 opened this issue May 17, 2024 · 1 comment
Open

[BUG] Error with nn.transformers layer size with Zero stage 3 #5543

q737645224 opened this issue May 17, 2024 · 1 comment
Labels
bug Something isn't working training

Comments

@q737645224
Copy link

Describe the bug
A clear and concise description of what the bug is.
The size of the nn.Transformers layer does not match, and parameters cannot be loaded after stage3 is used。But stage2 is capable of loading parameters normally

To Reproduce
Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior
A clear and concise description of what you expected to happen.
image

ds_report output
Please run ds_report to give us details about your setup.
image

Screenshots
If applicable, add screenshots to help explain your problem.

System info (please complete the following information):

  • OS: [e.g. Ubuntu 18.04]
  • GPU count and types [e.g. two machines with x8 A100s each]
  • Interconnects (if applicable) [e.g., two machines connected with 100 Gbps IB]
  • Python version
  • Any other relevant info about your setup

Launcher context
Are you launching your experiment with the deepspeed launcher, MPI, or something else?

Docker context
Are you using a specific docker image that you can share?

Additional context
Add any other context about the problem here.

@q737645224 q737645224 added bug Something isn't working training labels May 17, 2024
@loadams
Copy link
Contributor

loadams commented May 17, 2024

Can you please add a title?

@loadams loadams changed the title [BUG] [BUG] Error with nn.transformers layer size with Zero stage 3 May 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working training
Projects
None yet
Development

No branches or pull requests

2 participants