microsoft / DeepSpeed Public

Notifications
Fork 3.9k
Star 33k

Code
Issues 936
Pull requests 134
Discussions
Actions
Projects
Security
Insights

Additional navigation options

Code
Issues
Pull requests
Discussions
Actions
Projects
Security
Insights

Issues: microsoft/DeepSpeed

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

936 Open 1,615 Closed

Author

Filter by author

Label

Filter by label

Use alt + click/return to exclude labels

or ⇧ + click/return for logical OR

Projects

Filter by project

Milestones

Filter by milestone

Assignee

Filter by who’s assigned

Assigned to nobody

Sort

Sort by

Newest Oldest Most commented Least commented Recently updated Least recently updated Best match

Most reactions

Issues list

[BUG] deepspeed overlap_comm data race bug

Something isn't working

training

#5545 opened May 18, 2024 by yangyihang-bytedance

[Question]how to run the mixtral inference in multi-node? bug

Something isn't working

inference

#5544 opened May 17, 2024 by leachee99

[BUG] bug

Something isn't working

training

#5543 opened May 17, 2024 by q737645224

[REQUEST] DeepSpeed-Ulysses with the Pure Deepspeed Zero enhancement

New feature or request

#5542 opened May 16, 2024 by ppengtang

[BUG] Zero3: Gather the params for inference(huggingface_language_model.generate) in the end of 1 epoch and re-partition it for next epoch training bug

Something isn't working

training

#5539 opened May 15, 2024 by Coobiw

[BUG] Version >0.14.0 leads to RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! bug

Something isn't working

training

#5538 opened May 15, 2024 by pacman100

[BUG] FlopsProfiler upsample flops compute bug bug

Something isn't working

training

#5537 opened May 15, 2024 by xgbj

[BUG]CUDA error in pipeline parallel bug

Something isn't working

training

#5536 opened May 15, 2024 by sunkun1997

[BUG] fp_quantizer is not correctly built when non-jit installation bug

Something isn't working

inference

#5535 opened May 14, 2024 by twaka

[BUG]AttributeError: module 'torch.nn.functional' has no attribute 'scaled_dot_product_attention' bug

Something isn't working

compression

#5534 opened May 14, 2024 by harborsarah

[BUG] Zero3: Post backward hook is not triggered for submodules whose inputs have .required_grad=False bug

Something isn't working

training

#5524 opened May 12, 2024 by deepcharm

[BUG] Why the results were inconsistent in two identical tests with config zero2 + overlap_comm bug

Something isn't working

training

#5523 opened May 11, 2024 by Suparjie

[BUG]Why ZeroOneAdam is easy to OOM compared to Adam optimizer? bug

Something isn't working

training

#5521 opened May 10, 2024 by npuichigo

[BUG] BertLMHeadModel.from_pretrained hangs when using zero-3 / zero3-offload bug

Something isn't working

training

#5520 opened May 10, 2024 by XenonLamb

[BUG] Uneven work distribution caused by get_shard_size changes

#5515 opened May 9, 2024 by oelayan7

[BUG] When initializing model_engine, if an mpu is specified, it can lead to an excessively large checkpoint size, and the checkpoint may not be convertible through the zero_to_fp32.py script. bug

Something isn't working

training

#5514 opened May 9, 2024 by Kwen-Chen

[REQUEST] Launcher mode with SSH bypass enhancement

New feature or request

#5510 opened May 8, 2024 by dogacancolak-kensho

[BUG] Mismatch between dtype settings in model and ds_config results in NaN loss bug

Something isn't working

training

#5509 opened May 8, 2024 by Taiki-azrs

[REQUEST] Enable both CPU and NVMe for optimizer enhancement

New feature or request

#5508 opened May 8, 2024 by shanhx2000

[BUG] Unexpected High Memory Usage (OOM) when finetuning Llama2-7B bug

Something isn't working

training

#5507 opened May 8, 2024 by shanhx2000

[BUG] 3 GPUs is not as good as expectation compare with 2 GPUs; NV vs AMD performace; flash attention not support for AMD GPUs bug

Something isn't working

training

#5503 opened May 6, 2024 by 0781532

[BUG] Jamba (Mamba+MoE) + ZeRO3 + LoRA training hangs bug

Something isn't working

training

#5502 opened May 6, 2024 by hijkzzz

[REQUEST] Add documentation on how to run fast inference of transformers models with ZeRO-3 enhancement

New feature or request

#5498 opened May 3, 2024 by lewtun

[BUG] import deepspeed, MissingCUDAException bug

Something isn't working

build

Improvements to the build and testing systems.

#5497 opened May 3, 2024 by zsaladin

[BUG] Memory Leak in Stage 2 Optimizer bug

Something isn't working

training

#5496 opened May 2, 2024 by chiragjn

Previous 1 2 3 4 5 … 37 38 Next

Previous Next

ProTip! Follow long discussions with comments:>50.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly