Issues: microsoft/DeepSpeed
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Author
Label
Projects
Milestones
Assignee
Sort
Issues list
[BUG] deepspeed overlap_comm data race
bug
Something isn't working
training
#5545
opened May 18, 2024 by
yangyihang-bytedance
[Question]how to run the mixtral inference in multi-node?
bug
Something isn't working
inference
#5544
opened May 17, 2024 by
leachee99
[REQUEST] DeepSpeed-Ulysses with the Pure Deepspeed Zero
enhancement
New feature or request
#5542
opened May 16, 2024 by
ppengtang
[BUG] Zero3: Gather the params for inference(huggingface_language_model.generate) in the end of 1 epoch and re-partition it for next epoch training
bug
Something isn't working
training
#5539
opened May 15, 2024 by
Coobiw
[BUG] Version >0.14.0 leads to Something isn't working
training
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
bug
#5538
opened May 15, 2024 by
pacman100
[BUG] FlopsProfiler upsample flops compute bug
bug
Something isn't working
training
#5537
opened May 15, 2024 by
xgbj
[BUG]CUDA error in pipeline parallel
bug
Something isn't working
training
#5536
opened May 15, 2024 by
sunkun1997
[BUG] fp_quantizer is not correctly built when non-jit installation
bug
Something isn't working
inference
#5535
opened May 14, 2024 by
twaka
[BUG]AttributeError: module 'torch.nn.functional' has no attribute 'scaled_dot_product_attention'
bug
Something isn't working
compression
#5534
opened May 14, 2024 by
harborsarah
[BUG] Zero3: Post backward hook is not triggered for submodules whose inputs have .required_grad=False
bug
Something isn't working
training
#5524
opened May 12, 2024 by
deepcharm
[BUG] Why the results were inconsistent in two identical tests with config zero2 + overlap_comm
bug
Something isn't working
training
#5523
opened May 11, 2024 by
Suparjie
[BUG]Why ZeroOneAdam is easy to OOM compared to Adam optimizer?
bug
Something isn't working
training
#5521
opened May 10, 2024 by
npuichigo
[BUG] BertLMHeadModel.from_pretrained hangs when using zero-3 / zero3-offload
bug
Something isn't working
training
#5520
opened May 10, 2024 by
XenonLamb
[BUG] Uneven work distribution caused by get_shard_size changes
#5515
opened May 9, 2024 by
oelayan7
[BUG] When initializing model_engine, if an mpu is specified, it can lead to an excessively large checkpoint size, and the checkpoint may not be convertible through the Something isn't working
training
zero_to_fp32.py
script.
bug
#5514
opened May 9, 2024 by
Kwen-Chen
[REQUEST] Launcher mode with SSH bypass
enhancement
New feature or request
#5510
opened May 8, 2024 by
dogacancolak-kensho
[BUG] Mismatch between dtype settings in model and ds_config results in NaN loss
bug
Something isn't working
training
#5509
opened May 8, 2024 by
Taiki-azrs
[REQUEST] Enable both CPU and NVMe for optimizer
enhancement
New feature or request
#5508
opened May 8, 2024 by
shanhx2000
[BUG] Unexpected High Memory Usage (OOM) when finetuning Llama2-7B
bug
Something isn't working
training
#5507
opened May 8, 2024 by
shanhx2000
[BUG] 3 GPUs is not as good as expectation compare with 2 GPUs; NV vs AMD performace; flash attention not support for AMD GPUs
bug
Something isn't working
training
#5503
opened May 6, 2024 by
0781532
[BUG] Jamba (Mamba+MoE) + ZeRO3 + LoRA training hangs
bug
Something isn't working
training
#5502
opened May 6, 2024 by
hijkzzz
[REQUEST] Add documentation on how to run fast inference of New feature or request
transformers
models with ZeRO-3
enhancement
#5498
opened May 3, 2024 by
lewtun
[BUG] import deepspeed, MissingCUDAException
bug
Something isn't working
build
Improvements to the build and testing systems.
#5497
opened May 3, 2024 by
zsaladin
[BUG] Memory Leak in Stage 2 Optimizer
bug
Something isn't working
training
#5496
opened May 2, 2024 by
chiragjn
Previous Next
ProTip!
Follow long discussions with comments:>50.