Make the quantized data shape compatible with original tensor shape #5483

sfc-gh-reyazda · 2024-04-30T05:05:11Z

This PR adds a simple modification for the FP-Quantizer module to have the quantized weight compatible with the original checkpoint weight's shape.
This addresses Snowflake-Labs/snowflake-arctic#16

sfc-gh-reyazda · 2024-04-30T05:06:28Z

@microsoft-github-policy-service agree [company="{your company}"]

@microsoft-github-policy-service agree [company="Snowflake"]

sfc-gh-reyazda · 2024-04-30T05:07:15Z

@microsoft-github-policy-service agree company="your company"

@microsoft-github-policy-service agree company=Snowflake

…bs/deepspeed into fix-quantized-shape

Was providing the optimizer name which was configured, and not optimizer that was actually taking place after this function processing. This is not always aligned. Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>

…oft#5159) Enhance testing: Skip fused_optimizer tests if not supported. Added condition check to skip fused_optimizer tests if FusedAdam and FusedLamb are not supported by the accelerator. This enhancement ensures that the tests are appropriately skipped when the hardware configuration does not support these optimizers, preventing potential issues. Details: - Introduced a condition check to determine support for FusedAdam and FusedLamb. - If not supported, fused_optimizer tests are skipped to improve test reliability. - Improved compatibility and stability across different hardware configurations. --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Logan Adams <loadams@microsoft.com>

Fixes microsoft#4989 In addition to this PR, below changes are required to build below extensions successfully. Please note that not all unit tests for these extensions will pass with this PR. More details on the unit test results are below. These unit tests are skipped in CI anyway, so they will not break the CI. - transformer_inference - quantizer - random_ltd - pytorch/pytorch#121030 - microsoft#5402 Unit test results (rocm/pytorch:rocm6.1_ubuntu20.04_py3.9_pytorch_2.1.2) on MI200: **transformer_inference:** pytest --color=yes --durations=0 --verbose -s -m "inference_ops" -rF -n 4 unit/ops/transformer/inference Before this PR: ==== 674 failed, 622 skipped, 8 warnings, 1728 errors in 123.66s (0:02:03) ===== After this PR: ========== 555 failed, 983 passed, 1486 skipped, 8 warnings in 14.35s ========== **quantizer:** pytest --color=yes --durations=0 --verbose -s -m "inference_ops" -rF -n 4 unit/ops/quantizer Before this PR: ==== 244 failed, 8 warnings in 48.02s ==== After this PR: ===== 187 failed, 57 passed, 8 warnings in 14.74s ==== I could not find random_ltd related unit tests to run. --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Logan Adams <loadams@microsoft.com>

@jithunnair-amd

This PR enables building the below extensions for AMD GPUs with warp size 32. - transformer_inference - quantizer - random_ltd This PR works stand-alone for torch version <=2.0. For the latest versions, microsoft#5401 is required to be merged in addition to this PR. Unit test results (rocm/pytorch:rocm6.1_ubuntu20.04_py3.9_pytorch_2.1.2) on NAVI3x: **transformer_inference:** pytest --color=yes --durations=0 --verbose -s -m "inference_ops" -rF -n 4 unit/ops/transformer/inference Before this PR: ===== 674 failed, 622 skipped, 8 warnings, 1728 errors in 69.37s (0:01:09) ===== After this PR: ========== 476 failed, 1062 passed, 1486 skipped, 8 warnings in 9.31s ========== **quantizer:** pytest --color=yes --durations=0 --verbose -s -m "inference_ops" -rF -n 4 unit/ops/quantizer Before this PR: ==== 244 failed, 8 warnings in 30.53s ==== After this PR: ====== 186 failed, 58 passed, 8 warnings in 8.89s ====== I could not find random_ltd related unit tests to run. Fixes: microsoft#4753 microsoft#5474 ROCm#68 cc: @jithunnair-amd --------- Co-authored-by: rraminen@amd.com <rraminen> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>

* Use all_reduce instead of all_gather to fetch module parameters. This improves performance by reducing the overhead of concatenation and slicing, which are no longer required. * Instead, all tensors views are created prior to the collective (all_reduce), so upon its completion only the parameter status is updated. * The behavior is enabled via a new boolean flag under the section "zero_optimization": { "stage3_use_all_reduce_for_fetch_params": true } * By default the optimization is not enabled. Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

Hi. Please review the following changes I added support for BF16 to cpu adam. BF16, FP16 and float are supported at compilation time. the correct template is called at runtime according to input params dtype. --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

Fixing following error /datadisk2/wengshiy/llm.devkit/DeepSpeed/deepspeed/runtime/utils.py return get_accelerator().FloatTensor(float(v)).detach() TypeError: new(): data must be a sequence (got float) cuda accelerator modified the interface for fixing warning: microsoft@177dc14 --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

…icrosoft#5519) There is following error on XPU while unit testing "DeepSpeed/tests/unit/moe/test_moe.py" DeepSpeed/deepspeed/moe/sharded_moe.py line 223, in top1gating RuntimeError: Expected all tensors to be on the same device, but found at least two devices, xpu:0 and cpu! Fix it by device conversion. --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

**Fix overwriting of the compiled wrapper class attributes by those of the wrapped class itself: Copy only those attributes which are not already present in the wrapper.** In the current implementation of the `CompiledModuleWrapper` the wrapper attributes (eg `forward` method) are overwritten by `self._dict_ = module._dict_.copy()`: ``` def CompiledModuleWrapper(mod, compile_config: Union[CompileConfig, None] = None): class wrapper(mod.__class__): def __init__(self, module, compile_config: Union[CompileConfig, None] = None): self.__dict__ = module.__dict__.copy() ``` This causes the `wrapper`'s `forward` method not being called and, consequently, the wrapped module not compiled. Instead, the wrapped module `forward` method is being called as illustrated in the diagram below (a real scenario from Deespeed-Chat): ![compiled_module_wrapper_bug](https://github.com/microsoft/DeepSpeed/assets/75629718/00eeb3d1-927c-49c7-84ab-f882821cc452) The proposed fix copies only those attributes which are not present in the wrapper class, thus implementing the desired inheritance quality of the wrapper. Attached is a simple reproducer of the problem. [compiled_module_wrapper_bug.zip](https://github.com/microsoft/DeepSpeed/files/15378282/compiled_module_wrapper_bug.zip) Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>

change from "zero_like" to "zeros_like"

ditto Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>

Addresses the following warning: ``` /tmp/actions-runner/_work/DeepSpeed/DeepSpeed/unit-test-venv/lib/python3.8/site-packages/transformers/utils/hub.py:123: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead. ``` and the code on the transformers side is [here](https://github.com/huggingface/transformers/blob/1a585c1222a56bcaecc070966d558d4a9d862e83/src/transformers/utils/hub.py#L86C1-L96C81).

…5546) Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>

The new "timers" section describes configuration for different timers. Specifically, in the "throughput" section, it is possible to disable the throughput timer (enabled by default). This allows to avoid the performance degradation whenever the throughput measurement is not needed, for example in production environment. No device synchronize() is invoked when "synchronized" is set to False (default is True). This allows to produce approximate throughput measurements with minimal performance penalty. --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

The `DistributedAttention` in DeepSpeed-Ulysses has a compatibility with the training code in [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed/blob/main/megatron/model/transformer.py#L811) because it only takes sequential sequences as input parameters. However, this is not compatible with the frequently used scenarios of specifying parameters, such as the following scenario when using Flash Attention: ```python ulysses_attn = DistributedAttention(local_attention=flash_attn_func, sequence_process_group=None, scatter_idx=2, gather_idx=1) attn_output = ulysses_attn( query_states, key_states, value_states, dropout, softmax_scale, causal=causal, ) ``` Therefore, the `**kwargs` parameter has been added to increase compatibility with more local attention, while making minimal code modifications. Co-authored-by: Kwen-Chen <2133949025@qq.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>

…osoft#5562) This PR updates the `nv-ds-chat` GitHub workflow to include `hybrid_engine.py` file in the path. This is done to ensure testing on the DS-Chat flow is done whenever any changes are made to the Hybrid Engine.

When running for MiCS, we found many handle print on DeepSpeed from the output log, this pr is to remove it to suppress this.

…osoft#5577) Fixes microsoft#5535. Todo: need to test.

- [x] Needs to be tested. Fixes microsoft#5494. Sample screenshot: <img width="1141" alt="image" src="https://github.com/microsoft/DeepSpeed/assets/114770087/f89f642b-bca1-4d45-b3f1-ec7943ab2ad4">

till today only last layer (idx=-1) was considered using FINAL_LAYER_NORM_INDEX which is set to -1. this PR allows the user to pass custom value for model where this default value does not apply. see example for usage in HabanaAI/Megatron-DeepSpeed fork repository: https://github.com/HabanaAI/Megatron-DeepSpeed/blob/c9feb8cacabc6dd4da4266cff08db555a21122e2/tools/verify_checkpoint_non_tp_consistency.py#L296 --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Logan Adams <loadams@microsoft.com>

Add CUDA versions 12.4 and 12.5 to the list

Fixed the Windows build. Fixes applied: - Remove some more ops that don't build on Windows. - Remove the use of symlinks that didn't work correctly and replace with `shutil.copytree()`. - Small fixes to make the C++ code compile. Tested with Python 3.9 and CUDA 12.1. --------- Co-authored-by: Costin Eseanu <costineseanu@gmail.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>

…osoft#5099) was considering 4 bytes per model param, and 4 bytes per gradient. fixed it to 2 bytes - under the assumption of FP16/BF16 --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

This is a simple fix for inference woq part, changing from `'cuda'` to `get_accelerator().device_name()`. --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>

use dp_world_size for grad reduction, instead of seq_dp_world_size. Currently, for zero0, only sparse tensors use the correct world_size. tiny model with sp=4 grad norm test: grad_norm | step1 | step2 | step3 | step4 |step5 | step100 -- | -- | -- | -- | -- | --| -- zero1 | 15.825 | 16.646|15.853 | 16.159 | 17.333 | 15.555 zero0 | 3.956 | 4.161 | 3.963 | 4.040 | 4.333| 3.889 zero0(this patch) | 15.825 | 16.646 | 15.853| 16.159 | 17.333 | 15.554

In the process of adding onebit optimizers support for XPU devices, we have noticed that for different accelerator, the main difference of implementation of `compressed_allreduce` lies on `packbits` and `unpackbits`. CUDA uses cupy and NPU uses torch_npu. Instead of replace these to xpu only functions, we provided a CompressedBackend to do the `compressed_allreduce` work where users can add their own packbits/unpackbits kernels, which is a general path for all kinds of accelerators. In this PR, we: 1. Add CompressedBackend for onebitAdam, onebitLamb and zerooneAdam 2. Add XPU implement of packbits/unpackbits with SYCL, built in PackbitsBuilder 3. Add tests for onebit with CompressedBackend --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

Updated hpu-gaudi2 tests content as quantizer module is not yet supported.

MII legacy tests use `from transformers import Conversation` [here](https://github.com/microsoft/DeepSpeed-MII/blob/c171c4ee290e96c0d3e618b654be8add5eca973b/mii/legacy/method_table.py#L8). Conversation was removed from transformers [here](huggingface/transformers#31165) so we pin to a version before that before unpinning.

…icrosoft#5590) We have been encountered an accuracy issue when running Torch compile + zero3 + activation checkpointing. Specifically some grads gets is zeroed (running without torch compile, this issue is not encountered). This issue was also reproduced by Umesh Chand from the DS team. We found that in the Pytorch repo torch compile has been specifically disabled using the label: @torch._disable_dynamo() reference to the WA in the Pytorch repo (https://github.com/pytorch/pytorch/blob/ec8b254ef49b4a057cf89c2ae64520fb7b423a3e/torch/utils/checkpoint.py#L324) this indicates that there is some issue with torch compile and checkpointing (not necessarily DS related). given that the checkpointing function in DeepSpeed is based on the Pytorch function, We propose to adopt this WA to ensure correct behavior (it can be removed later if the underlying issue is fixed) Note: this shouldn't impact non-troch compile cases. --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

instead of "if" that causes host/device synchronization and introduces a bubble, while clamp is hapenning on the device

`deepspeed.runtime.zero.stage_1_and_2.DeepSpeedZeroOptimizer.average_tensor` only sets reduction stream waiting for default stream. This is ok in cases where the computation time is longer than the communication time, but when the communication time is longer, it may result in a rewrite of the ipg_buffer when the communication is not completed. ![image](https://github.com/microsoft/DeepSpeed/assets/35059704/950cbf8a-f439-4cf9-a364-dcdfd47f46a0) To fix this bug, the easiest way is just add default stream to wait for reduction stream at the **same point**. For example, in point 1, the `reduction stream` needs to wait for '2', so we add a wait_stream to `reduction stream` waiting for `default stream`. Also, the `default stream` needs to wait for 'A', so we need to add a wait_stream to `default stream` waiting for `reduction stream` before the 'B'. ![image](https://github.com/microsoft/DeepSpeed/assets/35059704/588a9469-d3f9-4c39-976d-3ae0502cf1d1) Compared with the modification of microsoft#5523, wait_stream does not cause host synchronization. Compared with the modification of microsoft#5545, the modification is more simple and the logic is the same, just waiting for what needs to wait. --- With this modification, losses of Qwen-1.5 with and without overlap_comm are totally identical. ![image](https://github.com/microsoft/DeepSpeed/assets/35059704/4d48d54e-e55b-4230-8b99-93549910a43f) --- On the contrary, there is an obvious gap with a small sequence length, which means a short computation time. ![image](https://github.com/microsoft/DeepSpeed/assets/35059704/c80af498-3358-4e36-9b13-8f266551d51d) Co-authored-by: gp513 <guopeng34@huawei.com> Co-authored-by: CurryRice233 <nmeia@qq.com> Co-authored-by: Joe Mayer <114769929+jomayeri@users.noreply.github.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>

…into fix-quantized-shape

Make the quantized data shape compatible with original tensor shape

085fcd8

sfc-gh-reyazda requested review from mrwyattii, awan-10 and arashb as code owners April 30, 2024 05:05

sfc-gh-reyazda mentioned this pull request Apr 30, 2024

Meet error in serving with huggingface inference tutorial Snowflake-Labs/snowflake-arctic#16

Closed

sfc-gh-reyazda and others added 5 commits May 4, 2024 01:14

change the scale and quantized data format

a83b384

minor fixes

048648d

fix

bf12893

minor fix

b18f71f

Merge branch 'master' into fix-quantized-shape

4d6e04b

RezaYazdaniAminabadi mentioned this pull request May 15, 2024

Add a new kernel for fusing the dequantization in fused-moe gemm vllm-project/vllm#4841

Open

sfc-gh-reyazda and others added 17 commits June 9, 2024 17:57

more fixed

f924455

Merge branch 'fix-quantized-shape' of https://github.com/Snowflake-La…

e03c0f4

…bs/deepspeed into fix-quantized-shape

Skip the UT cases that use unimplemented op builders. (microsoft#5372)

b3ab626

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Logan Adams <loadams@microsoft.com>

Remove synchronize calls from allgather params (microsoft#5516)

cf0ccb5

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

Small typos in functions set_none_gradients_to_zero (microsoft#5557)

5ff0d44

change from "zero_like" to "zeros_like"

Adapt doc for microsoft#4405 (microsoft#5552)

29ab009

ditto Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>

[INF] DSAttention allow input_mask to have false as value (microsoft#…

9db010e

…5546) Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>

deepcharm and others added 21 commits June 10, 2024 17:27

Update HPU docker version (microsoft#5566)

c160d76

[MiCS] Remove the handle print on DeepSpeed side (microsoft#5574)

c203830

When running for MiCS, we found many handle print on DeepSpeed from the output log, this pr is to remove it to suppress this.

Rename files in fp_quantize op from quantize.* to fp_quantize.* (micr…

5e5c8a7

…osoft#5577) Fixes microsoft#5535. Todo: need to test.

Update to fix sidebar over text (microsoft#5567)

ff01ade

- [x] Needs to be tested. Fixes microsoft#5494. Sample screenshot: <img width="1141" alt="image" src="https://github.com/microsoft/DeepSpeed/assets/114770087/f89f642b-bca1-4d45-b3f1-ec7943ab2ad4">

Update minor CUDA version compatibility (microsoft#5591)

a6076cf

Add CUDA versions 12.4 and 12.5 to the list

Add slide deck for meetup in Japan (microsoft#5598)

9db9970

estimate_zero2_model_states_mem_needs: fixing memory estiamtion (micr…

0bf3511

…osoft#5099) was considering 4 bytes per model param, and 4 bytes per gradient. fixed it to 2 bytes - under the assumption of FP16/BF16 --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

Fix cuda hardcode for inference woq (microsoft#5565)

cca53b0

This is a simple fix for inference woq part, changing from `'cuda'` to `get_accelerator().device_name()`. --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>

Updated hpu-gaudi2 tests content. (microsoft#5622)

9c15b8f

Updated hpu-gaudi2 tests content as quantizer module is not yet supported.

stage_1_and_2: optimize clip calculation to use clamp (microsoft#5632)

8a4d03c

instead of "if" that causes host/device synchronization and introduces a bubble, while clamp is hapenning on the device

Merge branch 'master' of https://github.com/Snowflake-Labs/deepspeed …

c47ad5f

…into fix-quantized-shape

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make the quantized data shape compatible with original tensor shape #5483

Make the quantized data shape compatible with original tensor shape #5483

sfc-gh-reyazda commented Apr 30, 2024

sfc-gh-reyazda commented Apr 30, 2024

sfc-gh-reyazda commented Apr 30, 2024

Make the quantized data shape compatible with original tensor shape #5483

Are you sure you want to change the base?

Make the quantized data shape compatible with original tensor shape #5483

Conversation

sfc-gh-reyazda commented Apr 30, 2024

sfc-gh-reyazda commented Apr 30, 2024

sfc-gh-reyazda commented Apr 30, 2024