Stable Cascade fix INT8 compression with NNCF #7987

Disty0 · 2024-05-19T20:02:13Z

What does this PR do?

Fixes NNCF compression with Stable Cascade.

NNCF compresses model weight to 8 bit and reduces the model footprint by half.
Full Stable Cascade model can run with 6-8 GB GPUs using NNCF compression and model cpu offload.

Fixed Issues:

NNCF compression uses uint8 and Stable Cascade Prior Pipeline tries to use the dtype from the model weights.
This behavior makes it unable to run since most stuff doesn't support Byte types.

This PR checks for uint8 and int8 uses BF16 or FP32 depending on the GPU support.

Notes

CUDA is using torch.cuda.is_bf16_supported().
IPEX (Intel ARC) is only checking the device since every XPU device does support BF16.

I don't know if there is a way to get the original dtype used before the NNCF compression step.
Current method i implemented ignores the user inputted torch dtype.

Example use of NNCF compression

pip install nncf==2.7.0

import copy
import nncf
device = "cuda"

def nncf_compress_model(model):
    return_device = model.device
    model.eval()
    if hasattr(model, "get_input_embeddings"):
        backup_embeddings = copy.deepcopy(model.get_input_embeddings())
    model = nncf.compress_weights(model.to(device)).to(return_device)
    if hasattr(model, "set_input_embeddings"):
        model.set_input_embeddings(backup_embeddings)
    return model

pipe.prior_prior = pipe.prior_pipe.prior = nncf_compress_model(pipe.prior_pipe.prior)

BF16

UINT8

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Did you read our philosophy doc (important for complex PRs)?
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Pipelines: @sayakpaul @yiyixuxu @DN6

sayakpaul · 2024-05-20T09:29:49Z

Thanks for this contribution and for bringing NNCF compression to our attention.

I think it's a bit of an anti-pattern to our library's philosophy that we're changing the data type under the hood to something that wasn't requested by the user in the first place.

Instead, can't we type-cast the modules explicitly after the pipeline is loaded and then doing the changes you mentioned in the PR description? In that case, we can just have a very nice doc about it and the users will be able to follow it.

@yiyixuxu WDYT?

Disty0 · 2024-05-20T11:08:47Z

Dig into the NNCF code a bit and found a way to get the original dtype:
2.7.0: https://github.com/openvinotoolkit/nncf/blob/release_v270/nncf/torch/quantization/weights_compression.py#L34
Latest: https://github.com/openvinotoolkit/nncf/blob/9cfc7b48f6511356021801790e725453b2612ac4/nncf/torch/quantization/layers.py#L1059

This returns the original dtype and seems to work fine with Full Prior and Lite Prior:

self.prior.down_blocks[0][0].channelwise[0].pre_ops["0"].scale.dtype

Note: I am using NNCF 2.7.0 because 2.8.0 and newer requires example inputs with PyTorch backend.

yiyixuxu · 2024-05-20T16:04:57Z

yeah agree with @sayakpaul
I don't think we can accept this PR, but happy to add a doc page for it!

Disty0 · 2024-05-20T18:53:17Z

NNCF compression autocasts to original dtype when running the model, so any change we do here doesn't change them.
Dtype change in the pipeline is mainly for latents and generator since they outright fail with uint8 or causes dtype mismatch since model weights are converted to the original dtype when running but inputs are untouched.

sayakpaul · 2024-05-20T19:00:52Z

Thanks for explaining but it really is quite antithetical to the library design. If there’s a way to patch the call in some manner I think that would be still acceptable to put in the docs but otherwise this really seems difficult.

Disty0 · 2024-05-20T19:16:18Z

We can add something like self._autocast_dtype that can be changed from outside to the pipeline but i didn't want to add another variable.
Or we can convert the dype variable used in the pipeline to self._dtype since it is already different than the property one. But doing this one will get out of sync when pipe.to() is used.

Something like this could do:

if getattr(self, "_autocast_dtype", None) is not None:
    dtype = self._autocast_dtype
else:
    dtype = next(self.prior.parameters()).dtype

Disty0 · 2024-05-20T20:02:20Z

Actually i forgot to check the simplest stuff first. self.dtype doesn't get updated by NNCF.

This also works.

dtype = self.dtype

Edit: But breaks again when model_cpu_offload is used.

Disty0 · 2024-05-20T20:30:39Z

Adding self._autocast_dtype variable and setting it manually from the outside seems like the only thing that works reliably.

Disty0 · 2024-05-29T13:47:53Z

Another workaround that doesn't require code change on diffusers side:

backup_clip_txt_pooled_mapper = copy.deepcopy(sd_model.prior_pipe.prior.clip_txt_pooled_mapper)

pipe.prior_prior = pipe.prior_pipe.prior = nncf_compress_model(pipe.prior_pipe.prior)

pipe.prior_prior.clip_txt_pooled_mapper = pipe.prior_pipe.prior.clip_txt_pooled_mapper = backup_clip_txt_pooled_mapper

sayakpaul · 2024-05-29T14:12:40Z

This is nice! Do you want to rather include a documentation about it? @yiyixuxu what say? @stevhliu can help with doc guidance.

Disty0 · 2024-05-29T16:09:13Z

Some notes about NNCF:

It is autocast. Meaning model weight will be stored using INT8 and will be casted to the original dtype (BF16/FP16 etc) on forward pass.
nncf.compress_weights() should run for each component of the model separately. Meaning, you can't run it on the entire pipe, you have to compress the modules you want separately (first pipe.unet, then pipe.text_encoder etc).
Modules that is being compressed has to be in the execution device when compressing, otherwise NNCF fails with device mismatch errors.
You can send a module to the execution device, compress, then send it back to CPU before moving on the next module manually when using model offload.
But you can't apply sequential offload before the compress step because "meta" device will mess things up.

SDXL and SD 1.5 works out of the box. Stable Cascade needs this (#7987 (comment)) workaround.

nncf==2.7.0 is required if you don't want to provide example inputs.
You can create a torch.compile backend to get example inputs at runtime for NNCF if you want to use the newer versions but not ideal.

Disty0 · 2024-05-29T16:19:42Z

This is how i implemented it in SDNext:
https://github.com/vladmandic/automatic/blob/master/modules/sd_models_compile.py#L104

I am treating it as if it was compiled because it breaks loading Loras, not because it is compiled. (It's not compiled.)

This function applies NNCF or other functions to known module types:

https://github.com/vladmandic/automatic/blob/master/modules/sd_models_compile.py#L30

yiyixuxu · 2024-05-29T20:39:33Z

@Disty0
would you be willing to add a doc page for this?
cc @stevhliu

Disty0 force-pushed the main branch from fc477e9 to dc33eb0 Compare May 19, 2024 20:17

Disty0 force-pushed the main branch from 01e0b10 to 50b354e Compare May 20, 2024 20:05

Disty0 marked this pull request as draft May 20, 2024 20:21

Stable Cascade add _autocast_dtype variable and fix NNCF

d84707c

Disty0 force-pushed the main branch from 50b354e to d84707c Compare May 20, 2024 20:40

Disty0 marked this pull request as ready for review May 20, 2024 20:57

Disty0 closed this May 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stable Cascade fix INT8 compression with NNCF #7987

Stable Cascade fix INT8 compression with NNCF #7987

Disty0 commented May 19, 2024 •

edited

sayakpaul commented May 20, 2024

Disty0 commented May 20, 2024 •

edited

yiyixuxu commented May 20, 2024

Disty0 commented May 20, 2024 •

edited

sayakpaul commented May 20, 2024

Disty0 commented May 20, 2024 •

edited

Disty0 commented May 20, 2024 •

edited

Disty0 commented May 20, 2024

Disty0 commented May 29, 2024

sayakpaul commented May 29, 2024

Disty0 commented May 29, 2024 •

edited

Disty0 commented May 29, 2024 •

edited

yiyixuxu commented May 29, 2024

Stable Cascade fix INT8 compression with NNCF #7987

Stable Cascade fix INT8 compression with NNCF #7987

Conversation

Disty0 commented May 19, 2024 • edited

What does this PR do?

Fixed Issues:

Notes

Example use of NNCF compression

BF16

UINT8

Before submitting

Who can review?

sayakpaul commented May 20, 2024

Disty0 commented May 20, 2024 • edited

yiyixuxu commented May 20, 2024

Disty0 commented May 20, 2024 • edited

sayakpaul commented May 20, 2024

Disty0 commented May 20, 2024 • edited

Disty0 commented May 20, 2024 • edited

Disty0 commented May 20, 2024

Disty0 commented May 29, 2024

sayakpaul commented May 29, 2024

Disty0 commented May 29, 2024 • edited

Disty0 commented May 29, 2024 • edited

yiyixuxu commented May 29, 2024

Disty0 commented May 19, 2024 •

edited

Disty0 commented May 20, 2024 •

edited

Disty0 commented May 20, 2024 •

edited

Disty0 commented May 20, 2024 •

edited

Disty0 commented May 20, 2024 •

edited

Disty0 commented May 29, 2024 •

edited

Disty0 commented May 29, 2024 •

edited