Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stable Cascade fix INT8 compression with NNCF #7987

Closed
wants to merge 1 commit into from

Conversation

Disty0
Copy link
Contributor

@Disty0 Disty0 commented May 19, 2024

What does this PR do?

Fixes NNCF compression with Stable Cascade.

NNCF compresses model weight to 8 bit and reduces the model footprint by half.
Full Stable Cascade model can run with 6-8 GB GPUs using NNCF compression and model cpu offload.

Fixed Issues:

NNCF compression uses uint8 and Stable Cascade Prior Pipeline tries to use the dtype from the model weights.
This behavior makes it unable to run since most stuff doesn't support Byte types.

This PR checks for uint8 and int8 uses BF16 or FP32 depending on the GPU support.

Notes

CUDA is using torch.cuda.is_bf16_supported().
IPEX (Intel ARC) is only checking the device since every XPU device does support BF16.

I don't know if there is a way to get the original dtype used before the NNCF compression step.
Current method i implemented ignores the user inputted torch dtype.

Example use of NNCF compression

pip install nncf==2.7.0
import copy
import nncf
device = "cuda"

def nncf_compress_model(model):
    return_device = model.device
    model.eval()
    if hasattr(model, "get_input_embeddings"):
        backup_embeddings = copy.deepcopy(model.get_input_embeddings())
    model = nncf.compress_weights(model.to(device)).to(return_device)
    if hasattr(model, "set_input_embeddings"):
        model.set_input_embeddings(backup_embeddings)
    return model

pipe.prior_prior = pipe.prior_pipe.prior = nncf_compress_model(pipe.prior_pipe.prior)

BF16

image

UINT8

image

Before submitting

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@sayakpaul
Copy link
Member

Thanks for this contribution and for bringing NNCF compression to our attention.

I think it's a bit of an anti-pattern to our library's philosophy that we're changing the data type under the hood to something that wasn't requested by the user in the first place.

Instead, can't we type-cast the modules explicitly after the pipeline is loaded and then doing the changes you mentioned in the PR description? In that case, we can just have a very nice doc about it and the users will be able to follow it.

@yiyixuxu WDYT?

@Disty0
Copy link
Contributor Author

Disty0 commented May 20, 2024

Dig into the NNCF code a bit and found a way to get the original dtype:
2.7.0: https://github.com/openvinotoolkit/nncf/blob/release_v270/nncf/torch/quantization/weights_compression.py#L34
Latest: https://github.com/openvinotoolkit/nncf/blob/9cfc7b48f6511356021801790e725453b2612ac4/nncf/torch/quantization/layers.py#L1059

This returns the original dtype and seems to work fine with Full Prior and Lite Prior:

self.prior.down_blocks[0][0].channelwise[0].pre_ops["0"].scale.dtype

Note: I am using NNCF 2.7.0 because 2.8.0 and newer requires example inputs with PyTorch backend.

@yiyixuxu
Copy link
Collaborator

yeah agree with @sayakpaul
I don't think we can accept this PR, but happy to add a doc page for it!

@Disty0
Copy link
Contributor Author

Disty0 commented May 20, 2024

NNCF compression autocasts to original dtype when running the model, so any change we do here doesn't change them.
Dtype change in the pipeline is mainly for latents and generator since they outright fail with uint8 or causes dtype mismatch since model weights are converted to the original dtype when running but inputs are untouched.

@sayakpaul
Copy link
Member

Thanks for explaining but it really is quite antithetical to the library design. If there’s a way to patch the call in some manner I think that would be still acceptable to put in the docs but otherwise this really seems difficult.

@Disty0
Copy link
Contributor Author

Disty0 commented May 20, 2024

We can add something like self._autocast_dtype that can be changed from outside to the pipeline but i didn't want to add another variable.
Or we can convert the dype variable used in the pipeline to self._dtype since it is already different than the property one. But doing this one will get out of sync when pipe.to() is used.

Something like this could do:

if getattr(self, "_autocast_dtype", None) is not None:
    dtype = self._autocast_dtype
else:
    dtype = next(self.prior.parameters()).dtype

@Disty0
Copy link
Contributor Author

Disty0 commented May 20, 2024

Actually i forgot to check the simplest stuff first. self.dtype doesn't get updated by NNCF.

This also works.

dtype = self.dtype

Edit: But breaks again when model_cpu_offload is used.

@Disty0
Copy link
Contributor Author

Disty0 commented May 20, 2024

Adding self._autocast_dtype variable and setting it manually from the outside seems like the only thing that works reliably.

@Disty0 Disty0 marked this pull request as ready for review May 20, 2024 20:57
@Disty0 Disty0 closed this May 26, 2024
@Disty0
Copy link
Contributor Author

Disty0 commented May 29, 2024

Another workaround that doesn't require code change on diffusers side:

backup_clip_txt_pooled_mapper = copy.deepcopy(sd_model.prior_pipe.prior.clip_txt_pooled_mapper)

pipe.prior_prior = pipe.prior_pipe.prior = nncf_compress_model(pipe.prior_pipe.prior)

pipe.prior_prior.clip_txt_pooled_mapper = pipe.prior_pipe.prior.clip_txt_pooled_mapper = backup_clip_txt_pooled_mapper

@sayakpaul
Copy link
Member

This is nice! Do you want to rather include a documentation about it? @yiyixuxu what say? @stevhliu can help with doc guidance.

@Disty0
Copy link
Contributor Author

Disty0 commented May 29, 2024

Some notes about NNCF:

  • It is autocast. Meaning model weight will be stored using INT8 and will be casted to the original dtype (BF16/FP16 etc) on forward pass.
  • nncf.compress_weights() should run for each component of the model separately. Meaning, you can't run it on the entire pipe, you have to compress the modules you want separately (first pipe.unet, then pipe.text_encoder etc).
  • Modules that is being compressed has to be in the execution device when compressing, otherwise NNCF fails with device mismatch errors.
    You can send a module to the execution device, compress, then send it back to CPU before moving on the next module manually when using model offload.
    But you can't apply sequential offload before the compress step because "meta" device will mess things up.

SDXL and SD 1.5 works out of the box. Stable Cascade needs this (#7987 (comment)) workaround.

nncf==2.7.0 is required if you don't want to provide example inputs.
You can create a torch.compile backend to get example inputs at runtime for NNCF if you want to use the newer versions but not ideal.

@Disty0
Copy link
Contributor Author

Disty0 commented May 29, 2024

This is how i implemented it in SDNext:
https://github.com/vladmandic/automatic/blob/master/modules/sd_models_compile.py#L104

I am treating it as if it was compiled because it breaks loading Loras, not because it is compiled. (It's not compiled.)

This function applies NNCF or other functions to known module types:

https://github.com/vladmandic/automatic/blob/master/modules/sd_models_compile.py#L30

@yiyixuxu
Copy link
Collaborator

@Disty0
would you be willing to add a doc page for this?
cc @stevhliu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants