Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Performance] Keep sigmas on CPU #15823

Merged
merged 5 commits into from
Jun 9, 2024
Merged

Conversation

drhead
Copy link
Contributor

@drhead drhead commented May 17, 2024

Description

  • Currently, the k diffusion sampler code creates a tensor for sigmas on the GPU. This will cause forced device syncs every step because they are used for control flow within the sampler code.
  • I changed them to stay on CPU. Every operation using the sigma values works when the value is on the CPU (much like it would if you had a native python list of floats). This also allows it to work ahead as control flow is no longer dependent on the GPU.
  • There are still other blocking ops, so this may not give an immediate performance benefit. I'll open a separate PR for the other one that I know of.

Checklist:

@Panchovix
Copy link

Panchovix commented May 17, 2024

It seems it interferes with #15751?

I get

Traceback (most recent call last):
      File "G:\Stable difussion\stable-diffusion-webui\modules\call_queue.py", line 57, in f
        res = list(func(*args, **kwargs))
                   ^^^^^^^^^^^^^^^^^^^^^
      File "G:\Stable difussion\stable-diffusion-webui\modules\call_queue.py", line 36, in f
        res = func(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^
      File "G:\Stable difussion\stable-diffusion-webui\modules\txt2img.py", line 109, in txt2img
        processed = processing.process_images(p)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "G:\Stable difussion\stable-diffusion-webui\modules\processing.py", line 839, in process_images
        res = process_images_inner(p)
              ^^^^^^^^^^^^^^^^^^^^^^^
      File "G:\Stable difussion\stable-diffusion-webui\extensions\sd-webui-controlnet\scripts\batch_hijack.py", line 59, in processing_process_images_hijack
        return getattr(processing, '__controlnet_original_process_images_inner')(p, *args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "G:\Stable difussion\stable-diffusion-webui\modules\processing.py", line 975, in process_images_inner
        samples_ddim = p.sample(conditioning=p.c, unconditional_conditioning=p.uc, seeds=p.seeds, subseeds=p.subseeds, subseed_strength=p.subseed_strength, prompts=p.prompts)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "G:\Stable difussion\stable-diffusion-webui\modules\processing.py", line 1322, in sample
        samples = self.sampler.sample(self, x, conditioning, unconditional_conditioning, image_conditioning=self.txt2img_image_conditioning(x))
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "G:\Stable difussion\stable-diffusion-webui\modules\sd_samplers_kdiffusion.py", line 181, in sample
        sigmas = self.get_sigmas(p, steps)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^
      File "G:\Stable difussion\stable-diffusion-webui\modules\sd_samplers_kdiffusion.py", line 118, in get_sigmas
        sigmas = scheduler.function(n=steps, **sigmas_kwargs)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    TypeError: get_align_your_steps_sigmas() missing 1 required positional argument: 'device'

---

@drhead
Copy link
Contributor Author

drhead commented May 17, 2024

It seems it interferes with #15751?

That is a problem on their end, and can be resolved by not sending the sigmas to the device.

LoganBooker added a commit to LoganBooker/stable-diffusion-webui that referenced this pull request May 17, 2024
* Consistent with implementations in k-diffusion.
* Makes this compatible with AUTOMATIC1111#15823
@Panchovix
Copy link

I was testing with hr-fix and it seems it doesn't work?

Traceback (most recent call last):
      File "G:\Stable difussion\stable-diffusion-webui\modules\call_queue.py", line 57, in f
        res = list(func(*args, **kwargs))
                   ^^^^^^^^^^^^^^^^^^^^^
      File "G:\Stable difussion\stable-diffusion-webui\modules\call_queue.py", line 36, in f
        res = func(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^
      File "G:\Stable difussion\stable-diffusion-webui\modules\txt2img.py", line 109, in txt2img
        processed = processing.process_images(p)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "G:\Stable difussion\stable-diffusion-webui\modules\processing.py", line 839, in process_images
        res = process_images_inner(p)
              ^^^^^^^^^^^^^^^^^^^^^^^
      File "G:\Stable difussion\stable-diffusion-webui\extensions\sd-webui-controlnet\scripts\batch_hijack.py", line 59, in processing_process_images_hijack
        return getattr(processing, '__controlnet_original_process_images_inner')(p, *args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "G:\Stable difussion\stable-diffusion-webui\modules\processing.py", line 975, in process_images_inner
        samples_ddim = p.sample(conditioning=p.c, unconditional_conditioning=p.uc, seeds=p.seeds, subseeds=p.subseeds, subseed_strength=p.subseed_strength, prompts=p.prompts)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "G:\Stable difussion\stable-diffusion-webui\modules\processing.py", line 1338, in sample
        return self.sample_hr_pass(samples, decoded_samples, seeds, subseeds, subseed_strength, prompts)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "G:\Stable difussion\stable-diffusion-webui\modules\processing.py", line 1423, in sample_hr_pass
        samples = self.sampler.sample_img2img(self, samples, noise, self.hr_c, self.hr_uc, steps=self.hr_second_pass_steps or self.steps, image_conditioning=image_conditioning)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "G:\Stable difussion\stable-diffusion-webui\modules\sd_samplers_kdiffusion.py", line 172, in sample_img2img
        samples = self.launch_sampling(t_enc + 1, lambda: self.func(self.model_wrap_cfg, xi, extra_args=self.sampler_extra_args, disable=False, callback=self.callback_state, **extra_params_kwargs))
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "G:\Stable difussion\stable-diffusion-webui\modules\sd_samplers_common.py", line 272, in launch_sampling
        return func()
               ^^^^^^
      File "G:\Stable difussion\stable-diffusion-webui\modules\sd_samplers_kdiffusion.py", line 172, in <lambda>
        samples = self.launch_sampling(t_enc + 1, lambda: self.func(self.model_wrap_cfg, xi, extra_args=self.sampler_extra_args, disable=False, callback=self.callback_state, **extra_params_kwargs))
                                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "G:\Stable difussion\stable-diffusion-webui\venv\Lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
        return func(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^
      File "G:\Stable difussion\stable-diffusion-webui\modules\sd_samplers_extra.py", line 71, in restart_sampler
        x = heun_step(x, old_sigma, new_sigma)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "G:\Stable difussion\stable-diffusion-webui\modules\sd_samplers_extra.py", line 20, in heun_step
        d = to_d(x, old_sigma, denoised)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "G:\Stable difussion\stable-diffusion-webui\repositories\k-diffusion\k_diffusion\sampling.py", line 48, in to_d
        return (x - denoised) / utils.append_dims(sigma, x.ndim)
               ~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

---

It only happens when hi-res starts to generate (after tiled upscale process)

@drhead
Copy link
Contributor Author

drhead commented May 18, 2024

I was testing with hr-fix and it seems it doesn't work?

That's not a hi-res fix issue, it's a k-diffusion issue specific to that sampler (Euler and Heun at least, DPM++ 2M works fine). I'll have to work on an upstream patch for k-diffusion for this fix to be viable, then. Marking as draft until then.

@drhead drhead marked this pull request as draft May 18, 2024 19:36
@drhead
Copy link
Contributor Author

drhead commented May 19, 2024

So, for this to resolve, either this PR needs to be merged upstream (crowsonkb/k-diffusion#109) or I have to monkey patch the function. This probably isn't going to be a merge candidate any time soon so I'll give the upstream PR some time.

@drhead drhead marked this pull request as ready for review June 8, 2024 23:01
@drhead
Copy link
Contributor Author

drhead commented Jun 8, 2024

Now that the new schedulers are merged, I've went ahead and patched the to_d method there and have stripped the device argument from all schedulers as it is no longer necessary. Should be ready to merge.

@AUTOMATIC1111
Copy link
Owner

AUTOMATIC1111 commented Jun 9, 2024

Is there a performance improvement?

@drhead
Copy link
Contributor Author

drhead commented Jun 9, 2024

Is there a performance improvement?

Yes. As stated, the current implementation induces a forced device sync during every sampling step, and the CPU will have to wait on the GPU to be done with its work for certain control flow operations. I have a more detailed explanation of its impacts here: crowsonkb/k-diffusion#108. And it is completely safe to have the sigmas tensor on the CPU (at least without this single operation that I had to patch) because the sigmas are only used for control flow and scalar operations.

Notably, this is also the last blocking operation that I am aware of in the main inference loop, so pytorch dispatch will be completely unblocked after this and you should see nearly uninterrupted 100% GPU usage.

@AUTOMATIC1111
Copy link
Owner

What are the best conditions for observing the improvements? Doing 50-step 512x768 SD1 generations on 3090, and iterations per second between this and dev are not noticeably different.

@drhead
Copy link
Contributor Author

drhead commented Jun 9, 2024

What are the best conditions for observing the improvements? Doing 50-step 512x768 SD1 generations on 3090, and iterations per second between this and dev are not noticeably different.

It's probably easiest to look at GPU usage with a fast update speed, since it ideally should almost never go below 100. Without it, you will notice a small dip between each step. It is per step so it may be subtle, it would probably be somewhat more pronounced on slower CPUs. But if you can see that dip, you know that it is not using the GPU for some period of time where it should be using it.

My standard when testing this was always 512x512 batch size 4. For completeness, I was also using --opt-channelslast (which should be noticeably faster on Ampere and above after the other performance patches where it wasn't always before). Live preview will probably also need to be off since it also includes blocking operations.

@AUTOMATIC1111
Copy link
Owner

AUTOMATIC1111 commented Jun 9, 2024

Here's what I get with disabled live previews:
dev:
100%|█████████████████| 50/50 [00:13<00:00, 3.73it/s]
100%|█████████████████| 50/50 [00:13<00:00, 3.66it/s]
100%|█████████████████| 50/50 [00:13<00:00, 3.70it/s]
Taskmgr_y8Yu4Z1LRO

This PR:
100%|█████████████████| 50/50 [00:13<00:00, 3.63it/s]
100%|█████████████████| 50/50 [00:13<00:00, 3.68it/s]
100%|█████████████████| 50/50 [00:13<00:00, 3.66it/s]
Taskmgr_ptziYStnI8

Let's see if anyone else reports improvement.

@AUTOMATIC1111
Copy link
Owner

AUTOMATIC1111 commented Jun 9, 2024

All right, I'm ready to merge this functionality in, but after playing with it, my conclusion is that the effect can be achieved by just doing

sigmas = scheduler.function(n=steps, **sigmas_kwargs, device=devices.cpu)

There's no need for all other changes that remove the device argument from samplers code. If you want, you can edit this PR to only have that change, or I can do that change myself. Or if I'm wrong, correct me.

@drhead
Copy link
Contributor Author

drhead commented Jun 9, 2024

All right, I'm ready to merge this functionality in, but after playing with it, my conclusion is that the effect can be achieved by just doing

sigmas = scheduler.function(n=steps, **sigmas_kwargs, device=devices.cpu)

There's no need for all other changes that remove the device argument from samplers code. If you want, you can edit this PR to only have that change, or I can do that change myself. Or if I'm wrong, correct me.

Yeah, I agree the device arguments don't necessarily have to be removed. Setting it to CPU should work. The to_d patch is also necessary to not break Euler and Heun and a couple of other samplers I think, since it tries to perform a pointwise op which would require the sample and sigmas on the same device (unless we have different sigmas in a batch this is the same as the scalar multiply it is now).

@AUTOMATIC1111 AUTOMATIC1111 merged commit 1d0bb39 into AUTOMATIC1111:dev Jun 9, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants