[Performance] Keep sigmas on CPU #15823

drhead · 2024-05-17T14:40:50Z

Description

Currently, the k diffusion sampler code creates a tensor for sigmas on the GPU. This will cause forced device syncs every step because they are used for control flow within the sampler code.
I changed them to stay on CPU. Every operation using the sigma values works when the value is on the CPU (much like it would if you had a native python list of floats). This also allows it to work ahead as control flow is no longer dependent on the GPU.
There are still other blocking ops, so this may not give an immediate performance benefit. I'll open a separate PR for the other one that I know of.

Checklist:

I have read contributing wiki page
I have performed a self-review of my own code
My code follows the style guidelines
My code passes tests

Panchovix · 2024-05-17T18:21:39Z

It seems it interferes with #15751?

I get

Traceback (most recent call last):
      File "G:\Stable difussion\stable-diffusion-webui\modules\call_queue.py", line 57, in f
        res = list(func(*args, **kwargs))
                   ^^^^^^^^^^^^^^^^^^^^^
      File "G:\Stable difussion\stable-diffusion-webui\modules\call_queue.py", line 36, in f
        res = func(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^
      File "G:\Stable difussion\stable-diffusion-webui\modules\txt2img.py", line 109, in txt2img
        processed = processing.process_images(p)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "G:\Stable difussion\stable-diffusion-webui\modules\processing.py", line 839, in process_images
        res = process_images_inner(p)
              ^^^^^^^^^^^^^^^^^^^^^^^
      File "G:\Stable difussion\stable-diffusion-webui\extensions\sd-webui-controlnet\scripts\batch_hijack.py", line 59, in processing_process_images_hijack
        return getattr(processing, '__controlnet_original_process_images_inner')(p, *args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "G:\Stable difussion\stable-diffusion-webui\modules\processing.py", line 975, in process_images_inner
        samples_ddim = p.sample(conditioning=p.c, unconditional_conditioning=p.uc, seeds=p.seeds, subseeds=p.subseeds, subseed_strength=p.subseed_strength, prompts=p.prompts)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "G:\Stable difussion\stable-diffusion-webui\modules\processing.py", line 1322, in sample
        samples = self.sampler.sample(self, x, conditioning, unconditional_conditioning, image_conditioning=self.txt2img_image_conditioning(x))
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "G:\Stable difussion\stable-diffusion-webui\modules\sd_samplers_kdiffusion.py", line 181, in sample
        sigmas = self.get_sigmas(p, steps)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^
      File "G:\Stable difussion\stable-diffusion-webui\modules\sd_samplers_kdiffusion.py", line 118, in get_sigmas
        sigmas = scheduler.function(n=steps, **sigmas_kwargs)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    TypeError: get_align_your_steps_sigmas() missing 1 required positional argument: 'device'

---

drhead · 2024-05-17T18:24:14Z

It seems it interferes with #15751?

That is a problem on their end, and can be resolved by not sending the sigmas to the device.

* Consistent with implementations in k-diffusion. * Makes this compatible with AUTOMATIC1111#15823

Panchovix · 2024-05-18T18:35:31Z

I was testing with hr-fix and it seems it doesn't work?

Traceback (most recent call last):
      File "G:\Stable difussion\stable-diffusion-webui\modules\call_queue.py", line 57, in f
        res = list(func(*args, **kwargs))
                   ^^^^^^^^^^^^^^^^^^^^^
      File "G:\Stable difussion\stable-diffusion-webui\modules\call_queue.py", line 36, in f
        res = func(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^
      File "G:\Stable difussion\stable-diffusion-webui\modules\txt2img.py", line 109, in txt2img
        processed = processing.process_images(p)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "G:\Stable difussion\stable-diffusion-webui\modules\processing.py", line 839, in process_images
        res = process_images_inner(p)
              ^^^^^^^^^^^^^^^^^^^^^^^
      File "G:\Stable difussion\stable-diffusion-webui\extensions\sd-webui-controlnet\scripts\batch_hijack.py", line 59, in processing_process_images_hijack
        return getattr(processing, '__controlnet_original_process_images_inner')(p, *args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "G:\Stable difussion\stable-diffusion-webui\modules\processing.py", line 975, in process_images_inner
        samples_ddim = p.sample(conditioning=p.c, unconditional_conditioning=p.uc, seeds=p.seeds, subseeds=p.subseeds, subseed_strength=p.subseed_strength, prompts=p.prompts)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "G:\Stable difussion\stable-diffusion-webui\modules\processing.py", line 1338, in sample
        return self.sample_hr_pass(samples, decoded_samples, seeds, subseeds, subseed_strength, prompts)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "G:\Stable difussion\stable-diffusion-webui\modules\processing.py", line 1423, in sample_hr_pass
        samples = self.sampler.sample_img2img(self, samples, noise, self.hr_c, self.hr_uc, steps=self.hr_second_pass_steps or self.steps, image_conditioning=image_conditioning)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "G:\Stable difussion\stable-diffusion-webui\modules\sd_samplers_kdiffusion.py", line 172, in sample_img2img
        samples = self.launch_sampling(t_enc + 1, lambda: self.func(self.model_wrap_cfg, xi, extra_args=self.sampler_extra_args, disable=False, callback=self.callback_state, **extra_params_kwargs))
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "G:\Stable difussion\stable-diffusion-webui\modules\sd_samplers_common.py", line 272, in launch_sampling
        return func()
               ^^^^^^
      File "G:\Stable difussion\stable-diffusion-webui\modules\sd_samplers_kdiffusion.py", line 172, in <lambda>
        samples = self.launch_sampling(t_enc + 1, lambda: self.func(self.model_wrap_cfg, xi, extra_args=self.sampler_extra_args, disable=False, callback=self.callback_state, **extra_params_kwargs))
                                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "G:\Stable difussion\stable-diffusion-webui\venv\Lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
        return func(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^
      File "G:\Stable difussion\stable-diffusion-webui\modules\sd_samplers_extra.py", line 71, in restart_sampler
        x = heun_step(x, old_sigma, new_sigma)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "G:\Stable difussion\stable-diffusion-webui\modules\sd_samplers_extra.py", line 20, in heun_step
        d = to_d(x, old_sigma, denoised)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "G:\Stable difussion\stable-diffusion-webui\repositories\k-diffusion\k_diffusion\sampling.py", line 48, in to_d
        return (x - denoised) / utils.append_dims(sigma, x.ndim)
               ~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

---

It only happens when hi-res starts to generate (after tiled upscale process)

drhead · 2024-05-18T19:36:09Z

I was testing with hr-fix and it seems it doesn't work?

That's not a hi-res fix issue, it's a k-diffusion issue specific to that sampler (Euler and Heun at least, DPM++ 2M works fine). I'll have to work on an upstream patch for k-diffusion for this fix to be viable, then. Marking as draft until then.

drhead · 2024-05-19T07:13:36Z

So, for this to resolve, either this PR needs to be merged upstream (crowsonkb/k-diffusion#109) or I have to monkey patch the function. This probably isn't going to be a merge candidate any time soon so I'll give the upstream PR some time.

Dev

drhead · 2024-06-08T23:02:52Z

Now that the new schedulers are merged, I've went ahead and patched the to_d method there and have stripped the device argument from all schedulers as it is no longer necessary. Should be ready to merge.

AUTOMATIC1111 · 2024-06-09T04:50:23Z

Is there a performance improvement?

drhead · 2024-06-09T06:49:03Z

Is there a performance improvement?

Yes. As stated, the current implementation induces a forced device sync during every sampling step, and the CPU will have to wait on the GPU to be done with its work for certain control flow operations. I have a more detailed explanation of its impacts here: crowsonkb/k-diffusion#108. And it is completely safe to have the sigmas tensor on the CPU (at least without this single operation that I had to patch) because the sigmas are only used for control flow and scalar operations.

Notably, this is also the last blocking operation that I am aware of in the main inference loop, so pytorch dispatch will be completely unblocked after this and you should see nearly uninterrupted 100% GPU usage.

AUTOMATIC1111 · 2024-06-09T07:01:49Z

What are the best conditions for observing the improvements? Doing 50-step 512x768 SD1 generations on 3090, and iterations per second between this and dev are not noticeably different.

drhead · 2024-06-09T07:12:35Z

What are the best conditions for observing the improvements? Doing 50-step 512x768 SD1 generations on 3090, and iterations per second between this and dev are not noticeably different.

It's probably easiest to look at GPU usage with a fast update speed, since it ideally should almost never go below 100. Without it, you will notice a small dip between each step. It is per step so it may be subtle, it would probably be somewhat more pronounced on slower CPUs. But if you can see that dip, you know that it is not using the GPU for some period of time where it should be using it.

My standard when testing this was always 512x512 batch size 4. For completeness, I was also using --opt-channelslast (which should be noticeably faster on Ampere and above after the other performance patches where it wasn't always before). Live preview will probably also need to be off since it also includes blocking operations.

AUTOMATIC1111 · 2024-06-09T07:33:32Z

Here's what I get with disabled live previews:
dev:
100%|█████████████████| 50/50 [00:13<00:00, 3.73it/s]
100%|█████████████████| 50/50 [00:13<00:00, 3.66it/s]
100%|█████████████████| 50/50 [00:13<00:00, 3.70it/s]

This PR:
100%|█████████████████| 50/50 [00:13<00:00, 3.63it/s]
100%|█████████████████| 50/50 [00:13<00:00, 3.68it/s]
100%|█████████████████| 50/50 [00:13<00:00, 3.66it/s]

Let's see if anyone else reports improvement.

AUTOMATIC1111 · 2024-06-09T13:43:09Z

All right, I'm ready to merge this functionality in, but after playing with it, my conclusion is that the effect can be achieved by just doing

sigmas = scheduler.function(n=steps, **sigmas_kwargs, device=devices.cpu)

There's no need for all other changes that remove the device argument from samplers code. If you want, you can edit this PR to only have that change, or I can do that change myself. Or if I'm wrong, correct me.

drhead · 2024-06-09T16:00:21Z

All right, I'm ready to merge this functionality in, but after playing with it, my conclusion is that the effect can be achieved by just doing
sigmas = scheduler.function(n=steps, **sigmas_kwargs, device=devices.cpu)
There's no need for all other changes that remove the device argument from samplers code. If you want, you can edit this PR to only have that change, or I can do that change myself. Or if I'm wrong, correct me.

Yeah, I agree the device arguments don't necessarily have to be removed. Setting it to CPU should work. The to_d patch is also necessary to not break Euler and Heun and a couple of other samplers I think, since it tries to perform a pointwise op which would require the sample and sigmas on the same device (unless we have different sigmas in a batch this is the same as the scalar multiply it is now).

Keep sigmas on CPU

01491d3

drhead requested a review from AUTOMATIC1111 as a code owner May 17, 2024 14:40

LoganBooker added a commit to LoganBooker/stable-diffusion-webui that referenced this pull request May 17, 2024

Default device for sigma tensor to CPU

1d74482

* Consistent with implementations in k-diffusion. * Makes this compatible with AUTOMATIC1111#15823

drhead marked this pull request as draft May 18, 2024 19:36

drhead added 4 commits June 8, 2024 18:44

Merge branch 'AUTOMATIC1111:master' into patch-3

58b24ea

Merge pull request #1 from AUTOMATIC1111/dev

428975e

Dev

patch k_diffusion to_d and strip device from schedulers

39a6d56

lint

d52a1e1

drhead marked this pull request as ready for review June 8, 2024 23:01

AUTOMATIC1111 approved these changes Jun 9, 2024

View reviewed changes

AUTOMATIC1111 merged commit 1d0bb39 into AUTOMATIC1111:dev Jun 9, 2024
3 checks passed

AUTOMATIC1111 added a commit that referenced this pull request Jun 9, 2024

undo some changes from #15823 and fix whitespace

99e65ec

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance] Keep sigmas on CPU #15823

[Performance] Keep sigmas on CPU #15823

drhead commented May 17, 2024

Panchovix commented May 17, 2024 •

edited

drhead commented May 17, 2024

Panchovix commented May 18, 2024

drhead commented May 18, 2024

drhead commented May 19, 2024

drhead commented Jun 8, 2024

AUTOMATIC1111 commented Jun 9, 2024 •

edited

drhead commented Jun 9, 2024

AUTOMATIC1111 commented Jun 9, 2024

drhead commented Jun 9, 2024

AUTOMATIC1111 commented Jun 9, 2024 •

edited

AUTOMATIC1111 commented Jun 9, 2024 •

edited

drhead commented Jun 9, 2024

[Performance] Keep sigmas on CPU #15823

[Performance] Keep sigmas on CPU #15823

Conversation

drhead commented May 17, 2024

Description

Checklist:

Panchovix commented May 17, 2024 • edited

drhead commented May 17, 2024

Panchovix commented May 18, 2024

drhead commented May 18, 2024

drhead commented May 19, 2024

drhead commented Jun 8, 2024

AUTOMATIC1111 commented Jun 9, 2024 • edited

drhead commented Jun 9, 2024

AUTOMATIC1111 commented Jun 9, 2024

drhead commented Jun 9, 2024

AUTOMATIC1111 commented Jun 9, 2024 • edited

AUTOMATIC1111 commented Jun 9, 2024 • edited

drhead commented Jun 9, 2024

Panchovix commented May 17, 2024 •

edited

AUTOMATIC1111 commented Jun 9, 2024 •

edited

AUTOMATIC1111 commented Jun 9, 2024 •

edited

AUTOMATIC1111 commented Jun 9, 2024 •

edited