Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Performance] Run live preview inference on a cudastream #15837

Open
wants to merge 5 commits into
base: dev
Choose a base branch
from

Conversation

drhead
Copy link
Contributor

@drhead drhead commented May 19, 2024

Description

I have vastly improved the performance of live preview by making two changes:

  • running the decode on its own CUDA stream which lets operations parallelize
  • making the required DtoH transfer non-blocking

As a result, live preview (at least with TAESD) is basically free now. On a 150 step, 512x512, batch 4 inference, it takes 25.1s to complete in total without live preview, and 25.7s to complete with live preview happening as often as it can (100ms delay, every step. In practice it doesn't actually preview that fast, but I find it hard to imagine that this isn't fast enough for almost anyone.)

In its current state, I am about as certain as I can be that this will cause problems for anyone who isn't using an NVIDIA card! it shouldn't run at all on CPU, but I know torch with AMD backends still calls things "cuda" so I would love feedback on how well it works on different hardware, to see if attempting to use CUDA streams on AMD makes your card explode or something, so that something can be done about that.

Screenshots/videos:

Delicious compute overlap:
image
Stream 7 is the main/default cudastream, stream 13 is the live preview cudastream.

Checklist:

@drhead drhead marked this pull request as ready for review May 19, 2024 01:03
@light-and-ray
Copy link
Contributor

But will it slow down the main generation?

@drhead
Copy link
Contributor Author

drhead commented May 19, 2024

But will it slow down the main generation?

Having live preview enabled will always slightly slow down the main generation since it involves doing more work compared to not having live preview on. This implementation should have a much lower reduction in performance, since it not only ensures that live preview never blocks the main generation, but it is also able to overlap its compute to an extent.

@gel-crabs
Copy link
Contributor

It works on AMD! I've used cudastreams before with CuPy, so the support is there, and the CUDA checks will work on AMD.

You will have to set the "Show previews of all images generated in a batch as a grid" in the Live Previews settings to on, or else you will get a 'Tensor' object has no attribute 'save' error (not exclusive to AMD).

However, I intermittently get purple images, noise images, etc. as live previews during generation, in between normal live previews. Forcing non_blocking to false in sd_samplers_common fixes this (I assume at a cost to performance).

It may have something to do with the other performance patches I've applied recently; can anyone else reproduce this?

@drhead
Copy link
Contributor Author

drhead commented May 19, 2024

It works on AMD! I've used cudastreams before with CuPy, so the support is there, and the CUDA checks will work on AMD.

You will have to set the "Show previews of all images generated in a batch as a grid" in the Live Previews settings to on, or else you will get a 'Tensor' object has no attribute 'save' error (not exclusive to AMD).

However, I intermittently get purple images, noise images, etc. as live previews during generation, in between normal live previews. Forcing non_blocking to false in sd_samplers_common fixes this (I assume at a cost to performance).

It may have something to do with the other performance patches I've applied recently; can anyone else reproduce this?

So that's where that function is used... I didn't test single image generation at all, sorry.

Please tell me if the commit I just made fixes both problems. If it doesn't, then that means there's a serious problem with non-blocking on AMD that needs further investigation.

@gel-crabs
Copy link
Contributor

It works on AMD! I've used cudastreams before with CuPy, so the support is there, and the CUDA checks will work on AMD.
You will have to set the "Show previews of all images generated in a batch as a grid" in the Live Previews settings to on, or else you will get a 'Tensor' object has no attribute 'save' error (not exclusive to AMD).
However, I intermittently get purple images, noise images, etc. as live previews during generation, in between normal live previews. Forcing non_blocking to false in sd_samplers_common fixes this (I assume at a cost to performance).
It may have something to do with the other performance patches I've applied recently; can anyone else reproduce this?

So that's where that function is used... I didn't test single image generation at all, sorry.

Please tell me if the commit I just made fixes both problems. If it doesn't, then that means there's a serious problem with non-blocking on AMD that needs further investigation.

The issue with single image generation is fixed!

The issue with intermittent purple/noisy images is still there, but less frequent. It turns out that removing --disable-nan-check from my startup file fixes it, though. Do you have the NaN check disabled?

@drhead
Copy link
Contributor Author

drhead commented May 19, 2024

I do have it disabled. disable_nan_check would be causing a forced sync. Whatever is happening indicates that synchronizing the cuda stream doesn't do what it should on AMD.

@gel-crabs
Copy link
Contributor

gel-crabs commented May 20, 2024

I do have it disabled. disable_nan_check would be causing a forced sync. Whatever is happening indicates that synchronizing the cuda stream doesn't do what it should on AMD.

I really should've taken into account that my setup is unsuitable for testing right now (I've started getting actual memory errors due to hot weather and a dirt-cheap mobo).

Since this issue involves synchronization with the CPU/memory, it could very well just be that. This issue seems to only start after my fans spin up.

I'm going to test this again in a few days after I rebuild my computer. Sorry for leading you on a wild goose chase if this turns out to be an issue on my end.

@wfjsw

This comment was marked as outdated.

@drhead

This comment was marked as resolved.

@wfjsw

This comment was marked as outdated.

@drhead

This comment was marked as resolved.

@drhead
Copy link
Contributor Author

drhead commented May 20, 2024

On another note, I do think this needs to include some sort of forced maximum interval between live preview updates as an option. Having it fully async like this is amazing when it can keep up (even though it trends towards providing less than the user may have asked for), but that won't always be the case for every system.

@wfjsw

This comment was marked as outdated.

@wfjsw
Copy link
Contributor

wfjsw commented May 21, 2024

nvm it is my fault; I somehow dropped an s when applying the patches.

It does error out, but the error is not in the console. It ends up in sysinfo endpoint

@Soulreaver90
Copy link

I run a AMD rx 6700xt. I've swapped between this and what's in production, and noticed my speed goes down quite a bit using this commit. At 1024x with SDXL, I get a avg of 1.10s/it, but with this commit it drops to 1.50s/it, slower by 40s. I haven't tested much else, but it doesn't improve performance for me unless I am missing something.

@gel-crabs
Copy link
Contributor

I run a AMD rx 6700xt. I've swapped between this and what's in production, and noticed my speed goes down quite a bit using this commit. At 1024x with SDXL, I get a avg of 1.10s/it, but with this commit it drops to 1.50s/it, slower by 40s. I haven't tested much else, but it doesn't improve performance for me unless I am missing something.

Did you run multiple generations each time you tested?

@drhead
Copy link
Contributor Author

drhead commented May 21, 2024

I run a AMD rx 6700xt. I've swapped between this and what's in production, and noticed my speed goes down quite a bit using this commit. At 1024x with SDXL, I get a avg of 1.10s/it, but with this commit it drops to 1.50s/it, slower by 40s. I haven't tested much else, but it doesn't improve performance for me unless I am missing something.

Do you have the other performance PR patches applied? The changes might depend on that to a degree, and I exclusively tested on the assumption that this is building on top of the other patches.

If you do this, and it still shows a performance regression, even after trying multiple times, I would like for you to run profiling and figure out a way to get me the file (it might end up being somewhere around 200MB). Having profiling data for an AMD card would be extremely helpful. I have been doing it by wrapping the main processing loop in processing.py around line 980:

            with devices.without_autocast() if devices.unet_needs_upcast else devices.autocast():
                from torch.profiler import profile, record_function, ProfilerActivity
                with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
                        record_shapes=True,
                        profile_memory=True, # Track memory allocation
                        with_stack=True,
                        with_flops=True) as prof:
                    with record_function("model_inference"):
                        samples_ddim = p.sample(conditioning=p.c, unconditional_conditioning=p.uc, seeds=p.seeds, subseeds=p.subseeds, subseed_strength=p.subseed_strength, prompts=p.prompts)

                # Print profiling results
                print(prof.key_averages().table(sort_by="self_cpu_time_total", row_limit=20))
                print(prof.key_averages().table(sort_by="self_cuda_time_total", row_limit=20))
                # Export to Chrome trace format
                prof.export_chrome_trace("trace_livepreview.json")

Apply this code, and start WebUI. Run a 10 step inference to let it warm up and get all of the first time operations out of the way (they make the profile unnecessarily large and hard to read). Once that is done, run a 20 step inference, making sure that it actually shows you live previews in that time span. You will overwrite the first profile when you do this.

This will give a printout of the top 20 operators by self time on both CPU and GPU, and export a trace file that can be opened on http://ui.perfetto.dev/. The profile shouldn't contain personal information (I chose the chrome trace even though Tensorboard traces contain more useful info like SM utilization, since tensorboard traces will show your hostname), but you should check out the trace on Perfetto yourself to verify.

@Soulreaver90
Copy link

I run a AMD rx 6700xt. I've swapped between this and what's in production, and noticed my speed goes down quite a bit using this commit. At 1024x with SDXL, I get a avg of 1.10s/it, but with this commit it drops to 1.50s/it, slower by 40s. I haven't tested much else, but it doesn't improve performance for me unless I am missing something.

Do you have the other performance PR patches applied? The changes might depend on that to a degree, and I exclusively tested on the assumption that this is building on top of the other patches.

If you do this, and it still shows a performance regression, even after trying multiple times, I would like for you to run profiling and figure out a way to get me the file (it might end up being somewhere around 200MB). Having profiling data for an AMD card would be extremely helpful. I have been doing it by wrapping the main processing loop in processing.py around line 980:

            with devices.without_autocast() if devices.unet_needs_upcast else devices.autocast():
                from torch.profiler import profile, record_function, ProfilerActivity
                with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
                        record_shapes=True,
                        profile_memory=True, # Track memory allocation
                        with_stack=True,
                        with_flops=True) as prof:
                    with record_function("model_inference"):
                        samples_ddim = p.sample(conditioning=p.c, unconditional_conditioning=p.uc, seeds=p.seeds, subseeds=p.subseeds, subseed_strength=p.subseed_strength, prompts=p.prompts)

                # Print profiling results
                print(prof.key_averages().table(sort_by="self_cpu_time_total", row_limit=20))
                print(prof.key_averages().table(sort_by="self_cuda_time_total", row_limit=20))
                # Export to Chrome trace format
                prof.export_chrome_trace("trace_livepreview.json")

Apply this code, and start WebUI. Run a 10 step inference to let it warm up and get all of the first time operations out of the way (they make the profile unnecessarily large and hard to read). Once that is done, run a 20 step inference, making sure that it actually shows you live previews in that time span. You will overwrite the first profile when you do this.

This will give a printout of the top 20 operators by self time on both CPU and GPU, and export a trace file that can be opened on http://ui.perfetto.dev/. The profile shouldn't contain personal information (I chose the chrome trace even though Tensorboard traces contain more useful info like SM utilization, since tensorboard traces will show your hostname), but you should check out the trace on Perfetto yourself to verify.

I did not test this with the performance patches, I had actually tested them separately. I am cloning a clean instance for testing and will apply this ontop of the performance patches.

@gel-crabs
Copy link
Contributor

It had nothing to do with my memory, same issues.

@drhead
Copy link
Contributor Author

drhead commented May 28, 2024

It had nothing to do with my memory, same issues.

In my own testing lately I have noticed a few issues with occasional noise outputs, but they're infrequent. I will probably want to make this a toggleable option if I can't eliminate these with more careful syncs.

@gel-crabs
Copy link
Contributor

It had nothing to do with my memory, same issues.

In my own testing lately I have noticed a few issues with occasional noise outputs, but they're infrequent. I will probably want to make this a toggleable option if I can't eliminate these with more careful syncs.

Ahh. If you have any ideas for places in the file where the stream sync can be put, I'd be glad to help out as they're more frequent on my machine/setup.

Another idea may be to put the sync in progress.py, as that's where the live preview itself is updated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants