Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thread-safety for rendering related states #1393

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

Drakulix
Copy link
Member

Best reviewed commit-by-commit.

The biggest is easily 5b2e919, which makes the CacheMap thread-safe. A lot of the double buffered state is used to inform rendering, so I am not sure if this can be avoided.

The rest of the changes are most likely less controversial.

@Drakulix Drakulix force-pushed the feature/thread-safe-rendering branch 3 times, most recently from f3ec96f to cb9f536 Compare April 17, 2024 13:36
@Drakulix Drakulix force-pushed the feature/thread-safe-rendering branch from cb9f536 to 1fe3c70 Compare April 17, 2024 13:47
@ids1024
Copy link
Member

ids1024 commented Apr 17, 2024

Where RwLock is used, we expect a lot more readers than writers? Otherwise it may perform worse than just a simple Mutex.

Though really, in a lot of cases it won't make much of a performance difference; nor will using a Mutex instead of RefCell. (It would be nice to have benchmarks, but that isn't really easy here.)

For CacheMap... if rendering is in a different thread from the thread handling Wayland requests and updating the cached state, would it make sense to have cached and pending in different locks? For the most part, the only place both are needed should be when applying the commit.

Though again, it might not make much difference. If the lock isn't held for very long, and we don't have a whole bunch of threads competing for it.

In general using Mutex/RwLock instead of RefCell is a fairly uncontroversial change. Though multiple threads opens up the possibility of dead locks based on inconsistent acquire order of multiple locks, race conditions of various sorts, and all that.

@YaLTeR
Copy link
Contributor

YaLTeR commented Apr 18, 2024

Ngl all these locks throughout make me very uneasy. I feel like it can lead to a lot of issues like: rendering thread gets some value (lock/unlock), then main thread gets a surface commit and updates values, then rendering thread later in the function gets another value (lock/unlock) and expects it to "match" the first value it got earlier in the function, but it no longer does.

@cmeissl
Copy link
Collaborator

cmeissl commented Apr 18, 2024

I agree with @ids1024 that using RwLock might not yield any benefit here and a Mutex might be the easier option. RwLock also has a few footguns which can result in "unexpected" deadlocks.

I also have similar concerns regarding atomicity of wayland operations as @YaLTeR. This could easily lead to unexpected behavior.

So instead of going into more details regarding this PR changes I want to talk about the actual requirements and if there
might be an alternative to solve them. AFAICT the motivation behind the changes is to allow concurrent rendering of multiple outputs (and probably also screencasts). I also already thought about how we could support offloading of screencopy to a separate thread because this is currently one issue I see on a low-performance board.
I did something similar in an old non rust based implementation, the way it worked was that the main thread still processed the wayland events and the rendering "sampled" (mostly cloning stuff around) the current state blocking wayland processing only for a short time and then offloaded the actual rendering to a separate thread.

I haven't thought that completely through, but from a quick look it should be possible to make the render elements (especially the WaylandSurfaceRenderElement) sample the state and implement Send. The biggest issue I see at the moment is how we handle importing memory/data into the renderer. This is currently done during realizing the render elements, so we would have to re-think that to keep time low during the "sample" phase. But maybe it is possible to split this into multiple smaller issues we can address. For example "sampling" the state in WayandSurfaceRenderElement only once would already yield better performance because we could lock the state once instead of in every time a function is invoked on the object (This is something that got visible in one of my last profiling sessions after getting rid of a lot of other stuff :)).

So some requirements to support this that come to my mind are:

  • Share Textures whenever possible (multiple renderers with a shared context)
  • Async upload of memory/buffers or making it "lazy" and move it to a render/upload thread (potentially only locking per renderer/id or using some kind of optimistic concurrency and in worst case wasting some bandwidth)
  • Keep atomicity of wayland operations
  • Avoid lock contention as far as possible

@Drakulix
Copy link
Member Author

Drakulix commented Apr 18, 2024

Where RwLock is used, we expect a lot more readers than writers? Otherwise it may perform worse than just a simple Mutex.

The idea was that this is the primary rendering-related state, which might be processed multiple times for rendering, but is more rarely updated. Surely a Mutex wouldn't make a huge difference though.

For CacheMap... if rendering is in a different thread from the thread handling Wayland requests and updating the cached state, would it make sense to have cached and pending in different locks? For the most part, the only place both are needed should be when applying the commit.

Though again, it might not make much difference. If the lock isn't held for very long, and we don't have a whole bunch of threads competing for it.

I thought about this as well and concluded the same thing and thus went with the easier solution.

Ngl all these locks throughout make me very uneasy. I feel like it can lead to a lot of issues like: rendering thread gets some value (lock/unlock), then main thread gets a surface commit and updates values, then rendering thread later in the function gets another value (lock/unlock) and expects it to "match" the first value it got earlier in the function, but it no longer does.

Yeah this definitely needs to be tested and probably bug fixed a lot, once any compositor actually makes use of it. It might lead to a few imperfect frames in it's current state. I can try to update and carry these patches until I have a proper implementation to show off (hopefully soon) to properly asses if this really is an issue.

I did something similar in an old non rust based implementation, the way it worked was that the main thread still processed the wayland events and the rendering "sampled" (mostly cloning stuff around) the current state blocking wayland processing only for a short time and then offloaded the actual rendering to a separate thread.

Yes that is the general idea, however I feel like doing this after accumulating all the elements is the wrong approach. The idea of parallel rendering comes from the fact that you simply can't drive two 240hz displays on one main loop realistically, because the timings to commit in time are so tight. We actually profile and measure the "elements"-step separately from "drawing" in cosmic-comp and the former takes a good amount of time even after the last round of allocation optimizations.

I really want to move this step into a render-thread as well, so while I like the theoretical idea of snapshotting the surface state, I don't feel like this would actually work as intended.

For what is worth, cosmic-comp now puts it's whole Shell-state behind an RwLock to allow parallel read-access from the rendering-threads, which means all wayland-events that need to call back into that are already blocked, most importantly commit. So I don't know if any of this would actually affect cosmic-comp. As such I would be interested to hear if any compositors are planning to implement this differently?

The biggest issue I see at the moment is how we handle importing memory/data into the renderer.

And that problems runs deeper. The renderers do mostly cache things using insert_if_missing not insert_if_missing_threadsafe, which is why I didn't update the user-data structs of the GlesRenderer or the MultiRenderer yet. Sharing contexts with the MultiRenderer also isn't exactly straight forward.. Right now I am instantiating a complete GpuManager per thread even though I would really like to share textures.

I don't think we should handle the import on the main thread btw. Importing with implicit sync might stall for quite a long time, with the MultiRenderer it might even involve a CPU copy. So locking EGLContexts to use them across multiple threads (except for creating shared contexts) is out of question luckily, which is another reason to move element-creation into the render threads.

Async upload of memory/buffers or making it "lazy" and move it to a render/upload thread (potentially only locking per renderer/id or using some kind of optimistic concurrency and in worst case wasting some bandwidth)

So the idea would be to decouple import from element-creation? That might be a good idea and would easily allow to move this to another thread. (in just continue using old buffers, while the import thread is still busy?)

For example "sampling" the state in WayandSurfaceRenderElement only once would already yield better performance because we could lock the state once instead of in every time a function is invoked on the object (This is something that got visible in one of my last profiling sessions after getting rid of a lot of other stuff :)).

In general I think this is a good idea, no matter on which thread this takes place. This ensures a consistent rendered state and might already solve 99% of the problems @YaLTeR has with this PR?

@YaLTeR
Copy link
Contributor

YaLTeR commented Apr 18, 2024

I don't think we should handle the import on the main thread btw. Importing with implicit sync might stall for quite a long time, with the MultiRenderer it might even involve a CPU copy.

SHM uploads are especially painful on the profiles. Which happen every time you mouse over a different window (cursor surfaces are usually SHM), and also for clients like foot and maybe swww that can/do go fullscreen.

For example "sampling" the state in WayandSurfaceRenderElement only once would already yield better performance because we could lock the state once instead of in every time a function is invoked on the object (This is something that got visible in one of my last profiling sessions after getting rid of a lot of other stuff :)).

In general I think this is a good idea, no matter on which thread this takes place. This ensures a consistent rendered state and might already solve 99% of the problems @YaLTeR has with this PR?

That will be very welcome regardless. I had to make my own poor man's implementation of this for animations (window resize, window close): https://github.com/YaLTeR/niri/blob/47f6c85f64a20a9e32bf402941ecb78065aff80a/src/render_helpers/surface.rs No opaque regions or damage there though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants