-
-
Notifications
You must be signed in to change notification settings - Fork 21.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ubershaders and pipeline pre-compilation (and dedicated transfer queues). #90400
Conversation
4615671
to
75603da
Compare
75603da
to
5e6944a
Compare
This resembles godotengine/godot-proposals#5229 and godotengine/godot-proposals#6497 a lot, although I haven't proposed it for VoxelGI and LightmapGI yet as these are not Environment or CameraEffects properties. If such a setting is disabled, we can assume the user is OK with having runtime shader compilation occur the first time the setting is enabled (since they'll probably be in an options menu while doing so). |
6b01e02
to
ed1030b
Compare
I gave this a shot and got pretty successful results. The current caveat is that pipeline compilation will be less likely to be triggered for resources loaded through a background thread in a loading screen unless the game features an scene first with the feature used in-place. If not, then it must defer the loading to the surface cache creation instead. However the results are pretty good. The pre-compilation on the TPS demo has gone down significantly: That's around 300 pipelines down from 650+ pipelines in the OP, pretty much doubling the speed of the initial load in the demo that I showcased in the video and still has no pipeline stutters during drawing. I haven't detected any regressions from implementing this yet but trying to find edge cases is still worth investigating. Godot.Third-Person.Shooter.Demo.DEBUG.2024-04-11.13-19-04-00.00.04.468-00.00.11.535.mp4I still think we could use some global settings to fine-tune the behavior (e.g. automatically detect, always pre-compile, never pre-compile), but this gets us much closer to an ideal level of pre-compilations that I wanted to see from the start. |
ed1030b
to
29e4df1
Compare
29e4df1
to
f767ec8
Compare
I investigated Canvas Renderer support and the potential problems we'd have to fix to fully take advantage of it. First off, Canvas Renderer does suffer from the exact same problem: pipelines are compiled at drawing time if necessary. However, the total amount of pipelines that this does happen on is fairly small. However, it's undeniable you can get stutters from behavior such as enabling and disabling lights in proximity of the elements. I added the entire framework for supporting ubershaders but ultimately left it disabled for now for a few reasons even if it does work as intended.
For now I'm leaning towards addressing other issues the PR currently has (such as an extra CPU cost due to a mutex I want to avoid), but if anyone has an example of a project that requires lots of different shaders, pipelines, is entirely 2D and suffers from stutters, that'd help to provide a good example of something I can use as a reference. |
50882fa
to
c21f062
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested locally with Vulkan Forward+ and Mobile rendering methods, it works as expected. Shader compilation stutter is completely gone in the TPS demo when shooting or destroying an enemy. Runtime performance is identical to master
when no shader compilation occurs.
The profilers that track pipeline compilations also work as expected. Docs look good to me as well.
This comes at the cost of slightly longer startup times, but I'd say it's worth it.
Benchmark
PC specifications
- CPU: Intel Core i9-13900K
- GPU: NVIDIA GeForce RTX 4090
- RAM: 64 GB (2×32 GB DDR5-5800 C30)
- SSD: Solidigm P44 Pro 2 TB
- OS: Linux (Fedora 39)
Using a Linux x86_64 optimized editor build (with LTO).
Startup + shutdown times when running https://github.com/godotengine/tps-demo's main menu:
Cold driver shader cache
$ hyperfine -iw1 -p "rm -rf ~/.cache/nvidia/GLCache" "bin/godot.linuxbsd.editor.x86_64 --path ~/Documents/Godot/tps-demo --quit" "bin/godot.linuxbsd.editor.x86_64.transfer_and_pipelines --path ~/Documents/Godot/tps-demo --quit"
Benchmark 1: bin/godot.linuxbsd.editor.x86_64 --path ~/Documents/Godot/tps-demo --quit
Time (mean ± σ): 2.412 s ± 0.029 s [User: 1.057 s, System: 0.294 s]
Range (min … max): 2.371 s … 2.463 s 10 runs
Benchmark 2: bin/godot.linuxbsd.editor.x86_64.transfer_and_pipelines --path ~/Documents/Godot/tps-demo --quit
Time (mean ± σ): 2.555 s ± 0.247 s [User: 1.418 s, System: 0.318 s]
Range (min … max): 2.079 s … 2.719 s 10 runs
Warm shader driver cache
$ hyperfine -iw1 "bin/godot.linuxbsd.editor.x86_64 --path ~/Documents/Godot/tps-demo --quit" "bin/godot.linuxbsd.editor.x86_64.transfer_and_pipelines --path ~/Documents/Godot/tps-demo --quit"
Benchmark 1: bin/godot.linuxbsd.editor.x86_64 --path ~/Documents/Godot/tps-demo --quit
Time (mean ± σ): 2.152 s ± 0.028 s [User: 0.831 s, System: 0.271 s]
Range (min … max): 2.126 s … 2.204 s 10 runs
Benchmark 2: bin/godot.linuxbsd.editor.x86_64.transfer_and_pipelines --path ~/Documents/Godot/tps-demo --quit
Time (mean ± σ): 2.236 s ± 0.039 s [User: 0.917 s, System: 0.294 s]
Range (min … max): 2.193 s … 2.320 s 10 runs
Summary
bin/godot.linuxbsd.editor.x86_64 --path ~/Documents/Godot/tps-demo --quit ran
1.04 ± 0.02 times faster than bin/godot.linuxbsd.editor.x86_64.transfer_and_pipelines --path ~/Documents/Godot/tps-demo --quit
PS: I wonder how this will interact with #88199 – does Metal make this approach possible? |
Should be completely fine as far as I know as the PR's approach is completely driver-agnostic. A lot of the changes on this one are just basically fixing a lot of stuff that wasn't thread safe, so it could expose some other bugs if part of the Metal driver assumed that wasn't gonna happen (which was a common issue in the D3D12 one but easily fixed). |
4f5b4de
to
7b92e5c
Compare
755d06e
to
7c3a8d1
Compare
For folks keeping up to date with this PR, we encountered a few problems that currently make this a bit risky to merge with the Metal backend. I'm unsure at the moment if the problem originates from the PR itself or just the fact that the Metal backend was not made to go through such heavily multithreaded work on the past before. This wouldn't be entirely unexpected, as this PR had to implement multiple fixes to the D3D12 driver to avoid race conditions. As far as I'm concerned I consider the PR to be done and it's been stable for us in Windows and Linux so far, but keep it in mind if Mac support is important for you. I'll attempt to see if the issues in Mac can be identified and solved. |
Small update, it seems most of it was related to the secondary thread stack size default being a different size and a bit too small for Godot. Increasing this has fixed most of the crashes. I'm still tracking one remaining issue but the PR seems to be working fine now on Mac. |
2127d4e
to
23468a3
Compare
…ice. Add ubershaders and rework pipeline caches for Forward+ and Mobile. - Implements asynchronous transfer queues from PR godotengine#87590. - Adds ubershaders that can run with specialization constants specified as push constants. - Pipelines with specialization constants can compile in the background. - Added monitoring for pipeline compilations. - Materials and shaders can now be created asynchronously on background threads. - Meshes that are loaded on background threads can also compile pipelines as part of the loading process.
23468a3
to
e2c6daf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great now! This is the final culmination of a lot of work spread over many months. I am very glad to see if finished.
This is ready to merge, and I suggest we merge it quickly to avoid conflicts.
I have personally tested on many devices including Win10, Linux, MacOS, and Android. I tested the TPS demo on all platforms, but I also tested the Nuku Warriors demo on Windows and multiple misc. demos on Linux. I am confident at this point that this is good enough for merging.
Amazing work @DarioSamo 🎉 |
Anyone wanting an introduction to this merge can have a look at the tutorial introduced by the PR to the docs here: https://docs.godotengine.org/en/latest/tutorials/performance/pipeline_compilations.html |
Tutorial for the new functionality added by godotengine/godot#90400
Tutorial for the new functionality added by godotengine/godot#90400
Super excited to try this with the XR Editor in Dev 4. Bit of a question however - Does this support the compatibility renderer? |
I'm afraid it's pretty much not possibly by design. Modern APIs like Vulkan are the only ones that provide direct control over creating pipelines, which is what this entire system is designed around. |
This PR could've fixed #95112 , needs further testing. |
This is a big PR with quite a bit of history that should be evaluated very thoroughly to evaluate where do we want to make some concessions and try to mitigate the side effects as much as possible. However, the benefits are essential to shipping games with the engine and making the final experience for users much better. To read further on @reduz's notes about the topic, you can check out these documents (Part 1 and Part 2).
Due to the complexity of this PR and how 4.3 is currently in feature freeze, I'd definitely not consider this PR until 4.3 is out. If you want the TL;DR: skip ahead to the two videos with the TPS demo to see the immediate difference.
NOTE: These improvements will only affect the Forward+ and Mobile renderers. No changes are expected for Compatibility.
Transfer queues
First of all, this PR supersedes the transfer queues PR and effectively uses it as its base. The reliance on needing to unlock parts of the behavior of RenderingDevice to make it multithread-friendly to reap the benefits was far too much to keep both PRs separate. As mentioned in that previous PR, merging it as is will cause a small performance regression unless #86333 is merged first.
Pipeline compilation
Modern APIs like Vulkan and D3D12 have made rendering pipeline management very explicit: their creation is no longer hidden behind the current rendering state and handled on demand by the driver. Instead, the developer must create the entire pipeline ahead of time and wait on a blocking operation that can take a significant amount of time depending on the complexity of the shader and the speed of the hardware. This has seen some improvements recently with the introduction of new extensions like VK_EXT_graphics_pipeline_library, but as always, Godot must engineer solutions aimed towards resolving the problem for as much hardware as possible and use such features optionally for optimization in the future.
Godot has the responsibility to perform as fast as possible for the end user, which leaves it no choice but to generate pipelines with the least amount of code and requirements as possible. The engine achieves this through the use of shader compilation macros (shader variants) and the use of specialization constants to optimize code for a particular pipeline (pipeline variants). While Godot resolves shader variant compilation and can even ship the shader cache to skip the step altogether, it coudn't resolve pipeline variant compilation ahead of time before this PR at all.
If you're familiar with the "stutters when playing the game for the first time" phenomenon that has plagued all games shipped with Godot 4's RD-based renderers, this is pretty much the entire root of the problem. This is not a problem exclusive to Godot as it's been very evident in lots of commercial releases that include very extensive shader pre-compilation steps the first time a game starts or a driver update happens. The issue is so prevalent even Digital Foundry points it out as the #1 problem plaguing PC game releases in this article and they never fail to mention the existence of the problem on any new game that suffers from it.
Ubershaders for Godot 4
The exciting part about this PR is an effective solution was developed to address this problem completely without the need to introduce extensive shader pre-compilation steps or any input from the game developer whatsoever. Instead, attempts have been made to make pipeline compilation a part of loading assets as much as possible. Not only does this mean most pipeline compilation is no longer resolved at drawing time, it can also even be done in background threads and presented as part of a regular loading screen. That means the game is no longer at the mercy of the renderer introducing these stutters when it needs to draw, but it makes the behavior much more predictable and able to be handled as part of a loading process.
The main improvement this PR makes is the introduction of ubershaders once more to the engine, but these are quite different from what was previously done in Godot 3. Unlike the previous version of the engine, these shaders do not correspond to generating text shaders with specializations and compiling them in the background, which could lead to a lot of CPU usage that'd take lots of time in weaker systems. Instead, ubershaders are mostly still very similar to the current shaders the engine already has, with a key difference: specialization constants are pulled from push constants instead. This means that the engine is able to use a version of the shader already that can be used for drawing immediately while the specialized version is generated in the background. Pipeline variants are much faster to generate like this instead of relying on runtime shader compilation to insert the constants as part of the shader text, as they work directly on the SPIR-V and skip the need to compile the shader from text again.
Specialization constants are a big part of how Godot optimizes pipelines, but they've been limited by parts of the design as to how many can actually be used. Any additional constant implied an explosion of variants that led to the pipeline cache structure getting even bigger (160 KB in just pointers in Forward+ for any single material in
master
at the moment!), and every new addition meant that if the state is very dynamic, stutters would occur due to extra pipeline compilation. This was quite evident in the Mobile renderer, which uses a specialization constant to disable lights if they're not used: as soon as a light popped up, then stutters due to pipeline compilation were inevitable.With this change, a new simple hashing system for pipeline caching is introduced instead:
Pipeline compilation at loading time
The other key part behind the PR is the introduction of pipeline compilation of the ubershaders in two extra steps.
The difference in how both of these changes work together is pretty evident on the TPS demo by simulating a clean run as an end user would see the first time they run the game. A big chunk of the stutters are gone, especially the one that happens the first time the character shoots, which is a typical case of a stutter that only happened at drawing time despite the effect being loaded in the scene tree already.
Both of these videos have pipeline caching disabled and the driver cache deleted between each run.
master
(dc91479)Godot.Third-Person.Shooter.Demo.DEBUG.2024-04-08.13-26-46-00.00.03.950-00.00.25.336.mp4
transfer_and_pipelines
2024-04-08.13-28-29-00.00.07.952-00.00.21.936.mp4
It's also worth noting how the loading screen animation actually plays out more of the time instead of having one big stutter at the end due to the initial pipeline compilation at drawing time. These loading times are also significantly shortened by making multiple improvements to the behavior of both the shader and pipeline compiler, allowing it to multi-thread more effectively and use more of the system's resources.
The negatives (and how we can mitigate them)
As was expected, these benefits do not come for free. But there's multiple ways we can attempt to mitigate most of the extra cost and this is an area I'm open to feedback on and that we can further optimize in future versions as well.
The biggest reason behind these negatives is the engine's flexibility. Features can be turned on and off without explicit operations from the user at a global level: a scene can be instanced to use VoxelGI while another one might use Lightmaps instead. As a matter of fact, this is exactly what the TPS demo does, so any run of the game must pre-compile the Lightmap variants because it can't know ahead of time which method the user has chosen without looking at the scene's contents, which is yet to be instanced during mesh loading.
One of the things I hope to improve while this PR is in progress is reducing the amount of variants that are pre-compiled as much as possible. Therefore it'd be great to gather feedback on which of these methods are most effective and how to implement them:
It's worth noting that under the current implementation, none of these leading to false positives will lead to the engine misbehaving: at worst, it just causes the drawing time stutters the current version already has.
Testing methodology
%LocalAppData%/NVIDIA/GLCache
). No test should be considered valid without deleting this cache first and foremost.Trying to measure the results can be a bit tricky as the results are heavily dependent on the behavior you see in a project. As the benefits are more visually evident as seen in the videos, it is hard to measure the effects of pipeline compilation at drawing time as they present themselves as stutters that happen all throughout the game instead of one particular scenario.
New performance monitors
Some new statistics have been added to the performance monitors which should help verify without a shadow of a doubt if the pipeline pre-compilation is working as intended. There's four different pipeline compilation sources that are identified and they should help towards understanding where a extended loading time or stutter comes from.
Quoted from the documentation added by this PR:
RENDERING_INFO_PIPELINE_COMPILATIONS_MESH
: Number of pipeline compilations that were triggered by loading meshes. These compilations will show up as longer loading times the first time a user runs the game and the pipeline is required.RENDERING_INFO_PIPELINE_COMPILATIONS_SURFACE
: Number of pipeline compilations that were triggered by building the surface cache before rendering the scene. These compilations will show up as a stutter when loading scenes the first time a user runs the game and the pipeline is required.RENDERING_INFO_PIPELINE_COMPILATIONS_DRAW
: Number of pipeline compilations that were triggered while drawing the scene. These compilations will show up as stutters during gameplay the first time a user runs the game and the pipeline is required.RENDERING_INFO_PIPELINE_COMPILATIONS_SPECIALIZATION
: Number of pipeline compilations that were triggered to optimize the current scene. These compilations are done in the background and should not cause any stutters whatsoever.bugsquad edit: Fixes #61233
TODO
Contributed by W4 Games. 🍀