Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vulkan graphics pipelines use excessive amount of memory on Galaxy S23 #101635

Open
r-eckert opened this issue Jan 16, 2025 · 15 comments
Open

Vulkan graphics pipelines use excessive amount of memory on Galaxy S23 #101635

r-eckert opened this issue Jan 16, 2025 · 15 comments

Comments

@r-eckert
Copy link

Tested versions

  • Reproducible in: 4.4.dev7, master starting from 98deb2a
  • Not reproducible in: 4.3.stable

System information

Samsung Galaxy S23 Ultra, Android 14, Vulkan (Mobile), Adreno 740

Issue description

After trying to update our project to Godot 4.4 we found that it crashes while loading on my Android phone.
The profiler showed that graphics memory was exceeding 4GB before the app closes while the game running on Godot 4.3 only uses around 800mb of graphics memory.

I have bisected everything between 4.3 and 4.4 and traced the problem to #90400 getting merged.

Here is a memory report generated by RenderingDevice.get_driver_and_device_memory_report from a version of our game with lots of content removed so it starts at all:

Total Driver Memory:76.373 MB
Total Driver Num Allocations: 51018
Total Device Memory:3755.699 MB
Total Device Num Allocations: 854

Memory use by object type (CSV format):

Category; Driver memory in MB; Driver Allocation Count; Device memory in MB; 
UNKNOWN;0.0;0;0.0;0
INSTANCE;19.86637;1520;0.0;0
PHYSICAL_DEVICE;0.0;0;0.0;0
DEVICE;0.136612;580;0.247704;42
QUEUE;0.0;0;27.88282;11
SEMAPHORE;0.00267;10;0.0;0
COMMAND_BUFFER;0.0;0;1.484375;101
FENCE;0.001068;4;0.0;0
DEVICE_MEMORY;0.0;0;1056.0;8
BUFFER;10.4068;25260;0.0;0
IMAGE;1.001448;1756;0.181641;1
EVENT;0.0;0;0.0;0
QUERY_POOL;0.001633;6;0.007828;2
BUFFER_VIEW;0.000992;1;0.0;0
IMAGE_VIEW;0.836527;857;0.0;0
SHADER_MODULE;29.46983;1736;0.0;0
PIPELINE_CACHE;4.005884;286;0.0;0
PIPELINE_LAYOUT;1.361336;936;0.0;0
RENDER_PASS;0.018356;189;0.0;0
PIPELINE;2.593163;3315;2668.945;607
DESCRIPTOR_SET_LAYOUT;3.279724;8207;0.0;0
SAMPLER;0.01339;39;0.0;0
DESCRIPTOR_POOL;1.765778;1800;0.800781;46
DESCRIPTOR_SET;0.0;0;0.0;0
FRAMEBUFFER;1.206264;538;0.1492;36
COMMAND_POOL;0.403104;3976;0.0;0
DESCRIPTOR_UPDATE_TEMPLATE_KHR;0.0;0;0.0;0
SURFACE_KHR;0.000031;1;0.0;0
SWAPCHAIN_KHR;0.002022;1;0.0;0
DEBUG_UTILS_MESSENGER_EXT;0.0;0;0.0;0
DEBUG_REPORT_CALLBACK_EXT;0.0;0;0.0;0
ACCELERATION_STRUCTURE;0.0;0;0.0;0
VMA_BUFFER_OR_IMAGE;0.0;0;0.0;0

You can see that device memory for pipelines is 2668.945mb. The same line in Godot 4.3 (and 4.4 before #90400 got merged) shows a little over 1mb.

I modified my engine build to log every allocation related to pipeline objects and found that it allocates a block of 24mb for some of the pipelines. Adding up those 24mb allocations gives me pretty much exactly the excess amount of memory use compared to without that PR

I suspect that it is related to the Ubershader that is used while the optimized pipeline is compiled.
A hacky attempt to disable the feature by preventing the "define UBERSHADER" in the shader from being set resulted in the weird allocations disappearing.
But I am not that familiar with the code yet and I am also running out of time that I can invest in this problem so hopefully someone here can find a proper workaround.

The attached MRP contains just a camera looking at a single cube and a script to print the memory report.
On my device this already uses 72mb for pipelines. Strangely on a Oneplus 6 it was only 6.6mb and on a Oneplus 8 it uses 12.6mb for pipelines which still seems excessive compared to 1mb but apparently this depends heavily on hardware or driver version.
I could not test yet how this scales with the real project on those other devices.

I understand that this is likely related to a driver issue that we can't fix but maybe it can be mitigated somehow? If not maybe ubershaders can be deactivated depending on hardware or a project setting?

Steps to reproduce

  1. Open the attached project
  2. Enable Deploy with Remote Debug
  3. Deploy to Android device
  4. Observe the logged memory report

Minimal reproduction project (MRP)

pipeline_memory_mrp.zip

@DarioSamo
Copy link
Contributor

DarioSamo commented Jan 16, 2025

Can you show what numbers we're dealing with here concerning the amount of pipelines created by your project as shown in the monitors here?

https://docs.godotengine.org/en/latest/tutorials/performance/pipeline_compilations.html

The amount of memory in use is indeed entirely driver-dependent. The only way to avoid the excessive memory usage caused by pipeline compilation would be to turn off specialization altogether, but that'd give you lower performance than 4.3 which opted to just specialize and stutter on the spot. This option is not offered at the moment and I'd push against such a thing being offered as it'd probably lead to users turning off things they shouldn't.

Strangely on a Oneplus 6 it was only 6.6mb and on a Oneplus 8 it uses 12.6mb for pipelines which still seems excessive compared to 1mb but apparently this depends heavily on hardware or driver version.

This is not that strange and I wouldn't say it's excessive. It's preloading the content before it shows it so it doesn't stutter. That's by design and you're likely to end up with a similar amount of usage when you end up exhausting their usage on the scene.

@r-eckert
Copy link
Author

The real project which is where the 2668.945mb number is from creates 110 pipelines from meshes, 43 from surfaces, 12 from specialization and 6 from canvas.

The numbers for the Oneplus devices are from the MRP which uses much less than the real project but enough to show that something weird is going on with 72mb used on the Galaxy S23. I just added the Oneplus numbers for comparison.
The MRP creates 4 from surfaces and 1 from specialization

@DarioSamo
Copy link
Contributor

DarioSamo commented Jan 16, 2025

The real project which is where the 2668.945mb number is from creates 110 pipelines from meshes, 43 from surfaces, 12 from specialization and 6 from canvas.

These pipeline numbers are very small. This sounds like a nasty code generation bug on this particular driver.

For reference, the third person shooter example project reaches about double the amount of pipelines that you've mentioned.

@DarioSamo
Copy link
Contributor

DarioSamo commented Jan 16, 2025

The MRP creates 4 from surfaces and 1 from specialization

That'd likely hint that the problem is the ubershader generation itself rather than the specialization. Indeed, it seems like the only way to fix it here would be to disable ubershaders altogether for this particular device and always only opt for generating the specialized variant and stutter (which is the 4.3 behavior).

@akien-mga akien-mga added this to the 4.4 milestone Jan 16, 2025
@akien-mga akien-mga moved this from Unassessed to Release Blocker in 4.x Release Blockers Jan 16, 2025
@clayjohn
Copy link
Member

@r-eckert Can you provide some context as to how you measured the device memory for pipelines in 4.3? RenderingDevice.get_driver_and_device_memory_report() wasn't added until 4.4 dev 1. Did you make a custom engine build from 4.3 with #96044 cherry-picked on top?

@clayjohn
Copy link
Member

clayjohn commented Jan 28, 2025

Tested with a Pixel 4 (Adreno 540) and I can't reproduce this issue. I get 16.578125 mb from the pipelines.

The fact that the S23 seems to be allocating memory in multiples of 24 mb might be a hint as to what is going wrong.

Edit: Actually, testing with dev 3, I get 0.511719 mb from pipelines

@DarioSamo
Copy link
Contributor

Tested with a Pixel 4 (Adreno 540) and I can't reproduce this issue. I get 16.578125 mb from the pipelines.
Edit: Actually, testing with dev 3, I get 0.511719 mb from pipelines

I still think this fits within reason though, by design there'll be more pipelines compiled ahead of time, it's just that usually the memory they consume isn't that much compared to the benefit they provide when it comes to no stuttering. The device in question reported in the OP seems like an outlier compared to weaker phones we've tested on, but we don't have much control over the code generation in that regard.

My recommendation would be to try disabling the pipeline cache feature in project settings and see if that affects it.

@r-eckert
Copy link
Author

@r-eckert Can you provide some context as to how you measured the device memory for pipelines in 4.3? RenderingDevice.get_driver_and_device_memory_report() wasn't added until 4.4 dev 1. Did you make a custom engine build from 4.3 with #96044 cherry-picked on top?

I was originally comparing the memory usage with the profiler in Android Studio. I did not actually test on 4.3 using the memory report. I only started using that to find out what exactly was causing the memory use since the Android Studio profiler only labeled the memory as "Graphics".

@clayjohn clayjohn moved this from Release Blocker to Bad in 4.x Release Blockers Jan 28, 2025
@clayjohn
Copy link
Member

Testing with the Forward+ renderer I get
Beta1: 241 mb
dev3: 145 mb

My recommendation would be to try disabling the pipeline cache feature in project settings and see if that affects it.

This didn't help unfortunately.

@Calinou
Copy link
Member

Calinou commented Jan 30, 2025

Testing on a Samsung Galaxy Tab S9 Ultra (16 GB RAM) with Android 14. This tablet has the same SoC as a Samsung Galaxy S23 (Snapdragon 8 Gen 2) but has more RAM in its 1 TB variant (16 GB instead of 12 GB).

4.4.dev3

Godot Engine v4.4.dev3.official.f4af8201b - https://godotengine.org
Vulkan 1.3.128 - Forward Mobile - Using Device #0: Qualcomm - Adreno (TM) 740

=== Driver Memory Report ===
Launch with --extra-gpu-memory-tracking and build with DEBUG_ENABLED for this functionality to work.
Device memory may be unavailable if the API does not support it(e.g. VK_EXT_device_memory_report is unsupported).

Total Driver Memory:9.852648 MB
Total Driver Num Allocations: 6710
Total Device Memory:171.5334 MB
Total Device Num Allocations: 222

Memory use by object type (CSV format):

Category; Driver memory in MB; Driver Allocation Count; Device memory in MB; Device Allocation Count
UNKNOWN;0.0;0;0.0;0
INSTANCE;0.455457;255;0.0;0
PHYSICAL_DEVICE;0.0;0;0.0;0
DEVICE;0.074432;288;0.246601;29
QUEUE;0.0;0;9.296883;5
SEMAPHORE;0.002136;8;0.0;0
COMMAND_BUFFER;0.0;0;0.96875;34
FENCE;0.000534;2;0.0;0
DEVICE_MEMORY;0.0;0;160.0;4
BUFFER;0.082397;200;0.0;0
IMAGE;0.175781;426;0.357422;1
EVENT;0.0;0;0.0;0
QUERY_POOL;0.001633;6;0.007828;2
BUFFER_VIEW;0.000992;1;0.0;0
IMAGE_VIEW;0.200348;202;0.0;0
SHADER_MODULE;5.313644;712;0.0;0
PIPELINE_CACHE;0.482186;116;0.0;0
PIPELINE_LAYOUT;0.3983;398;0.0;0
RENDER_PASS;0.005444;55;0.0;0
PIPELINE;0.416645;645;0.150131;114
DESCRIPTOR_SET_LAYOUT;0.942505;2832;0.0;0
SAMPLER;0.013733;40;0.0;0
DESCRIPTOR_POOL;0.822136;188;0.453125;22
DESCRIPTOR_SET;0.0;0;0.0;0
FRAMEBUFFER;0.380615;195;0.052612;11
COMMAND_POOL;0.081676;139;0.0;0
DESCRIPTOR_UPDATE_TEMPLATE_KHR;0.0;0;0.0;0
SURFACE_KHR;0.000031;1;0.0;0
SWAPCHAIN_KHR;0.002022;1;0.0;0
DEBUG_UTILS_MESSENGER_EXT;0.0;0;0.0;0
DEBUG_REPORT_CALLBACK_EXT;0.0;0;0.0;0
ACCELERATION_STRUCTURE;0.0;0;0.0;0
VMA_BUFFER_OR_IMAGE;0.0;0;0.0;0

4.4.beta1

Godot Engine v4.4.beta1.official.d33da79d3 - https://godotengine.org
Vulkan 1.3.128 - Forward Mobile - Using Device #0: Qualcomm - Adreno (TM) 740

=== Driver Memory Report ===
Launch with --extra-gpu-memory-tracking and build with DEBUG_ENABLED for this functionality to work.
Device memory may be unavailable if the API does not support it(e.g. VK_EXT_device_memory_report is unsupported).

Total Driver Memory:10.0953722000122 MB
Total Driver Num Allocations: 6259
Total Device Memory:244.196598052979 MB
Total Device Num Allocations: 242

Memory use by object type (CSV format):

Category; Driver memory in MB; Driver Allocation Count; Device memory in MB; Device Allocation Count
UNKNOWN;0.0;0;0.0;0
INSTANCE;1.09360408782959;278;0.0;0
PHYSICAL_DEVICE;0.0;0;0.0;0
DEVICE;0.10919952392578;349;0.24674606323242;31
QUEUE;0.0;0;9.29688262939453;5
SEMAPHORE;0.00614166259766;23;0.0;0
COMMAND_BUFFER;0.0;0;1.5234375;43
FENCE;0.00080108642578;3;0.0;0
DEVICE_MEMORY;0.0;0;160.0;4
BUFFER;0.0823974609375;200;0.0;0
IMAGE;0.182373046875;442;0.357421875;1
EVENT;0.0;0;0.00001525878906;2
QUERY_POOL;0.00163269042969;6;0.00782775878906;2
BUFFER_VIEW;0.00099182128906;1;0.0;0
IMAGE_VIEW;0.20828247070312;210;0.0;0
SHADER_MODULE;4.87978172302246;564;0.0;0
PIPELINE_CACHE;0.63574600219727;139;0.0;0
PIPELINE_LAYOUT;0.34844970703125;338;0.0;0
RENDER_PASS;0.00786209106445;78;0.0;0
PIPELINE;0.44414138793945;674;72.2350921630859;119
DESCRIPTOR_SET_LAYOUT;0.79085922241211;2362;0.0;0
SAMPLER;0.01373291015625;40;0.0;0
DESCRIPTOR_POOL;0.81290054321289;204;0.4765625;24
DESCRIPTOR_SET;0.0;0;0.0;0
FRAMEBUFFER;0.380615234375;195;0.0526123046875;11
COMMAND_POOL;0.09380722045898;151;0.0;0
DESCRIPTOR_UPDATE_TEMPLATE_KHR;0.0;0;0.0;0
SURFACE_KHR;0.00003051757812;1;0.0;0
SWAPCHAIN_KHR;0.00202178955078;1;0.0;0
DEBUG_UTILS_MESSENGER_EXT;0.0;0;0.0;0
DEBUG_REPORT_CALLBACK_EXT;0.0;0;0.0;0
ACCELERATION_STRUCTURE;0.0;0;0.0;0
VMA_BUFFER_OR_IMAGE;0.0;0;0.0;0

Interestingly, this is present in the output even though the output appears correct:

Launch with --extra-gpu-memory-tracking and build with DEBUG_ENABLED for this functionality to work.

In fact, it still appears if I add --extra-gpu-memory-tracking to the Main Run Args project setting. If I run the project on desktop, all values return 0 as the argument isn't passed automatically. If I add it to Main Run Args and run on desktop, it causes a crash on startup (I'll report this separately). This happens even if I just run the project manager with that argument.

Edit: The crash on desktop is likely similar to #95967.

@clayjohn
Copy link
Member

clayjohn commented Feb 1, 2025

@r-eckert Is there any change you have an S22 or S24 that you can test your main project with?

Alternatively, is there any way you could privately share APKs of your project (with the memory report) exported from dev 3 and beta 1?

@clayjohn
Copy link
Member

clayjohn commented Feb 1, 2025

I have made some headway investigating this. We know a few things:

  1. The problem is definitely the ubershaders
  2. The problem just appears to be that one device/driver
  3. Each individual pipeline is taking way more memory than it should
  4. There are way more pipelines now than there were before the ubershader

In #102217 I reduced the total number of pipelines that get generated by making better use of the information we have at startup. That reduced memory usage from pipelines by 1/3, but it doesn't help the pathological explosion of size that each ubershader pipeline has.

To investigate a bit further I tried running the shader through the Adreno Offline Compiler to see if we can glean more relevant information:

The following results are from the fragment shader of the default spatial material captured on the first frame of running

Dev3
Adreno Offline Compiler (AOC)
    -----------------------------
    AOC Version     : 2.0
    Compiler Version: E031.42.11.00


======== Shader Stats FS ========

         Shader Preamble Stats
Total instruction count                                     :   469
ALU instruction count 32bit                                 :   18
ALU instruction count 16bit                                 :   0
Complex instruction count 32bit                             :   0
Complex instruction count 16bit                             :   0
Texture read instruction count                              :   0
Memory read instruction count                               :   4
Memory write instruction count                              :   52
Flow control instruction count                              :   1
Barrier and fence Instruction count                         :   0
Short latency sync instruction count                        :   49
Long latency sync instruction count                         :   2
Miscellaneous instruction count                             :   343

         Main Shader Stats
Total instruction count                                     :   3654
ALU instruction count 32bit                                 :   1534
ALU instruction count 16bit                                 :   40
Complex instruction count 32bit                             :   71
Complex instruction count 16bit                             :   1
Texture read instruction count                              :   22
Memory read instruction count                               :   135
Memory write instruction count                              :   0
Flow control instruction count                              :   80
Barrier and fence Instruction count                         :   58
Short latency sync instruction count                        :   140
Long latency sync instruction count                         :   75
Miscellaneous instruction count                             :   1498
Full precision register footprint per shader instance       :   21
Half precision register footprint per shader instance       :   41
Overall register footprint per shader instance              :   21
Scratch memory usage per shader instance                    :   0
Output component count                                      :   4
Input component count                                       :   10
ALU fiber occupancy percentage                              :   50
Beta 1
    Adreno Offline Compiler (AOC)
    -----------------------------
    AOC Version     : 2.0
    Compiler Version: E031.42.11.00


======== Shader Stats FS ========

         Shader Preamble Stats
Total instruction count                                     :   800
ALU instruction count 32bit                                 :   45
ALU instruction count 16bit                                 :   0
Complex instruction count 32bit                             :   0
Complex instruction count 16bit                             :   0
Texture read instruction count                              :   0
Memory read instruction count                               :   40
Memory write instruction count                              :   129
Flow control instruction count                              :   1
Barrier and fence Instruction count                         :   0
Short latency sync instruction count                        :   98
Long latency sync instruction count                         :   37
Miscellaneous instruction count                             :   450

         Main Shader Stats
Total instruction count                                     :   5483
ALU instruction count 32bit                                 :   1918
ALU instruction count 16bit                                 :   97
Complex instruction count 32bit                             :   100
Complex instruction count 16bit                             :   0
Texture read instruction count                              :   38
Memory read instruction count                               :   104
Memory write instruction count                              :   0
Flow control instruction count                              :   180
Barrier and fence Instruction count                         :   88
Short latency sync instruction count                        :   176
Long latency sync instruction count                         :   93
Miscellaneous instruction count                             :   2689
Full precision register footprint per shader instance       :   26
Half precision register footprint per shader instance       :   48
Overall register footprint per shader instance              :   26
Scratch memory usage per shader instance                    :   20
Output component count                                      :   4
Input component count                                       :   11
ALU fiber occupancy percentage                              :   37

I then used the following diff to force disable optional features

Diff
diff --git a/servers/rendering/renderer_rd/shaders/forward_mobile/scene_forward_mobile_inc.glsl b/servers/rendering/renderer_rd/shaders/forward_mobile/scene_forward_mobile_inc.glsl
index 49c8905dbf..6225c7a744 100644
--- a/servers/rendering/renderer_rd/shaders/forward_mobile/scene_forward_mobile_inc.glsl
+++ b/servers/rendering/renderer_rd/shaders/forward_mobile/scene_forward_mobile_inc.glsl
@@ -89,15 +89,15 @@ float sc_packed_3() {
 #endif

 bool sc_use_light_projector() {
-       return ((sc_packed_0() >> 0) & 1U) != 0;
+       return false;
 }

 bool sc_use_light_soft_shadows() {
-       return ((sc_packed_0() >> 1) & 1U) != 0;
+       return false;
 }

 bool sc_use_directional_soft_shadows() {
-       return ((sc_packed_0() >> 2) & 1U) != 0;
+       return false;
 }

 bool sc_decal_use_mipmaps() {
@@ -113,23 +113,23 @@ bool sc_disable_fog() {
 }

 bool sc_use_depth_fog() {
-       return ((sc_packed_0() >> 6) & 1U) != 0;
+       return false;
 }

 bool sc_use_fog_aerial_perspective() {
-       return ((sc_packed_0() >> 7) & 1U) != 0;
+       return false;
 }

 bool sc_use_fog_sun_scatter() {
-       return ((sc_packed_0() >> 8) & 1U) != 0;
+       return false;
 }

 bool sc_use_fog_height_density() {
-       return ((sc_packed_0() >> 9) & 1U) != 0;
+       return false;
 }

 bool sc_use_lightmap_bicubic_filter() {
-       return ((sc_packed_0() >> 10) & 1U) != 0;
+       return false;
 }

 bool sc_multimesh() {
Master with all features force disabled
    Adreno Offline Compiler (AOC)
    -----------------------------
    AOC Version     : 2.0
    Compiler Version: E031.42.11.00


======== Shader Stats FS ========

         Shader Preamble Stats
Total instruction count                                     :   484
ALU instruction count 32bit                                 :   27
ALU instruction count 16bit                                 :   0
Complex instruction count 32bit                             :   0
Complex instruction count 16bit                             :   0
Texture read instruction count                              :   0
Memory read instruction count                               :   4
Memory write instruction count                              :   58
Flow control instruction count                              :   1
Barrier and fence Instruction count                         :   0
Short latency sync instruction count                        :   45
Long latency sync instruction count                         :   2
Miscellaneous instruction count                             :   347

         Main Shader Stats
Total instruction count                                     :   3900
ALU instruction count 32bit                                 :   1574
ALU instruction count 16bit                                 :   40
Complex instruction count 32bit                             :   71
Complex instruction count 16bit                             :   0
Texture read instruction count                              :   28
Memory read instruction count                               :   129
Memory write instruction count                              :   0
Flow control instruction count                              :   98
Barrier and fence Instruction count                         :   53
Short latency sync instruction count                        :   161
Long latency sync instruction count                         :   88
Miscellaneous instruction count                             :   1658
Full precision register footprint per shader instance       :   20
Half precision register footprint per shader instance       :   40
Overall register footprint per shader instance              :   20
Scratch memory usage per shader instance                    :   0
Output component count                                      :   4
Input component count                                       :   11
ALU fiber occupancy percentage                              :   50

Notably you can see the total instruction count changes dramatically and the Beta 1 ubershader uses scratch memory (which I suspect is the key thing here).

Ultimately, I think we are just going to need to disable the Ubershader for the Adreno 740.

@clayjohn
Copy link
Member

clayjohn commented Feb 5, 2025

@Calinou Can you test the following branches on your device?

  1. https://github.com/DarioSamo/godot/tree/dont-unroll-ubershader
  2. clayjohn@2e07ca8 (which includes Reduce mobile pipeline compilations #102217 as well as an additional optimization)
  3. https://github.com/clayjohn/godot/tree/mobile-pipelines-all-settings (which includes 2 and some additional changes)

With 3, on my Adreno 640 I am back to 0.5 mb in the MRP. With 2 I get 8mb instead of 16. If my theory is correct 1 won't do anything, 2, will help a bit, but not much, and 3 will fix the problem :)

unroll
    Adreno Offline Compiler (AOC)
    -----------------------------
    AOC Version     : 2.0
    Compiler Version: E031.42.11.00


======== Shader Stats FS ========

         Shader Preamble Stats
Total instruction count                                     :   808
ALU instruction count 32bit                                 :   45
ALU instruction count 16bit                                 :   0
Complex instruction count 32bit                             :   0
Complex instruction count 16bit                             :   0
Texture read instruction count                              :   0
Memory read instruction count                               :   40
Memory write instruction count                              :   130
Flow control instruction count                              :   1
Barrier and fence Instruction count                         :   0
Short latency sync instruction count                        :   99
Long latency sync instruction count                         :   37
Miscellaneous instruction count                             :   456

         Main Shader Stats
Total instruction count                                     :   5483
ALU instruction count 32bit                                 :   1918
ALU instruction count 16bit                                 :   97
Complex instruction count 32bit                             :   100
Complex instruction count 16bit                             :   0
Texture read instruction count                              :   38
Memory read instruction count                               :   104
Memory write instruction count                              :   0
Flow control instruction count                              :   180
Barrier and fence Instruction count                         :   88
Short latency sync instruction count                        :   176
Long latency sync instruction count                         :   93
Miscellaneous instruction count                             :   2689
Full precision register footprint per shader instance       :   26
Half precision register footprint per shader instance       :   48
Overall register footprint per shader instance              :   26
Scratch memory usage per shader instance                    :   20
Output component count                                      :   4
Input component count                                       :   11
ALU fiber occupancy percentage                              :   37
my WIP patch
    Adreno Offline Compiler (AOC)
    -----------------------------
    AOC Version     : 2.0
    Compiler Version: E031.42.11.00


======== Shader Stats FS ========

         Shader Preamble Stats
Total instruction count                                     :   498
ALU instruction count 32bit                                 :   26
ALU instruction count 16bit                                 :   0
Complex instruction count 32bit                             :   0
Complex instruction count 16bit                             :   0
Texture read instruction count                              :   0
Memory read instruction count                               :   4
Memory write instruction count                              :   62
Flow control instruction count                              :   1
Barrier and fence Instruction count                         :   0
Short latency sync instruction count                        :   46
Long latency sync instruction count                         :   2
Miscellaneous instruction count                             :   357

         Main Shader Stats
Total instruction count                                     :   5002
ALU instruction count 32bit                                 :   1958
ALU instruction count 16bit                                 :   51
Complex instruction count 32bit                             :   97
Complex instruction count 16bit                             :   0
Texture read instruction count                              :   34
Memory read instruction count                               :   144
Memory write instruction count                              :   0
Flow control instruction count                              :   123
Barrier and fence Instruction count                         :   88
Short latency sync instruction count                        :   186
Long latency sync instruction count                         :   109
Miscellaneous instruction count                             :   2212
Full precision register footprint per shader instance       :   26
Half precision register footprint per shader instance       :   48
Overall register footprint per shader instance              :   26
Scratch memory usage per shader instance                    :   8
Output component count                                      :   4
Input component count                                       :   11
ALU fiber occupancy percentage                              :   37

Compilation succeeded.
Moving all settings to spec constants ``` Adreno Offline Compiler (AOC) ----------------------------- AOC Version : 2.0 Compiler Version: E031.42.11.00

======== Shader Stats FS ========

     Shader Preamble Stats

Total instruction count : 458
ALU instruction count 32bit : 32
ALU instruction count 16bit : 0
Complex instruction count 32bit : 0
Complex instruction count 16bit : 0
Texture read instruction count : 0
Memory read instruction count : 5
Memory write instruction count : 58
Flow control instruction count : 1
Barrier and fence Instruction count : 0
Short latency sync instruction count : 41
Long latency sync instruction count : 1
Miscellaneous instruction count : 320

     Main Shader Stats

Total instruction count : 4049
ALU instruction count 32bit : 1495
ALU instruction count 16bit : 38
Complex instruction count 32bit : 64
Complex instruction count 16bit : 0
Texture read instruction count : 22
Memory read instruction count : 87
Memory write instruction count : 0
Flow control instruction count : 88
Barrier and fence Instruction count : 36
Short latency sync instruction count : 128
Long latency sync instruction count : 69
Miscellaneous instruction count : 2022
Full precision register footprint per shader instance : 19
Half precision register footprint per shader instance : 39
Overall register footprint per shader instance : 20
Scratch memory usage per shader instance : 0
Output component count : 4
Input component count : 9
ALU fiber occupancy percentage : 50

Compilation succeeded.

</details>

It looks like my WIP patch brings the scratch memory usage way down. My working theory is that scratch memory is responsible for the pathological increase in size. So hopefully testing will confirm that it indeed helps

@r-eckert
Copy link
Author

r-eckert commented Feb 6, 2025

I just gave the Nr. 3 branch a test with our full project on my S23 and now it doesn't crash anymore. But unfortunately it still uses more than 1GB for pipelines. It might still kill the game on lower spec devices but now I can test the full game on my phone again so that's something.

=== Driver Memory Report ===
Launch with --extra-gpu-memory-tracking and build with DEBUG_ENABLED for this functionality to work.
Device memory may be unavailable if the API does not support it(e.g. VK_EXT_device_memory_report is unsupported).

Total Driver Memory:152.896263122559 MB
Total Driver Num Allocations: 79805
Total Device Memory:1892.51273727417 MB
Total Device Num Allocations: 1562

Memory use by object type (CSV format):

Category; Driver memory in MB; Driver Allocation Count; Device memory in MB; Device Allocation Count
UNKNOWN;0.0;0;0.0;0
INSTANCE;55.7987051010132;3757;0.0;0
PHYSICAL_DEVICE;0.0;0;0.0;0
DEVICE;0.23361206054688;933;0.25259780883789;75
QUEUE;0.0;0;37.1757888793945;14
SEMAPHORE;0.00667572021484;25;0.0;0
COMMAND_BUFFER;0.0;0;3.84375;127
FENCE;0.00106811523438;4;0.0;0
DEVICE_MEMORY;0.0;0;544.0;6
BUFFER;12.1915283203125;29592;0.0;0
IMAGE;1.20703125;2186;1.1875;2
EVENT;0.0;0;0.00001525878906;2
QUERY_POOL;0.00163269042969;6;0.00782775878906;2
BUFFER_VIEW;0.00099182128906;1;0.0;0
IMAGE_VIEW;7.69058227539062;7754;0.0;0
SHADER_MODULE;44.2954740524292;2638;0.0;0
PIPELINE_CACHE;9.15967750549316;563;0.0;0
PIPELINE_LAYOUT;2.17038726806641;1400;0.0;0
RENDER_PASS;0.06209564208984;651;0.0;0
PIPELINE;5.92645263671875;7019;1304.15357971191;1102
DESCRIPTOR_SET_LAYOUT;5.30092239379883;13872;0.0;0
SAMPLER;0.01338958740234;39;0.0;0
DESCRIPTOR_POOL;2.98871994018555;5593;1.22265625;70
DESCRIPTOR_SET;0.0;0;0.0;0
FRAMEBUFFER;5.15935897827148;2319;0.66902160644531;162
COMMAND_POOL;0.68590545654297;1451;0.0;0
DESCRIPTOR_UPDATE_TEMPLATE_KHR;0.0;0;0.0;0
SURFACE_KHR;0.00003051757812;1;0.0;0
SWAPCHAIN_KHR;0.00202178955078;1;0.0;0
DEBUG_UTILS_MESSENGER_EXT;0.0;0;0.0;0
DEBUG_REPORT_CALLBACK_EXT;0.0;0;0.0;0
ACCELERATION_STRUCTURE;0.0;0;0.0;0
VMA_BUFFER_OR_IMAGE;0.0;0;0.0;0
Monitors:
Canvas 4
Mesh 45
Surface 155
Draw 77
Specialization 163

@clayjohn
Copy link
Member

clayjohn commented Feb 6, 2025

@r-eckert It looks like that branch is blowing up the number of surface, draw, and specialization compiles. Are you changing quality settings at run time by any chance? Or any of the following:

  1. Changing soft shadow samples
  2. Changing MSAA dynamically
  3. Using multiple Viewports with different MSAA levels?
  4. Using multiple viewports with different HDR2D setting?
  5. Using VRS?

Its odd that your pipeline size is only cut by half, so there must be something in your project that isn't captured by the MRP. Is there any chance you can produce an MRP that is more representative of your project or provide us with access to your project for testing?

At this point I don't think it is worth spending any more time investigating this issue since it only appears with your device and your project and we don't have access to either.

@clayjohn clayjohn moved this from Bad to Not Critical in 4.x Release Blockers Feb 6, 2025
@akien-mga akien-mga modified the milestones: 4.4, 4.5 Mar 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Not Critical
Status: For team assessment
Development

No branches or pull requests

6 participants