Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unnecessary barrier + possibly incorrect barriers? #1097

Open
jherico opened this issue Jan 10, 2024 · 6 comments · May be fixed by #1103
Open

Unnecessary barrier + possibly incorrect barriers? #1097

jherico opened this issue Jan 10, 2024 · 6 comments · May be fixed by #1103

Comments

@jherico
Copy link
Contributor

jherico commented Jan 10, 2024

You're moving both the storage buffers between the compute and graphics queues and vice versa, but only the storageBuffers.output.buffer is actually used on the graphics queue.

bufferBarrier.buffer = storageBuffers.input.buffer;

Looking more closely at this, I think the logic in addComputeToComputeBarriers is wrong. The loop in buildComputeCommandBuffer executes 64 times, using the descriptor set to swap which buffer is being read and which one is being written, but the addComputeToComputeBarriers only ever does one thing... it ensures that both the input and output buffers are going from a write state to a read state.

One barrier should be going from read to write, and the other should be going from write to read, and the next frame they should swap back, over and over until the loop is exited, at which point the cross-queue barriers come into play.

@jherico jherico changed the title Unnecessary barrier Unnecessary barrier + possibly incorrect barriers? Jan 11, 2024
@jherico
Copy link
Contributor Author

jherico commented Jan 11, 2024

This is the intra-compute barrier function I've done in my port:

    void computeToComputeBarrier(const vk::CommandBuffer& cmdBuf, bool reverse) {
        std::array<vk::BufferMemoryBarrier2, 2> barriers;
        barriers[0] = vk::BufferMemoryBarrier2{ //
                                                // src stage and access
                                                vk::PipelineStageFlagBits2::eComputeShader, vk::AccessFlagBits2::eShaderWrite,
                                                // dst stage and access
                                                vk::PipelineStageFlagBits2::eComputeShader, vk::AccessFlagBits2::eShaderRead,
                                                // src and dst stage
                                                VK_QUEUE_FAMILY_IGNORED, VK_QUEUE_FAMILY_IGNORED,
                                                // buffer, offset and size
                                                reverse ? storageBuffers.input.buffer : storageBuffers.output.buffer, 0, VK_WHOLE_SIZE
        };
        barriers[1] = vk::BufferMemoryBarrier2{ //
                                                // src stage and access
                                                vk::PipelineStageFlagBits2::eComputeShader, vk::AccessFlagBits2::eShaderRead,
                                                // dst stage and access
                                                vk::PipelineStageFlagBits2::eComputeShader, vk::AccessFlagBits2::eShaderWrite,
                                                // src and dst stage
                                                VK_QUEUE_FAMILY_IGNORED, VK_QUEUE_FAMILY_IGNORED,
                                                // buffer, offset and size
                                                reverse ? storageBuffers.output.buffer : storageBuffers.input.buffer, 0, VK_WHOLE_SIZE
        };
        cmdBuf.pipelineBarrier2(vk::DependencyInfo{ {}, nullptr, barriers, nullptr });
    }

I'm able to run this without any validation errors, though I'm not getting validation errors on your version either, so I guess the layers can't detect something subtle like this.

@jherico
Copy link
Contributor Author

jherico commented Jan 11, 2024

Sorry, I didn't realize you already had a global "rework sync" issue that would cover this here #871

@jherico jherico linked a pull request Feb 5, 2024 that will close this issue
@SRSaunders
Copy link
Contributor

SRSaunders commented May 1, 2024

I may have been the last one to touch this sync code, so I should probably try to answer your comments above:

You're moving both the storage buffers between the compute and graphics queues and vice versa, but only the storageBuffers.output.buffer is actually used on the graphics queue.

I think you may be right about this for the compute-to-graphics transition.

One barrier should be going from read to write, and the other should be going from write to read, and the next frame they should swap back, over and over until the loop is exited, at which point the cross-queue barriers come into play.

The only compute queue memory barriers that matter are the write-to-read barriers. These are needed to ensure the previous stage write operations are completed before reads happen in the next stage of the pipeline (when the buffers flip from write-mode to read-mode). That's why you only see write-to-read barriers in the code. Read-to-write operations do not require memory barriers - these are execution based. See https://github.com/KhronosGroup/Vulkan-Docs/wiki/Synchronization-Examples-(Legacy-synchronization-APIs) for more info.

I will have another look at the barriers, including the cross-queue ones, to see if any need correction or simplification (i.e. make sure masks are correct and get rid of unneeded barriers for buffers not involved in transitions).

@SRSaunders
Copy link
Contributor

SRSaunders commented May 16, 2024

After studying the existing code, the compute-to-compute buffer memory barriers are functional as is but could use some optimization based on your analysis above. The write-to-read barrier is correct, but only needs to be applied to one of the buffers during each call, i.e. the one being written to. There is no need to create a memory barrier for the other buffer which is being read from, since the same vkCmdPipelineBarrier() call provides an execution barrier which is enough.

The split queue transfer operations use access masks and pipeline stages for the graphics and compute queues. For the graphics -> compute queue transfer I think it's better to use VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT instead of VK_PIPELINE_STAGE_VERTEX_INPUT_BIT, and also use VK_ACCESS_COLOR_ATTACHMENT_WRITE_BIT for the access mask.

Regarding the "extra" queue ownership transitions for input.buffer, it's true that it doesn't need to be moved back and forth between queues, but the action is harmless and does not cause performance issues. I think it's better to just leave it alone for simplicity and consistency.

Thanks @jherico for pointing out these issues. I will submit a PR that addresses them, and in addition adds double buffering for the compute queue for parallel operation with the graphics queue. I always wondered why there were 2 compute command buffers defined, but only 1 was used. I suspect the double buffering code was never completed. I have completed this with a small number of changes primarily to the semaphores. The results are quite interesting and speed up the fps for this example by ~50% on Windows, and >100% on Linux. No change on macOS since there is no dedicated compute queue on that platform - it's shared with the graphics queue.

Here are some RenderDoc screen caps so you can see what is happening with the compute buffers in the compute-to-compute transitions (using my pending PR). The purple triangles are the memory barriers, and they follow the yellow write operations. The read operations are also marked in yellow, but they are not followed by memory barriers. You can see how the memory barriers are interleaved for input.buffer and output.buffer.

For input.buffer:
RenderDoc input-buffer

For output.buffer
RenderDoc output-buffer

@jherico
Copy link
Contributor Author

jherico commented May 21, 2024

My experience with this is that when I corrected the barriers, the behavior changed drastically, specifically the amount of computed movement per frame increased.

I suspect that the current barrier setup may be sufficiently correct (or just too complicated) to satisfy the validation layers but is causing the computations in the 64-pass compute loop to be overwritten frequently.

I'll try to grab a local renderdoc capture and see if I can verify the issue.

@SaschaWillems
Copy link
Owner

There are now two pull request for this. Which one should a look into?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants