-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move output swizzling pass before fusions #1651
base: develop
Are you sure you want to change the base?
Conversation
krzysz00
commented
Sep 13, 2024
- Move the output fusion swizzling before fusions
- Due to the tension between LDS consolidation and multibuffering, remove the reuse-lds call before the output swizzle. As a consequence, remove the "increasing total LDS usage" heuristic from output swizzle enablemente, since it should probably be fine
- Fix an issue where fusion traversal wasn't working correctly, resulting in insufficinetly vectorized writes to global memory despite previous attempts to fix the issue
- Fix a test that wasn't using i8 LDS
- Update the packed arithmetic test to check for vectorized writes
- Add a guard in case the ExistingOps strictness is still letting LDS writes into the output swizzle rewrite
@@ -441,14 +426,6 @@ void RockOutputSwizzlePass::runOnOperation() { | |||
<< ldsRequiredBytes << " bytes, skipping pass\n"); | |||
return; | |||
} | |||
// heuristic: if we need more LDS, skip this pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should check if there's any performance regression due to this. I'm happy to do this if you are busy with other things.
%input_reg = rock.alloc() : memref<16xf32, #gpu.address_space<private>> | ||
%output_reg = rock.alloc() : memref<16xf32, #gpu.address_space<private>> | ||
%ws_lds = rock.alloc() : memref<64xf32, #gpu.address_space<workgroup>> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if rock.alloc() are always supposed to allocate i8, should we add that as a check in GpuAllocOp::verify()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's more that the LDS reduce pass fails if you don't do this so ... yeah, I'll add a check.
What problem are you solving here ? As in what is the motivation for this to be moved upwards in the pipeline and rely "more" on "utils" to get vectorizations data? |
df5a80d
to
4cc423b
Compare
After a long discussion with @dhernandez0, Im feeling we should not be doing this w/o more analysis on how this affects Before :
After this PR :
Again, coming back to the original question of mine, what problem are you solving here ? |
The problem I wanted to solve here was that we'd be doing badly-formatted reads from global memory because we'd be reading in the MFMA layout and not the coalesced-read-promoting layout that you get after doing the LDS swizzle - on the reasonable assumption that the fusion inputs are stored somewhat like the final output Which might've been false |
That itself sounds like good idea... however I think we can pragmatically verify that is the case -- as in we only do the output swizzle if the gemm output buffer agrees with |
I had some time and I've run some performance experiments (see the file attached). There are some performance regressions: We should do performance experiments for fusions as well I think. |
Thanks @dhernandez0... |
I've done a quick experiment with fusion (conv+add+relu) on MI300:
tensorflow code:
There is a nice speed up in this case: develop 0.0541076ms To reproduce, run: python3 slow_tf.py
MIGRAPHX_DISABLE_PASSES=auto_contiguous MIGRAPHX_TRACE_BENCHMARKING=3 ./bin/migraphx-driver perf --exhaustive-tune --onnx slow_nhwc_tf.onnx |
Thanks @dhernandez0! so it works out nicely when the layout of |
Yes, I think so, it makes sense. However, I think this is a realistic case for most use cases. Tensors of a network generally have the same layout. In the recent ticket https://github.com/ROCm/rocMLIR-internal/issues/1625, After running 5 times here are averages of the previous experiment: |
@krzysz00 just came to my mind. I used the develop branch after upstream merge for these experiments. I think this PR branch is not up to date. |
7f45086
to
3a49093
Compare
3bde8c6
to
5a4cc0f
Compare
* Move the output fusion swizzling before fusions * Due to the tension between LDS consolidation and multibuffering, remove the reuse-lds call before the output swizzle. As a consequence, remove the "increasing total LDS usage" heuristic from output swizzle enablemente, since it should probably be fine * Fix an issue where fusion traversal wasn't working correctly, resulting in insufficinetly vectorized writes to global memory despite previous attempts to fix the issue * Fix a test that wasn't using i8 LDS * Update the packed arithmetic test to check for vectorized writes * Add a guard in case the ExistingOps strictness is still letting LDS writes into the output swizzle rewrite
…ert the enableApplicability change
0c8e550
to
de2595a
Compare