-
Notifications
You must be signed in to change notification settings - Fork 1
[TTL] Fuse DMA tile loops via pre-conversion grouping #95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[TTL] Fuse DMA tile loops via pre-conversion grouping #95
Conversation
a52dfaa to
99dfbfb
Compare
…-lowering-fuse-sibling-loops
Added assertSupportedLayoutForTileLoop() function that asserts sharded layouts are not yet supported, with a reference to issue #118 Updated getTileGridShape() to take a Location parameter and call the assertion Updated getTileGridShapeFromValue() to pass v.getLoc() to getTileGridShape()
4806159 to
c5913c0
Compare
e5a511b to
9554082
Compare
9554082 to
d952f07
Compare
|
Thinking out loud: so the main motivator for the compute op is that we can fuse trivially, but these are datamovement so we can't use a compute op, so we do the fusing more manually? Is there any way we could abstract that further? |
|
|
||
| namespace { | ||
|
|
||
| static constexpr llvm::StringLiteral kTileLoopMarker = "ttkernel.tile_loop"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: duplicate definition
| } | ||
|
|
||
| /// Check if two loops are adjacent in the same block with only constants | ||
| /// between them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if these alias?
Lowered UNFUSED (correct):
for tile in 0..4: noc_async_read_tile(tile, A, cb0) // issue all reads
for tile in 0..4: noc_async_write_tile(tile, cb0, B) // issue all writes to DRAM
for tile in 0..4: noc_async_read_tile(tile, B, cb1) // issue all reads from DRAM
noc_async_read_barrier() // wait for loop 1
noc_async_write_barrier() // wait for loop 2 - ALL writes to B complete
noc_async_read_barrier() // wait for loop 3
Lowered FUSED (loops 2+3 have same bounds):
for tile in 0..4: noc_async_read_tile(tile, A, cb0)
for tile in 0..4:
noc_async_write_tile(tile, cb0, B) // issue write to B[tile]
noc_async_read_tile(tile, B, cb1) // immediately read B[tile] - RACE!
// write is async, hasn't landed yet
noc_async_read_barrier()
noc_async_write_barrier()
noc_async_read_barrier()
| // corresponding global barrier. Untyped handles are rejected by the | ||
| // verifier, but we also fail the rewrite defensively. | ||
| auto kind = getTransferKindFromHandleType(adaptor.getXf().getType()); | ||
| auto kind = getTransferKindFromHandleType(op.getXf().getType()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ttl type is required to be able to figure out the direction.
| // CHECK: } | ||
|
|
||
| // Consecutive barriers deduplicated to single barrier. | ||
| // CHECK: noc_async_read_barrier(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check not/check next to make sure this is actually deduplicated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are many -NOT checks in other tests... I don't think they need to be everywhere
…-lowering-fuse-sibling-loops
creation, CB pointer retrieval) before tile loops, enabling DMA tile loop fusion directly during conversion. - Add pre-conversion grouping that collects adjacent copy ops with matching tile grid bounds and emits fused loops - Recursively process nested regions (e.g., scf.for loop bodies) - Remove fuse-tile-loops pipeline option (no longer needed) - Update test expectations for fused output
- Add dominance check in emitGroupedCopies to prevent use-before-def when CB/tensor operands are defined between copy operations - Remove TTKernelFuseSiblingTileLoops pass (pre-conversion grouping handles fusion during ConvertTTLToTTKernel) - Add edge case tests for grouping rejection and multi-tile writes - Update dma_single_core.mlir to remove FUSED pipeline checks
6685250 to
7be41b3
Compare
…fails, enabling partial fusion instead of falling back to no fusion. - Add partial_fusion_four_copies test verifying two fused loops are generated when CB bindings break the chain.
7be41b3 to
5429a7c
Compare
zoecarver
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thank you!
What?
Restructure TTL-to-TTKernel lowering to emit setup ops before tile loops, enabling DMA tile loop fusion directly during conversion.
Why?
Previously, setup ops (tensor accessor creation, CB pointer retrieval) were emitted inline before each tile loop, blocking the FuseSiblingTileLoops pass from fusing adjacent loops. This change emits all setup ops first, then a single fused loop for copies with matching tile grids.
Pre-conversion grouping is more efficient than pattern-based lowering followed by post-hoc fusion. Setup ops are emitted once before the fused loop rather than inside each individual loop.
How?
How to Test?
Checklist: