Skip to content

Conversation

@brnorris03
Copy link
Contributor

@brnorris03 brnorris03 commented Dec 15, 2025

What?

Restructure TTL-to-TTKernel lowering to emit setup ops before tile loops, enabling DMA tile loop fusion directly during conversion.

Why?

Previously, setup ops (tensor accessor creation, CB pointer retrieval) were emitted inline before each tile loop, blocking the FuseSiblingTileLoops pass from fusing adjacent loops. This change emits all setup ops first, then a single fused loop for copies with matching tile grids.

Pre-conversion grouping is more efficient than pattern-based lowering followed by post-hoc fusion. Setup ops are emitted once before the fused loop rather than inside each individual loop.

How?

  • Add pre-conversion grouping in ConvertTTLToTTKernel.cpp that collects adjacent copy ops with matching tile grid bounds
  • Emit fused loops directly during conversion (setup block → single tile loop with all DMAs)
  • Recursively process nested regions (e.g., scf.for loop bodies)
  • Partial fusion: When dominance fails mid-group, the code splits into subgroups and fuses what it can (e.g., 4 copies with CB bind in middle -> two fused loops instead of four separate loops).
  • Remove fuse-tile-loops pipeline option (no longer needed)

How to Test?

llvm-lit -sv test/ttlang/Conversion/TTLToTTKernel/
llvm-lit -sv test/ttlang/Translate/TTLToCpp/

Checklist:

  • Self-reviewed (style, logic)
  • Added tests (or justified none needed)
  • PR is small and focused (one task)

@brnorris03 brnorris03 changed the base branch from main to bnorris/ttl-dm-kernel-lowering December 15, 2025 07:45
@brnorris03 brnorris03 force-pushed the bnorris/ttl-dm-kernel-lowering-fuse-sibling-loops branch from a52dfaa to 99dfbfb Compare December 15, 2025 14:48
Base automatically changed from bnorris/ttl-dm-kernel-lowering to main December 15, 2025 22:12
Added assertSupportedLayoutForTileLoop() function that asserts sharded layouts are not yet supported, with a reference to issue #118
Updated getTileGridShape() to take a Location parameter and call the assertion
Updated getTileGridShapeFromValue() to pass v.getLoc() to getTileGridShape()
@brnorris03 brnorris03 force-pushed the bnorris/ttl-dm-kernel-lowering-fuse-sibling-loops branch from 4806159 to c5913c0 Compare December 16, 2025 03:01
@brnorris03 brnorris03 changed the title [ttl] kernel lowering fuse sibling loops in dm threads [ttkernel] Fuse sibling loops in dm threads Dec 16, 2025
@brnorris03 brnorris03 marked this pull request as ready for review December 16, 2025 03:33
@brnorris03 brnorris03 requested a review from a team as a code owner December 16, 2025 03:33
@brnorris03 brnorris03 requested a review from zoecarver December 16, 2025 03:33
@brnorris03 brnorris03 changed the title [ttkernel] Fuse sibling loops in dm threads [ttkernel] Fuse generated sibling loops in dm threads Dec 16, 2025
@brnorris03 brnorris03 force-pushed the bnorris/ttl-dm-kernel-lowering-fuse-sibling-loops branch 2 times, most recently from e5a511b to 9554082 Compare December 16, 2025 04:04
@brnorris03 brnorris03 force-pushed the bnorris/ttl-dm-kernel-lowering-fuse-sibling-loops branch from 9554082 to d952f07 Compare December 16, 2025 04:05
@zoecarver
Copy link
Contributor

Thinking out loud: so the main motivator for the compute op is that we can fuse trivially, but these are datamovement so we can't use a compute op, so we do the fusing more manually? Is there any way we could abstract that further?


namespace {

static constexpr llvm::StringLiteral kTileLoopMarker = "ttkernel.tile_loop";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: duplicate definition

}

/// Check if two loops are adjacent in the same block with only constants
/// between them.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if these alias?

    Lowered UNFUSED (correct):
  for tile in 0..4: noc_async_read_tile(tile, A, cb0)      // issue all reads
  for tile in 0..4: noc_async_write_tile(tile, cb0, B)    // issue all writes to DRAM
  for tile in 0..4: noc_async_read_tile(tile, B, cb1)     // issue all reads from DRAM
  noc_async_read_barrier()   // wait for loop 1
  noc_async_write_barrier()  // wait for loop 2 - ALL writes to B complete
  noc_async_read_barrier()   // wait for loop 3

  Lowered FUSED (loops 2+3 have same bounds):
  for tile in 0..4: noc_async_read_tile(tile, A, cb0)
  for tile in 0..4:
    noc_async_write_tile(tile, cb0, B)  // issue write to B[tile]
    noc_async_read_tile(tile, B, cb1)   // immediately read B[tile] - RACE!
                                         // write is async, hasn't landed yet
  noc_async_read_barrier()
  noc_async_write_barrier()
  noc_async_read_barrier()

// corresponding global barrier. Untyped handles are rejected by the
// verifier, but we also fail the rewrite defensively.
auto kind = getTransferKindFromHandleType(adaptor.getXf().getType());
auto kind = getTransferKindFromHandleType(op.getXf().getType());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ttl type is required to be able to figure out the direction.

// CHECK: }

// Consecutive barriers deduplicated to single barrier.
// CHECK: noc_async_read_barrier();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check not/check next to make sure this is actually deduplicated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are many -NOT checks in other tests... I don't think they need to be everywhere

creation, CB pointer retrieval) before tile loops, enabling DMA tile
loop fusion directly during conversion.

- Add pre-conversion grouping that collects adjacent copy ops with
  matching tile grid bounds and emits fused loops
- Recursively process nested regions (e.g., scf.for loop bodies)
- Remove fuse-tile-loops pipeline option (no longer needed)
- Update test expectations for fused output
@brnorris03 brnorris03 changed the title [ttkernel] Fuse generated sibling loops in dm threads [TTL] Fuse DMA tile loops via pre-conversion grouping Dec 25, 2025
- Add dominance check in emitGroupedCopies to prevent use-before-def
  when CB/tensor operands are defined between copy operations
- Remove TTKernelFuseSiblingTileLoops pass (pre-conversion grouping
  handles fusion during ConvertTTLToTTKernel)
- Add edge case tests for grouping rejection and multi-tile writes
- Update dma_single_core.mlir to remove FUSED pipeline checks
@brnorris03 brnorris03 force-pushed the bnorris/ttl-dm-kernel-lowering-fuse-sibling-loops branch from 6685250 to 7be41b3 Compare December 25, 2025 06:20
…fails, enabling partial fusion instead of falling back to no fusion.

- Add partial_fusion_four_copies test verifying two fused loops are generated when CB bindings break the chain.
@brnorris03 brnorris03 force-pushed the bnorris/ttl-dm-kernel-lowering-fuse-sibling-loops branch from 7be41b3 to 5429a7c Compare December 25, 2025 06:21
Copy link
Contributor

@zoecarver zoecarver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants