Skip to content

Conversation

@brnorris03
Copy link
Contributor

@brnorris03 brnorris03 commented Dec 5, 2025

This draft PR is solely for discussion on a proposed ttl dialect (not intended to merge). See TTL_Dialect_Plan.md

@brnorris03 brnorris03 force-pushed the bnorris/ttl-dialect-plan branch from 11769f9 to 4c4f4d1 Compare December 5, 2025 17:38
@brnorris03 brnorris03 force-pushed the bnorris/ttl-dialect-plan branch from 4c4f4d1 to 16a5f0e Compare December 5, 2025 17:38
```
Python Kernel → Python AST → TTL Dialect → TTL Passes → TTKernel → ConvertTTKernelToEmitC → C++ Source
↓ ↓
Validation, Synchronization, C++ Compiler
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With C++ source plus metadata about input/output tensors, CBs etc we go directly to TT-NN generic operation (it internally compiles and runs C++):

https://github.com/tenstorrent/tt-metal/blob/0ae4611214adb349a8621a46605943e0dac0e82b/ttnn/cpp/ttnn/operations/generic/generic_op.hpp#L20

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point! I am changing the runtime integration section completely and will update the workflows here.

// Calculate total elements for TTKernel CB conversion
int64_t getTotalElements() const {
int64_t elementsPerBlock = std::accumulate(
getShape().begin(), getShape().end(), 1, std::multiplies<int64_t>());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably can re-use getElementsPerBlock below.

Note: TTKernel doesn't support per-transaction waits. All ttl.wait
operations lower to global DMA barriers (`ttkernel.noc_async_read_barrier`
or `ttkernel.noc_async_write_barrier`). This type exists for ordering
and future optimization opportunities.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of transfer from a pipe wait will likely lower into waiting on a semaphore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This document is no longer being edited, it was split into more manageable parts in the docs/ttl directory.

@brnorris03 brnorris03 force-pushed the bnorris/ttl-dialect-plan branch 7 times, most recently from 052d517 to a414bb1 Compare December 9, 2025 23:24
@brnorris03 brnorris03 force-pushed the bnorris/ttl-dialect-plan branch from a414bb1 to acca0ba Compare December 9, 2025 23:41
@brnorris03 brnorris03 force-pushed the bnorris/ttl-dialect-plan branch from 1f3bd1d to 51686b7 Compare December 10, 2025 01:34
@brnorris03 brnorris03 force-pushed the bnorris/ttl-dialect-plan branch from 8a738d0 to 6f40799 Compare December 15, 2025 16:06
let summary = "Handle for asynchronous transfer with transaction ID tracking";
let description = [{
Transfer handle for DMA operations that maps to a TTKernel transaction ID (TRID).
Each ttl.copy operation receives a unique TRID (0-15), and ttl.wait operations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What function in TTKernel returns trid?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None, compiler must generate it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there example code?

Arity requirement: The dst_range tuple must have the same arity as the
grid rank to prevent ambiguity. For a 2D grid (grid_x, grid_y), both dimensions
must be specified explicitly. Use slice(x, x+1) for a single core in that dimension.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the language spec has a lesser constraint. Pipes within the same pipe net must have the same dimensionality. But this dimensionality can be arbitrary since we have this ability with grid_size and core functions. For example there can be 1D pipe net defined within a 2D grid.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will update to match.

Runtime representation: PipeNet carries no runtime data. During lowering to TTKernel,
PipeNet operations are expanded and removed:
- ttl.create_pipenet %pipe1, %pipe2, ... → stores pipe list in operation operands
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it is worth materializing each pipe description here. In most cases the pipe list will be formed with Python list comprehension. Maybe we just capture this comprehension's loop nest. But maybe not in MVP.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, this is just one possibility, I think it's probably easier to use a container (tensor) for storing the pipes.

TTL-specific attributes are defined below:

```tablegen
def TTL_SliceAttr : AttrDef<TTL_Dialect, "Slice"> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have representation for slicing into tensor accessor?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not at the moment, but ttnn tensors do support slice. We may need an extra op (that we define) to extract slices from the tensor accessor before it can be used as an arg in an other op.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, TT-NN does have slicing, but I am referring to tensor slicing with ttl.copy. I guess we need a way to convert slices expression to shard id/page id in noc_async_xxx_shard/page.

// TTL IR (%tensor : tensor<..., #ttl.tensor_encoding<DeviceDRAM,
// #ttl.layout<sharded, grid=[2,2]>>>)
%accessor = ttl.tensor_accessor %tensor
%xf = ttl.copy %accessor[%shard_id], %cb
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where this %shard_id comes from? Do we convert slices into it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the high-level idea, but not completely sure about the syntax yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So probably not the direct indexing like above, but updated to the more MLIR-typical (but again, different syntax can be implemented as needed/wanted):

// TTL IR (After Python AST Compilation)
// Python: shard_id = ttl.core(dim=1)
//         xf = ttl.copy(a[shard_id], a_blk)

// Tensor accessor wraps the tensor with its layout metadata
%a_accessor = ttl.tensor_accessor %a 
    : tensor<64x64xf32, #ttl.tensor_encoding<DeviceDRAM, #ttl.layout<sharded, grid=[2,2]>>>
    -> !ttl.accessor<tensor<64x64xf32, #ttl.tensor_encoding<DeviceDRAM, #ttl.layout<sharded, grid=[2,2]>>>>

// Core coordinate flattened to 1D (0-3 for 2x2 grid)
%shard_id = ttl.core {dims = 1} : index

// Reserve CB slot
%a_blk = ttl.cb_reserve %a_cb : !ttl.circular_buffer<[1,1], !ttcore.tile<32x32,f32>, 2>
    -> tensor<1x1x!ttcore.tile<32x32,f32>, #ttl.tensor_encoding<L1, #ttl.layout<tiled>>>

// Copy from accessor slice to CB block
// Indices are explicit operands; direction inferred from operand types
%xf_a = ttl.copy 
    from %a_accessor at [%shard_id] 
    to %a_blk 
    : !ttl.accessor<...>, index -> !ttl.transfer_handle

ttl.wait %xf_a : !ttl.transfer_handle
ttl.cb_push %a_cb, %a_blk : !ttl.circular_buffer<...>, tensor<...>

@brnorris03 brnorris03 force-pushed the bnorris/ttl-dialect-plan branch from 0cbcd6c to c53cfcc Compare December 18, 2025 03:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants