-
Notifications
You must be signed in to change notification settings - Fork 1
[Discussion] ttl dialect proposal (plan)
#54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
11769f9 to
4c4f4d1
Compare
4c4f4d1 to
16a5f0e
Compare
docs/TTL_Dialect_Plan.md
Outdated
| ``` | ||
| Python Kernel → Python AST → TTL Dialect → TTL Passes → TTKernel → ConvertTTKernelToEmitC → C++ Source | ||
| ↓ ↓ | ||
| Validation, Synchronization, C++ Compiler |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With C++ source plus metadata about input/output tensors, CBs etc we go directly to TT-NN generic operation (it internally compiles and runs C++):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great point! I am changing the runtime integration section completely and will update the workflows here.
docs/TTL_Dialect_Plan.md
Outdated
| // Calculate total elements for TTKernel CB conversion | ||
| int64_t getTotalElements() const { | ||
| int64_t elementsPerBlock = std::accumulate( | ||
| getShape().begin(), getShape().end(), 1, std::multiplies<int64_t>()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably can re-use getElementsPerBlock below.
docs/TTL_Dialect_Plan.md
Outdated
| Note: TTKernel doesn't support per-transaction waits. All ttl.wait | ||
| operations lower to global DMA barriers (`ttkernel.noc_async_read_barrier` | ||
| or `ttkernel.noc_async_write_barrier`). This type exists for ordering | ||
| and future optimization opportunities. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case of transfer from a pipe wait will likely lower into waiting on a semaphore.
docs/TTL_Dialect_Plan.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This document is no longer being edited, it was split into more manageable parts in the docs/ttl directory.
052d517 to
a414bb1
Compare
a414bb1 to
acca0ba
Compare
1f3bd1d to
51686b7
Compare
…dd table of python ttl -> ttl dialect mapping
8a738d0 to
6f40799
Compare
| let summary = "Handle for asynchronous transfer with transaction ID tracking"; | ||
| let description = [{ | ||
| Transfer handle for DMA operations that maps to a TTKernel transaction ID (TRID). | ||
| Each ttl.copy operation receives a unique TRID (0-15), and ttl.wait operations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What function in TTKernel returns trid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None, compiler must generate it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there example code?
docs/ttl/02_TTL_Type_System.md
Outdated
| Arity requirement: The dst_range tuple must have the same arity as the | ||
| grid rank to prevent ambiguity. For a 2D grid (grid_x, grid_y), both dimensions | ||
| must be specified explicitly. Use slice(x, x+1) for a single core in that dimension. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the language spec has a lesser constraint. Pipes within the same pipe net must have the same dimensionality. But this dimensionality can be arbitrary since we have this ability with grid_size and core functions. For example there can be 1D pipe net defined within a 2D grid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will update to match.
| Runtime representation: PipeNet carries no runtime data. During lowering to TTKernel, | ||
| PipeNet operations are expanded and removed: | ||
| - ttl.create_pipenet %pipe1, %pipe2, ... → stores pipe list in operation operands |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it is worth materializing each pipe description here. In most cases the pipe list will be formed with Python list comprehension. Maybe we just capture this comprehension's loop nest. But maybe not in MVP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure, this is just one possibility, I think it's probably easier to use a container (tensor) for storing the pipes.
| TTL-specific attributes are defined below: | ||
|
|
||
| ```tablegen | ||
| def TTL_SliceAttr : AttrDef<TTL_Dialect, "Slice"> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have representation for slicing into tensor accessor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not at the moment, but ttnn tensors do support slice. We may need an extra op (that we define) to extract slices from the tensor accessor before it can be used as an arg in an other op.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, TT-NN does have slicing, but I am referring to tensor slicing with ttl.copy. I guess we need a way to convert slices expression to shard id/page id in noc_async_xxx_shard/page.
docs/ttl/02_TTL_Type_System.md
Outdated
| // TTL IR (%tensor : tensor<..., #ttl.tensor_encoding<DeviceDRAM, | ||
| // #ttl.layout<sharded, grid=[2,2]>>>) | ||
| %accessor = ttl.tensor_accessor %tensor | ||
| %xf = ttl.copy %accessor[%shard_id], %cb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where this %shard_id comes from? Do we convert slices into it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's the high-level idea, but not completely sure about the syntax yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So probably not the direct indexing like above, but updated to the more MLIR-typical (but again, different syntax can be implemented as needed/wanted):
// TTL IR (After Python AST Compilation)
// Python: shard_id = ttl.core(dim=1)
// xf = ttl.copy(a[shard_id], a_blk)
// Tensor accessor wraps the tensor with its layout metadata
%a_accessor = ttl.tensor_accessor %a
: tensor<64x64xf32, #ttl.tensor_encoding<DeviceDRAM, #ttl.layout<sharded, grid=[2,2]>>>
-> !ttl.accessor<tensor<64x64xf32, #ttl.tensor_encoding<DeviceDRAM, #ttl.layout<sharded, grid=[2,2]>>>>
// Core coordinate flattened to 1D (0-3 for 2x2 grid)
%shard_id = ttl.core {dims = 1} : index
// Reserve CB slot
%a_blk = ttl.cb_reserve %a_cb : !ttl.circular_buffer<[1,1], !ttcore.tile<32x32,f32>, 2>
-> tensor<1x1x!ttcore.tile<32x32,f32>, #ttl.tensor_encoding<L1, #ttl.layout<tiled>>>
// Copy from accessor slice to CB block
// Indices are explicit operands; direction inferred from operand types
%xf_a = ttl.copy
from %a_accessor at [%shard_id]
to %a_blk
: !ttl.accessor<...>, index -> !ttl.transfer_handle
ttl.wait %xf_a : !ttl.transfer_handle
ttl.cb_push %a_cb, %a_blk : !ttl.circular_buffer<...>, tensor<...>0cbcd6c to
c53cfcc
Compare
This draft PR is solely for discussion on a proposed
ttldialect (not intended to merge). See TTL_Dialect_Plan.md