[TTL] Fix TensorAccesorArgs generation in thread functions #203

brnorris03 · 2025-12-31T22:52:33Z

What?

Fixes the generation of TensorAccessorArgs to be per-function, not per-copy site and using a base compile-time args index (number of CBs in the kernel) for the first TensorAccessor offset and incrementally computing subsequent indices. The arguments to thread functions are filtered to include only tensors used in the thread.

Before (per-copy materialization with placeholders):

// At each copy site:
auto accessor = TensorAccessor(TensorAccessorArgs<42, 0>(), bankBase, pageSize);
// Python regex replaces 42 with actual offset

After (pre-materialization with simple offsets):

// At function entry (once per function):
auto args_0 = TensorAccessorArgs<3, 0>();  // base_cta=3 (num CBs), runtime_offset=0
auto accessor_0 = TensorAccessor(args_0, bankBase_0, pageSize);
auto args_1 = TensorAccessorArgs<4, 1>();  // base_cta+1, runtime_offset+1
auto accessor_1 = TensorAccessor(args_1, bankBase_1, pageSize);

// At copy sites: reuse pre-materialized accessors
noc_async_read_tile(tile_idx, accessor_0, write_ptr);

Why?

Tensor accessor construction was duplicated at copy sites. A modification of generated C++ code was required.

How?

The lowering now pre-materializes TensorAccessorArgs and TensorAccessor values at function entry for NOC kernels and reuses them, allowing multiple tensor arguments to be handled via chained offsets instead of hardcoded indices.

How to Test?

the existing check-ttlang-all target

Checklist:

Self-reviewed (style, logic)
Added tests (or justified none needed)
PR is small and focused (one task)

Closes #168

@compute

commit 39ea4c2 Author: Boyana Norris <[email protected]> Date: Tue Dec 30 18:28:46 2025 -0800 Add ttlang-translate tool (#201) Add a `ttlang-translate` tool with just the ttkernel to C translation (all we need). Removes dependence on ttmlir-translate (reducing build needs). Part of #200 commit 420d098 Author: Boyana Norris <[email protected]> Date: Tue Dec 30 16:59:12 2025 -0800 Uplift tt-mlir (#169) Update to latest tt-mlir, which also includes an llvm-project update. Build fixes: - Added `lib/CAPI/CMakeLists.txt` to build `TTLangCAPI` with `ENABLE_AGGREGATION` and `MLIRFuncDialect` linkage - Modified `python/CMakeLists.txt` to use INTERFACE target `TTLangPythonCAPI` linking upstream `TTMLIRPythonCAPI` plus `TTLangCAPI` - Extended `find_library` search in `python/CMakeLists.txt` to check build tree, install tree, and toolchain venv - Added check for TT hardware in cmake and create a stub target for check-ttlang-python-lit that just prints a message when TTNN is not available, instead of running tests that would just produce an error. Test updates: - Added fallback defaults in `test/lit.cfg.py` for `ttlang_obj_root` and `ttlang_source_dir`. - Updated Python binding tests to work with shared MLIR registry. Postponed: debugging hardware tests CI since each iteration takes ~1.5 hours * [x] Self-reviewed (style, logic) * [x] Added tests (or justified none needed) * [x] PR is small and focused (one task) commit 475f3b5 Author: Zoe Carver <[email protected]> Date: Tue Dec 30 08:21:31 2025 -0800 Add lower-affine (#163) (#194) Well that was easy... was just missing a pass. Fixes #163 commit 0448da0 Author: Zoe Carver <[email protected]> Date: Tue Dec 30 06:59:25 2025 -0800 Update decorators to be spelled @ttl.kernel(), @ttl.compute(), @ttl.datamovement() instead of @pykernel_gen(), @compute(), @DataMovement() (#186). (#190) New module exporting kernel, compute, datamovement as the renamed decorator API Added _get_annotation_name() helper to handle both CircularBuffer and ttl.CircularBuffer type annotations fixes #186

…essor

…xt for tt-mlir (CAPI shared library) when using a tt-mlir build tree and not any install tree

available until after build)

…t-lang into bnorris/fix-tensoraccessor

…cessor' of github.com:tenstorrent/tt-lang into bnorris/fix-tensoraccessor

brnorris03 · 2026-01-02T04:49:59Z

lib/Dialect/TTL/Transforms/ConvertTTLToTTKernel.cpp

 static FailureOr<Value>
-getBufferAddressFromRuntimeArg(Value tensor, Location loc,
-                               ConversionPatternRewriter &rewriter) {
+getBufferAddressFromRuntimeArg(Value tensor, Location loc, OpBuilder &builder) {


Switched helper builders (this one and others below) to OpBuilder so they can be called from non-rewrite contexts and to avoid mutating pattern rewriter state outside a rewrite.

- Replace dangling reference in CopyLowering with shared_ptr - Remove magic number fallback in materializeTensorAccessor, emit proper error when ttl.base_cta_index attribute is missing - Add [[nodiscard]] to FailureOr-returning functions - Add OpBuilder::InsertionGuard in materializeFunctionTensorAccessors - Add invalid test case for missing base_cta_index attribute - Update existing tests to include ttl.base_cta_index attribute and explicit CHECK patterns for TensorAccessorArgs chaining

…essor

brnorris03 · 2026-01-07T05:49:55Z

But what I think we are missing in this PR is that base_cta_index = len(CB_symbol_table) for every dm function. So the metadata will be indexed incorrectly. I think we could probably generate a program that reads the wrong tensor's metadata and generates an incorrect runtime output (edit: I originally thought I did this with shape = 64x64 but I was reading the wrong tile output).

I have not changed the argument passing logic. It is up to the frontend to ensure that host code is passing the correct tensor arguments in the correct order. The base index is computed because at the moment, CBs are passed first. If the front end adds an arbitrary non-CB, non-tensor argument after the CBs, then yeah, number of CBs would be the wrong base index for the tensor arguments. But at present there are no such arbitrary arguments between the CBs and tensors, so base being the # of CBs (kernel-scope) is correct.

zoecarver · 2026-01-07T13:45:48Z

so base being the # of CBs (kernel-scope) is correct

What is kernel scope here? Do you mean per "kernel" (thread) that we save to an individual C++ file at the end?

zoecarver · 2026-01-07T14:18:54Z

Let's try to break this down a bit:

when I say kernel I mean each C++ file saved at the end of compilation
we have CBs and TAs that are passed as args to the kernel (nothing else, no arbitrary args)
CBs metadata are always consistent across all kernels, each kernel gets 3 CBs for eg simple_add (no matter if it uses them or not)
TAs metadata are also always consistent across all kernels (we pass all kernels all TAs even if they are not captured/used)
CTA is static metadata (is_sharded, is_dram, etc.)
CRTA is buffer addresses
CTA contains all CBs and TAs [1]
CRTA is filtered per kernel [2]
next_compile_time_args_offset is per kernel

[1] you can see here:

kernel_compile_time_args = cb_indices + list(tensor_accessor_args)

[2] you can see here:

tensor_indices = self.kernel_tensor_indices[kernel_idx]
common_runtime_args = [args[idx].buffer_address() for idx in tensor_indices]

So now maybe you see the problem I'm getting at. Let's take simple_add dm_out for example:

  CTA = [0, 1, 2, args_for_lhs, args_for_rhs, args_for_out]
                  ↑ CTA[3]      ↑ CTA[3+N]    ↑ CTA[3+N+M]

And dm_out only uses the out TA (last one) which should be CTA[5].

ttl.base_cta_index = get_cb_count(), this is (contrary to the doc comment) the global list of CBs, which is correct, so we are starting with 3 for lhs. BUT next_compile_time_args_offset is per kernel, so if there is only one TensorAccessor in the kernel, it will get the first "next offset" which is 0 IIUC. So:

Kernel	compile_time_args (CTA)	TensorAccessorArgs
dm_write	[0, 1, 2, args_lhs, args_rhs, args_out]	<3, 0> reads CTA[3] = args_lhs ❌

Now does the issue I'm trying to highlight make sense? Or have I gone wrong somewhere in my understanding here?

zoecarver · 2026-01-07T14:20:48Z

python/ttlang/circular_buffer.py



+def get_cb_count():
+    """Get current total CB count (kernel-scope)."""


This is not kernel scope, _cb_index_counter is incremented across all kernels. It is only reset at the very start of the program.

Does it have to be program-global? Shouldn't it be reset per kernel (where kernel is composed of the three thread functions)?

zoecarver · 2026-01-07T14:24:35Z

python/ttlang/dtype_utils.py

    Raises:
        ValueError: If dtype is not supported
    """
-    dtype_int = dtype.value


When does this break? I don't want to speculatively update this to accommodate other binding patterns. Maybe it makes more sense to update the bindings to be consistent? Regardless, if this PR doesn't introduce a pattern that requires this change, maybe we can leave it for whatever PR does?

Ensuring that this doesn't just crash when a wrong type is passed to a helper method is not a "speculative" update, IMO. But reverting since this is out of scope of the PR and nothing crashes at the moment with the few examples we have.

python/ttlang/ttl_api.py

zoecarver · 2026-01-07T15:04:54Z

lib/Dialect/TTL/Transforms/ConvertTTLToTTKernel.cpp

+  CopyLowering(const TypeConverter &typeConverter, MLIRContext *context,
+               FuncAccessorMapsPtr funcAccessorMaps)
+      : OpConversionPattern(typeConverter, context),
+        funcAccessorMaps(std::move(funcAccessorMaps)) {}


To make my above point more concrete: every use of the shared_ptr moves it, so it must not need to be shared! std::move nulls out the pointer, so either this is a bug with sharing, or we don't actually need sharing.

brnorris03 · 2026-01-07T16:38:53Z

Let's try to break this down a bit:

when I say kernel I mean each C++ file saved at the end of compilation

we have CBs and TAs that are passed as args to the kernel (nothing else, no arbitrary args)

CBs metadata are always consistent across all kernels, each kernel gets 3 CBs for eg simple_add (no matter if it uses them or not)

TAs metadata are also always consistent across all kernels (we pass all kernels all TAs even if they are not captured/used)

CTA is static metadata (is_sharded, is_dram, etc.)

CRTA is buffer addresses

CTA contains all CBs and TAs [1]

CRTA is filtered per kernel [2]

next_compile_time_args_offset is per kernel

[1] you can see here:
kernel_compile_time_args = cb_indices + list(tensor_accessor_args)
[2] you can see here:
tensor_indices = self.kernel_tensor_indices[kernel_idx]
common_runtime_args = [args[idx].buffer_address() for idx in tensor_indices]
So now maybe you see the problem I'm getting at. Let's take simple_add dm_out for example:
  CTA = [0, 1, 2, args_for_lhs, args_for_rhs, args_for_out]
                  ↑ CTA[3]      ↑ CTA[3+N]    ↑ CTA[3+N+M]
And dm_out only uses the out TA (last one) which should be CTA[5].

ttl.base_cta_index = get_cb_count(), this is (contrary to the doc comment) the global list of CBs, which is correct, so we are starting with 3 for lhs. BUT next_compile_time_args_offset is per kernel, so if there is only one TensorAccessor in the kernel, it will get the first "next offset" which is 0 IIUC. So:

Kernel compile_time_args (CTA) TensorAccessorArgs
dm_write [0, 1, 2, args_lhs, args_rhs, args_out] <3, 0> reads CTA[3] = args_lhs ❌
Now does the issue I'm trying to highlight make sense? Or have I gone wrong somewhere in my understanding here?

I think you are conflating some kind of global tensor index with the local (per thread) tensor argument index, which is all that the the tensor accessor arg operation is for -- specifying which of the arguments passed to the C++ function are tensors, nothing more. The above example input indices are 3, 4, and 5, always -- meaning that the third, fourth and fifth arguments are tensors. The host must set up the kernel with the correct tensors which could be anywhere and any subset of however many other tensors there might be.

brnorris03 · 2026-01-07T16:40:04Z

so base being the # of CBs (kernel-scope) is correct

What is kernel scope here? Do you mean per "kernel" (thread) that we save to an individual C++ file at the end?

Kernel is the metalium definition. Each has three threads (compute, two dm).

zoecarver · 2026-01-07T16:51:47Z

The host must set up the kernel with the correct tensors which could be anywhere and any subset of however many other tensors there might be.

If this is how you want to model it, then you need to update what is passed for CTA. But to be clear, it will need to be the exact subset of used tensors (same logic as crta).

zoecarver · 2026-01-07T16:57:57Z

I still think we are making this more complicated than it needs to be. We are doing a lot of work just to use next_compile_time_args_offset which seems like a less clear API in general. It makes sense for a programmer writing a kernel to use next_compile_time_args_offset, but the compiler needs to calculate the indexes anyway, so why not just pass them directly? For a programmer, it's easy for them to get indexes wrong, for the compiler, it's not.

My proposal here, now that we've agreed there is an issue (I hope), is to pass base_cta_index and base_crta_index as attributes, then just use that value directly in the lowering. I think that would be a lot simpler and smaller change.

With the approach in this PR, we still pass base_cta_index, then we have to materialize accessors with chaining, and the runtime must filter CTA per kernel (or C++ must use global indexing).

With what I'm proposing we'd pass both base_cta_index and base_crta_index as function attributes and the compiler would compute exact indicies for each tensor accessor during lowering. So

  // Instead of chaining:
  auto args = TensorAccessorArgs<base_cta_index, 0>();  // Wrong if CTA is global

  // Compiler directly computes:
  auto args = TensorAccessorArgs<5, 0>();  // Exact index for 'out' tensor

This also makes the tests very easy to verify because we can say "yep that's the 6th arg, so the literal 5 makes sense".

brnorris03 · 2026-01-07T17:12:38Z

I don't really know why what you suggest is any simpler and it sounds like you understand the ttnn runtime argument passing better than me, so please feel free to take over this PR, I will move on to something else. I don't really care what is done for this as long as (1) can handle arbitrary kernels and (2) requires no changes to generated code.

zoecarver · 2026-01-07T17:13:32Z

(edit: sorry I sent this before I saw your most recent comment)

I don't want to be pedantic here, but I do want to make sure we have a shared understanding.

I think you are conflating some kind of global tensor index with the local (per thread) tensor argument index, which is all that the the tensor accessor arg operation is for -- specifying which of the arguments passed to the C++ function are tensors, nothing more.

My comment above is very specific as to what I am talking about regarding CTA and CRTA (not "some kind of global tensor index" and not conflating those two). What are you referring to specifically? Are you referring to CTA or CRTA?

CTA reads ArgsConfig metadata (a bitfield), not as far as I understand, for "specifying which of the arguments [...] are tensors".

zoecarver · 2026-01-07T17:20:33Z

Feel free to take over this PR, I will move on to something else.

I want to collaborate with you on this, and I'm sorry if this review doesn't feel that way. I am happy to help finish this PR if you'd like, but I'd also be happy to continue working with you on it. Maybe we can chat offline and work on it together?

I think this is almost good to go, there is just this small bug that I'd really like to align on, because it's very relevant to how we lower tensor accessors and fixing it will allow us to handle arbitrary kernels :)

…f chaining - Removed base_crta_index attribute reading from C++ code - CRTA index is always 0, track internally in conversion - Updated buildTensorAccessor to use baseCTA + accessorIndex and baseCRTA + accessorIndex - Changed from chaining (prev_args) to simple index offsets for all accessors - Updated test expectations: - TTLToCpp tests use hardcoded offsets (e.g. TensorAccessorArgs<3, 1>) - TTLToTTKernel tests use hardcoded offsets instead of prev_args chaining - Removed base_crta_index attributes from test MLIR

…essor

brnorris03 · 2026-01-07T23:42:03Z

replaced by #220

brnorris03 added 11 commits December 29, 2025 18:21

wip

c276ffd

update TensorAccessorArgs generation and lit tests

54a0f1c

remove unused cb args

9d67e6e

Merge remote-tracking branch 'origin/main' into bnorris/fix-tensoracc…

7071b96

…essor

update tests

57aefbd

wip

e376d53

ensure that both at build and runtime, we are using the correct conte…

a12ad9d

…xt for tt-mlir (CAPI shared library) when using a tt-mlir build tree and not any install tree

fix tests

cb22b7e

fix registration; remove ttnn check from configure (ttnn is not

f418a0b

available until after build)

add back final mlir output

0373012

brnorris03 mentioned this pull request Jan 2, 2026

metal/tt-lang single and multi-core matmul #67

Merged

7 tasks

brnorris03 added 2 commits January 1, 2026 17:09

add pytest target and conftest.py to exclude lit tests

09f7afc

add pytest target and conftest.py to exclude lit tests

ed5db70

brnorris03 force-pushed the bnorris/fix-tensoraccessor branch from 09f7afc to ed5db70 Compare January 2, 2026 01:16

brnorris03 added 4 commits January 1, 2026 17:17

Merge branch 'bnorris/fix-tensoraccessor' of github.com:tenstorrent/t…

931ac1a

…t-lang into bnorris/fix-tensoraccessor

remove unused system descriptor code

6fc29de

Merge branches 'bnorris/fix-tensoraccessor' and 'bnorris/fix-tensorac…

f0f6d0d

…cessor' of github.com:tenstorrent/tt-lang into bnorris/fix-tensoraccessor

skip instead of fail lit tests when ttnn is not available

0114b63

brnorris03 force-pushed the bnorris/fix-tensoraccessor branch from 0b871df to 0114b63 Compare January 2, 2026 03:46

brnorris03 commented Jan 2, 2026

View reviewed changes

brnorris03 added 2 commits January 1, 2026 21:28

use Operation* in DenseMap

50b358b

brnorris03 marked this pull request as ready for review January 2, 2026 16:31

brnorris03 requested a review from a team as a code owner January 2, 2026 16:31

brnorris03 requested a review from zoecarver January 2, 2026 16:31

brnorris03 changed the title ~~[TTL] Use chained TensorAccessorArgs~~ [TTL] Emit chained TensorAccessorArgs Jan 2, 2026

brnorris03 changed the title ~~[TTL] Emit chained TensorAccessorArgs~~ [TTL] Emit chained TensorAccessorArgs at dm function level (not per copy) Jan 2, 2026

Merge remote-tracking branch 'origin/main' into bnorris/fix-tensoracc…

a0114ad

…essor

phizalev-TT approved these changes Jan 2, 2026

View reviewed changes

brnorris03 added 2 commits January 6, 2026 21:08

remove redundant (per copy) get_common_arg_val calls

bcb4070

remove test unrelated to pr

7098262

brnorris03 marked this pull request as ready for review January 7, 2026 05:28

Merge remote-tracking branch 'origin/main' into bnorris/fix-tensoracc…

24bb6b0

…essor

brnorris03 requested a review from zoecarver January 7, 2026 05:32

zoecarver reviewed Jan 7, 2026

View reviewed changes

python/ttlang/ttl_api.py Outdated Show resolved Hide resolved

zoecarver reviewed Jan 7, 2026

View reviewed changes

revert dtype_utils.py to main

638f441

brnorris03 added 6 commits January 7, 2026 09:58

add ttl.base_crta_index thread attribute

3091648

Filter tensor args by inspecting the python ast for each thread

7546d5f

update tests

6a899f2

Merge remote-tracking branch 'origin/main' into bnorris/fix-tensoracc…

b1a28a6

…essor

add unit tests for the tensor arg filtering

248a6f8

brnorris03 changed the title ~~[TTL] Emit chained TensorAccessorArgs at dm function level (not per copy)~~ [TTL] Fix TensorAccesorArgs generation in thread functions Jan 7, 2026

use filtered function tensor args

5f83aee

brnorris03 closed this Jan 7, 2026



		def get_cb_count():
		"""Get current total CB count (kernel-scope)."""

[TTL] Fix TensorAccesorArgs generation in thread functions #203

[TTL] Fix TensorAccesorArgs generation in thread functions #203

Uh oh!

Conversation

brnorris03 commented Dec 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What?

Why?

How?

How to Test?

Checklist:

Uh oh!

brnorris03 Jan 2, 2026

Choose a reason for hiding this comment

Uh oh!

brnorris03 commented Jan 7, 2026

Uh oh!

zoecarver commented Jan 7, 2026

Uh oh!

zoecarver commented Jan 7, 2026

Uh oh!

zoecarver Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

brnorris03 Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zoecarver Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

brnorris03 Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zoecarver Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

brnorris03 commented Jan 7, 2026

Uh oh!

brnorris03 commented Jan 7, 2026

Uh oh!

zoecarver commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zoecarver commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

brnorris03 commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zoecarver commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zoecarver commented Jan 7, 2026

Uh oh!

brnorris03 commented Jan 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

brnorris03 commented Dec 31, 2025 •

edited

Loading

brnorris03 Jan 7, 2026 •

edited

Loading

zoecarver commented Jan 7, 2026 •

edited

Loading

zoecarver commented Jan 7, 2026 •

edited

Loading

brnorris03 commented Jan 7, 2026 •

edited

Loading

zoecarver commented Jan 7, 2026 •

edited

Loading