Skip to content

Conversation

@brnorris03
Copy link
Contributor

@brnorris03 brnorris03 commented Dec 31, 2025

What?

Fixes the generation of TensorAccessorArgs to be per-function, not per-copy site and using a base compile-time args index (number of CBs in the kernel) for the first TensorAccessor offset and incrementally computing subsequent indices. The arguments to thread functions are filtered to include only tensors used in the thread.

Before (per-copy materialization with placeholders):

// At each copy site:
auto accessor = TensorAccessor(TensorAccessorArgs<42, 0>(), bankBase, pageSize);
// Python regex replaces 42 with actual offset

After (pre-materialization with simple offsets):

// At function entry (once per function):
auto args_0 = TensorAccessorArgs<3, 0>();  // base_cta=3 (num CBs), runtime_offset=0
auto accessor_0 = TensorAccessor(args_0, bankBase_0, pageSize);
auto args_1 = TensorAccessorArgs<4, 1>();  // base_cta+1, runtime_offset+1
auto accessor_1 = TensorAccessor(args_1, bankBase_1, pageSize);

// At copy sites: reuse pre-materialized accessors
noc_async_read_tile(tile_idx, accessor_0, write_ptr);

Why?

Tensor accessor construction was duplicated at copy sites. A modification of generated C++ code was required.

How?

The lowering now pre-materializes TensorAccessorArgs and TensorAccessor values at function entry for NOC kernels and reuses them, allowing multiple tensor arguments to be handled via chained offsets instead of hardcoded indices.

How to Test?

  • the existing check-ttlang-all target

Checklist:

  • Self-reviewed (style, logic)
  • Added tests (or justified none needed)
  • PR is small and focused (one task)

Closes #168

commit 39ea4c2
Author: Boyana Norris <[email protected]>
Date:   Tue Dec 30 18:28:46 2025 -0800

    Add ttlang-translate tool (#201)

    Add a `ttlang-translate` tool with just the ttkernel to C translation
    (all we need). Removes dependence on ttmlir-translate (reducing build
    needs).

    Part of #200

commit 420d098
Author: Boyana Norris <[email protected]>
Date:   Tue Dec 30 16:59:12 2025 -0800

    Uplift tt-mlir (#169)

    Update to latest tt-mlir, which also includes an llvm-project update.

    Build fixes:
    - Added `lib/CAPI/CMakeLists.txt` to build `TTLangCAPI` with
    `ENABLE_AGGREGATION` and `MLIRFuncDialect` linkage
    - Modified `python/CMakeLists.txt` to use INTERFACE target
    `TTLangPythonCAPI` linking upstream `TTMLIRPythonCAPI` plus `TTLangCAPI`
    - Extended `find_library` search in `python/CMakeLists.txt` to check
    build tree, install tree, and toolchain venv
    - Added check for TT hardware in cmake and create a stub target for
    check-ttlang-python-lit that just prints a message when TTNN is not
    available, instead of running tests that would just produce an error.

    Test updates:
    - Added fallback defaults in `test/lit.cfg.py` for `ttlang_obj_root` and
    `ttlang_source_dir`.
    - Updated Python binding tests to work with shared MLIR registry.

    Postponed: debugging hardware tests CI since each iteration takes ~1.5
    hours
    *   [x] Self-reviewed (style, logic)
    *   [x] Added tests (or justified none needed)
    *   [x] PR is small and focused (one task)

commit 475f3b5
Author: Zoe Carver <[email protected]>
Date:   Tue Dec 30 08:21:31 2025 -0800

    Add lower-affine (#163) (#194)

    Well that was easy... was just missing a pass.

    Fixes #163

commit 0448da0
Author: Zoe Carver <[email protected]>
Date:   Tue Dec 30 06:59:25 2025 -0800

    Update decorators to be spelled @ttl.kernel(), @ttl.compute(), @ttl.datamovement() instead of @pykernel_gen(), @compute(), @DataMovement() (#186). (#190)

    New module exporting kernel, compute, datamovement as the renamed
    decorator API

    Added _get_annotation_name() helper to handle both CircularBuffer and
    ttl.CircularBuffer type annotations

    fixes #186
…xt for tt-mlir (CAPI shared library) when using a tt-mlir build tree and not any install tree
@brnorris03 brnorris03 force-pushed the bnorris/fix-tensoraccessor branch from 09f7afc to ed5db70 Compare January 2, 2026 01:16
@brnorris03 brnorris03 force-pushed the bnorris/fix-tensoraccessor branch from 0b871df to 0114b63 Compare January 2, 2026 03:46
static FailureOr<Value>
getBufferAddressFromRuntimeArg(Value tensor, Location loc,
ConversionPatternRewriter &rewriter) {
getBufferAddressFromRuntimeArg(Value tensor, Location loc, OpBuilder &builder) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched helper builders (this one and others below) to OpBuilder so they can be called from non-rewrite contexts and to avoid mutating pattern rewriter state outside a rewrite.

- Replace dangling reference in CopyLowering with shared_ptr
- Remove magic number fallback in materializeTensorAccessor, emit proper
  error when ttl.base_cta_index attribute is missing
- Add [[nodiscard]] to FailureOr-returning functions
- Add OpBuilder::InsertionGuard in materializeFunctionTensorAccessors
- Add invalid test case for missing base_cta_index attribute
- Update existing tests to include ttl.base_cta_index attribute and
  explicit CHECK patterns for TensorAccessorArgs chaining
@brnorris03 brnorris03 marked this pull request as ready for review January 2, 2026 16:31
@brnorris03 brnorris03 requested a review from a team as a code owner January 2, 2026 16:31
@brnorris03 brnorris03 requested a review from zoecarver January 2, 2026 16:31
@brnorris03 brnorris03 changed the title [TTL] Use chained TensorAccessorArgs [TTL] Emit chained TensorAccessorArgs Jan 2, 2026
@brnorris03 brnorris03 changed the title [TTL] Emit chained TensorAccessorArgs [TTL] Emit chained TensorAccessorArgs at dm function level (not per copy) Jan 2, 2026
@brnorris03 brnorris03 marked this pull request as ready for review January 7, 2026 05:28
@brnorris03 brnorris03 requested a review from zoecarver January 7, 2026 05:32
@brnorris03
Copy link
Contributor Author

But what I think we are missing in this PR is that base_cta_index = len(CB_symbol_table) for every dm function. So the metadata will be indexed incorrectly. I think we could probably generate a program that reads the wrong tensor's metadata and generates an incorrect runtime output (edit: I originally thought I did this with shape = 64x64 but I was reading the wrong tile output).

I have not changed the argument passing logic. It is up to the frontend to ensure that host code is passing the correct tensor arguments in the correct order. The base index is computed because at the moment, CBs are passed first. If the front end adds an arbitrary non-CB, non-tensor argument after the CBs, then yeah, number of CBs would be the wrong base index for the tensor arguments. But at present there are no such arbitrary arguments between the CBs and tensors, so base being the # of CBs (kernel-scope) is correct.

@zoecarver
Copy link
Contributor

so base being the # of CBs (kernel-scope) is correct

What is kernel scope here? Do you mean per "kernel" (thread) that we save to an individual C++ file at the end?

@zoecarver
Copy link
Contributor

Let's try to break this down a bit:

  • when I say kernel I mean each C++ file saved at the end of compilation
  • we have CBs and TAs that are passed as args to the kernel (nothing else, no arbitrary args)
  • CBs metadata are always consistent across all kernels, each kernel gets 3 CBs for eg simple_add (no matter if it uses them or not)
  • TAs metadata are also always consistent across all kernels (we pass all kernels all TAs even if they are not captured/used)
  • CTA is static metadata (is_sharded, is_dram, etc.)
  • CRTA is buffer addresses
  • CTA contains all CBs and TAs [1]
  • CRTA is filtered per kernel [2]
  • next_compile_time_args_offset is per kernel

[1] you can see here:

kernel_compile_time_args = cb_indices + list(tensor_accessor_args)

[2] you can see here:

tensor_indices = self.kernel_tensor_indices[kernel_idx]
common_runtime_args = [args[idx].buffer_address() for idx in tensor_indices]

So now maybe you see the problem I'm getting at. Let's take simple_add dm_out for example:

  CTA = [0, 1, 2, args_for_lhs, args_for_rhs, args_for_out]
                  ↑ CTA[3]      ↑ CTA[3+N]    ↑ CTA[3+N+M]

And dm_out only uses the out TA (last one) which should be CTA[5].

ttl.base_cta_index = get_cb_count(), this is (contrary to the doc comment) the global list of CBs, which is correct, so we are starting with 3 for lhs. BUT next_compile_time_args_offset is per kernel, so if there is only one TensorAccessor in the kernel, it will get the first "next offset" which is 0 IIUC. So:

Kernel compile_time_args (CTA) TensorAccessorArgs
dm_write [0, 1, 2, args_lhs, args_rhs, args_out] <3, 0> reads CTA[3] = args_lhs ❌

Now does the issue I'm trying to highlight make sense? Or have I gone wrong somewhere in my understanding here?



def get_cb_count():
"""Get current total CB count (kernel-scope)."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not kernel scope, _cb_index_counter is incremented across all kernels. It is only reset at the very start of the program.

Copy link
Contributor Author

@brnorris03 brnorris03 Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it have to be program-global? Shouldn't it be reset per kernel (where kernel is composed of the three thread functions)?

Raises:
ValueError: If dtype is not supported
"""
dtype_int = dtype.value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When does this break? I don't want to speculatively update this to accommodate other binding patterns. Maybe it makes more sense to update the bindings to be consistent? Regardless, if this PR doesn't introduce a pattern that requires this change, maybe we can leave it for whatever PR does?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ensuring that this doesn't just crash when a wrong type is passed to a helper method is not a "speculative" update, IMO. But reverting since this is out of scope of the PR and nothing crashes at the moment with the few examples we have.

CopyLowering(const TypeConverter &typeConverter, MLIRContext *context,
FuncAccessorMapsPtr funcAccessorMaps)
: OpConversionPattern(typeConverter, context),
funcAccessorMaps(std::move(funcAccessorMaps)) {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make my above point more concrete: every use of the shared_ptr moves it, so it must not need to be shared! std::move nulls out the pointer, so either this is a bug with sharing, or we don't actually need sharing.

@brnorris03
Copy link
Contributor Author

Let's try to break this down a bit:

  • when I say kernel I mean each C++ file saved at the end of compilation
  • we have CBs and TAs that are passed as args to the kernel (nothing else, no arbitrary args)
  • CBs metadata are always consistent across all kernels, each kernel gets 3 CBs for eg simple_add (no matter if it uses them or not)
  • TAs metadata are also always consistent across all kernels (we pass all kernels all TAs even if they are not captured/used)
  • CTA is static metadata (is_sharded, is_dram, etc.)
  • CRTA is buffer addresses
  • CTA contains all CBs and TAs [1]
  • CRTA is filtered per kernel [2]
  • next_compile_time_args_offset is per kernel

[1] you can see here:

kernel_compile_time_args = cb_indices + list(tensor_accessor_args)

[2] you can see here:

tensor_indices = self.kernel_tensor_indices[kernel_idx]
common_runtime_args = [args[idx].buffer_address() for idx in tensor_indices]

So now maybe you see the problem I'm getting at. Let's take simple_add dm_out for example:

  CTA = [0, 1, 2, args_for_lhs, args_for_rhs, args_for_out]
                  ↑ CTA[3]      ↑ CTA[3+N]    ↑ CTA[3+N+M]

And dm_out only uses the out TA (last one) which should be CTA[5].

ttl.base_cta_index = get_cb_count(), this is (contrary to the doc comment) the global list of CBs, which is correct, so we are starting with 3 for lhs. BUT next_compile_time_args_offset is per kernel, so if there is only one TensorAccessor in the kernel, it will get the first "next offset" which is 0 IIUC. So:

Kernel compile_time_args (CTA) TensorAccessorArgs
dm_write [0, 1, 2, args_lhs, args_rhs, args_out] <3, 0> reads CTA[3] = args_lhs ❌
Now does the issue I'm trying to highlight make sense? Or have I gone wrong somewhere in my understanding here?

I think you are conflating some kind of global tensor index with the local (per thread) tensor argument index, which is all that the the tensor accessor arg operation is for -- specifying which of the arguments passed to the C++ function are tensors, nothing more. The above example input indices are 3, 4, and 5, always -- meaning that the third, fourth and fifth arguments are tensors. The host must set up the kernel with the correct tensors which could be anywhere and any subset of however many other tensors there might be.

@brnorris03
Copy link
Contributor Author

so base being the # of CBs (kernel-scope) is correct

What is kernel scope here? Do you mean per "kernel" (thread) that we save to an individual C++ file at the end?

Kernel is the metalium definition. Each has three threads (compute, two dm).

@zoecarver
Copy link
Contributor

zoecarver commented Jan 7, 2026

The host must set up the kernel with the correct tensors which could be anywhere and any subset of however many other tensors there might be.

If this is how you want to model it, then you need to update what is passed for CTA. But to be clear, it will need to be the exact subset of used tensors (same logic as crta).

@zoecarver
Copy link
Contributor

zoecarver commented Jan 7, 2026

I still think we are making this more complicated than it needs to be. We are doing a lot of work just to use next_compile_time_args_offset which seems like a less clear API in general. It makes sense for a programmer writing a kernel to use next_compile_time_args_offset, but the compiler needs to calculate the indexes anyway, so why not just pass them directly? For a programmer, it's easy for them to get indexes wrong, for the compiler, it's not.

My proposal here, now that we've agreed there is an issue (I hope), is to pass base_cta_index and base_crta_index as attributes, then just use that value directly in the lowering. I think that would be a lot simpler and smaller change.

With the approach in this PR, we still pass base_cta_index, then we have to materialize accessors with chaining, and the runtime must filter CTA per kernel (or C++ must use global indexing).

With what I'm proposing we'd pass both base_cta_index and base_crta_index as function attributes and the compiler would compute exact indicies for each tensor accessor during lowering. So

  // Instead of chaining:
  auto args = TensorAccessorArgs<base_cta_index, 0>();  // Wrong if CTA is global

  // Compiler directly computes:
  auto args = TensorAccessorArgs<5, 0>();  // Exact index for 'out' tensor

This also makes the tests very easy to verify because we can say "yep that's the 6th arg, so the literal 5 makes sense".

@brnorris03
Copy link
Contributor Author

brnorris03 commented Jan 7, 2026

I don't really know why what you suggest is any simpler and it sounds like you understand the ttnn runtime argument passing better than me, so please feel free to take over this PR, I will move on to something else. I don't really care what is done for this as long as (1) can handle arbitrary kernels and (2) requires no changes to generated code.

@zoecarver
Copy link
Contributor

zoecarver commented Jan 7, 2026

(edit: sorry I sent this before I saw your most recent comment)

I don't want to be pedantic here, but I do want to make sure we have a shared understanding.

I think you are conflating some kind of global tensor index with the local (per thread) tensor argument index, which is all that the the tensor accessor arg operation is for -- specifying which of the arguments passed to the C++ function are tensors, nothing more.

My comment above is very specific as to what I am talking about regarding CTA and CRTA (not "some kind of global tensor index" and not conflating those two). What are you referring to specifically? Are you referring to CTA or CRTA?

CTA reads ArgsConfig metadata (a bitfield), not as far as I understand, for "specifying which of the arguments [...] are tensors".

@zoecarver
Copy link
Contributor

Feel free to take over this PR, I will move on to something else.

I want to collaborate with you on this, and I'm sorry if this review doesn't feel that way. I am happy to help finish this PR if you'd like, but I'd also be happy to continue working with you on it. Maybe we can chat offline and work on it together?

I think this is almost good to go, there is just this small bug that I'd really like to align on, because it's very relevant to how we lower tensor accessors and fixing it will allow us to handle arbitrary kernels :)

…f chaining

- Removed base_crta_index attribute reading from C++ code
- CRTA index is always 0, track internally in conversion
- Updated buildTensorAccessor to use baseCTA + accessorIndex and baseCRTA + accessorIndex
- Changed from chaining (prev_args) to simple index offsets for all accessors
- Updated test expectations:
  - TTLToCpp tests use hardcoded offsets (e.g. TensorAccessorArgs<3, 1>)
  - TTLToTTKernel tests use hardcoded offsets instead of prev_args chaining
- Removed base_crta_index attributes from test MLIR
@brnorris03 brnorris03 changed the title [TTL] Emit chained TensorAccessorArgs at dm function level (not per copy) [TTL] Fix TensorAccesorArgs generation in thread functions Jan 7, 2026
@brnorris03
Copy link
Contributor Author

replaced by #220

@brnorris03 brnorris03 closed this Jan 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[ttl] Update copy lowering to use the correct TensorAccessorArgs

4 participants