-
Notifications
You must be signed in to change notification settings - Fork 1
[TTL] Fix TensorAccesorArgs generation in thread functions #203
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
commit 39ea4c2 Author: Boyana Norris <[email protected]> Date: Tue Dec 30 18:28:46 2025 -0800 Add ttlang-translate tool (#201) Add a `ttlang-translate` tool with just the ttkernel to C translation (all we need). Removes dependence on ttmlir-translate (reducing build needs). Part of #200 commit 420d098 Author: Boyana Norris <[email protected]> Date: Tue Dec 30 16:59:12 2025 -0800 Uplift tt-mlir (#169) Update to latest tt-mlir, which also includes an llvm-project update. Build fixes: - Added `lib/CAPI/CMakeLists.txt` to build `TTLangCAPI` with `ENABLE_AGGREGATION` and `MLIRFuncDialect` linkage - Modified `python/CMakeLists.txt` to use INTERFACE target `TTLangPythonCAPI` linking upstream `TTMLIRPythonCAPI` plus `TTLangCAPI` - Extended `find_library` search in `python/CMakeLists.txt` to check build tree, install tree, and toolchain venv - Added check for TT hardware in cmake and create a stub target for check-ttlang-python-lit that just prints a message when TTNN is not available, instead of running tests that would just produce an error. Test updates: - Added fallback defaults in `test/lit.cfg.py` for `ttlang_obj_root` and `ttlang_source_dir`. - Updated Python binding tests to work with shared MLIR registry. Postponed: debugging hardware tests CI since each iteration takes ~1.5 hours * [x] Self-reviewed (style, logic) * [x] Added tests (or justified none needed) * [x] PR is small and focused (one task) commit 475f3b5 Author: Zoe Carver <[email protected]> Date: Tue Dec 30 08:21:31 2025 -0800 Add lower-affine (#163) (#194) Well that was easy... was just missing a pass. Fixes #163 commit 0448da0 Author: Zoe Carver <[email protected]> Date: Tue Dec 30 06:59:25 2025 -0800 Update decorators to be spelled @ttl.kernel(), @ttl.compute(), @ttl.datamovement() instead of @pykernel_gen(), @compute(), @DataMovement() (#186). (#190) New module exporting kernel, compute, datamovement as the renamed decorator API Added _get_annotation_name() helper to handle both CircularBuffer and ttl.CircularBuffer type annotations fixes #186
…xt for tt-mlir (CAPI shared library) when using a tt-mlir build tree and not any install tree
available until after build)
09f7afc to
ed5db70
Compare
…t-lang into bnorris/fix-tensoraccessor
…cessor' of github.com:tenstorrent/tt-lang into bnorris/fix-tensoraccessor
0b871df to
0114b63
Compare
| static FailureOr<Value> | ||
| getBufferAddressFromRuntimeArg(Value tensor, Location loc, | ||
| ConversionPatternRewriter &rewriter) { | ||
| getBufferAddressFromRuntimeArg(Value tensor, Location loc, OpBuilder &builder) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switched helper builders (this one and others below) to OpBuilder so they can be called from non-rewrite contexts and to avoid mutating pattern rewriter state outside a rewrite.
- Replace dangling reference in CopyLowering with shared_ptr - Remove magic number fallback in materializeTensorAccessor, emit proper error when ttl.base_cta_index attribute is missing - Add [[nodiscard]] to FailureOr-returning functions - Add OpBuilder::InsertionGuard in materializeFunctionTensorAccessors - Add invalid test case for missing base_cta_index attribute - Update existing tests to include ttl.base_cta_index attribute and explicit CHECK patterns for TensorAccessorArgs chaining
I have not changed the argument passing logic. It is up to the frontend to ensure that host code is passing the correct tensor arguments in the correct order. The base index is computed because at the moment, CBs are passed first. If the front end adds an arbitrary non-CB, non-tensor argument after the CBs, then yeah, number of CBs would be the wrong base index for the tensor arguments. But at present there are no such arbitrary arguments between the CBs and tensors, so base being the # of CBs (kernel-scope) is correct. |
What is kernel scope here? Do you mean per "kernel" (thread) that we save to an individual C++ file at the end? |
|
Let's try to break this down a bit:
[1] you can see here: [2] you can see here: So now maybe you see the problem I'm getting at. Let's take simple_add dm_out for example: And dm_out only uses the
Now does the issue I'm trying to highlight make sense? Or have I gone wrong somewhere in my understanding here? |
|
|
||
|
|
||
| def get_cb_count(): | ||
| """Get current total CB count (kernel-scope).""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not kernel scope, _cb_index_counter is incremented across all kernels. It is only reset at the very start of the program.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it have to be program-global? Shouldn't it be reset per kernel (where kernel is composed of the three thread functions)?
| Raises: | ||
| ValueError: If dtype is not supported | ||
| """ | ||
| dtype_int = dtype.value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When does this break? I don't want to speculatively update this to accommodate other binding patterns. Maybe it makes more sense to update the bindings to be consistent? Regardless, if this PR doesn't introduce a pattern that requires this change, maybe we can leave it for whatever PR does?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ensuring that this doesn't just crash when a wrong type is passed to a helper method is not a "speculative" update, IMO. But reverting since this is out of scope of the PR and nothing crashes at the moment with the few examples we have.
| CopyLowering(const TypeConverter &typeConverter, MLIRContext *context, | ||
| FuncAccessorMapsPtr funcAccessorMaps) | ||
| : OpConversionPattern(typeConverter, context), | ||
| funcAccessorMaps(std::move(funcAccessorMaps)) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make my above point more concrete: every use of the shared_ptr moves it, so it must not need to be shared! std::move nulls out the pointer, so either this is a bug with sharing, or we don't actually need sharing.
I think you are conflating some kind of global tensor index with the local (per thread) tensor argument index, which is all that the the tensor accessor arg operation is for -- specifying which of the arguments passed to the C++ function are tensors, nothing more. The above example input indices are 3, 4, and 5, always -- meaning that the third, fourth and fifth arguments are tensors. The host must set up the kernel with the correct tensors which could be anywhere and any subset of however many other tensors there might be. |
Kernel is the metalium definition. Each has three threads (compute, two dm). |
If this is how you want to model it, then you need to update what is passed for CTA. But to be clear, it will need to be the exact subset of used tensors (same logic as crta). |
|
I still think we are making this more complicated than it needs to be. We are doing a lot of work just to use My proposal here, now that we've agreed there is an issue (I hope), is to pass With the approach in this PR, we still pass With what I'm proposing we'd pass both This also makes the tests very easy to verify because we can say "yep that's the 6th arg, so the literal 5 makes sense". |
|
I don't really know why what you suggest is any simpler and it sounds like you understand the ttnn runtime argument passing better than me, so please feel free to take over this PR, I will move on to something else. I don't really care what is done for this as long as (1) can handle arbitrary kernels and (2) requires no changes to generated code. |
|
(edit: sorry I sent this before I saw your most recent comment) I don't want to be pedantic here, but I do want to make sure we have a shared understanding.
My comment above is very specific as to what I am talking about regarding CTA and CRTA (not "some kind of global tensor index" and not conflating those two). What are you referring to specifically? Are you referring to CTA or CRTA? CTA reads |
I want to collaborate with you on this, and I'm sorry if this review doesn't feel that way. I am happy to help finish this PR if you'd like, but I'd also be happy to continue working with you on it. Maybe we can chat offline and work on it together? I think this is almost good to go, there is just this small bug that I'd really like to align on, because it's very relevant to how we lower tensor accessors and fixing it will allow us to handle arbitrary kernels :) |
…f chaining - Removed base_crta_index attribute reading from C++ code - CRTA index is always 0, track internally in conversion - Updated buildTensorAccessor to use baseCTA + accessorIndex and baseCRTA + accessorIndex - Changed from chaining (prev_args) to simple index offsets for all accessors - Updated test expectations: - TTLToCpp tests use hardcoded offsets (e.g. TensorAccessorArgs<3, 1>) - TTLToTTKernel tests use hardcoded offsets instead of prev_args chaining - Removed base_crta_index attributes from test MLIR
|
replaced by #220 |
What?
Fixes the generation of
TensorAccessorArgsto be per-function, not per-copy site and using a base compile-time args index (number of CBs in the kernel) for the first TensorAccessor offset and incrementally computing subsequent indices. The arguments to thread functions are filtered to include only tensors used in the thread.Before (per-copy materialization with placeholders):
After (pre-materialization with simple offsets):
Why?
Tensor accessor construction was duplicated at copy sites. A modification of generated C++ code was required.
How?
The lowering now pre-materializes
TensorAccessorArgsandTensorAccessorvalues at function entry for NOC kernels and reuses them, allowing multiple tensor arguments to be handled via chained offsets instead of hardcoded indices.How to Test?
check-ttlang-alltargetChecklist:
Closes #168