Skip to content

Conversation

@zoecarver
Copy link
Contributor

@zoecarver zoecarver commented Dec 23, 2025

Port the python bindings from d2m to ttl. Refactor to remove existing runtime tests and replace them with parameterized pytest. Write new lit tests.

Removed the metal runtime path and all d2m logic, renamed accordingly.

Overhauled operators.py to auto generate almost everything (kind of cool).

Filed a few issues for patterns we don't support, there is still more work to do:

  • get pytest running (hangs right now, maybe can defer)
  • get lit working natively (hacked right now because I lost the battle with lit config, required)
  • add a large-ish fused kernel (follow up change likely)

Base automatically changed from bnorris/integrate-store to main December 23, 2025 17:55
@zoecarver zoecarver force-pushed the zoecarver/ttl-python-bindings-take-three branch 2 times, most recently from e8ce977 to 2fe77d6 Compare December 24, 2025 00:33
Comment on lines +113 to +116
// TODO: Revisit shape rank validation for TTNN tensors.
// TTNN tensors have 4D device shape (grid + shard) while CBs have 2D shard
// shape. For now, only validate element types match. The relationship between
// tensor shape and CB shape needs further investigation.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brnorris03 curious to hear your thoughts on this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with doing this validation later in a separate PR. I myself need to do a bit more reading to understand that fully.

@@ -0,0 +1,142 @@
#!/usr/bin/env python3
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also curious to hear your thoughts on this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely good to generate from the defs and eliminate manual additions

Comment on lines 227 to 229
# TODO(XX): Fix TensorAccessorArgs CTA offsets. C++ emits placeholder 42+idx,
# replace with actual offset = idx * args_per_tensor.
if args_per_tensor > 0:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are going to need to totally overhaul this, but I think that should maybe come as a post-commit fix if we can tolerate this awful hack for a bit.

test/lit.cfg.py Outdated
)
)

# Run tests via tt-lang-hw-sim VM (for tests requiring simulator)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Temporary, will remove once I build up the courage to fight lit configs again

# UNSUPPORTED: system-darwin
# RUN: %python -m pytest %s -v

# NOT YET RUNNING this is just a placeholder for how this could look.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also would love your thoughts on this pytest strategy.

Copy link
Contributor

@brnorris03 brnorris03 Dec 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put together some pytest scaffolding for the middle end (but re-usable for DSL inputs, too) at #167 -- it should enable very minimalistic test definitions (and single ops fully auto-generated, as you already started here). I think it would be great to have a single test base that does all the logistics of compiling and running.

In any case, after you merge this, I should be able to consolidate both the DSL-based and middle end to end tests if the design works as expected.

@zoecarver zoecarver marked this pull request as ready for review December 24, 2025 00:40
@zoecarver zoecarver requested a review from a team as a code owner December 24, 2025 00:40
)


def create_stream_layout_for_input(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Was able to remove a lot of stuff in the last commit :)

core: Target core coordinates for multicast
mcast: Multicast dimensions
"""
return d2m.semaphore_set(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have corresponding ttl ops?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not yet, I suggest we do this after the fact given the size of this PR and the prerequisite of semaphore support landing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be good to create a separate PR for this bugfix

Comment on lines +438 to +443
// TODO(XX): Placeholder CTA offsets - Python regex replaces 42+argIdx
// with actual offsets from ttnn.TensorAccessorArgs.get_compile_time_args().
auto argIdx = getTensorFuncArgIndex(tensor);
int32_t ctaPlaceholder =
42 + (failed(argIdx) ? 0 : static_cast<int32_t>(*argIdx));
constexpr int32_t crtaPlaceholder = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regex replacement in the C++? Sounds kind of risky... Can you instead track the actual compile-time args offsets during the MLIR generation. So something like this:

  1. Compute args_per_tensor early in _compile_and_run_kernel (where actual tensors are available):

    args_per_tensor = len(ttnn.TensorAccessorArgs(args[0]).get_compile_time_args())
  2. Pass it through the compilation pipeline as a kwarg to TTLGenericCompiler.__init__ and store it.

  3. In ttl_ast.py _emit_entry, compute CTA offsets as tensor arguments are collected:

    cta_offset = tensor_index * self.args_per_tensor
  4. Attach each CTA offset as a function argument attribute when creating the func.FuncOp:

    self.func_entry.setArgAttr(i, "ttl.cta_offset", IntegerAttr.get(..., cta_offset))
  5. When compiling, in materializeTensorAccessor, we'd just read the attribute from the function argument

Copy link
Contributor

@brnorris03 brnorris03 Dec 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although this may be overkill? I re-read the documentation and played a bit with my examples, and you always start the args at 0, then call next_compile_time_args_offset() for the rest of the args, e.g., for a simple binary op example below CTA offset is always 0. When wouldn't it be 0 for ttl dm threads?

What runtime parameters do we have at the moment?

  const uint32_t a_addr = get_arg_val<uint32_t>(0);
  const uint32_t b_addr = get_arg_val<uint32_t>(1);
  const uint32_t n_tiles = get_arg_val<uint32_t>(2);
  const uint32_t start_id =
      get_arg_val<uint32_t>(3); // Starting tile ID for this core

  constexpr auto cb_in0 = tt::CBIndex::c_0;
  constexpr auto cb_in1 = tt::CBIndex::c_1;

  const uint32_t tile_size_bytes = get_tile_size(cb_in0);

  constexpr auto args_a = TensorAccessorArgs<0>();
  constexpr auto args_b =
      TensorAccessorArgs<args_a.next_compile_time_args_offset()>();

  const auto a = TensorAccessor(args_a, a_addr, tile_size_bytes);
  const auto b = TensorAccessor(args_b, b_addr, tile_size_bytes);

I think in our case, this should always be correct, right? And if not 0, you control the args, so you would know which is the first tensor (accessor) arg., but at the moment I don't think we have any others, so should be 0.

// First accessor at offset 0
auto args1 = TensorAccessorArgs<0>();
// Second accessor starts after first one's CTA args
auto args2 = TensorAccessorArgs<args1.next_compile_time_args_offset()>();
// Third accessor starts after second one's CTA args  
auto args3 = TensorAccessorArgs<args2.next_compile_time_args_offset()>();
...

or if we have runtime args at some point:

// First tensor accessor starts at CRTA offset 0
auto args_src = TensorAccessorArgs<0, 0>();

// Second tensor accessor starts after the first one's CRTA args
auto args_dst = TensorAccessorArgs<args_src.next_compile_time_args_offset(), 
                                   args_src.next_common_runtime_args_offset()>();

I checked and ttkernel doesn't support this yet, but I can add it if needed (first would try to figure out if we can do entirely in tt-lang).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I propose that this (hacky) fix is split out from this PR and merged first. In the meanwhile I am working on adding support for CTA and CRTA argument chaining as shown above (to tt-mlir). When that is merged, we can get rid of the hack and update the AST code generator to include the base cta and crta indices to the dm threads.

Copy link
Contributor

@brnorris03 brnorris03 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much offline discussion on the tensor accessors, but nothing that would stop merging this. I would like to get the current e2e working within some limited context (with known next steps to remove limitations).

Comment on lines 338 to 343
args_per_tensor = len(ttnn.TensorAccessorArgs(args[0]).get_compile_time_args())

# Write all kernels to /tmp for debugging
for name, thread_type in kernel_info:
cpp_source = ttkernel_to_cpp_by_name(module, name)
_write_kernel_to_tmp(name, cpp_source)
_write_kernel_to_tmp(name, cpp_source, args_per_tensor)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assumes args_per_tensor is the same for all which is not necessarily true, could have different layout tensors in non-elementwise kinds of functions (but ok for now)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now just num_tensors = len(args), does that sound good?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: consider putting the invalid test in a subdirectory

Comment on lines +43 to +45
lhs_accessor = TensorAccessor(lhs)
rhs_accessor = TensorAccessor(rhs)
out_accessor = TensorAccessor(out)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am starting to wonder whether having these be explicitly specified by the user has any value at all? If we generate them, then we can ensure consistency among reads/compute/writes. I guess that's more of a discussion point, not a thing you need to address in this PR.

Copy link
Contributor Author

@zoecarver zoecarver Dec 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we agreed that TAs can go away (in the spec at least)

// CircularBufferType
//===--------------------------------------------------------------------===//

tt_type_class<CircularBufferType>(m, "CircularBufferType")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for adding it!

Copy link
Contributor

@brnorris03 brnorris03 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@zoecarver zoecarver enabled auto-merge (squash) December 27, 2025 00:24
@zoecarver zoecarver force-pushed the zoecarver/ttl-python-bindings-take-three branch from 126e1bd to c4ef41e Compare December 27, 2025 00:24
@zoecarver zoecarver merged commit 3fd3933 into main Dec 27, 2025
5 checks passed
@zoecarver zoecarver deleted the zoecarver/ttl-python-bindings-take-three branch December 27, 2025 00:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants