-
Notifications
You must be signed in to change notification settings - Fork 1
[ttl] The one python bindings PR #159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
e8ce977 to
2fe77d6
Compare
| // TODO: Revisit shape rank validation for TTNN tensors. | ||
| // TTNN tensors have 4D device shape (grid + shard) while CBs have 2D shard | ||
| // shape. For now, only validate element types match. The relationship between | ||
| // tensor shape and CB shape needs further investigation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@brnorris03 curious to hear your thoughts on this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am fine with doing this validation later in a separate PR. I myself need to do a bit more reading to understand that fully.
| @@ -0,0 +1,142 @@ | |||
| #!/usr/bin/env python3 | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also curious to hear your thoughts on this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely good to generate from the defs and eliminate manual additions
python/ttlang/ttl_api.py
Outdated
| # TODO(XX): Fix TensorAccessorArgs CTA offsets. C++ emits placeholder 42+idx, | ||
| # replace with actual offset = idx * args_per_tensor. | ||
| if args_per_tensor > 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are going to need to totally overhaul this, but I think that should maybe come as a post-commit fix if we can tolerate this awful hack for a bit.
test/lit.cfg.py
Outdated
| ) | ||
| ) | ||
|
|
||
| # Run tests via tt-lang-hw-sim VM (for tests requiring simulator) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Temporary, will remove once I build up the courage to fight lit configs again
test/python/test_elementwise_ops.py
Outdated
| # UNSUPPORTED: system-darwin | ||
| # RUN: %python -m pytest %s -v | ||
|
|
||
| # NOT YET RUNNING this is just a placeholder for how this could look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also would love your thoughts on this pytest strategy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put together some pytest scaffolding for the middle end (but re-usable for DSL inputs, too) at #167 -- it should enable very minimalistic test definitions (and single ops fully auto-generated, as you already started here). I think it would be great to have a single test base that does all the logistics of compiling and running.
In any case, after you merge this, I should be able to consolidate both the DSL-based and middle end to end tests if the design works as expected.
python/ttlang/layouts.py
Outdated
| ) | ||
|
|
||
|
|
||
| def create_stream_layout_for_input( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Was able to remove a lot of stuff in the last commit :)
| core: Target core coordinates for multicast | ||
| mcast: Multicast dimensions | ||
| """ | ||
| return d2m.semaphore_set( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have corresponding ttl ops?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not yet, I suggest we do this after the fact given the size of this PR and the prerequisite of semaphore support landing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be good to create a separate PR for this bugfix
| // TODO(XX): Placeholder CTA offsets - Python regex replaces 42+argIdx | ||
| // with actual offsets from ttnn.TensorAccessorArgs.get_compile_time_args(). | ||
| auto argIdx = getTensorFuncArgIndex(tensor); | ||
| int32_t ctaPlaceholder = | ||
| 42 + (failed(argIdx) ? 0 : static_cast<int32_t>(*argIdx)); | ||
| constexpr int32_t crtaPlaceholder = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regex replacement in the C++? Sounds kind of risky... Can you instead track the actual compile-time args offsets during the MLIR generation. So something like this:
-
Compute
args_per_tensorearly in_compile_and_run_kernel(where actual tensors are available):args_per_tensor = len(ttnn.TensorAccessorArgs(args[0]).get_compile_time_args())
-
Pass it through the compilation pipeline as a kwarg to
TTLGenericCompiler.__init__and store it. -
In
ttl_ast.py_emit_entry, compute CTA offsets as tensor arguments are collected:cta_offset = tensor_index * self.args_per_tensor
-
Attach each CTA offset as a function argument attribute when creating the
func.FuncOp:self.func_entry.setArgAttr(i, "ttl.cta_offset", IntegerAttr.get(..., cta_offset))
-
When compiling, in
materializeTensorAccessor, we'd just read the attribute from the function argument
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although this may be overkill? I re-read the documentation and played a bit with my examples, and you always start the args at 0, then call next_compile_time_args_offset() for the rest of the args, e.g., for a simple binary op example below CTA offset is always 0. When wouldn't it be 0 for ttl dm threads?
What runtime parameters do we have at the moment?
const uint32_t a_addr = get_arg_val<uint32_t>(0);
const uint32_t b_addr = get_arg_val<uint32_t>(1);
const uint32_t n_tiles = get_arg_val<uint32_t>(2);
const uint32_t start_id =
get_arg_val<uint32_t>(3); // Starting tile ID for this core
constexpr auto cb_in0 = tt::CBIndex::c_0;
constexpr auto cb_in1 = tt::CBIndex::c_1;
const uint32_t tile_size_bytes = get_tile_size(cb_in0);
constexpr auto args_a = TensorAccessorArgs<0>();
constexpr auto args_b =
TensorAccessorArgs<args_a.next_compile_time_args_offset()>();
const auto a = TensorAccessor(args_a, a_addr, tile_size_bytes);
const auto b = TensorAccessor(args_b, b_addr, tile_size_bytes);
I think in our case, this should always be correct, right? And if not 0, you control the args, so you would know which is the first tensor (accessor) arg., but at the moment I don't think we have any others, so should be 0.
// First accessor at offset 0
auto args1 = TensorAccessorArgs<0>();
// Second accessor starts after first one's CTA args
auto args2 = TensorAccessorArgs<args1.next_compile_time_args_offset()>();
// Third accessor starts after second one's CTA args
auto args3 = TensorAccessorArgs<args2.next_compile_time_args_offset()>();
...
or if we have runtime args at some point:
// First tensor accessor starts at CRTA offset 0
auto args_src = TensorAccessorArgs<0, 0>();
// Second tensor accessor starts after the first one's CRTA args
auto args_dst = TensorAccessorArgs<args_src.next_compile_time_args_offset(),
args_src.next_common_runtime_args_offset()>();
I checked and ttkernel doesn't support this yet, but I can add it if needed (first would try to figure out if we can do entirely in tt-lang).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I propose that this (hacky) fix is split out from this PR and merged first. In the meanwhile I am working on adding support for CTA and CRTA argument chaining as shown above (to tt-mlir). When that is merged, we can get rid of the hack and update the AST code generator to include the base cta and crta indices to the dm threads.
brnorris03
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Much offline discussion on the tensor accessors, but nothing that would stop merging this. I would like to get the current e2e working within some limited context (with known next steps to remove limitations).
python/ttlang/ttl_api.py
Outdated
| args_per_tensor = len(ttnn.TensorAccessorArgs(args[0]).get_compile_time_args()) | ||
|
|
||
| # Write all kernels to /tmp for debugging | ||
| for name, thread_type in kernel_info: | ||
| cpp_source = ttkernel_to_cpp_by_name(module, name) | ||
| _write_kernel_to_tmp(name, cpp_source) | ||
| _write_kernel_to_tmp(name, cpp_source, args_per_tensor) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assumes args_per_tensor is the same for all which is not necessarily true, could have different layout tensors in non-elementwise kinds of functions (but ok for now)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now just num_tensors = len(args), does that sound good?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: consider putting the invalid test in a subdirectory
| lhs_accessor = TensorAccessor(lhs) | ||
| rhs_accessor = TensorAccessor(rhs) | ||
| out_accessor = TensorAccessor(out) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am starting to wonder whether having these be explicitly specified by the user has any value at all? If we generate them, then we can ensure consistency among reads/compute/writes. I guess that's more of a discussion point, not a thing you need to address in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we agreed that TAs can go away (in the spec at least)
| // CircularBufferType | ||
| //===--------------------------------------------------------------------===// | ||
|
|
||
| tt_type_class<CircularBufferType>(m, "CircularBufferType") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you for adding it!
brnorris03
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
126e1bd to
c4ef41e
Compare
Port the python bindings from d2m to ttl. Refactor to remove existing runtime tests and replace them with parameterized pytest. Write new lit tests.
Removed the metal runtime path and all d2m logic, renamed accordingly.
Overhauled operators.py to auto generate almost everything (kind of cool).
Filed a few issues for patterns we don't support, there is still more work to do: