[example] ttnn fused kernel with block-based compute #57
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This pull request introduces a fused elementwise operation example for TTNN, showing how to efficiently perform chained elementwise computations (such as
Output = exp(A + B) + C) without storing intermediate results in memory. The implementation maximizes DST register utilization, processes tiles in blocks for better memory access patterns, and reduces memory bandwidth by eliminating intermediate reads/writes. The changes include a detailed README, a compute kernel for the fused operation, and a ternary reader kernel for block-based input loading.Implemented
kernels/compute/fused_elementwise.cpp: a block-based compute kernel that fuses three operations (A + B, unary op (exporrelu), then+ C) in DST registers, processing up to 4 tiles per DST cycle and handling remainder tiles for arbitrary input sizes.Added
kernels/dataflow/reader_ternary.cpp: a reader kernel that loads tiles from three input tensors in blocks, reserves buffer space efficiently, and synchronizes reads with a single barrier for improved performance.The kernels avoid intermediate buffer storage, use block-based synchronization, and maximize hardware register usage for reduced DRAM accesses and improved throughput.
Type of Change
Related Issues
Testing
Test Configuration
Tests Added/Modified
MLIR Changes (if applicable)
Checklist
pre-commit run --all-files)Additional Notes