-
Notifications
You must be signed in to change notification settings - Fork 22
Ldjurovic/fast exp new #897
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Thank you for your contribution! 🚀 You can run tt-metal integration tests by adding the If you want to run metal post-commit tests, you can add the 📖 For more information, please refer to our CONTRIBUTING guide. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR optimizes exponential calculation in fast approximation mode by consolidating the sanitization and calculation steps into a single LOADMACRO sequence that is recorded and replayed using the lltt::replay mechanism. This reduces instruction overhead and improves performance.
Key Changes
- Replaced ~100 lines of manual LOADMACRO invocations with ~25 lines using the replay buffer approach
- Updated threshold value from -88.5 to -86.6 and adjusted B_MINUS_C constant
- Added comprehensive test suite for fast exponential approximation
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
tt_llk_wormhole_b0/common/inc/sfpu/ckernel_sfpu_exp.h |
Refactored fast approximation mode to use replay buffer with 24 recorded instructions; updated constants and LOADMACRO setup |
tt_llk_blackhole/common/inc/sfpu/ckernel_sfpu_exp.h |
Similar refactoring for Blackhole architecture with 16 recorded instructions; includes variable rename from in to val |
tests/sources/fast_exp_test.cpp |
New C++ test implementation for fast exponential calculation across all TRISC kernels |
tests/python_tests/test_fast_exp.py |
New Python test suite with multiple input dimensions and format configurations |
tests/python_tests/helpers/utils.py |
Extended passed_test function to support custom tolerances and one-face checking |
Comments suppressed due to low confidence (1)
tests/python_tests/test_fast_exp.py:72
- Variable generate_golden is not used.
generate_golden = get_golden_generator(UnarySFPUGolden)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| TTI_SFPCONFIG(0x0000, 0x4, 0x0); // Load it into macro sequence register 0 (destination = 4) | ||
|
|
||
| TTI_SFPCONFIG( | ||
| 0x0010, 0x8 /*LOADMACRO control*/, 0x1); // Specifies that the store in LOAMACRO “Sequence 0” will inherit the instr_mod0 field from the LOADMACRO |
Copilot
AI
Dec 1, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo in comment: "LOAMACRO" should be "LOADMACRO".
| 0x0010, 0x8 /*LOADMACRO control*/, 0x1); // Specifies that the store in LOAMACRO “Sequence 0” will inherit the instr_mod0 field from the LOADMACRO | |
| 0x0010, 0x8 /*LOADMACRO control*/, 0x1); // Specifies that the store in LOADMACRO “Sequence 0” will inherit the instr_mod0 field from the LOADMACRO |
181479b to
a605af0
Compare
| L1_to_L1_iterations: int = 1, | ||
| custom_rtol: float = None, | ||
| custom_atol: float = None, | ||
| one_face_check: bool = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better yet, set num_faces and default it to 4. and then in your case just set 1. That sounds like something more scalable.
tests/python_tests/test_fast_exp.py
Outdated
| input_dimensions=[[32, 32], [32, 64], [64, 32], [64, 64], [128, 32], [32, 128]], | ||
| approx_mode=[ApproximationMode.Yes], | ||
| mathop=[MathOperation.Exp], | ||
| dest_acc=[DestAccumulation.No], # , DestAccumulation.Yes], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tests/python_tests/test_fast_exp.py
Outdated
| golden_tensor, | ||
| res_tensor, | ||
| formats.output_format, | ||
| custom_atol=0.1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is big, can we go lower?
| using namespace ckernel; | ||
| using namespace ckernel::sfpu; | ||
|
|
||
| const int iterations = 32; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| const int iterations = 32; |
Seems unused.
| TTI_SFPLOADI(0, 0xA, lo16(B_MINUS_C)); | ||
| TTI_SFPLOADI(0, 0x8, hi16(B_MINUS_C)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| TTI_SFPLOADI(0, 0xA, lo16(B_MINUS_C)); | |
| TTI_SFPLOADI(0, 0x8, hi16(B_MINUS_C)); | |
| TTI_SFPLOADI(ckernel::p_sfpu::LREG0, sfpi::SFPLOADI_MOD0_LOWER, lo16(B_MINUS_C)); | |
| TTI_SFPLOADI(ckernel::p_sfpu::LREG0, sfpi::SFPLOADI_MOD0_UPPER, hi16(B_MINUS_C)); |
It'd be nice to replace the magic numbers with some constants. It's much easier to read the code later.
37a5248 to
39aba02
Compare
🚀 tt-metal post-commit testsBranch:
Test Results:
🔗 Links📊 Post-commit workflow: #19889451922 |
ba3c2e9 to
a2f80dd
Compare
🚀 tt-metal post-commit testsBranch:
Test Results:
🔗 Links📊 Post-commit workflow: #20427636630 |
🚀 tt-metal post-commit testsBranch:
Test Results:
🔗 Links📊 Post-commit workflow: #20428364068 |
🚀 tt-metal post-commit testsBranch:
Test Results: 🔗 Links📊 Post-commit workflow: #20460108882 |
🚀 tt-metal post-commit testsBranch:
Test Results: 🔗 Links📊 Post-commit workflow: #20462076957 |
Ticket
Problem description
Make calculating exponential in fast and approx mode faster
What's changed
Moved both sanitization and calculation to one LOADMACRO.