-
Notifications
You must be signed in to change notification settings - Fork 22
Add SFPU eltwise max test #1008
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Thank you for your contribution! 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds comprehensive test coverage for the SFPU eltwise max operation, which was previously untested. The implementation includes both C++ kernel tests and Python validation tests to verify correctness across multiple data formats and architectures.
Key changes:
- Added C++ kernel test (
sfpu_eltwise_max_test.cpp) with unpack, math, and pack sections for hardware validation - Added Python test (
test_sfpu_eltwise_max.py) with parametrized test cases covering Float32, Float16, Float16_b, and Bfp8_b formats - Test validates element-wise max operation by comparing two tiles and storing the result
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| tests/sources/sfpu_eltwise_max_test.cpp | Implements low-level kernel test that unpacks 2 tiles, performs element-wise max comparison using _calculate_max_, and packs results back to L1 memory |
| tests/python_tests/test_sfpu_eltwise_max.py | Python test driver that generates stimuli, executes the kernel test, and validates results against PyTorch's torch.maximum golden reference |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| res_address = write_stimuli_to_l1( | ||
| test_config, | ||
| src_A, | ||
| src_B, |
Copilot
AI
Dec 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The src_B variable is generated but never used in this test. The eltwise max operation only uses src_A which contains 2 tiles. Consider either removing src_B from the generate_stimuli call or updating the function to not generate it if it's not needed.
| "testname": test_name, | ||
| "dest_acc": dest_acc, | ||
| "input_A_dimensions": input_dimensions, | ||
| "input_B_dimensions": input_dimensions, |
Copilot
AI
Dec 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The input_B_dimensions parameter is set but not actually used since the test only operates on src_A data. Consider removing this parameter from the test_config or setting it to None/an empty list to clarify that only input A is being used.
| "input_B_dimensions": input_dimensions, | |
| "input_B_dimensions": None, |
| # Force dest_acc for certain formats | ||
| if formats.input_format in [DataFormat.Float16, DataFormat.Float32]: | ||
| dest_acc = DestAccumulation.Yes |
Copilot
AI
Dec 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The dest_acc parameter is overridden for Float16 and Float32 formats, but this happens after it's already been parametrized in the test. This means the test will effectively ignore the DestAccumulation.No parameter for these formats. Consider either removing DestAccumulation.No from the parametrization for these formats, or documenting why it's overridden.
| # Force dest_acc for certain formats | |
| if formats.input_format in [DataFormat.Float16, DataFormat.Float32]: | |
| dest_acc = DestAccumulation.Yes | |
| # For Float16/Float32, destination accumulation must be enabled in hardware. | |
| # If the test was parametrized with dest_acc=No for these formats, skip this | |
| # combination instead of silently overriding the parameter value. | |
| if ( | |
| formats.input_format in [DataFormat.Float16, DataFormat.Float32] | |
| and dest_acc == DestAccumulation.No | |
| ): | |
| pytest.skip( | |
| "dest_acc=No is not supported for Float16/Float32; destination accumulation must be enabled." | |
| ) |
| for (int i = 0; i < TILE_CNT; i++) | ||
| { | ||
| _llk_pack_<DstSync::SyncHalf, is_fp32_dest_acc_en, false>(i, L1_ADDRESS(buffer_Res[i])); | ||
| } |
Copilot
AI
Dec 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pack loop packs TILE_CNT tiles back to L1, but only the first tile contains the result of the max operation. The second tile remains unchanged from its input value. Consider packing only the first tile (i = 0) since that's the only one that contains meaningful output.
| for (int i = 0; i < TILE_CNT; i++) | |
| { | |
| _llk_pack_<DstSync::SyncHalf, is_fp32_dest_acc_en, false>(i, L1_ADDRESS(buffer_Res[i])); | |
| } | |
| _llk_pack_<DstSync::SyncHalf, is_fp32_dest_acc_en, false>(0, L1_ADDRESS(buffer_Res[0])); |
30fe32d to
8d06a61
Compare
|
@fvranicTT @nvelickovicTT I still need to measure peft difference between old sfpi and new TTI code and see if this really gives us perf gain |
0a335b7 to
02f4ee6
Compare
91fbc8f to
0c8b17f
Compare
| ], # Number of iterations to run the test in order to minimize profiler overhead in measurement | ||
| input_dimensions=[ | ||
| [32, 32], | ||
| # [32, 64], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these here to stay?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will see once I fix CI errors I am getting
Ticket
no
Problem description
We are not testing sfpu eltwise max anywhere explicitly. Also this will serve as starting point for some optimizations needed to be done for SDPA. Test is AI generated so I guess I will have to pass through it once to make sure everything is written fine
What's changed
Added python and cpp test
Type of change