Skip to content

Conversation

@wransom-TT
Copy link
Contributor

@wransom-TT wransom-TT commented Jan 1, 2026

Ticket

tenstorrent/tt-metal#27845

Metal PR:

tenstorrent/tt-metal#35216

Problem description

MPWI currently only supports kernel_hw<=32.

What's changed

SFPU MPWI functions have been updated to do accumulation over multiple chunks

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update

Checklist

Copilot AI review requested due to automatic review settings January 1, 2026 19:19
@wransom-TT wransom-TT requested a review from rtawfik01 as a code owner January 1, 2026 19:19
@wransom-TT wransom-TT added the metal-post-commit-tests This label is used for running tt-metal post-commit tests: wormhole and blackhole label Jan 1, 2026
@github-actions
Copy link
Contributor

github-actions bot commented Jan 1, 2026

Thank you for your contribution! 🚀
If you want to run metal post-commit tests, you can add the metal-post-commit-tests label to this pull request.
📖 For more information, please refer to our CONTRIBUTING guide.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds support for large kernels (kernel_hw > 32) to the Max Pool With Indices (MPWI) operation by implementing accumulation over multiple chunks. Previously, MPWI only supported kernels with height/width ≤ 32.

Key changes:

  • Added an accumulate template parameter to enable chunk-based processing for large kernels
  • Renamed the third parameter from tile_idx (unused) to chunk to track the chunk index during accumulation
  • Implemented accumulation logic that stores intermediate results in separate destination tiles and combines them across chunks

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.

File Description
tt_llk_wormhole_b0/common/inc/sfpu/ckernel_sfpu_max_pool_indices.h Updated MPWI functions to support large kernel accumulation for Wormhole B0 architecture
tt_llk_blackhole/common/inc/sfpu/ckernel_sfpu_max_pool_indices.h Updated MPWI functions to support large kernel accumulation for Blackhole architecture

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

TT_SFPSTORE(p_sfpu::LREG0, InstrModLoadStore::DEFAULT, ADDR_MOD_3, values_accum_tile_offset + col_offset);
}

// store the final result to DST 0 (data) and DST 2 (indexes)
Copy link

Copilot AI Jan 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment uses "indexes" but throughout the rest of the codebase and documentation, the term "indices" is consistently used. This should be changed to "indices" for consistency.

Suggested change
// store the final result to DST 0 (data) and DST 2 (indexes)
// store the final result to DST 0 (data) and DST 2 (indices)

Copilot uses AI. Check for mistakes.
@github-actions
Copy link
Contributor

github-actions bot commented Jan 1, 2026

🚀 tt-metal post-commit tests

Branch: wransom/mpwi_large
Test Configuration:

  • Wormhole tests: 🟢 Enabled
  • Blackhole tests: 🟢 Enabled

Test Results:

  • Wormhole tests: 🛑 Failed - View run
  • C++ post-commit tests: 🎯 Passed - View run
  • Blackhole tests: 🛑 Failed - View run
  • Device perf regressions tests: 🎯 Passed - View run

🔗 Links

📊 Post-commit workflow: #20644145065

@github-actions
Copy link
Contributor

github-actions bot commented Jan 2, 2026

🚀 tt-metal post-commit tests

Branch: wransom/mpwi_large
Test Configuration:

  • Wormhole tests: 🟢 Enabled
  • Blackhole tests: 🟢 Enabled

Test Results:

  • Wormhole tests: 🛑 Failed - View run
  • C++ post-commit tests: 🎯 Passed - View run
  • Blackhole tests: 🛑 Failed - View run
  • Device perf regressions tests: 🎯 Passed - View run

🔗 Links

📊 Post-commit workflow: #20644810710

@wransom-TT wransom-TT added metal-post-commit-tests This label is used for running tt-metal post-commit tests: wormhole and blackhole and removed metal-post-commit-tests This label is used for running tt-metal post-commit tests: wormhole and blackhole labels Jan 5, 2026
@github-actions
Copy link
Contributor

github-actions bot commented Jan 5, 2026

🚀 tt-metal post-commit tests

Branch: wransom/mpwi_large
Test Configuration:

  • Wormhole tests: 🟢 Enabled
  • Blackhole tests: 🟢 Enabled

Test Results:

  • Wormhole tests: 🎯 Passed - View run
  • C++ post-commit tests: 🎯 Passed - View run
  • Blackhole tests: 🛑 Failed - View run
  • Device perf regressions tests: 🎯 Passed - View run

🔗 Links

📊 Post-commit workflow: #20725100735

@github-actions
Copy link
Contributor

github-actions bot commented Jan 5, 2026

🚀 tt-metal post-commit tests

Branch: wransom/mpwi_large
Test Configuration:

  • Wormhole tests: 🟢 Enabled
  • Blackhole tests: 🟢 Enabled

Test Results:

  • Wormhole tests: 🎯 Passed - View run
  • C++ post-commit tests: 🎯 Passed - View run
  • Blackhole tests: 🛑 Failed - View run
  • Device perf regressions tests: 🎯 Passed - View run

🔗 Links

📊 Post-commit workflow: #20727817391

}
else
{
static_assert(!accumulate, "accumulate mode is not supported for TILE layout");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please make sure that layout == TILE in that case. This way, assert will trigger for ROW_MAJOR + accumulate as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is in the else block of the if constexpr (layout == ckernel::DataLayout::ROW_MAJOR) so I think it's already checking this.

}
else
{
static_assert(!accumulate, "accumulate mode is not supported for TILE layout");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same thing here

bool accumulate = false>
inline void _calculate_max_pool_with_indices_(const uint values_tile_idx, const uint indices_tile_idx, const uint chunk)
{
// size of each tile in Dest is 64 rows
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding LLK_ASSERTs to check values of values_tile_idx, indice_tile_idx, and chunk are valid.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if I can do much to validate the chunk number. They should increase 1 by 1 from 0 but without a way to save the state I'm not sure how to do this.

For the DST tiles the only real requirement is that each tile has a clear tile after it if accumulation is true. Otherwise there are no restrictions, would you like me to add these checks?

@github-actions
Copy link
Contributor

🚀 tt-metal post-commit tests

Branch: wransom/mpwi_large
Test Configuration:

  • Wormhole tests: 🟢 Enabled
  • Blackhole tests: 🟢 Enabled

Test Results:

  • Wormhole tests: 🚫 Workflow Cancelled - Check logs for details
  • C++ post-commit tests: 🚫 Workflow Cancelled - Check logs for details
  • Blackhole tests: 🚫 Workflow Cancelled - Check logs for details
  • Device perf regressions tests: 🚫 Workflow Cancelled - Check logs for details

🔗 Links

📊 Post-commit workflow: #20929388907

@github-actions
Copy link
Contributor

🚀 tt-metal post-commit tests

Branch: wransom/mpwi_large
Test Configuration:

  • Wormhole tests: 🟢 Enabled
  • Blackhole tests: 🟢 Enabled

Test Results:

  • Wormhole tests: 🚫 Workflow Cancelled - Check logs for details
  • C++ post-commit tests: 🚫 Workflow Cancelled - Check logs for details
  • Blackhole tests: 🚫 Workflow Cancelled - Check logs for details
  • Device perf regressions tests: 🚫 Workflow Cancelled - Check logs for details

🔗 Links

📊 Post-commit workflow: #20930842380

@github-actions
Copy link
Contributor

🚀 tt-metal post-commit tests

Branch: wransom/mpwi_large
Test Configuration:

  • Wormhole tests: 🟢 Enabled
  • Blackhole tests: 🟢 Enabled

Test Results:

  • Wormhole tests: 🚫 Workflow Cancelled - Check logs for details
  • C++ post-commit tests: 🛑 Failed - View run
  • Blackhole tests: 🚫 Workflow Cancelled - Check logs for details
  • Device perf regressions tests: 🚫 Workflow Cancelled - Check logs for details

🔗 Links

📊 Post-commit workflow: #21005031265

@github-actions
Copy link
Contributor

🚀 tt-metal post-commit tests

Branch: wransom/mpwi_large
Test Configuration:

  • Wormhole tests: 🟢 Enabled
  • Blackhole tests: 🟢 Enabled

Test Results:

  • Wormhole tests: 🎯 Passed - View run
  • C++ post-commit tests: 🛑 Failed - View run
  • Blackhole tests: 🎯 Passed - View run
  • Device perf regressions tests: 🎯 Passed - View run

🔗 Links

📊 Post-commit workflow: #21005217637

{ // for all but the first chunk we need to load the previous result from DST 1 and 3 and do a max with the current result in DST 0 and 2
TT_SFPLOAD(p_sfpu::LREG1, InstrModLoadStore::DEFAULT, ADDR_MOD_3, values_accum_tile_offset + col_offset); // previous accumulated value
TT_SFPLOAD(p_sfpu::LREG5, instr_mod_index, ADDR_MOD_3, indices_accum_tile_offset + col_offset); // previous accumulated index
TTI_SFPSWAP(0, p_sfpu::LREG0, p_sfpu::LREG1, p_sfpswap::ALL_ROWS_MAX); // LREG0 contains max of current and previous value
Copy link
Contributor

@fvranicTT fvranicTT Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This instruction requires 2 cycles to complete and should always be followed by an SFPNOP instruction.
Could we swap order of the second SFPLOAD and SFPSWAP to hide this NOP?
edit: Nevermind, it's already handled by the order of STOREs.

template <bool APPROXIMATION_MODE, bool is_fp32_dest_acc_en, int ITERATIONS>
inline void _calculate_max_pool_with_indices_generic_(const uint values_tile_idx, const uint indices_tile_idx, const uint tile_idx /* unused */)
template <bool APPROXIMATION_MODE, bool is_fp32_dest_acc_en, int ITERATIONS, bool accumulate = false>
inline void _calculate_max_pool_with_indices_generic_(const uint values_tile_idx, const uint indices_tile_idx, const uint chunk)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After looking into this locally, I think we can eliminate some loads and stores. Will write more later.
Basically, after reducing an 8-row block, results were stored to memory and immediately reloaded for the next reduction stage. I think we can avoid that.

Copy link
Contributor

@fvranicTT fvranicTT Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the original implementation we reduce 8 rows and immediately store the result, only to reload it right after:

Original:

  1. Load rows 0-7 from Dest Reg -> LREG0-3, LREG4-7
  2. Reduce (max in LREG0/LREG4)
  3. Store LREG0/LREG4 -> Dest Reg offset 0
  4. Load rows 8-15 from Dest Reg -> LREG0-3, LREG4-7
  5. Reduce (max in LREG0/LREG4)
  6. Store LREG0/LREG4 -> Dest Reg offset 16
  7. Load from Dest Reg offset 0 -> LREG0/LREG4 <- reload result of step 3
  8. Load from Dest Reg offset 16 -> LREG1/LREG5 <- reload result of step 6
  9. SWAP
  10. store final

Idea:

  1. Load rows 0-7 from Dest Reg -> LREG0-3, LREG4-7
  2. Reduce (max in LREG0/LREG4)
  3. Store LREG0/LREG4 -> Dest Reg offset 0
  4. Load rows 8-15 from Dest Reg -> LREG0-3, LREG4-7
  5. Reduce (max in LREG0/LREG4)
  6. Skip store - keep in LREG0/LREG4
  7. Load from Dest Reg offset 0 -> LREG1/LREG5
  8. Skip load - second result already in LREG0/LREG4
  9. SWAP
  10. store final

So steps 6 and step 8 differ, and this is where we can save some time.

This cuts out 2 stores and 2 loads per call to process_16_rows x4 = 8 loads and 8 stores

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one's a bit more tricky. Originally we process all rows 0-15 first, then all rows 16-31, then do the final swaps:

process_16_rows(0, even);       // rows 0-15 even
process_16_rows(0, odd);        // rows 0-15 odd
process_16_rows(32, even);      // rows 16-31 even, store result
process_16_rows(32, odd);       // rows 16-31 odd, store result
final_swap(even);               // has to reload rows 16-31 even
final_swap(odd);                // has to reload rows 16-31 odd

But there's no dependency between even and odd columns (they touch different memory), so we can process each column completely and then move on:

process_16_rows(0, even, true);
process_16_rows(32, even, false);   // keep result in LREG0/LREG4
final_swap(even);                   // use what's already in registers

process_16_rows(0, odd, true);
process_16_rows(32, odd, false);    // keep result in LREG0/LREG4
final_swap(odd);                    // use what's already in registers

Now final_swap() can use the result directly from LREG0/LREG4 instead of loading it back from memory.
That's another 2 stores and 2 loads saved per column, so 4 more of each total.

Copy link
Contributor

@fvranicTT fvranicTT Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I summarized my ideas and wrote the instructions to the LLM and here is what it produced. I haven't had the time to actually test it, but I went through the code and I think it does what I summarized above:

/**
 * @brief Calculates column-wise MaxPool of a tile, placing output values into the first row.
 *        Also places the index of the max value into the first row of the indices tile.
 *        Supports {FP32, FP16_B} for values, and {UINT16, INT32, UINT32} for indices, inferred from the Dest mode used.
 *        Can reduce up to 32 rows of a tile.
 * @tparam APPROXIMATION_MODE Whether to use the approximation mode (unused).
 * @tparam is_fp32_dest_acc_en Whether Dest is in 32bit mode (true) or 16bit mode (false).
 * @tparam ITERATIONS The number of iterations to use for the MaxPool operation (unused).
 * @tparam accumulate Whether to accumulate results for large kernels (default is false).
 * @param values_tile_idx The index of the tile in the Dest register containing the data to be reduced.
 * @param indices_tile_idx The index of the tile in the Dest register containing the indices of the data.
 * @param chunk The chunk index for large kernel accumulation.
 *
 * Note this function is only implemented for ROW_MAJOR data layout, so when _init_max_pool_with_indices_ is called
 * it must be called with layout=DataLayout::ROW_MAJOR.
 */
template <bool APPROXIMATION_MODE, bool is_fp32_dest_acc_en, int ITERATIONS, bool accumulate = false>
inline void _calculate_max_pool_with_indices_generic_(const uint values_tile_idx, const uint indices_tile_idx, const uint chunk)
{
    // size of each tile in Dest is 64 rows
    constexpr uint dst_tile_size         = 64;
    const uint values_tile_offset        = values_tile_idx * dst_tile_size;
    const uint indices_tile_offset       = indices_tile_idx * dst_tile_size;
    const uint values_accum_tile_offset  = (values_tile_idx + 1) * dst_tile_size;
    const uint indices_accum_tile_offset = (indices_tile_idx + 1) * dst_tile_size;
    // each face is 16 rows
    constexpr uint eight_row_offset   = 16;
    constexpr uint sixteen_row_offset = 32;
    constexpr uint8_t instr_mod_index = is_fp32_dest_acc_en ? InstrModLoadStore::INT32 : InstrModLoadStore::LO16;

    // ROW MAJOR DATA VERSION OF MPWI
    // DATA IS EXPECTED TO BE IN THE FOLLOWING ORDER IN DEST:
    // Face 0 Row 0
    // Face 1 Row 0
    // Face 0 Row 1
    // Face 1 Row 1
    // .
    // .
    // .
    // Face 0 Row 31
    // Face 1 Row 31

    // Reduces 8 rows to max in LREG0/LREG4, optionally stores result.
    auto reduce_8_rows = [instr_mod_index](const uint val_base, const uint idx_base, const bool store_result) __attribute__((always_inline))
    {
        // data - precomputed base address eliminates repeated arithmetic
        TT_SFPLOAD(p_sfpu::LREG0, InstrModLoadStore::DEFAULT, ADDR_MOD_7, val_base + 0);  // Row 0 and 1
        TT_SFPLOAD(p_sfpu::LREG1, InstrModLoadStore::DEFAULT, ADDR_MOD_7, val_base + 4);  // Row 2 and 3
        TT_SFPLOAD(p_sfpu::LREG2, InstrModLoadStore::DEFAULT, ADDR_MOD_7, val_base + 8);  // Row 4 and 5
        TT_SFPLOAD(p_sfpu::LREG3, InstrModLoadStore::DEFAULT, ADDR_MOD_7, val_base + 12); // Row 6 and 7
        // index
        TT_SFPLOAD(p_sfpu::LREG4, instr_mod_index, ADDR_MOD_7, idx_base + 0);
        TT_SFPLOAD(p_sfpu::LREG5, instr_mod_index, ADDR_MOD_7, idx_base + 4);
        TT_SFPLOAD(p_sfpu::LREG6, instr_mod_index, ADDR_MOD_7, idx_base + 8);
        TT_SFPLOAD(p_sfpu::LREG7, instr_mod_index, ADDR_MOD_7, idx_base + 12);

        // Reduce 8 rows to max in LREG0/LREG4 via replay buffer
        lltt::replay(0, 7);

        // Only store when necessary - caller controls this
        if (store_result)
        {
            TT_SFPSTORE(p_sfpu::LREG0, InstrModLoadStore::DEFAULT, ADDR_MOD_7, val_base + 0);
            TT_SFPSTORE(p_sfpu::LREG4, instr_mod_index, ADDR_MOD_7, idx_base + 0);
        }
        // Result: Max of 8 rows in LREG0 (values) and LREG4 (indices)
    };

    // OPTIMIZATION: Flattened process_16_rows - eliminates nested lambda overhead
    // and removes redundant store-load pairs.
    // Note: After reducing the second 8-row block, the result is already in LREG0/LREG4.
    // We only need to reload the first block's result into LREG1/LREG5 for the final swap.
    // store_result: if false, result stays in LREG0/LREG4 for caller to use directly.
    auto process_16_rows = [&reduce_8_rows, values_tile_offset, indices_tile_offset, eight_row_offset, instr_mod_index](
                               const uint base_offset, const uint col_offset, const bool store_result) __attribute__((always_inline))
    {
        // Precompute base addresses for both 8-row blocks
        const uint val_base_first  = values_tile_offset + base_offset + col_offset;
        const uint idx_base_first  = indices_tile_offset + base_offset + col_offset;
        const uint val_base_second = values_tile_offset + eight_row_offset + base_offset + col_offset;
        const uint idx_base_second = indices_tile_offset + eight_row_offset + base_offset + col_offset;

        // First 8 rows: reduce and STORE (we need to free registers for second block)
        reduce_8_rows(val_base_first, idx_base_first, true);

        // Second 8 rows: reduce but DON'T STORE - keep result in LREG0/LREG4
        reduce_8_rows(val_base_second, idx_base_second, false);

        // Now: LREG0/LREG4 contains Max(R8-15), need to load Max(R0-7) into LREG1/LREG5
        TT_SFPLOAD(p_sfpu::LREG1, InstrModLoadStore::DEFAULT, ADDR_MOD_7, val_base_first);
        TT_SFPLOAD(p_sfpu::LREG5, instr_mod_index, ADDR_MOD_7, idx_base_first);

        // Swap to get Max(R0-15) in LREG0/LREG4
        TTI_SFPSWAP(0, p_sfpu::LREG0, p_sfpu::LREG1, p_sfpswap::ALL_ROWS_MAX);

        // Only store if needed - caller may use LREG0/LREG4 directly
        if (store_result)
        {
            TT_SFPSTORE(p_sfpu::LREG4, instr_mod_index, ADDR_MOD_7, idx_base_first);
            TT_SFPSTORE(p_sfpu::LREG0, InstrModLoadStore::DEFAULT, ADDR_MOD_7, val_base_first);
        }
    };

    // Final swap: combine first 16 rows with second 16 rows.
    // OPTIMIZATION: After process_16_rows(sixteen_row_offset, col), Max(R16-31) is already in LREG0/LREG4.
    // We only need to load Max(R0-15) into LREG1/LREG5, saving 2 loads per column.
    auto final_swap =
        [values_tile_offset, indices_tile_offset, values_accum_tile_offset, indices_accum_tile_offset, instr_mod_index, chunk](
            const uint col_offset) __attribute__((always_inline))
    {
        // Precompute addresses
        const uint val_first = values_tile_offset + col_offset;
        const uint idx_first = indices_tile_offset + col_offset;

        // LREG0/LREG4 already contains Max(R16-31) from the previous process_16_rows call
        // Only need to load Max(R0-15) into LREG1/LREG5
        TT_SFPLOAD(p_sfpu::LREG1, InstrModLoadStore::DEFAULT, ADDR_MOD_7, val_first); // Max(R0-15) for F0,1
        TT_SFPLOAD(p_sfpu::LREG5, instr_mod_index, ADDR_MOD_7, idx_first);
        TTI_SFPSWAP(0, p_sfpu::LREG0, p_sfpu::LREG1, p_sfpswap::ALL_ROWS_MAX); // LREG0 contains Max(R0-31) for F0,1

        if constexpr (accumulate)
        {
            const uint val_accum = values_accum_tile_offset + col_offset;
            const uint idx_accum = indices_accum_tile_offset + col_offset;
            if (chunk > 0)
            { // for all but the first chunk we need to load the previous result from DST 1 and 3 and do a max with the current result in DST 0 and 2
                TT_SFPLOAD(p_sfpu::LREG1, InstrModLoadStore::DEFAULT, ADDR_MOD_7, val_accum); // previous accumulated value
                TT_SFPLOAD(p_sfpu::LREG5, instr_mod_index, ADDR_MOD_7, idx_accum);            // previous accumulated index
                TTI_SFPSWAP(0, p_sfpu::LREG0, p_sfpu::LREG1, p_sfpswap::ALL_ROWS_MAX); // LREG0 contains max of current and previous value
            }
            // for each chunk we store the running result to DST 1 and 3
            TT_SFPSTORE(p_sfpu::LREG4, instr_mod_index, ADDR_MOD_7, idx_accum);
            TT_SFPSTORE(p_sfpu::LREG0, InstrModLoadStore::DEFAULT, ADDR_MOD_7, val_accum);
        }

        // store the final result to DST 0 (data) and DST 2 (indices)
        TT_SFPSTORE(p_sfpu::LREG4, instr_mod_index, ADDR_MOD_7, idx_first);
        TT_SFPSTORE(p_sfpu::LREG0, InstrModLoadStore::DEFAULT, ADDR_MOD_7, val_first);
    };

    // OPTIMIZATION: Process each column completely before moving to the next.
    // This allows the second process_16_rows to leave Max(R16-31) in LREG0/LREG4,
    // which final_swap can use directly without reloading.
    // Saves 2 stores + 2 loads per column = 4 stores + 4 loads total.
    constexpr int even_column_offset = 0;
    constexpr int odd_column_offset  = 2;

    // Even columns: process rows 0-15, then 16-31, then final swap
    process_16_rows(0, even_column_offset, true);                      // Store Max(R0-15) for final_swap to load
    process_16_rows(sixteen_row_offset, even_column_offset, false);    // Keep Max(R16-31) in LREG0/LREG4
    final_swap(even_column_offset);                                    // Uses LREG0/LREG4 directly

    // Odd columns: process rows 0-15, then 16-31, then final swap
    process_16_rows(0, odd_column_offset, true);                       // Store Max(R0-15) for final_swap to load
    process_16_rows(sixteen_row_offset, odd_column_offset, false);     // Keep Max(R16-31) in LREG0/LREG4
    final_swap(odd_column_offset);                                     // Uses LREG0/LREG4 directly
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

blackhole metal-post-commit-tests This label is used for running tt-metal post-commit tests: wormhole and blackhole wormhole

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants