mla ps support paged 64 and 3buffer layout for ds3.2 #1917

minmengdie · 2026-01-28T02:26:16Z

Motivation

support for paged 64 and 3-buffer layout for MLA (Multi-Head Latent Attention) operations in DeepSpeed 3.2

Technical Details

Added page_size and nhead_kv parameters throughout the MLA API stack (Python, C++, and CUDA)
Introduced support for "byte" datatype to handle 3-buffer layout with separate nope/scale/rope buffers
Added new kernel assembly files (mla_a16w8_qh16_m16x4_n16x1_coex0_mask1_ps_page64_ds32.co) for optimized page size 64 support
Updated metadata generation logic to properly calculate KV offsets for paged layouts
Added test infrastructure for 3-buffer layout with helper functions for initialization and validation

Test Plan

python3 op_tests/test_mla_persistent.py -blk=64 -d=bf16 -kvd=fp8 -pl=3BUFFER

Test Result

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Copilot

Pull request overview

This pull request adds support for paged 64 and 3-buffer layout for MLA (Multi-Head Latent Attention) operations in DeepSpeed 3.2. The PR introduces new parameters (page_size and nhead_kv) across the entire call stack to enable flexible paging strategies and support for a specialized 3-buffer KV cache layout with FP8 quantization.

Changes:

Added page_size and nhead_kv parameters throughout the MLA API stack (Python, C++, and CUDA)
Introduced support for "byte" datatype to handle 3-buffer layout with separate nope/scale/rope buffers
Added new kernel assembly files (mla.co, mla_page64.co) for optimized page size 64 support
Updated metadata generation logic to properly calculate KV offsets for paged layouts
Added test infrastructure for 3-buffer layout with helper functions for initialization and validation

Reviewed changes

Copilot reviewed 13 out of 15 changed files in this pull request and generated 11 comments.

Show a summary per file

File	Description
op_tests/test_mla_sparse.py	Added `kv_last_page_lens` and `page_size`/`nhead_kv` parameters to function calls
op_tests/test_mla_persistent.py	Major additions: 3-buffer KV cache helper functions, new test cases for 3-buffer layout
op_tests/test_mla.py	Updated function signatures to include new `page_size` and `nhead_kv` parameters
hsa/gfx942/mla/mla_page64.co	New assembly kernel binary for page size 64 support
hsa/gfx942/mla/mla_asm.csv	Added kernel configuration entry for "byte" datatype
hsa/gfx942/mla/mla.co	New assembly kernel binary for 3-buffer layout support
csrc/py_itfs_cu/asm_mla.cu	Extended to handle "byte" datatype and removed page_size extraction from KV tensor
csrc/kernels/mla/metadata/v1_comm.cuh	Code formatting improvements and copyright update
csrc/kernels/mla/metadata/v1_2_device.cuh	Added paged layout support with proper kv_offset calculation
csrc/kernels/mla/metadata.cu	Added `kv_last_page_lens` and `page_size` parameters to metadata generation
csrc/include/rocm_ops.hpp	Updated Python bindings with new parameters
csrc/include/mla.h	Updated function signatures with new parameters
csrc/include/attention_asm_mla.h	Updated function signatures with new parameters
aiter/ops/attention.py	Added default values for backward compatibility
aiter/mla.py	Added `page_size` and `nhead_kv` parameters with default values

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

op_tests/test_mla_persistent.py

csrc/kernels/mla/metadata/v1_2_device.cuh

op_tests/test_mla_persistent.py

aiter/ops/attention.py

csrc/kernels/mla/metadata/v1_comm.cuh

mla ps support paged 64 and 3buffer layout for ds3.2

fe0f981

minmengdie requested review from a team and Copilot January 28, 2026 02:26

Copilot started reviewing on behalf of minmengdie January 28, 2026 02:27 View session

Copilot AI reviewed Jan 28, 2026

View reviewed changes

fix the github-actions

1c7550d

minmengdie force-pushed the mmd/dev/mla_ps_paged3buffer branch from 2dc00f6 to 1c7550d Compare January 28, 2026 02:59

upload kernel

1fe8d52

minmengdie requested review from Zzz9990 and ruanjm January 29, 2026 03:50

ruanjm reviewed Jan 29, 2026

View reviewed changes

aiter/ops/attention.py Outdated Show resolved Hide resolved

csrc/kernels/mla/metadata/v1_comm.cuh Outdated Show resolved Hide resolved

fix the comments

e0f10ab

shengnxu approved these changes Jan 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mla ps support paged 64 and 3buffer layout for ds3.2 #1917

mla ps support paged 64 and 3buffer layout for ds3.2 #1917

Uh oh!

minmengdie commented Jan 28, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mla ps support paged 64 and 3buffer layout for ds3.2 #1917

Are you sure you want to change the base?

mla ps support paged 64 and 3buffer layout for ds3.2 #1917

Uh oh!

Conversation

minmengdie commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

minmengdie commented Jan 28, 2026 •

edited

Loading