Add gfx950 mla a8w8 qh32 kernel #1912

slippedJim · 2026-01-27T06:21:42Z

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Copilot

Pull request overview

This PR adds support for the gfx950 MLA (Multi-Head Latent Attention) a8w8 (8-bit activation, 8-bit weight) qh32 (32 query heads) kernel. The changes enable a new kernel configuration for fp8 data types with 32 heads in persistent mode with decode sequence length of 4.

Changes:

Added new kernel entry to the assembly CSV configuration for fp8/fp8 with 32 heads
Updated metadata generation logic to support 32-head configurations alongside existing 16 and 128-head support
Modified test harness to disable causal masking and reduce iteration count for testing
Added debug logging to trace tensor information and kernel arguments

Reviewed changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
hsa/gfx950/mla/mla_asm.csv	Registers new fp8/fp8 kernel with 32-head configuration
csrc/py_itfs_cu/asm_mla.cu	Adds configuration branch for fp8/fp8 32-head decode with qlen=4 and enables debug logging
csrc/kernels/mla/metadata/v1_comm.cuh	Introduces NUM_HEADS dispatcher macro for compile-time head count specialization
csrc/kernels/mla/metadata/v1_2_device.cuh	Extends metadata generation to support 32-head fp8 configurations
aiter/ops/attention.py	Updates metadata calculation to include 32-head case for fp8 data types
aiter/mla.py	Extends native support check and adds tensor debugging utilities
op_tests/test_mla_persistent.py	Disables causal masking for testing new kernel configuration
aiter/test_common.py	Reduces performance test iterations for faster debugging

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

op_tests/test_mla_persistent.py

csrc/py_itfs_cu/asm_mla.cu

aiter/mla.py

aiter/test_common.py

csrc/kernels/mla/metadata/v1_2_device.cuh

slippedJim added 2 commits January 23, 2026 17:01

update

3ec581a

update

d0b74aa

slippedJim requested review from a team and Copilot January 27, 2026 06:21

Copilot AI reviewed Jan 27, 2026

View reviewed changes

slippedJim force-pushed the jim/dev/mla_qh32 branch from f98d992 to 2d8cb5b Compare January 28, 2026 03:35

update kernels & add restriction in test script

92790a5

slippedJim force-pushed the jim/dev/mla_qh32 branch from 2d8cb5b to 92790a5 Compare January 28, 2026 03:38

slippedJim added 7 commits January 27, 2026 21:46

typo

ae79db1

update

c0ecad1

fix missing headers

6dc8524

update for qh32 qseq1

1a79c63

update

68db7cb

update

7509d9d

update

48c2ae1

slippedJim marked this pull request as draft January 30, 2026 06:17

slippedJim added 2 commits January 30, 2026 02:31

update

37b33f8

gitMerge branch 'main' into jim/dev/mla_qh32

12d5bf9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add gfx950 mla a8w8 qh32 kernel #1912

Add gfx950 mla a8w8 qh32 kernel #1912

slippedJim commented Jan 27, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add gfx950 mla a8w8 qh32 kernel #1912

Are you sure you want to change the base?

Add gfx950 mla a8w8 qh32 kernel #1912

Conversation

slippedJim commented Jan 27, 2026

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants