Skip to content

Conversation

@PerryZhang01
Copy link
Contributor

@PerryZhang01 PerryZhang01 commented Dec 11, 2025

This PR integrates new paged attention triton kernel, it supports sliding_window and sink params. This PR also refactors attention layers with attention backend dispatch.

the accuracy of gpt-oss in gsm8k dataset:
image

output_zeros=False,
)
else:
if self.rotary_emb is not None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this pass missed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, now we have model using reshape_and_cache_with_pertoken_quant ? if none, we don`t wanna introduce new dispatch or if else, if necessary, then we add it.

self.reduce_indptr = reduce_indptr
self.reduce_final_map = reduce_final_map
self.reduce_partial_map = reduce_partial_map
if block_tables_converted is not None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep these..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just someone deleted it when rebase main, I will recover it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants