-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GroupQueryAttention produces incorrect results when loaded from mxr #3596
Comments
The IR prinout from the trimmed model is the same for both compiled and loaded:
|
I observed that the elements of the query tensor in compute_attention_probabilities kernel were all shifted forward by 4 elements. After printing the pointer addresses, it appears that for some reason the pointer of the query tensor_view points to the address of the seqlens_k tensor, but only in the loaded model.
|
I know why this fails. We dont serialize the |
#3613 might fix this issue if you want to try it out. |
This issue occurs in the llama2 fp16 and int4 weights models, as well as a trimmed model that returns after the first GQA node.
The text was updated successfully, but these errors were encountered: