Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GroupQueryAttention produces incorrect results when loaded from mxr #3596

Open
turneram opened this issue Nov 6, 2024 · 4 comments
Open
Assignees

Comments

@turneram
Copy link
Contributor

turneram commented Nov 6, 2024

This issue occurs in the llama2 fp16 and int4 weights models, as well as a trimmed model that returns after the first GQA node.

@turneram turneram self-assigned this Nov 6, 2024
@turneram
Copy link
Contributor Author

turneram commented Nov 6, 2024

The IR prinout from the trimmed model is the same for both compiled and loaded:

module: "main"
@0 = check_context::migraphx::gpu::context -> float_type, {}, {}
@1 = hip::hip_allocate_memory[shape=int8_type, {69763080}, {1},id=main:scratch] -> int8_type, {69763080}, {1}
@2 = hip::hip_copy_literal[id=main:@literal:0] -> half_type, {1, 4096, 12288}, {0, 12288, 1}
@3 = hip::hip_copy_literal[id=main:@literal:3] -> half_type, {32000, 4096}, {4096, 1}
@4 = load[offset=112,end=184](@1) -> int64_type, {1, 9}, {9, 1}
input_ids = @param:input_ids -> int64_type, {1, 9}, {9, 1}
@6 = hip::copy_to_gpu(input_ids,@4) -> int64_type, {1, 9}, {9, 1}
@7 = load[offset=72,end=108](@1) -> float_type, {1, 9}, {9, 1}
@8 = gpu::code_object[code_object=9168,symbol_name=convert_kernel,global=9,local=1024,](@6,@7) -> float_type, {1, 9}, {9, 1}
@9 = load[offset=147456,end=221184](@1) -> half_type, {1, 9, 4096}, {36864, 4096, 1}
@10 = gpu::code_object[code_object=9168,symbol_name=gather_kernel,global=36864,local=1024,](@3,@8,@9) -> half_type, {1, 9, 4096}, {36864, 4096, 1}
@11 = load[offset=73728,end=147456](@1) -> half_type, {1, 9, 4096}, {36864, 4096, 1}
@12 = gpu::code_object[code_object=10208,symbol_name=convert_mul_mul_reduce_sum_convert_add_rsqrt_mul_kernel,global=2304,local=256,](@10,@11) -> half_type, {1, 9, 4096}, {36864, 4096, 1}
@13 = load[offset=67330056,end=67551240](@1) -> half_type, {1, 9, 12288}, {110592, 12288, 1}
@14 = gpu::gemm[alpha=1,beta=0,compute_fp32=1,trans_batch=0,solution_idx=0](@12,@2,@13) -> half_type, {1, 9, 12288}, {110592, 12288, 1}
past_key_values.0.value = @param:past_key_values.0.value -> half_type, {1, 32, 4096, 128}, {16777216, 524288, 128, 1}
@16 = load[offset=33554432,end=67108864](@1) -> half_type, {1, 32, 4096, 128}, {16777216, 524288, 128, 1}
@17 = hip::copy_to_gpu(past_key_values.0.value,@16) -> half_type, {1, 32, 4096, 128}, {16777216, 524288, 128, 1}
@18 = hip::hip_copy_literal[id=main:@literal:1] -> half_type, {4096, 64}, {64, 1}
@19 = hip::hip_copy_literal[id=main:@literal:2] -> half_type, {4096, 64}, {64, 1}
past_key_values.0.key = @param:past_key_values.0.key -> half_type, {1, 32, 4096, 128}, {16777216, 524288, 128, 1}
@21 = load[offset=0,end=33554432](@1) -> half_type, {1, 32, 4096, 128}, {16777216, 524288, 128, 1}
@22 = hip::copy_to_gpu(past_key_values.0.key,@21) -> half_type, {1, 32, 4096, 128}, {16777216, 524288, 128, 1}
attention_mask = @param:attention_mask -> int64_type, {1, 9}, {9, 1}
@24 = load[offset=67108872,end=67108944](@1) -> int64_type, {1, 9}, {9, 1}
@25 = hip::copy_to_gpu(attention_mask,@24) -> int64_type, {1, 9}, {9, 1}
@26 = load[offset=67108864,end=67108868](@1) -> int32_type, {1, 1}, {1, 1}
@27 = gpu::code_object[code_object=9344,symbol_name=convert_reduce_sum_add_convert_kernel,global=16,local=64,](@25,@26) -> int32_type, {1, 1}, {1, 1}
@28 = hip::copy_from_gpu(@17) -> half_type, {1, 32, 4096, 128}, {16777216, 524288, 128, 1}
@29 = hip::copy_from_gpu(@22) -> half_type, {1, 32, 4096, 128}, {16777216, 524288, 128, 1}
@30 = reshape_lazy[dims={1, 9, 96, 128}](@14) -> half_type, {1, 9, 96, 128}, {110592, 12288, 128, 1}
@31 = transpose[permutation={0, 2, 1, 3}](@30) -> half_type, {1, 96, 9, 128}, {110592, 128, 12288, 1}
@32 = load[offset=67108872,end=67330056](@1) -> half_type, {1, 96, 9, 128}, {110592, 128, 12288, 1}
@33 = gpu::code_object[code_object=10920,symbol_name=gqa_rotary_embedding_kernel,global=110592,local=1024,](@31,@27,@19,@18,@32) -> half_type, {1, 96, 9, 128}, {110592, 128, 12288, 1}
@34 = gpu::code_object[code_object=9768,symbol_name=concat_past_present_kernel,global=73728,local=1024,](@33,@22,@17,@27) -> half_type, {1, 96, 9, 128}, {110592, 128, 12288, 1}
@35 = load[offset=67330056,end=69689352](@1) -> half_type, {1, 32, 9, 4096}, {1179648, 36864, 4096, 1}
@36 = gpu::code_object[code_object=12288,symbol_name=compute_attention_probabilities_kernel,global=466944,local=1024,](@34,@22,@17,@27,@35) -> half_type, {1, 32, 9, 4096}, {1179648, 36864, 4096, 1}
@37 = gpu::code_object[code_object=11224,symbol_name=gqa_softmax_kernel,global=288,local=1024,](@33,@22,@36,@27,@36) -> half_type, {1, 32, 9, 4096}, {1179648, 36864, 4096, 1}
@38 = load[offset=69689352,end=69763080](@1) -> half_type, {1, 9, 4096}, {36864, 4096, 1}
@39 = gpu::code_object[code_object=9904,symbol_name=compute_attention_scores_kernel,global=36864,local=1024,](@33,@22,@17,@27,@37,@38) -> half_type, {1, 9, 4096}, {36864, 4096, 1}
@40 = hip::copy_from_gpu(@39) -> half_type, {1, 9, 4096}, {36864, 4096, 1}
@41 = hip::sync_stream(@29,@28,@40) -> half_type, {1, 32, 4096, 128}, {16777216, 524288, 128, 1}
@42 = @return(@41,@28,@40)

@turneram
Copy link
Contributor Author

turneram commented Nov 6, 2024

I observed that the elements of the query tensor in compute_attention_probabilities kernel were all shifted forward by 4 elements. After printing the pointer addresses, it appears that for some reason the pointer of the query tensor_view points to the address of the seqlens_k tensor, but only in the loaded model.

Kernel/Tensor Compiled model address Loaded model address
Rotary
Output 0x7f215ba00008 0x7f8f33000008
SeqLens 0x7f215ba00000 0x7f8f33000000
Concat
Query 0x7f215ba00008 0x7f8f33000008
SeqLens 0x7f215ba00000 0x7f8f33000000
Compute probabilities
Query 0x7f215ba00008 0x7f8f33000000
SeqLens 0x7f215ba00000 0x7f8f33000000

@pfultz2
Copy link
Collaborator

pfultz2 commented Nov 11, 2024

I know why this fails. We dont serialize the output_arg in code_object op. So the concat_past_present_kernel will return the last argument rather than the first argument.

@pfultz2
Copy link
Collaborator

pfultz2 commented Nov 11, 2024

#3613 might fix this issue if you want to try it out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants