GroupQueryAttention produces incorrect results when loaded from mxr #3596

turneram · 2024-11-06T16:19:55Z

This issue occurs in the llama2 fp16 and int4 weights models, as well as a trimmed model that returns after the first GQA node.

turneram · 2024-11-06T16:32:34Z

The IR prinout from the trimmed model is the same for both compiled and loaded:

module: "main"
@0 = check_context::migraphx::gpu::context -> float_type, {}, {}
@1 = hip::hip_allocate_memory[shape=int8_type, {69763080}, {1},id=main:scratch] -> int8_type, {69763080}, {1}
@2 = hip::hip_copy_literal[id=main:@literal:0] -> half_type, {1, 4096, 12288}, {0, 12288, 1}
@3 = hip::hip_copy_literal[id=main:@literal:3] -> half_type, {32000, 4096}, {4096, 1}
@4 = load[offset=112,end=184](@1) -> int64_type, {1, 9}, {9, 1}
input_ids = @param:input_ids -> int64_type, {1, 9}, {9, 1}
@6 = hip::copy_to_gpu(input_ids,@4) -> int64_type, {1, 9}, {9, 1}
@7 = load[offset=72,end=108](@1) -> float_type, {1, 9}, {9, 1}
@8 = gpu::code_object[code_object=9168,symbol_name=convert_kernel,global=9,local=1024,](@6,@7) -> float_type, {1, 9}, {9, 1}
@9 = load[offset=147456,end=221184](@1) -> half_type, {1, 9, 4096}, {36864, 4096, 1}
@10 = gpu::code_object[code_object=9168,symbol_name=gather_kernel,global=36864,local=1024,](@3,@8,@9) -> half_type, {1, 9, 4096}, {36864, 4096, 1}
@11 = load[offset=73728,end=147456](@1) -> half_type, {1, 9, 4096}, {36864, 4096, 1}
@12 = gpu::code_object[code_object=10208,symbol_name=convert_mul_mul_reduce_sum_convert_add_rsqrt_mul_kernel,global=2304,local=256,](@10,@11) -> half_type, {1, 9, 4096}, {36864, 4096, 1}
@13 = load[offset=67330056,end=67551240](@1) -> half_type, {1, 9, 12288}, {110592, 12288, 1}
@14 = gpu::gemm[alpha=1,beta=0,compute_fp32=1,trans_batch=0,solution_idx=0](@12,@2,@13) -> half_type, {1, 9, 12288}, {110592, 12288, 1}
past_key_values.0.value = @param:past_key_values.0.value -> half_type, {1, 32, 4096, 128}, {16777216, 524288, 128, 1}
@16 = load[offset=33554432,end=67108864](@1) -> half_type, {1, 32, 4096, 128}, {16777216, 524288, 128, 1}
@17 = hip::copy_to_gpu(past_key_values.0.value,@16) -> half_type, {1, 32, 4096, 128}, {16777216, 524288, 128, 1}
@18 = hip::hip_copy_literal[id=main:@literal:1] -> half_type, {4096, 64}, {64, 1}
@19 = hip::hip_copy_literal[id=main:@literal:2] -> half_type, {4096, 64}, {64, 1}
past_key_values.0.key = @param:past_key_values.0.key -> half_type, {1, 32, 4096, 128}, {16777216, 524288, 128, 1}
@21 = load[offset=0,end=33554432](@1) -> half_type, {1, 32, 4096, 128}, {16777216, 524288, 128, 1}
@22 = hip::copy_to_gpu(past_key_values.0.key,@21) -> half_type, {1, 32, 4096, 128}, {16777216, 524288, 128, 1}
attention_mask = @param:attention_mask -> int64_type, {1, 9}, {9, 1}
@24 = load[offset=67108872,end=67108944](@1) -> int64_type, {1, 9}, {9, 1}
@25 = hip::copy_to_gpu(attention_mask,@24) -> int64_type, {1, 9}, {9, 1}
@26 = load[offset=67108864,end=67108868](@1) -> int32_type, {1, 1}, {1, 1}
@27 = gpu::code_object[code_object=9344,symbol_name=convert_reduce_sum_add_convert_kernel,global=16,local=64,](@25,@26) -> int32_type, {1, 1}, {1, 1}
@28 = hip::copy_from_gpu(@17) -> half_type, {1, 32, 4096, 128}, {16777216, 524288, 128, 1}
@29 = hip::copy_from_gpu(@22) -> half_type, {1, 32, 4096, 128}, {16777216, 524288, 128, 1}
@30 = reshape_lazy[dims={1, 9, 96, 128}](@14) -> half_type, {1, 9, 96, 128}, {110592, 12288, 128, 1}
@31 = transpose[permutation={0, 2, 1, 3}](@30) -> half_type, {1, 96, 9, 128}, {110592, 128, 12288, 1}
@32 = load[offset=67108872,end=67330056](@1) -> half_type, {1, 96, 9, 128}, {110592, 128, 12288, 1}
@33 = gpu::code_object[code_object=10920,symbol_name=gqa_rotary_embedding_kernel,global=110592,local=1024,](@31,@27,@19,@18,@32) -> half_type, {1, 96, 9, 128}, {110592, 128, 12288, 1}
@34 = gpu::code_object[code_object=9768,symbol_name=concat_past_present_kernel,global=73728,local=1024,](@33,@22,@17,@27) -> half_type, {1, 96, 9, 128}, {110592, 128, 12288, 1}
@35 = load[offset=67330056,end=69689352](@1) -> half_type, {1, 32, 9, 4096}, {1179648, 36864, 4096, 1}
@36 = gpu::code_object[code_object=12288,symbol_name=compute_attention_probabilities_kernel,global=466944,local=1024,](@34,@22,@17,@27,@35) -> half_type, {1, 32, 9, 4096}, {1179648, 36864, 4096, 1}
@37 = gpu::code_object[code_object=11224,symbol_name=gqa_softmax_kernel,global=288,local=1024,](@33,@22,@36,@27,@36) -> half_type, {1, 32, 9, 4096}, {1179648, 36864, 4096, 1}
@38 = load[offset=69689352,end=69763080](@1) -> half_type, {1, 9, 4096}, {36864, 4096, 1}
@39 = gpu::code_object[code_object=9904,symbol_name=compute_attention_scores_kernel,global=36864,local=1024,](@33,@22,@17,@27,@37,@38) -> half_type, {1, 9, 4096}, {36864, 4096, 1}
@40 = hip::copy_from_gpu(@39) -> half_type, {1, 9, 4096}, {36864, 4096, 1}
@41 = hip::sync_stream(@29,@28,@40) -> half_type, {1, 32, 4096, 128}, {16777216, 524288, 128, 1}
@42 = @return(@41,@28,@40)

turneram · 2024-11-06T16:57:20Z

I observed that the elements of the query tensor in compute_attention_probabilities kernel were all shifted forward by 4 elements. After printing the pointer addresses, it appears that for some reason the pointer of the query tensor_view points to the address of the seqlens_k tensor, but only in the loaded model.

Kernel/Tensor	Compiled model address	Loaded model address
Rotary
Output	0x7f215ba00008	0x7f8f33000008
SeqLens	0x7f215ba00000	0x7f8f33000000
Concat
Query	0x7f215ba00008	0x7f8f33000008
SeqLens	0x7f215ba00000	0x7f8f33000000
Compute probabilities
Query	0x7f215ba00008	0x7f8f33000000
SeqLens	0x7f215ba00000	0x7f8f33000000

pfultz2 · 2024-11-11T23:26:02Z

I know why this fails. We dont serialize the output_arg in code_object op. So the concat_past_present_kernel will return the last argument rather than the first argument.

pfultz2 · 2024-11-11T23:52:28Z

#3613 might fix this issue if you want to try it out.

turneram self-assigned this Nov 6, 2024

turneram mentioned this issue Nov 8, 2024

Workaround to get correct results from llama2 mxr #3602

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GroupQueryAttention produces incorrect results when loaded from mxr #3596

GroupQueryAttention produces incorrect results when loaded from mxr #3596

turneram commented Nov 6, 2024 •

edited

Loading

turneram commented Nov 6, 2024

turneram commented Nov 6, 2024

pfultz2 commented Nov 11, 2024

pfultz2 commented Nov 11, 2024

GroupQueryAttention produces incorrect results when loaded from mxr #3596

GroupQueryAttention produces incorrect results when loaded from mxr #3596

Comments

turneram commented Nov 6, 2024 • edited Loading

turneram commented Nov 6, 2024

turneram commented Nov 6, 2024

pfultz2 commented Nov 11, 2024

pfultz2 commented Nov 11, 2024

turneram commented Nov 6, 2024 •

edited

Loading