-
Notifications
You must be signed in to change notification settings - Fork 658
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Accuracy Issue for Sharded Llama #19948
Comments
Add a warning for sharded llama accuracy until iree-org/iree#19948 is resolved.
This reverts the changes to `SchedulingExecution.cpp` from commit 4fffb0e. This change caused corrupt tokens to be outputted from sharded llama models (iree-org#19948)
Collected two new traces from the shortfin server. One is with the specified commit, and has corrupt tokens. One is with the commit reverted and had good token output. Not sure if there are any insights that can be gleaned from this: Good OutputPrompt - 0:
<|begin_of_text|>Name the capital of the United States.<|eot_id|>
Response:
data: assistant
The capital of the United States is Washington, D.C.
-------------------------------------------------- Bad OutputPrompt - 0:
<|begin_of_text|>Name the capital of the United States.<|eot_id|>
Response:
data: assistant
://://://://://://://://://://://://://_REF
I
I
I
-------------------------------------------------- |
Expanding on: Prompt - 0:
<|begin_of_text|>Name the capital of the United States.<|eot_id|>
Response:
data: assistant
://://://://://://://://://://://://://_REF
I
I
I
-------------------------------------------------- The bad tokens start to happen during the 3rd decode invocation. The outputs in-order are:
Prefill: 128006 - Empty String (good) From there is repeats for awhile, until it hits the tokens at the end of the list above, which are also nonsensical. Output from good tokens were:
Good tokens had no repetitions are discernable patterns in the output. As you can see at the 3rd decode step, the outputs start to differ (1129 vs 791) |
Add a warning for sharded llama accuracy until iree-org/iree#19948 is resolved.
What happened?
The shortfin server is showing corrupt outputs at HEAD of IREE when running
llama3.1_8b_tp8
nod-ai/shark-ai#934:I was able to bisect this to 4fffb0e.
I'm not sure of a good way to reproduce this or provide good signal on the IREE side.
I'll include steps to invoke
iree-run-module
while I investigate this, in case you're already aware of how to do it. Otherwise, let me know if there's a better method of reproduction I could do for you.Steps to reproduce your issue
/data/llama3.1/weights/8b/fp16/tp8
onmi300x-3
or/shark_dev/data/llama3.1/weights/8b/fp16/tp8
onmi300x
mkdir 8b_short_inputs cd 8b_short_inputs wget https://sharkpublic.blob.core.windows.net/sharkpublic/stephen/llama3.1_8b_tp8/inputs/prefill/iree_issue_corrupt_shards/tokens.npy wget https://sharkpublic.blob.core.windows.net/sharkpublic/stephen/llama3.1_8b_tp8/inputs/prefill/iree_issue_corrupt_shards/seq_ids.npy wget https://sharkpublic.blob.core.windows.net/sharkpublic/stephen/llama3.1_8b_tp8/inputs/prefill/iree_issue_corrupt_shards/seq_block_ids.npy wget https://sharkpublic.blob.core.windows.net/sharkpublic/stephen/llama3.1_8b_tp8/inputs/prefill/iree_issue_corrupt_shards/cache_state_shard_0.npy wget https://sharkpublic.blob.core.windows.net/sharkpublic/stephen/llama3.1_8b_tp8/inputs/prefill/iree_issue_corrupt_shards/cache_state_shard_1.npy wget https://sharkpublic.blob.core.windows.net/sharkpublic/stephen/llama3.1_8b_tp8/inputs/prefill/iree_issue_corrupt_shards/cache_state_shard_2.npy wget https://sharkpublic.blob.core.windows.net/sharkpublic/stephen/llama3.1_8b_tp8/inputs/prefill/iree_issue_corrupt_shards/cache_state_shard_3.npy wget https://sharkpublic.blob.core.windows.net/sharkpublic/stephen/llama3.1_8b_tp8/inputs/prefill/iree_issue_corrupt_shards/cache_state_shard_4.npy wget https://sharkpublic.blob.core.windows.net/sharkpublic/stephen/llama3.1_8b_tp8/inputs/prefill/iree_issue_corrupt_shards/cache_state_shard_5.npy wget https://sharkpublic.blob.core.windows.net/sharkpublic/stephen/llama3.1_8b_tp8/inputs/prefill/iree_issue_corrupt_shards/cache_state_shard_6.npy wget https://sharkpublic.blob.core.windows.net/sharkpublic/stephen/llama3.1_8b_tp8/inputs/prefill/iree_issue_corrupt_shards/cache_state_shard_7.npy
iree-compile llama3.1_8b_tp8.mlir \ -o llama3.1_8b_tp8.vmfb \ --iree-hal-target-device=hip[0] \ --iree-hal-target-device=hip[1] \ --iree-hal-target-device=hip[2] \ --iree-hal-target-device=hip[3] \ --iree-hal-target-device=hip[4] \ --iree-hal-target-device=hip[5] \ --iree-hal-target-device=hip[6] \ --iree-hal-target-device=hip[7] \ --iree-hip-target=gfx942 \ --iree-dispatch-creation-enable-aggressive-fusion=true \ --iree-global-opt-propagate-transposes=true \ --iree-opt-aggressively-propagate-transposes=true \ --iree-opt-data-tiling=false \ --iree-preprocessing-pass-pipeline='builtin.module(util.func(iree-preprocessing-generalize-linalg-matmul-experimental))' \ --iree-hal-indirect-command-buffers=true \ --iree-stream-resource-memory-model=discrete \ --iree-hal-memoization=true \ --iree-opt-strip-assertions
What component(s) does this issue relate to?
No response
Version information
0781072
Additional context
No response
The text was updated successfully, but these errors were encountered: