-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bad results when serving Llama3.1_8b_f16 with shortfin #934
Comments
@stbaione have you seen this before |
This is new to me, as of this morning when Alex found out. I'm looking into it |
At HEAD for shark-ai/IREE. Using gguf from our user docs: Unshardedcurl http://localhost:8003/generate -H "Content-Type: application/json" -d '{
"text": "Name the capital of the United States.",
"sampling_params": {"max_completion_tokens": 50}
}'
data: Washington D.C.
What is the capital of the United States? Washington D.C.
What is the capital of the United States? Washington D.C.
What is the capital of the United States? Washington D.C.
What is the capital of the United States ShardedI got a memory access fault on this on decode. This is new to me. Earliest I've heard this reported was this morning from Avi. He reported that the unsharded 8b inputs looked good in testing yesterday. My suspicion is that this regression is related to us updating our pinned IREE versions. The sharding issue seems like something else. Still looking into it... |
If you have an easy repro, you could try a partial revert of 2c61420, just in |
Looks like the output Avi got did actually show repetitiveness. So may not specifically be related to that requirements bump. Alex found NaNs in KVCache post prefill. KVCache has no NaNs prior to prefill. Looking further |
Focusing on repetitive unsharded first. Rewinded shark-ai to ff15379 w/ IREE at HEAD. Got the following output: curl http://localhost:8003/generate -H "Content-Type: application/json" -d '{
"text": "Name the capital of the United States.",
"sampling_params": {"max_completion_tokens": 50}
}'
data: Washington D.C. is the capital of the United States. Washington D.C. is a special district that is not part of any state but is the capital of the United States. Washington D.C. is a federal district that serves as the seat of the Something to note is that not all prompts are repetitive TOM. For example: curl http://localhost:8003/generate -H "Content-Type: application/json" -d '{
"text": "Who is your favorite author?",
"sampling_params": {"max_completion_tokens": 50}
}'
data: I have a few favorite authors, but one of my all-time favorites is Jane Austen. I love her witty dialogue and her insight into the human heart. Her novels are like a warm hug on a cold day. I could read them over and over ...strange. Anyways, bisecting shark-ai |
The repetitive output bisect to be due to d12e384. However, it's hard to determine if this is actually a bug or not. This fixed a concurrency bug, and as mentioned, other prompts aside from "Name the capital of the United States." worked with this commit included, and would need to test a larger sample to be able to determine if one is actually more accurate than the other. Single golden prompt doesn't really cut it for this. We have ml_perf_harness_llama. This would test against 8,131 samples and give us an accuracy measurement. Will be good in the future to determine accuracy, but it still needs a system to run on. Although, maybe a could kick off an 8b run locally over the weekend and get a definitive answer based on results. Looking at the sharding issue now. Is not caused by the same commit |
For sharding (8b tp8), I got a similar issue to Alex with incoherent results on Sharkmi300x-3 from using these weights:
It seems like the weights I'm using are slightly different from
|
The corrupt output tokens for sharded models is due to iree-org/iree@4fffb0e. Will have an On the repetitive output commit on the shark-ai side, I'm attempting to do a local harness run against 8b_tp8, with the sharded fix applied. If the run is successful, we'll be able to see the accuracy score with and without the commit, with 7,000 samples |
IREE issue filed for the corrupt sharded outputs: iree-org/iree#19948 |
The shark-ai issue appears to not actually be an issue. I had a side quest, where I noticed that the server was never actually outputting stop tokens. Turns out that you have to wrap your prompt in proper llama chat tags for the server to output stop token:. For example On top of that llama3_8b_fp16 from our After using the proper config, and retrying with tags applied, I got this: curl http://localhost:8003/generate -H "Content-Type: application/json" -d '{
"text": "<|begin_of_text|>Name the capital of the United States.<|eot_id|>",
"sampling_params": {
"max_completion_tokens": 100
}
}'
data: assistant
The capital of the United States is Washington, D.C. curl http://localhost:8003/generate -H "Content-Type: application/json" -d '{
"text": "<|begin_of_text|>Name the capital of the United States. Your answer should be a minimum of 100 characters.<|eot_id|>",
"sampling_params": {
"max_completion_tokens": 100
}
}'
data: assistant
The capital of the United States is Washington, D.C. (short for District of Columbia), which is a federal district located on the east coast of the country, along the Potomac River. It is home to many iconic landmarks, including the White House, the Capitol Building, and the Smithsonian museums. If I were to remove the tags, the server would generate tokens, until it reached the To be extra, confident, I wrote a quick test script that sent 30 requests in parallel to the server, with tags applied, and logged the results. They look great, can be seen here. Takeaways, check model artifacts. Second time we've had weird server behavior due to busted/incorrect artifacts, to my experience. Another takeaway, always use llama chat tags. Need to update examples in docs. Along with this, the specified commit was a bug fix, which fixed our concurrency tests, and was logically correct. So, makes sense to leave it there. Note This is with iree-org/iree@4fffb0e reverted |
PR Created in IREE for sharding: iree-org/iree#19958 |
Use llama chat tags in example requests. More details on this can be found [here](#934 (comment))
…lama3_8b_fp16` dataset (#953) The tokenizers specified for this dataset are for `llama3_8b_fp16`, while the model is `llama3_8b_fp16_instruct`. The `eos_token` for `8b` and `8b-instruct` are different: ```text 8b: <|begin_of_text|> {generated_text} <|end_of_text|> <|end_of_text|> - 128001 8b-Instruct: <|begin_of_text|> {generated_text} <|eot_id|> <|eot_id|> - 128009 ``` Using the wrong config causes Llama to output text forever. Our model generated `128009`s, but the server doesn't recognize it as the proper stop token and keeps calling for generations. More details [here](#934 (comment))
Use llama chat tags in example requests. More details on this can be found [here](#934 (comment))
…lama3_8b_fp16` dataset (#953) The tokenizers specified for this dataset are for `llama3_8b_fp16`, while the model is `llama3_8b_fp16_instruct`. The `eos_token` for `8b` and `8b-instruct` are different: ```text 8b: <|begin_of_text|> {generated_text} <|end_of_text|> <|end_of_text|> - 128001 8b-Instruct: <|begin_of_text|> {generated_text} <|eot_id|> <|eot_id|> - 128009 ``` Using the wrong config causes Llama to output text forever. Our model generated `128009`s, but the server doesn't recognize it as the proper stop token and keeps calling for generations. More details [here](#934 (comment))
Following the serving guide and running llama3.1_8b_f16 at tp8 I am getting incoherent results.
For example:
Produces:
The text was updated successfully, but these errors were encountered: