Bad results when serving Llama3.1_8b_f16 with shortfin #934

Alex-Vasile · 2025-02-07T18:15:45Z

Following the serving guide and running llama3.1_8b_f16 at tp8 I am getting incoherent results.

For example:

curl http://localhost:8000/generate \
    -H "Content-Type: application/json" \
    -d '{
        "text": "Name the capital of the United States.",
        "sampling_params": {"max_completion_tokens": 50}
    }'

Produces:

data: !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

The text was updated successfully, but these errors were encountered:

dan-garvey · 2025-02-07T18:48:31Z

@stbaione have you seen this before

stbaione · 2025-02-07T18:53:21Z

@stbaione have you seen this before

This is new to me, as of this morning when Alex found out. I'm looking into it

stbaione · 2025-02-07T19:40:40Z

At HEAD for shark-ai/IREE. Using gguf from our user docs:

Unsharded

curl http://localhost:8003/generate     -H "Content-Type: application/json"     -d '{
        "text": "Name the capital of the United States.",
        "sampling_params": {"max_completion_tokens": 50}
    }'

data:  Washington D.C.
What is the capital of the United States? Washington D.C.
What is the capital of the United States? Washington D.C.
What is the capital of the United States? Washington D.C.
What is the capital of the United States

Sharded

I got a memory access fault on this on decode. This is new to me. Earliest I've heard this reported was this morning from Avi.

He reported that the unsharded 8b inputs looked good in testing yesterday. My suspicion is that this regression is related to us updating our pinned IREE versions. The sharding issue seems like something else. Still looking into it...

ScottTodd · 2025-02-07T20:12:07Z

My suspicion is that this regression is related to us updating our pinned IREE versions.

If you have an easy repro, you could try a partial revert of 2c61420, just in shortfin/CMakeLists.txt. So switch shortfin to build against the older IREE runtime code but still use the latest IREE compiler code / .vmfb generation.

stbaione · 2025-02-07T20:19:06Z

Looks like the output Avi got did actually show repetitiveness. So may not specifically be related to that requirements bump.

Alex found NaNs in KVCache post prefill. KVCache has no NaNs prior to prefill. Looking further

stbaione · 2025-02-07T22:02:43Z

Focusing on repetitive unsharded first. Rewinded shark-ai to ff15379 w/ IREE at HEAD.

Got the following output:

curl http://localhost:8003/generate     -H "Content-Type: application/json"     -d '{
        "text": "Name the capital of the United States.",
        "sampling_params": {"max_completion_tokens": 50}
    }'
data:  Washington D.C. is the capital of the United States. Washington D.C. is a special district that is not part of any state but is the capital of the United States. Washington D.C. is a federal district that serves as the seat of the

Something to note is that not all prompts are repetitive TOM. For example:

curl http://localhost:8003/generate     -H "Content-Type: application/json"     -d '{
        "text": "Who is your favorite author?",
        "sampling_params": {"max_completion_tokens": 50}
    }'

data:  I have a few favorite authors, but one of my all-time favorites is Jane Austen. I love her witty dialogue and her insight into the human heart. Her novels are like a warm hug on a cold day. I could read them over and over

...strange. Anyways, bisecting shark-ai

stbaione · 2025-02-08T00:19:33Z

The repetitive output bisect to be due to d12e384.

However, it's hard to determine if this is actually a bug or not. This fixed a concurrency bug, and as mentioned, other prompts aside from "Name the capital of the United States." worked with this commit included, and would need to test a larger sample to be able to determine if one is actually more accurate than the other. Single golden prompt doesn't really cut it for this.

We have ml_perf_harness_llama. This would test against 8,131 samples and give us an accuracy measurement. Will be good in the future to determine accuracy, but it still needs a system to run on. Although, maybe a could kick off an 8b run locally over the weekend and get a definitive answer based on results.

Looking at the sharding issue now. Is not caused by the same commit

aviator19941 · 2025-02-08T00:26:26Z

For sharding (8b tp8), I got a similar issue to Alex with incoherent results on Sharkmi300x-3 from using these weights:

  --irpa-file=/shark-dev/8b/instruct/weights/tp8/llama3.1_8b_instruct_fp16_tp8.irpa

It seems like the weights I'm using are slightly different from /data/llama3.1/weights/8b/fp16/tp8/llama3.1_8b_fp16_tp8_parameters.irpa.

curl http://localhost:8081/gener
ate     -H "Content-Type: application/json"     -d '{
        "text": "Name the capital of the United States.",
        "sampling_params": {"max_completion_tokens": 50}
    }'
data:  Washington!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !.  I! !!!!!!!
.NameTheNameThecapital ofthe capital ofthe!!!!.!.!

curl http://localhost:8081/generate     -H "Content-Type: application/json"     -d '{
        "text": "1, 2, 3, 4, 5",
        "sampling_params": {"max_completion_tokens": 50}
    }'
data: , ://://://://://://://://://://://://://://://://://://://://://://://://://://://://://://://://://://://://://://://://://://://://://://://://://

stbaione · 2025-02-08T21:10:58Z

The corrupt output tokens for sharded models is due to iree-org/iree@4fffb0e.

Will have an iree-run-module repro & ticket filed by Monday morning.

On the repetitive output commit on the shark-ai side, I'm attempting to do a local harness run against 8b_tp8, with the sharded fix applied. If the run is successful, we'll be able to see the accuracy score with and without the commit, with 7,000 samples

stbaione · 2025-02-10T17:49:13Z

IREE issue filed for the corrupt sharded outputs: iree-org/iree#19948

stbaione · 2025-02-11T03:18:22Z

The shark-ai issue appears to not actually be an issue. I had a side quest, where I noticed that the server was never actually outputting stop tokens. Turns out that you have to wrap your prompt in proper llama chat tags for the server to output stop token:. For example <|begin_of_text|> Hello, how are you?<|eot_id|>. These tags are expected, and the model would have been trained with them applied.

On top of that llama3_8b_fp16 from our hf_datasets script actually has a llama3.1-8b tokenizer_config, NOT a llama3.1-8b-instruct tokenizer_config. I checked mi300x-3's configs for 8b and 405b and they were both incorrect. These will need to be updated.

After using the proper config, and retrying with tags applied, I got this:

curl http://localhost:8003/generate      -H "Content-Type: application/json"      -d '{
       "text": "<|begin_of_text|>Name the capital of the United States.<|eot_id|>",
       "sampling_params": {
         "max_completion_tokens": 100
       }
     }'
data: assistant
The capital of the United States is Washington, D.C.

curl http://localhost:8003/generate      -H "Content-Type: application/json"      -d '{
       "text": "<|begin_of_text|>Name the capital of the United States. Your answer should be a minimum of 100 characters.<|eot_id|>",
       "sampling_params": {
         "max_completion_tokens": 100
       }
     }'
data: assistant

The capital of the United States is Washington, D.C. (short for District of Columbia), which is a federal district located on the east coast of the country, along the Potomac River. It is home to many iconic landmarks, including the White House, the Capitol Building, and the Smithsonian museums.

If I were to remove the tags, the server would generate tokens, until it reached the max_completion_tokens value.

To be extra, confident, I wrote a quick test script that sent 30 requests in parallel to the server, with tags applied, and logged the results. They look great, can be seen here.

Takeaways, check model artifacts. Second time we've had weird server behavior due to busted/incorrect artifacts, to my experience. Another takeaway, always use llama chat tags. Need to update examples in docs.

Along with this, the specified commit was a bug fix, which fixed our concurrency tests, and was logically correct. So, makes sense to leave it there.

Note

This is with iree-org/iree@4fffb0e reverted

stbaione · 2025-02-11T17:57:36Z

PR Created in IREE for sharding: iree-org/iree#19958

Use llama chat tags in example requests. More details on this can be found [here](#934 (comment))

…lama3_8b_fp16` dataset (#953) The tokenizers specified for this dataset are for `llama3_8b_fp16`, while the model is `llama3_8b_fp16_instruct`. The `eos_token` for `8b` and `8b-instruct` are different: ```text 8b: <|begin_of_text|> {generated_text} <|end_of_text|> <|end_of_text|> - 128001 8b-Instruct: <|begin_of_text|> {generated_text} <|eot_id|> <|eot_id|> - 128009 ``` Using the wrong config causes Llama to output text forever. Our model generated `128009`s, but the server doesn't recognize it as the proper stop token and keeps calling for generations. More details [here](#934 (comment))

Use llama chat tags in example requests. More details on this can be found [here](#934 (comment))

…lama3_8b_fp16` dataset (#953) The tokenizers specified for this dataset are for `llama3_8b_fp16`, while the model is `llama3_8b_fp16_instruct`. The `eos_token` for `8b` and `8b-instruct` are different: ```text 8b: <|begin_of_text|> {generated_text} <|end_of_text|> <|end_of_text|> - 128001 8b-Instruct: <|begin_of_text|> {generated_text} <|eot_id|> <|eot_id|> - 128009 ``` Using the wrong config causes Llama to output text forever. Our model generated `128009`s, but the server doesn't recognize it as the proper stop token and keeps calling for generations. More details [here](#934 (comment))

ScottTodd mentioned this issue Feb 8, 2025

Release tracker - 3.2.0 iree-org/iree#19641

Open

6 tasks

stbaione mentioned this issue Feb 10, 2025

Accuracy Issue for Sharded Llama iree-org/iree#19948

Open

This was referenced Feb 11, 2025

Use llama chat tags in example requests #951

Merged

Use llama-8b-instruct tokenizer.json and tokenizer_config.json for llama3_8b_fp16 dataset #953

Merged

stbaione added a commit that referenced this issue Feb 11, 2025

Use llama chat tags in example requests (#951)

b78f901

Use llama chat tags in example requests. More details on this can be found [here](#934 (comment))

This was referenced Feb 11, 2025

Chat Templates #955

Open

Add mi300 runner for toy_llama tests iree-org/iree#19961

Open

monorimet pushed a commit that referenced this issue Feb 13, 2025

Use llama chat tags in example requests (#951)

f7c0b3f

Use llama chat tags in example requests. More details on this can be found [here](#934 (comment))

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bad results when serving Llama3.1_8b_f16 with shortfin #934

Bad results when serving Llama3.1_8b_f16 with shortfin #934

Alex-Vasile commented Feb 7, 2025

dan-garvey commented Feb 7, 2025

stbaione commented Feb 7, 2025

stbaione commented Feb 7, 2025

ScottTodd commented Feb 7, 2025

stbaione commented Feb 7, 2025

stbaione commented Feb 7, 2025

stbaione commented Feb 8, 2025 •

edited

Loading

aviator19941 commented Feb 8, 2025 •

edited

Loading

stbaione commented Feb 8, 2025

stbaione commented Feb 10, 2025

stbaione commented Feb 11, 2025 •

edited

Loading

stbaione commented Feb 11, 2025

Bad results when serving Llama3.1_8b_f16 with shortfin #934

Bad results when serving Llama3.1_8b_f16 with shortfin #934

Comments

Alex-Vasile commented Feb 7, 2025

dan-garvey commented Feb 7, 2025

stbaione commented Feb 7, 2025

stbaione commented Feb 7, 2025

Unsharded

Sharded

ScottTodd commented Feb 7, 2025

stbaione commented Feb 7, 2025

stbaione commented Feb 7, 2025

stbaione commented Feb 8, 2025 • edited Loading

aviator19941 commented Feb 8, 2025 • edited Loading

stbaione commented Feb 8, 2025

stbaione commented Feb 10, 2025

stbaione commented Feb 11, 2025 • edited Loading

stbaione commented Feb 11, 2025

stbaione commented Feb 8, 2025 •

edited

Loading

aviator19941 commented Feb 8, 2025 •

edited

Loading

stbaione commented Feb 11, 2025 •

edited

Loading