Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bad results when serving Llama3.1_8b_f16 with shortfin #934

Open
Alex-Vasile opened this issue Feb 7, 2025 · 12 comments
Open

Bad results when serving Llama3.1_8b_f16 with shortfin #934

Alex-Vasile opened this issue Feb 7, 2025 · 12 comments

Comments

@Alex-Vasile
Copy link
Contributor

Following the serving guide and running llama3.1_8b_f16 at tp8 I am getting incoherent results.

For example:

curl http://localhost:8000/generate \
    -H "Content-Type: application/json" \
    -d '{
        "text": "Name the capital of the United States.",
        "sampling_params": {"max_completion_tokens": 50}
    }'

Produces:

data: !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
@dan-garvey
Copy link
Member

@stbaione have you seen this before

@stbaione
Copy link
Contributor

stbaione commented Feb 7, 2025

@stbaione have you seen this before

This is new to me, as of this morning when Alex found out. I'm looking into it

@stbaione
Copy link
Contributor

stbaione commented Feb 7, 2025

At HEAD for shark-ai/IREE. Using gguf from our user docs:

Unsharded

curl http://localhost:8003/generate     -H "Content-Type: application/json"     -d '{
        "text": "Name the capital of the United States.",
        "sampling_params": {"max_completion_tokens": 50}
    }'

data:  Washington D.C.
What is the capital of the United States? Washington D.C.
What is the capital of the United States? Washington D.C.
What is the capital of the United States? Washington D.C.
What is the capital of the United States

Sharded

I got a memory access fault on this on decode. This is new to me. Earliest I've heard this reported was this morning from Avi.

He reported that the unsharded 8b inputs looked good in testing yesterday. My suspicion is that this regression is related to us updating our pinned IREE versions. The sharding issue seems like something else. Still looking into it...

@ScottTodd
Copy link
Member

My suspicion is that this regression is related to us updating our pinned IREE versions.

If you have an easy repro, you could try a partial revert of 2c61420, just in shortfin/CMakeLists.txt. So switch shortfin to build against the older IREE runtime code but still use the latest IREE compiler code / .vmfb generation.

@stbaione
Copy link
Contributor

stbaione commented Feb 7, 2025

Looks like the output Avi got did actually show repetitiveness. So may not specifically be related to that requirements bump.

Alex found NaNs in KVCache post prefill. KVCache has no NaNs prior to prefill. Looking further

@stbaione
Copy link
Contributor

stbaione commented Feb 7, 2025

Focusing on repetitive unsharded first. Rewinded shark-ai to ff15379 w/ IREE at HEAD.

Got the following output:

curl http://localhost:8003/generate     -H "Content-Type: application/json"     -d '{
        "text": "Name the capital of the United States.",
        "sampling_params": {"max_completion_tokens": 50}
    }'
data:  Washington D.C. is the capital of the United States. Washington D.C. is a special district that is not part of any state but is the capital of the United States. Washington D.C. is a federal district that serves as the seat of the

Something to note is that not all prompts are repetitive TOM. For example:

curl http://localhost:8003/generate     -H "Content-Type: application/json"     -d '{
        "text": "Who is your favorite author?",
        "sampling_params": {"max_completion_tokens": 50}
    }'

data:  I have a few favorite authors, but one of my all-time favorites is Jane Austen. I love her witty dialogue and her insight into the human heart. Her novels are like a warm hug on a cold day. I could read them over and over

...strange. Anyways, bisecting shark-ai

@stbaione
Copy link
Contributor

stbaione commented Feb 8, 2025

The repetitive output bisect to be due to d12e384.

However, it's hard to determine if this is actually a bug or not. This fixed a concurrency bug, and as mentioned, other prompts aside from "Name the capital of the United States." worked with this commit included, and would need to test a larger sample to be able to determine if one is actually more accurate than the other. Single golden prompt doesn't really cut it for this.

We have ml_perf_harness_llama. This would test against 8,131 samples and give us an accuracy measurement. Will be good in the future to determine accuracy, but it still needs a system to run on. Although, maybe a could kick off an 8b run locally over the weekend and get a definitive answer based on results.

Looking at the sharding issue now. Is not caused by the same commit

@aviator19941
Copy link
Collaborator

aviator19941 commented Feb 8, 2025

For sharding (8b tp8), I got a similar issue to Alex with incoherent results on Sharkmi300x-3 from using these weights:

  --irpa-file=/shark-dev/8b/instruct/weights/tp8/llama3.1_8b_instruct_fp16_tp8.irpa

It seems like the weights I'm using are slightly different from /data/llama3.1/weights/8b/fp16/tp8/llama3.1_8b_fp16_tp8_parameters.irpa.

curl http://localhost:8081/gener
ate     -H "Content-Type: application/json"     -d '{
        "text": "Name the capital of the United States.",
        "sampling_params": {"max_completion_tokens": 50}
    }'
data:  Washington!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !.  I! !!!!!!!
.NameTheNameThecapital ofthe capital ofthe!!!!.!.!
curl http://localhost:8081/generate     -H "Content-Type: application/json"     -d '{
        "text": "1, 2, 3, 4, 5",
        "sampling_params": {"max_completion_tokens": 50}
    }'
data: , ://://://://://://://://://://://://://://://://://://://://://://://://://://://://://://://://://://://://://://://://://://://://://://://://://

@stbaione
Copy link
Contributor

stbaione commented Feb 8, 2025

The corrupt output tokens for sharded models is due to iree-org/iree@4fffb0e.

Will have an iree-run-module repro & ticket filed by Monday morning.

On the repetitive output commit on the shark-ai side, I'm attempting to do a local harness run against 8b_tp8, with the sharded fix applied. If the run is successful, we'll be able to see the accuracy score with and without the commit, with 7,000 samples

@stbaione
Copy link
Contributor

IREE issue filed for the corrupt sharded outputs: iree-org/iree#19948

@stbaione
Copy link
Contributor

stbaione commented Feb 11, 2025

The shark-ai issue appears to not actually be an issue. I had a side quest, where I noticed that the server was never actually outputting stop tokens. Turns out that you have to wrap your prompt in proper llama chat tags for the server to output stop token:. For example <|begin_of_text|> Hello, how are you?<|eot_id|>. These tags are expected, and the model would have been trained with them applied.

On top of that llama3_8b_fp16 from our hf_datasets script actually has a llama3.1-8b tokenizer_config, NOT a llama3.1-8b-instruct tokenizer_config. I checked mi300x-3's configs for 8b and 405b and they were both incorrect. These will need to be updated.

After using the proper config, and retrying with tags applied, I got this:

curl http://localhost:8003/generate      -H "Content-Type: application/json"      -d '{
       "text": "<|begin_of_text|>Name the capital of the United States.<|eot_id|>",
       "sampling_params": {
         "max_completion_tokens": 100
       }
     }'
data: assistant
The capital of the United States is Washington, D.C.
curl http://localhost:8003/generate      -H "Content-Type: application/json"      -d '{
       "text": "<|begin_of_text|>Name the capital of the United States. Your answer should be a minimum of 100 characters.<|eot_id|>",
       "sampling_params": {
         "max_completion_tokens": 100
       }
     }'
data: assistant

The capital of the United States is Washington, D.C. (short for District of Columbia), which is a federal district located on the east coast of the country, along the Potomac River. It is home to many iconic landmarks, including the White House, the Capitol Building, and the Smithsonian museums.

If I were to remove the tags, the server would generate tokens, until it reached the max_completion_tokens value.

To be extra, confident, I wrote a quick test script that sent 30 requests in parallel to the server, with tags applied, and logged the results. They look great, can be seen here.

Takeaways, check model artifacts. Second time we've had weird server behavior due to busted/incorrect artifacts, to my experience. Another takeaway, always use llama chat tags. Need to update examples in docs.

Along with this, the specified commit was a bug fix, which fixed our concurrency tests, and was logically correct. So, makes sense to leave it there.

Note

This is with iree-org/iree@4fffb0e reverted

@stbaione
Copy link
Contributor

PR Created in IREE for sharding: iree-org/iree#19958

stbaione added a commit that referenced this issue Feb 11, 2025
Use llama chat tags in example requests.

More details on this can be found
[here](#934 (comment))
stbaione added a commit that referenced this issue Feb 11, 2025
…lama3_8b_fp16` dataset (#953)

The tokenizers specified for this dataset are for `llama3_8b_fp16`,
while the model is `llama3_8b_fp16_instruct`. The `eos_token` for `8b`
and `8b-instruct` are different:

```text
8b:

<|begin_of_text|> {generated_text} <|end_of_text|>

<|end_of_text|> - 128001

8b-Instruct:

<|begin_of_text|> {generated_text} <|eot_id|>


<|eot_id|> - 128009
```

Using the wrong config causes Llama to output text forever. Our model
generated `128009`s, but the server doesn't recognize it as the proper
stop token and keeps calling for generations.

More details
[here](#934 (comment))
monorimet pushed a commit that referenced this issue Feb 13, 2025
Use llama chat tags in example requests.

More details on this can be found
[here](#934 (comment))
monorimet pushed a commit that referenced this issue Feb 13, 2025
…lama3_8b_fp16` dataset (#953)

The tokenizers specified for this dataset are for `llama3_8b_fp16`,
while the model is `llama3_8b_fp16_instruct`. The `eos_token` for `8b`
and `8b-instruct` are different:

```text
8b:

<|begin_of_text|> {generated_text} <|end_of_text|>

<|end_of_text|> - 128001

8b-Instruct:

<|begin_of_text|> {generated_text} <|eot_id|>


<|eot_id|> - 128009
```

Using the wrong config causes Llama to output text forever. Our model
generated `128009`s, but the server doesn't recognize it as the proper
stop token and keeps calling for generations.

More details
[here](#934 (comment))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants