Skip to content

Commit

Permalink
Use llama-8b-instruct tokenizer.json and tokenizer_config.json for `l…
Browse files Browse the repository at this point in the history
…lama3_8b_fp16` dataset (#953)

The tokenizers specified for this dataset are for `llama3_8b_fp16`,
while the model is `llama3_8b_fp16_instruct`. The `eos_token` for `8b`
and `8b-instruct` are different:

```text
8b:

<|begin_of_text|> {generated_text} <|end_of_text|>

<|end_of_text|> - 128001

8b-Instruct:

<|begin_of_text|> {generated_text} <|eot_id|>


<|eot_id|> - 128009
```

Using the wrong config causes Llama to output text forever. Our model
generated `128009`s, but the server doesn't recognize it as the proper
stop token and keeps calling for generations.

More details
[here](#934 (comment))
  • Loading branch information
stbaione authored Feb 11, 2025
1 parent b78f901 commit fbc69de
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion sharktank/sharktank/utils/hf_datasets.py
Original file line number Diff line number Diff line change
Expand Up @@ -103,7 +103,7 @@ def alias_dataset(from_name: str, to_name: str):
),
RemoteFile(
"tokenizer_config.json",
"NousResearch/Meta-Llama-3-8B",
"NousResearch/Meta-Llama-3-8B-Instruct",
"tokenizer_config.json",
extra_filenames=["tokenizer.json"],
),
Expand Down

0 comments on commit fbc69de

Please sign in to comment.