Use llama-8b-instruct tokenizer.json and tokenizer_config.json for `l…

…lama3_8b_fp16` dataset (#953) The tokenizers specified for this dataset are for `llama3_8b_fp16`, while the model is `llama3_8b_fp16_instruct`. The `eos_token` for `8b` and `8b-instruct` are different: ```text 8b: <|begin_of_text|> {generated_text} <|end_of_text|> <|end_of_text|> - 128001 8b-Instruct: <|begin_of_text|> {generated_text} <|eot_id|> <|eot_id|> - 128009 ``` Using the wrong config causes Llama to output text forever. Our model generated `128009`s, but the server doesn't recognize it as the proper stop token and keeps calling for generations. More details [here](#934 (comment))
nod-ai · Feb 11, 2025 · fbc69de · fbc69de
1 parent b78f901
commit fbc69de
Showing 1 changed file with 1 addition and 1 deletion.
diff --git a/sharktank/sharktank/utils/hf_datasets.py b/sharktank/sharktank/utils/hf_datasets.py
@@ -103,7 +103,7 @@ def alias_dataset(from_name: str, to_name: str):
         ),
         RemoteFile(
             "tokenizer_config.json",
-            "NousResearch/Meta-Llama-3-8B",
+            "NousResearch/Meta-Llama-3-8B-Instruct",
             "tokenizer_config.json",
             extra_filenames=["tokenizer.json"],
         ),