Use llama-8b-instruct tokenizer.json and tokenizer_config.json for `llama3_8b_fp16` dataset #953

stbaione · 2025-02-11T16:30:16Z

The tokenizers specified for this dataset are for llama3_8b_fp16, while the model is llama3_8b_fp16_instruct. The eos_token for 8b and 8b-instruct are different:

8b:

<|begin_of_text|> {generated_text} <|end_of_text|>

<|end_of_text|> - 128001

8b-Instruct:

<|begin_of_text|> {generated_text} <|eot_id|>


<|eot_id|> - 128009

Using the wrong config causes Llama to output text forever. Our model generated 128009s, but the server doesn't recognize it as the proper stop token and keeps calling for generations.

More details here

…lama3_8b_fp16` dataset

…lama3_8b_fp16` dataset (#953) The tokenizers specified for this dataset are for `llama3_8b_fp16`, while the model is `llama3_8b_fp16_instruct`. The `eos_token` for `8b` and `8b-instruct` are different: ```text 8b: <|begin_of_text|> {generated_text} <|end_of_text|> <|end_of_text|> - 128001 8b-Instruct: <|begin_of_text|> {generated_text} <|eot_id|> <|eot_id|> - 128009 ``` Using the wrong config causes Llama to output text forever. Our model generated `128009`s, but the server doesn't recognize it as the proper stop token and keeps calling for generations. More details [here](#934 (comment))

Use llama-8b-instruct tokenizer.json and tokenizer_config.json for `l…

2492f6a

…lama3_8b_fp16` dataset

stbaione requested a review from renxida February 11, 2025 16:30

stbaione added 2 commits February 11, 2025 10:34

Merge branch 'main' into hfdataset-instruct-tokenizer-fix

cd475c7

Merge branch 'main' into hfdataset-instruct-tokenizer-fix

ac338fb

renxida approved these changes Feb 11, 2025

View reviewed changes

stbaione merged commit fbc69de into nod-ai:main Feb 11, 2025
30 of 34 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use llama-8b-instruct tokenizer.json and tokenizer_config.json for `llama3_8b_fp16` dataset #953

Use llama-8b-instruct tokenizer.json and tokenizer_config.json for `llama3_8b_fp16` dataset #953

stbaione commented Feb 11, 2025

Use llama-8b-instruct tokenizer.json and tokenizer_config.json for llama3_8b_fp16 dataset #953

Use llama-8b-instruct tokenizer.json and tokenizer_config.json for llama3_8b_fp16 dataset #953

Conversation

stbaione commented Feb 11, 2025

Use llama-8b-instruct tokenizer.json and tokenizer_config.json for `llama3_8b_fp16` dataset #953

Use llama-8b-instruct tokenizer.json and tokenizer_config.json for `llama3_8b_fp16` dataset #953