Fix(doc): address missing doc changes (#2362)

NanoCode012 · SalmanMohammadi · web-flow · commit 75cbd1530172 · 2025-02-25T13:50:02.000-05:00
* fix: add multiple tips about eos_token masking

* fix: format dataset preprocessing doc

* Update docs/dataset-formats/conversation.qmd

Co-authored-by: salman &lt;salman.mohammadi@outlook.com&gt;

---------

Co-authored-by: salman &lt;salman.mohammadi@outlook.com&gt;
diff --git a/docs/config.qmd b/docs/config.qmd
@@ -166,14 +166,15 @@ datasets:
     # IMPORTANT: The following fields determine which parts of the conversation to train on.
     # Priority order: message_field_training > message_field_training_detail > train_on_inputs or role in roles_to_train
     # See examples at `docs/dataset-formats/conversation.qmd`
-    # Note: If the below 4 fields are empty, defaults to training only on the last message.
+    # Note: If the below 4 fields are set to empty, defaults to training only on the last message.
 
     # Optional[List[str]]. Roles to train on. The tokens from these roles will be considered for the loss.
     roles_to_train: ["assistant"]  # default
     # Optional[str]. Which EOS tokens to train on in the conversation. Possible values are:
     # - all: train on all EOS tokens
     # - turn (default): train on the EOS token at the end of each trainable turn
     # - last: train on the last EOS token in the conversation
+    # TIP: Please make sure that your `tokenizer.eos_token` is same as EOS/EOT token in template. Otherwise, set `eos_token` under `special_tokens`.
     train_on_eos: last
     # The key in the message turn that indicates via boolean whether tokens of a turn should be considered for training. Useful to selectively train on certain turns besides the `roles_to_train`.
     message_field_training: training
diff --git a/docs/dataset-formats/conversation.qmd b/docs/dataset-formats/conversation.qmd
@@ -104,6 +104,10 @@ datasets:
     type: chat_template
 ```
 
+::: {.callout-important}
+Please make sure that your `tokenizer.eos_token` is same as EOS/EOT token in template. Otherwise, set `eos_token` under `special_tokens`.
+:::
+
 5. (Advanced) Using fine-grained control over tokens and turns to train in a conversation
 
 For a data sample that looks like:
@@ -151,4 +155,6 @@ datasets:
     message_field_training_detail: train_detail
 ```
 
-Tip: It is not necessary to use both `message_field_training` and `message_field_training_detail` at a time.
+::: {.callout-tip}
+It is not necessary to set both `message_field_training` and `message_field_training_detail` at once.
+:::
diff --git a/docs/dataset_preprocessing.qmd b/docs/dataset_preprocessing.qmd
@@ -3,19 +3,24 @@ title: Dataset Preprocessing
 description: How datasets are processed
 ---
 
+## Overview
+
 Dataset pre-processing is the step where Axolotl takes each dataset you've configured alongside
-the (dataset format)[../dataset-formats/] and prompt strategies to:
+the [dataset format](docs/dataset-formats) and prompt strategies to:
+
  - parse the dataset based on the *dataset format*
  - transform the dataset to how you would interact with the model based on the *prompt strategy*
  - tokenize the dataset based on the configured model & tokenizer
  - shuffle and merge multiple datasets together if using more than one
 
 The processing of the datasets can happen one of two ways:
 
-1. Before kicking off training by calling `python -m axolotl.cli.preprocess /path/to/your.yaml --debug`
+1. Before kicking off training by calling `axolotl preprocess config.yaml --debug`
 2. When training is started
 
-What are the benefits of pre-processing? When training interactively or for sweeps
+### What are the benefits of pre-processing?
+
+When training interactively or for sweeps
 (e.g. you are restarting the trainer often), processing the datasets can oftentimes be frustratingly
 slow. Pre-processing will cache the tokenized/formatted datasets according to a hash of dependent
 training parameters so that it will intelligently pull from its cache when possible.
@@ -28,8 +33,12 @@ default path of `./last_run_prepared/`, but will ignore anything already cached
 setting `dataset_prepared_path: ./last_run_prepared`, the trainer will use whatever pre-processed
 data is in the cache.
 
-What are the edge cases? Let's say you are writing a custom prompt strategy or using a user-defined
+### What are the edge cases?
+
+Let's say you are writing a custom prompt strategy or using a user-defined
 prompt template. Because the trainer cannot readily detect these changes, we cannot change the
-calculated hash value for the pre-processed dataset. If you have `dataset_prepared_path: ...` set
+calculated hash value for the pre-processed dataset.
+
+If you have `dataset_prepared_path: ...` set
 and change your prompt templating logic, it may not pick up the changes you made and you will be
 training over the old prompt.
diff --git a/docs/faq.qmd b/docs/faq.qmd
@@ -46,3 +46,7 @@ description: Frequently asked questions
 **Q: `Content end boundary is the same as start boundary for turn ___. This is likely an empty turn.`**
 
 > A: This is likely an empty turn.
+
+**Q: The EOS/EOT token is incorrectly being masked or not being masked.**
+
+> A: This is because of the mismatch between `tokenizer.eos_token` and EOS/EOT token in template. Please make sure to set `eos_token` under `special_tokens` to the same EOS/EOT token as in template.