Skip to content

Commit 75cbd15

Browse files
Fix(doc): address missing doc changes (#2362)
* fix: add multiple tips about eos_token masking * fix: format dataset preprocessing doc * Update docs/dataset-formats/conversation.qmd Co-authored-by: salman <[email protected]> --------- Co-authored-by: salman <[email protected]>
1 parent 2efe1b4 commit 75cbd15

File tree

4 files changed

+27
-7
lines changed

4 files changed

+27
-7
lines changed

docs/config.qmd

+2-1
Original file line numberDiff line numberDiff line change
@@ -166,14 +166,15 @@ datasets:
166166
# IMPORTANT: The following fields determine which parts of the conversation to train on.
167167
# Priority order: message_field_training > message_field_training_detail > train_on_inputs or role in roles_to_train
168168
# See examples at `docs/dataset-formats/conversation.qmd`
169-
# Note: If the below 4 fields are empty, defaults to training only on the last message.
169+
# Note: If the below 4 fields are set to empty, defaults to training only on the last message.
170170

171171
# Optional[List[str]]. Roles to train on. The tokens from these roles will be considered for the loss.
172172
roles_to_train: ["assistant"] # default
173173
# Optional[str]. Which EOS tokens to train on in the conversation. Possible values are:
174174
# - all: train on all EOS tokens
175175
# - turn (default): train on the EOS token at the end of each trainable turn
176176
# - last: train on the last EOS token in the conversation
177+
# TIP: Please make sure that your `tokenizer.eos_token` is same as EOS/EOT token in template. Otherwise, set `eos_token` under `special_tokens`.
177178
train_on_eos: last
178179
# The key in the message turn that indicates via boolean whether tokens of a turn should be considered for training. Useful to selectively train on certain turns besides the `roles_to_train`.
179180
message_field_training: training

docs/dataset-formats/conversation.qmd

+7-1
Original file line numberDiff line numberDiff line change
@@ -104,6 +104,10 @@ datasets:
104104
type: chat_template
105105
```
106106
107+
::: {.callout-important}
108+
Please make sure that your `tokenizer.eos_token` is same as EOS/EOT token in template. Otherwise, set `eos_token` under `special_tokens`.
109+
:::
110+
107111
5. (Advanced) Using fine-grained control over tokens and turns to train in a conversation
108112

109113
For a data sample that looks like:
@@ -151,4 +155,6 @@ datasets:
151155
message_field_training_detail: train_detail
152156
```
153157

154-
Tip: It is not necessary to use both `message_field_training` and `message_field_training_detail` at a time.
158+
::: {.callout-tip}
159+
It is not necessary to set both `message_field_training` and `message_field_training_detail` at once.
160+
:::

docs/dataset_preprocessing.qmd

+14-5
Original file line numberDiff line numberDiff line change
@@ -3,19 +3,24 @@ title: Dataset Preprocessing
33
description: How datasets are processed
44
---
55

6+
## Overview
7+
68
Dataset pre-processing is the step where Axolotl takes each dataset you've configured alongside
7-
the (dataset format)[../dataset-formats/] and prompt strategies to:
9+
the [dataset format](docs/dataset-formats) and prompt strategies to:
10+
811
- parse the dataset based on the *dataset format*
912
- transform the dataset to how you would interact with the model based on the *prompt strategy*
1013
- tokenize the dataset based on the configured model & tokenizer
1114
- shuffle and merge multiple datasets together if using more than one
1215

1316
The processing of the datasets can happen one of two ways:
1417

15-
1. Before kicking off training by calling `python -m axolotl.cli.preprocess /path/to/your.yaml --debug`
18+
1. Before kicking off training by calling `axolotl preprocess config.yaml --debug`
1619
2. When training is started
1720

18-
What are the benefits of pre-processing? When training interactively or for sweeps
21+
### What are the benefits of pre-processing?
22+
23+
When training interactively or for sweeps
1924
(e.g. you are restarting the trainer often), processing the datasets can oftentimes be frustratingly
2025
slow. Pre-processing will cache the tokenized/formatted datasets according to a hash of dependent
2126
training parameters so that it will intelligently pull from its cache when possible.
@@ -28,8 +33,12 @@ default path of `./last_run_prepared/`, but will ignore anything already cached
2833
setting `dataset_prepared_path: ./last_run_prepared`, the trainer will use whatever pre-processed
2934
data is in the cache.
3035

31-
What are the edge cases? Let's say you are writing a custom prompt strategy or using a user-defined
36+
### What are the edge cases?
37+
38+
Let's say you are writing a custom prompt strategy or using a user-defined
3239
prompt template. Because the trainer cannot readily detect these changes, we cannot change the
33-
calculated hash value for the pre-processed dataset. If you have `dataset_prepared_path: ...` set
40+
calculated hash value for the pre-processed dataset.
41+
42+
If you have `dataset_prepared_path: ...` set
3443
and change your prompt templating logic, it may not pick up the changes you made and you will be
3544
training over the old prompt.

docs/faq.qmd

+4
Original file line numberDiff line numberDiff line change
@@ -46,3 +46,7 @@ description: Frequently asked questions
4646
**Q: `Content end boundary is the same as start boundary for turn ___. This is likely an empty turn.`**
4747

4848
> A: This is likely an empty turn.
49+
50+
**Q: The EOS/EOT token is incorrectly being masked or not being masked.**
51+
52+
> A: This is because of the mismatch between `tokenizer.eos_token` and EOS/EOT token in template. Please make sure to set `eos_token` under `special_tokens` to the same EOS/EOT token as in template.

0 commit comments

Comments
 (0)