Multimodality/sft extension #91

RaphaelKreft · 2025-10-06T10:52:03Z

Implementation of a SFT Dataset for SFT Training in Megatron. Developed originally for Visual Instruction Tuning.

As the gpt_dataset it uses an IndexedDataset as Low-Level Dataset.
From there it loads pre-tokenized sft data, then masks the user prompts (in our case including image tokens).
Per training sample: loads a single sample from indexed dataset, then pads to maximum sequence length

Added

--sft cli argument (when given, use the new SFTIndexedDataset)
SFTIndexedDataset loading and preparing pre-tokenized sft data

Questions / Todos

Add dynamic loading of "begin of user prompt" and "end-of-turn" sequence (could be part of tokenizer, is currently hard-coded and thus only works for LLama3 Vision Model Chat templates)
Implement option to mask loss on all special tokens (BOS, EOD, EOS, SFT-Related special tokens)??

…e efficiency of loss-mask creation.

Alvorecer721 · 2025-10-13T17:54:18Z

megatron/core/datasets/sft_dataset.py

    return matches


+def get_matching_mask_by_start_end(tokens, begin_seq: torch.Tensor, end_seq: torch.Tensor):


CRUCIAL BUG: We're Searching in the Wrong Place

Let’s make this concrete with a simple example.

Sequence: a b c d e Tokens: a b c d Labels: b c d e

Each loss_mask[i] controls whether we train on predicting labels[i] from tokens[i], i.e.:

loss_mask[0]: a → b

loss_mask[1]: b → c

loss_mask[2]: c → d

loss_mask[3]: d → e

The Problem

Say we want to mask the prediction of 'c' (don’t train the model to predict 'c').

Current (incorrect) logic:

We search for 'c' in tokens → find index 2

Set loss_mask[2] = 0

Result:

a → b (trained) b → c (trained) ❌ we wanted to mask this one c → d (masked) d → e (trained)

This disables the c→d prediction instead of b→c.

Correct logic

We should search for 'c' in labels, not tokens:

'c' is at label index 1

Set loss_mask[1] = 0

Result:

a → b (trained) b → c (masked) ✅ correct c → d (trained) d → e (trained)

I now renamed the arguments to the sequence matching methods (to "data" and "sequence") and more importantly give them the labels to calculate the mask.

Alvorecer721 · 2025-10-13T18:02:38Z

megatron/core/datasets/sft_dataset.py

+    end_len = len(end_seq)
+
+    if 0 < begin_len <= len(tokens):
+        matches_begin = get_matching_mask(tokens, begin_seq, only_begin=True)


Alvorecer721 · 2025-10-13T18:10:14Z

megatron/core/datasets/sft_dataset.py

+                "position_ids": position_ids,
+            }
+
+    def _get_ltor_masks_and_position_ids(self, tokens,


I would prefer using data as in the original _get_ltor_masks_and_position_ids against tokens here, in the original implementation tokens are passed here to prevent the generation after eod token in tokens during pretraining, for SFT maybe we should pass in labels instead.

Alvorecer721 · 2025-10-13T18:11:48Z

megatron/core/datasets/sft_dataset.py

+        if not only_begin:
+            matches_float = matches.float().unsqueeze(0).unsqueeze(0)  # (1, 1, N)
+            kernel = torch.ones(1, 1, query_len, device=sequence.device)
+            expanded = F.conv1d(matches_float, kernel, padding=query_len - 1)


Do you think using convid here for padding is a bit overkill? @TJ-Solergibert

Well, it can be, but as far as 1. Doesn't hurts performance and 2. Everyone is comfortable with it, it's fine.

Keep in mind that this function is performed 1. In the CPU 2. While the GPU is processing the previous batch, so as far as you don't hit any CPU OOM error and you are not bottlenecked by the DataLoader you are good. To check for the later just compare the throughput when using mock data.

- Load sequences from tokenizer properties instead of tokenizing at runtime - Pre-compute token sequences as tensors in __init__ - Use .to() instead of torch.tensor() in hot path for efficiency - Reduces overhead during training data loading

…for now (untested)

…mples and then exit. Remove dummy packing arg and code from main code. (untested)

…ics itself.

…egardless of checkpoint interval

…r of samples

…ort to sft-dataset.

…to samples needed.

…equence loss.

… not form eod indices

RaphaelKreft and others added 7 commits October 2, 2025 16:46

add sft argument and preliminary implementation of SFTIndexedDataset

e8bda1c

make SFTIndexedDataset align with GPTDataset.

c9b5701

Implement masking of user sequences

5bfcacb

Debug SFTIndexed Dataset

5a5b5de

add masking for attention rows for padding tokens

6e458a8

Remove special tokens from user start/end sequence

4ff1170

end truncated samples with eod token

df80bc2

RaphaelKreft requested a review from Alvorecer721 October 6, 2025 10:52

Alvorecer721 requested a review from TJ-Solergibert October 6, 2025 11:31

RaphaelKreft and others added 2 commits October 7, 2025 14:56

Add option to mask special token(sequences) in sft dataset and improv…

912d683

…e efficiency of loss-mask creation.

debugged sdt and tokenizer. added options to not mask image tokens

c591b9a

Alvorecer721 requested changes Oct 13, 2025

View reviewed changes

RaphaelKreft and others added 18 commits October 13, 2025 22:55

fix: use labels to calculate values for loss mask NOT tokens

812ea3e

implement right padding, add debug flag, remove goldfish loss, cleanup

bc021e1

remove caching as it potentially can cause issues. Add cmd arg for plw

49a4e8a

Add tokenizer properties for pre-tokenized SFT sequences

96ea32d

add support for plw in sft_dataset.py.

d832f6e

fix: add sft_plw to gpt dataset config properly

74baf69

add assistant loss logging for sft (untested)

7b81b95

improve assistant loss logging to work with plw=1 (untested)

515977b

fix assistant mask not passed though callchain correctly

1ee7c2e

implement sample-packing for sft dataset (untested)

56fbc8c

add and use python-version of build_packed_idx method in sft_dataset …

e14e9cf

…for now (untested)

Fix minor issues to make packing training launch and compile

0fcbc7a

add sft init script that can init sft dataset once to print packed sa…

846ab71

…mples and then exit. Remove dummy packing arg and code from main code. (untested)

log packing statistics also if loaded from cache

3693f1e

remove unecessary prints from init sft script. Sft dset prints sttist…

572b92d

…ics itself.

Add --final-checkpoint arg, that enables storing a final checkpoint r…

07b185c

…egardless of checkpoint interval

Add option to skip skip margin samples, to exactly reach target numbe…

4f1ee44

…r of samples

RaphaelKreft and others added 14 commits October 24, 2025 13:53

Add option to skip skip margin samples, to exactly reach target numbe…

b89133c

…r of samples

Minor fixes to sft-dataset

217f3ea

re-enable forward pre-hook before final checkpoint

c5eef18

activate c++ index building helper, deprecate python one

ba0e5c6

remove not-mask img tokens, add sample averaging and multi-epoch supp…

53b4619

…ort to sft-dataset.

use correct shuffle in single doc idx building

4192389

extend key conf attributes of sft_dataset to account for packing

129d45a

cleanup old python index building implementation. Limit sample index …

8383953

…to samples needed.

store correct size of sample index and mitigate NaN when equalizing s…

1ea97b4

…equence loss.

move zero-out of loss-mask into respective method

bf5d2b7

fix attempt NaN loss

da8cb3b

sft-dataset: obtain sample boundaries during low level sample loading…

ab43c71

… not form eod indices

add load loss mask from disk option - cleanup sft dataset

bf74020

minor fixes to sft_dataset.py after refactor

8c5188f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Multimodality/sft extension #91

Multimodality/sft extension #91

Uh oh!

RaphaelKreft commented Oct 6, 2025

Uh oh!

Alvorecer721 Oct 13, 2025

Uh oh!

RaphaelKreft Oct 13, 2025

Uh oh!

Alvorecer721 Oct 13, 2025

Uh oh!

Alvorecer721 Oct 13, 2025

Uh oh!

Alvorecer721 Oct 13, 2025

Uh oh!

TJ-Solergibert Oct 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		return matches


		def get_matching_mask_by_start_end(tokens, begin_seq: torch.Tensor, end_seq: torch.Tensor):

Multimodality/sft extension #91

Are you sure you want to change the base?

Multimodality/sft extension #91

Uh oh!

Conversation

RaphaelKreft commented Oct 6, 2025

Added

Questions / Todos

Uh oh!

Alvorecer721 Oct 13, 2025

Choose a reason for hiding this comment

CRUCIAL BUG: We're Searching in the Wrong Place

The Problem

Correct logic

Uh oh!

RaphaelKreft Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

Alvorecer721 Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

Alvorecer721 Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

Alvorecer721 Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

TJ-Solergibert Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants