Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This change adds support for next-fit bin packing in MultipackBatchSampler. What this means is that we can now use sample packing while preserving the order of the examples.
Motivation and Context
Current sample packing affects order of examples in ways that can affect training results. This PR adds a new flag:
--sample_pack_sequentially
which uses a simple greedy / sequential next-fit bin packing.If you use
--sample_pack_sequentially
alone, the order of the examples will be determined by the underlyingRandomSampler
. If you use--sample_pack_sequentially
with--curriculum_sampling
, the order of the examples will be the same as in the training data. Both options make sense and differ from the current settings.For example, previously it was not possible to do proper curriculum learning since the bin packing algorithm would reorder the examples anyway.
How has this been tested?
I tested it by running the new config:
examples/llama-3/lora-1b-sample-packing-sequentially.yml
On this dataset, packing efficiency is high (>97%).
I would also like to bring attention to a potential bug in the multipack code:
group_size
andbin_size