Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sequential sample packing #2404

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

DreamGenX
Copy link
Contributor

Description

This change adds support for next-fit bin packing in MultipackBatchSampler. What this means is that we can now use sample packing while preserving the order of the examples.

Motivation and Context

Current sample packing affects order of examples in ways that can affect training results. This PR adds a new flag:
--sample_pack_sequentially which uses a simple greedy / sequential next-fit bin packing.

If you use --sample_pack_sequentially alone, the order of the examples will be determined by the underlying RandomSampler. If you use --sample_pack_sequentially with --curriculum_sampling, the order of the examples will be the same as in the training data. Both options make sense and differ from the current settings.

For example, previously it was not possible to do proper curriculum learning since the bin packing algorithm would reorder the examples anyway.

How has this been tested?

I tested it by running the new config: examples/llama-3/lora-1b-sample-packing-sequentially.yml
On this dataset, packing efficiency is high (>97%).


I would also like to bring attention to a potential bug in the multipack code:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant