Data Mixture Modification Script #74

alexdremov · 2025-05-27T09:32:33Z

This script can process blended dataset metadata:

incorporate new datasets
remove present datasets
remove already seen tokens

Processing procedure file is commited to this PR too (https://github.com/swiss-ai/Megatron-LM/pull/74/files#diff-d4fed1e9afb714170efaffdf51f31de4ffa2419a7fd029508fa496fb2ebfa7ba)

alexdremov · 2025-05-27T09:35:23Z

scripts/tools/create_dataset_metadata.py

+
+        unwrapped_new_datasets = create_data_prefix([dataset])
+        new_megatron_datasets = [
+            self.create_megatron_dataset(i) for i in unwrapped_new_datasets


The only reason for creating megatorn dataset is to determine number of samples. This is kind of ugly, but I could not think of other solution

alexdremov · 2025-05-27T09:53:00Z

verify that the current mixing weight calculation is consistent with the main code. This can be done by removing and then adding the same dataset

datamixture

b8406b0

alexdremov commented May 27, 2025

View reviewed changes

fixes

ed0857f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Data Mixture Modification Script #74

Data Mixture Modification Script #74

Uh oh!

alexdremov commented May 27, 2025 •

edited

Loading

Uh oh!

alexdremov May 27, 2025

Uh oh!

alexdremov commented May 27, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Data Mixture Modification Script #74

Are you sure you want to change the base?

Data Mixture Modification Script #74

Uh oh!

Conversation

alexdremov commented May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexdremov May 27, 2025

Choose a reason for hiding this comment

Uh oh!

alexdremov commented May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

alexdremov commented May 27, 2025 •

edited

Loading

alexdremov commented May 27, 2025 •

edited

Loading