Skip to content

Conversation

@alexdremov
Copy link

@alexdremov alexdremov commented May 27, 2025

This script can process blended dataset metadata:

  • incorporate new datasets
  • remove present datasets
  • remove already seen tokens

Processing procedure file is commited to this PR too (https://github.com/swiss-ai/Megatron-LM/pull/74/files#diff-d4fed1e9afb714170efaffdf51f31de4ffa2419a7fd029508fa496fb2ebfa7ba)


unwrapped_new_datasets = create_data_prefix([dataset])
new_megatron_datasets = [
self.create_megatron_dataset(i) for i in unwrapped_new_datasets
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only reason for creating megatorn dataset is to determine number of samples. This is kind of ugly, but I could not think of other solution

@alexdremov
Copy link
Author

alexdremov commented May 27, 2025

  • verify that the current mixing weight calculation is consistent with the main code. This can be done by removing and then adding the same dataset

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant