Skip to content

[feat/bug] Concatenated dataset takes too much resources #136

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jlamypoirier opened this issue Jan 30, 2025 · 0 comments · Fixed by #146
Closed

[feat/bug] Concatenated dataset takes too much resources #136

jlamypoirier opened this issue Jan 30, 2025 · 0 comments · Fixed by #146
Assignees
Labels
bug Something isn't working enhancement New feature or request Priority
Milestone

Comments

@jlamypoirier
Copy link
Collaborator

🧐 Problem Description

Concatenating datasets help treating large datasets consisting of multiple files as one big dataset, but turns out it will cause us trouble, because sampling index arrays will get too big. Ex. if we're sampling 1T tokens @ 1K seqlen, avg 100 tokens/doc, we'll have:

  • 10G doc idx entries (40 GB)
  • 1G sample idx entries (8GB)
  • 1G shuffle idx entries (4GB)

This is still somewhat manageable, but more will clearly cause memory issues. It also blocks two optimizations suggested in #132:

  • Sample on GPU: may cause OOM.
  • Sample in parallel: Hard to do with a single big dataset.

💡 Proposed Solution

A mix of two solutions:

  1. Split sampling into smaller chunks (one or multiple epochs). Good for mid-sized dataset, but won't work for bigger ones if we shuffle the whole dataset. Also we'd need multiple arrays / files, so need to be careful about open file count. And only possible if we shuffle epochs independently.
  2. Split the dataset into smaller chunks, i.e. go back to blending dataset chunks, sampled and shuffled independently. The most viable solution, We can still solve [feat] Integrate dataset re-weighting and preprocessing into Fast-LLM for streamlined data loading #25 with hierarchical blending / concatenate_memmap. But we have more options than before thanks to concatenated datasets and don't have to match blended chunks with actual files (ex. blend concatenations of multiple datasets, concatenate then split into chunks of fixed size independent of the file structure, etc.)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request Priority
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant