You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Concatenating datasets help treating large datasets consisting of multiple files as one big dataset, but turns out it will cause us trouble, because sampling index arrays will get too big. Ex. if we're sampling 1T tokens @ 1K seqlen, avg 100 tokens/doc, we'll have:
10G doc idx entries (40 GB)
1G sample idx entries (8GB)
1G shuffle idx entries (4GB)
This is still somewhat manageable, but more will clearly cause memory issues. It also blocks two optimizations suggested in #132:
Sample on GPU: may cause OOM.
Sample in parallel: Hard to do with a single big dataset.
💡 Proposed Solution
A mix of two solutions:
Split sampling into smaller chunks (one or multiple epochs). Good for mid-sized dataset, but won't work for bigger ones if we shuffle the whole dataset. Also we'd need multiple arrays / files, so need to be careful about open file count. And only possible if we shuffle epochs independently.
Split the dataset into smaller chunks, i.e. go back to blending dataset chunks, sampled and shuffled independently. The most viable solution, We can still solve [feat] Integrate dataset re-weighting and preprocessing into Fast-LLM for streamlined data loading #25 with hierarchical blending / concatenate_memmap. But we have more options than before thanks to concatenated datasets and don't have to match blended chunks with actual files (ex. blend concatenations of multiple datasets, concatenate then split into chunks of fixed size independent of the file structure, etc.)
The text was updated successfully, but these errors were encountered:
🧐 Problem Description
Concatenating datasets help treating large datasets consisting of multiple files as one big dataset, but turns out it will cause us trouble, because sampling index arrays will get too big. Ex. if we're sampling 1T tokens @ 1K seqlen, avg 100 tokens/doc, we'll have:
This is still somewhat manageable, but more will clearly cause memory issues. It also blocks two optimizations suggested in #132:
💡 Proposed Solution
A mix of two solutions:
concatenate_memmap
. But we have more options than before thanks to concatenated datasets and don't have to match blended chunks with actual files (ex. blend concatenations of multiple datasets, concatenate then split into chunks of fixed size independent of the file structure, etc.)The text was updated successfully, but these errors were encountered: