[feat/bug] Concatenated dataset takes too much resources #136

jlamypoirier · 2025-01-30T20:00:25Z

🧐 Problem Description

Concatenating datasets help treating large datasets consisting of multiple files as one big dataset, but turns out it will cause us trouble, because sampling index arrays will get too big. Ex. if we're sampling 1T tokens @ 1K seqlen, avg 100 tokens/doc, we'll have:

10G doc idx entries (40 GB)
1G sample idx entries (8GB)
1G shuffle idx entries (4GB)

This is still somewhat manageable, but more will clearly cause memory issues. It also blocks two optimizations suggested in #132:

Sample on GPU: may cause OOM.
Sample in parallel: Hard to do with a single big dataset.

💡 Proposed Solution

A mix of two solutions:

Split sampling into smaller chunks (one or multiple epochs). Good for mid-sized dataset, but won't work for bigger ones if we shuffle the whole dataset. Also we'd need multiple arrays / files, so need to be careful about open file count. And only possible if we shuffle epochs independently.
Split the dataset into smaller chunks, i.e. go back to blending dataset chunks, sampled and shuffled independently. The most viable solution, We can still solve [feat] Integrate dataset re-weighting and preprocessing into Fast-LLM for streamlined data loading #25 with hierarchical blending / concatenate_memmap. But we have more options than before thanks to concatenated datasets and don't have to match blended chunks with actual files (ex. blend concatenations of multiple datasets, concatenate then split into chunks of fixed size independent of the file structure, etc.)

The text was updated successfully, but these errors were encountered:

jlamypoirier added bug Something isn't working enhancement New feature or request Priority labels Jan 30, 2025

jlamypoirier added this to the 0.3.0 milestone Jan 30, 2025

jlamypoirier self-assigned this Feb 14, 2025

jlamypoirier mentioned this issue Feb 15, 2025

Dataset from file #146

Merged

8 tasks

jlamypoirier closed this as completed in #146 Mar 4, 2025

jlamypoirier mentioned this issue Mar 7, 2025

[bug] Failing to run training with concatenation of file datasets #176

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat/bug] Concatenated dataset takes too much resources #136

[feat/bug] Concatenated dataset takes too much resources #136

jlamypoirier commented Jan 30, 2025

[feat/bug] Concatenated dataset takes too much resources #136

[feat/bug] Concatenated dataset takes too much resources #136

Comments

jlamypoirier commented Jan 30, 2025

🧐 Problem Description

💡 Proposed Solution