Add force_sample_level to split_dataset_by_node by muyihao · Pull Request #8268 · huggingface/datasets

muyihao · 2026-06-13T14:52:49Z

Allow users to explicitly request sample-level sharding for streaming IterableDataset even when num_shards % world_size == 0. Default behavior is unchanged.

When num_physical_files < world_size, or when shards divide evenly but are too few or too imbalanced for shard-level sharding, applying sample-level sharding before expensive map/filter avoids duplicating those transformations across nodes:

ds = split_dataset_by_node(ds, rank, world_size, force_sample_level=True)
ds = ds.map(tokenize_fn)  # only the examples this rank consumes

Map-style datasets ignore the flag.

Closes #8253

Allow users to explicitly request sample-level sharding for streaming `IterableDataset` even when `num_shards % world_size == 0`. Default behavior is unchanged. When `num_physical_files < world_size`, or when shards divide evenly but are too few or too imbalanced for shard-level sharding, applying sample-level sharding before expensive `map`/`filter` avoids duplicating those transformations across nodes: ds = split_dataset_by_node(ds, rank, world_size, force_sample_level=True) ds = ds.map(tokenize_fn) # only the examples this rank consumes Map-style datasets ignore the flag. Closes huggingface#8253

muyihao · 2026-06-13T14:54:51Z

@lhoestq , hi, could you please help review this PR? Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add force_sample_level to split_dataset_by_node#8268

Add force_sample_level to split_dataset_by_node#8268
muyihao wants to merge 1 commit into
huggingface:mainfrom
muyihao:feat/split-dataset-by-node-strategy

muyihao commented Jun 13, 2026

Uh oh!

muyihao commented Jun 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

muyihao commented Jun 13, 2026

Uh oh!

muyihao commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

muyihao commented Jun 13, 2026 •

edited

Loading