Skip to content

Add force_sample_level to split_dataset_by_node#8268

Open
muyihao wants to merge 1 commit into
huggingface:mainfrom
muyihao:feat/split-dataset-by-node-strategy
Open

Add force_sample_level to split_dataset_by_node#8268
muyihao wants to merge 1 commit into
huggingface:mainfrom
muyihao:feat/split-dataset-by-node-strategy

Conversation

@muyihao

@muyihao muyihao commented Jun 13, 2026

Copy link
Copy Markdown

Allow users to explicitly request sample-level sharding for streaming IterableDataset even when num_shards % world_size == 0. Default behavior is unchanged.

When num_physical_files < world_size, or when shards divide evenly but are too few or too imbalanced for shard-level sharding, applying sample-level sharding before expensive map/filter avoids duplicating those transformations across nodes:

ds = split_dataset_by_node(ds, rank, world_size, force_sample_level=True)
ds = ds.map(tokenize_fn)  # only the examples this rank consumes

Map-style datasets ignore the flag.

Closes #8253

Allow users to explicitly request sample-level sharding for streaming
`IterableDataset` even when `num_shards % world_size == 0`. Default behavior
is unchanged.

When `num_physical_files < world_size`, or when shards divide evenly but are
too few or too imbalanced for shard-level sharding, applying sample-level
sharding before expensive `map`/`filter` avoids duplicating those
transformations across nodes:

    ds = split_dataset_by_node(ds, rank, world_size, force_sample_level=True)
    ds = ds.map(tokenize_fn)  # only the examples this rank consumes

Map-style datasets ignore the flag.

Closes huggingface#8253
@muyihao

muyihao commented Jun 13, 2026

Copy link
Copy Markdown
Author

@lhoestq , hi, could you please help review this PR? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature request: Add a streaming_shard operator for early sample-level sharding when file-level sharding is insufficient

1 participant