Sets the `num_proc` argument for all calls to `load_dataset` in the repo. #1128

finbarrtimbers · 2025-10-31T21:58:15Z

This lets us use multiple processes to load the dataset.

Runs:

Single GPU finetune job: Beaker
Singe GPU GRPO run: Beaker

Note

Standardizes parallel dataset loading by using max_num_processes() for num_proc across the repository.

Performance/Parallelism:
- Add open_instruct.utils.max_num_processes() helper and apply it to nearly all datasets.load_dataset(...) calls across the codebase to set num_proc for parallel data loading.
- Update modules and scripts (e.g., decontamination/*, open_instruct/dataset_transformation.py, quantize/*, scripts/data/*, evaluation/util scripts) to import the helper and pass num_proc consistently.
Refactor:
- Minor import adjustments to reference open_instruct.utils and max_num_processes where used.

^{Written by Cursor Bugbot for commit a5e6d55. This will update automatically on new commits. Configure here.}

natolambert

LGTM - I didn't explicitly test any of the changes

finbarrtimbers added 2 commits October 31, 2025 15:57

Set load_parallel

2066767

Cleaned up PR.

90c0fae

finbarrtimbers requested a review from natolambert October 31, 2025 22:16

finbarrtimbers marked this pull request as ready for review October 31, 2025 22:16

finbarrtimbers enabled auto-merge October 31, 2025 22:18

This comment was marked as outdated.

Sign in to view

Merge branch 'main' into parallel-dataset

a5e6d55

natolambert approved these changes Nov 3, 2025

View reviewed changes

finbarrtimbers added this pull request to the merge queue Nov 3, 2025

Merged via the queue into main with commit 1a2b79a Nov 3, 2025
4 checks passed

finbarrtimbers deleted the parallel-dataset branch November 3, 2025 19:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Sets the `num_proc` argument for all calls to `load_dataset` in the repo. #1128

Sets the `num_proc` argument for all calls to `load_dataset` in the repo. #1128

Uh oh!

finbarrtimbers commented Oct 31, 2025 •

edited by cursor bot

Loading

Uh oh!

This comment was marked as outdated.

Uh oh!

natolambert left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Sets the num_proc argument for all calls to load_dataset in the repo. #1128

Sets the num_proc argument for all calls to load_dataset in the repo. #1128

Uh oh!

Conversation

finbarrtimbers commented Oct 31, 2025 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

natolambert left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Sets the `num_proc` argument for all calls to `load_dataset` in the repo. #1128

Sets the `num_proc` argument for all calls to `load_dataset` in the repo. #1128

finbarrtimbers commented Oct 31, 2025 •

edited by cursor bot

Loading