Skip to content

Conversation

@finbarrtimbers
Copy link
Collaborator

@finbarrtimbers finbarrtimbers commented Oct 31, 2025

This lets us use multiple processes to load the dataset.

Runs:

  • Single GPU finetune job: Beaker
  • Singe GPU GRPO run: Beaker

Note

Standardizes parallel dataset loading by using max_num_processes() for num_proc across the repository.

  • Performance/Parallelism:
    • Add open_instruct.utils.max_num_processes() helper and apply it to nearly all datasets.load_dataset(...) calls across the codebase to set num_proc for parallel data loading.
    • Update modules and scripts (e.g., decontamination/*, open_instruct/dataset_transformation.py, quantize/*, scripts/data/*, evaluation/util scripts) to import the helper and pass num_proc consistently.
  • Refactor:
    • Minor import adjustments to reference open_instruct.utils and max_num_processes where used.

Written by Cursor Bugbot for commit a5e6d55. This will update automatically on new commits. Configure here.

@finbarrtimbers finbarrtimbers marked this pull request as ready for review October 31, 2025 22:16
cursor[bot]

This comment was marked as outdated.

Copy link
Collaborator

@natolambert natolambert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - I didn't explicitly test any of the changes

@finbarrtimbers finbarrtimbers added this pull request to the merge queue Nov 3, 2025
Merged via the queue into main with commit 1a2b79a Nov 3, 2025
4 checks passed
@finbarrtimbers finbarrtimbers deleted the parallel-dataset branch November 3, 2025 19:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants