Multi-Node Training Examples

This repository contains examples of distributed training recipes using Metaflow for orchestration. The examples demonstrate different approaches to distributed training for large language models (LLMs) using frameworks like TorchTune and Hugging Face.

Repository Structure

The repository is organized into separate examples, each showcasing a different distributed training approach:

TorchTune: Implementation of supervised fine-tuning (SFT) using Meta's TorchTune framework.
Hugging Face GRPO: Implementation of General Reward Prompting Optimization (GRPO) training using Hugging Face's TRL library.

Getting Started

Prerequisites

Outerbounds and metaflow-torchrun installed (pip install outerbounds metaflow-torchrun)
Kubernetes cluster with GPU nodes (examples are configured for H100 GPUs)
Access to model weights (most examples pull from Hugging Face Hub)

Examples Overview

TorchTune (Supervised Fine-Tuning)

The TorchTune example demonstrates supervised fine-tuning for large language models with support for:

Fully Sharded Data Parallel (FSDP) training
Multi-node distributed training
Support for checkpoint resumption
Gradient accumulation
Activation checkpointing and offloading

See the TorchTune README for more details.

Hugging Face GRPO (General Reward Prompting Optimization)

The Hugging Face GRPO example demonstrates training with General Reward Prompting Optimization which:

Uses reward functions like accuracy and formatting
Leverages Hugging Face's TRL library
Supports multi-node training with DeepSpeed
Shows integration with various accelerator configurations

See the Hugging Face GRPO README for more details.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
huggingface/grpo		huggingface/grpo
torchtune		torchtune
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multi-Node Training Examples

Repository Structure

Getting Started

Prerequisites

Examples Overview

TorchTune (Supervised Fine-Tuning)

Hugging Face GRPO (General Reward Prompting Optimization)

About

Releases

Packages

Languages

outerbounds/distributed-learning

Folders and files

Latest commit

History

Repository files navigation

Multi-Node Training Examples

Repository Structure

Getting Started

Prerequisites

Examples Overview

TorchTune (Supervised Fine-Tuning)

Hugging Face GRPO (General Reward Prompting Optimization)

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages