Skip to content

Update to NGC PyTorch release 25.03 #219

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

tscholak
Copy link
Collaborator

@tscholak tscholak commented Apr 2, 2025

✨ Description

This PR updates the Dockerfile base image from nvcr.io/nvidia/pytorch:24.11-py3 to nvcr.io/nvidia/pytorch:25.03-py3.

The new base image brings updated versions of CUDA, PyTorch, cuDNN, NCCL, RAPIDS, and other key components. Notably, it includes PyTorch 2.7.0 RC1 (2.7.0a0+7c8ec84dab).

This change ensures compatibility with newer hardware and libraries, unlocks recent performance improvements, and aligns us with the most up-to-date NVIDIA PyTorch ecosystem.

Benchmarks and stability checks are still required to confirm that training behavior is unchanged.

🔍 Type of change

  • 📦 Dependency bump (updates dependencies, including Dockerfile or package changes)

📝 Changes

  1. Updated Dockerfile base image to nvcr.io/nvidia/pytorch:25.03-py3.

🔄 Package Version Changes

Package Previous Version Updated Version Key User-Facing Changes or Enhancements
PyTorch 2.6.0a0+df5bbc09d1 2.7.0a0+7c8ec84dab (RC1) Still pre-release—use with caution.
NVIDIA CUDA 12.6.3 12.8.1.012 Better compatibility with newer GPUs, bugfixes, minor perf improvements.
NVIDIA cuDNN 9.5.1.17 9.8.0.87 Adds support for CUDA compute 12.0 and 10.0; improves inference perf on newer architectures.
NVIDIA NCCL 2.22.3 2.25.1 Automatically loads XML topology; better multi-node config out of the box.
NVIDIA RAPIDS™ 24.10 25.02 Docker images now use CUDA 12.8; no major user-facing API changes but improves compatibility with rest of stack.
rdma-core 39.0 50.0 Adds support for new RDMA adapters (Alibaba iWarp, Azure). Only relevant in specialized network setups.
Nsight Compute 2024.3.2.3 2025.1.1.2 Enables kernel profiling on Microsoft Compute Driver Model (MCDM); useful for profiling on Windows setups.
Nsight Systems 2024.6.1.90 2025.1.1.110 Adds Dask and PyTorch profiling support, and trace enhancements for new GPU generations (Blackwell).
TensorRT™ 10.6.0.26 10.9.0.34 Internal perf tuning for inference workloads; no new APIs exposed in public docs.
Torch-TensorRT 2.6.0a0 2.7.0a0 ~4x speedup in graph isomorphism check via new hash function; improves compilation time for large models.
NVIDIA DALI® 1.43 1.47 Adds CUDA 12.8 support, graph-level optimizations (common subgraph elimination), more flexible operator specs.
JupyterLab 4.2.5 4.3.5 Fixes common UI bugs (undo, scrolling), improves accessibility and editor contrast.
TransformerEngine 1.12 2.1 Better optimizer API compatibility, improvements in Mixture of Experts (MoE) routing, support for new attention formats.
TensorRT Model Optimizer 0.19.0 0.23 Adds FP8/FP6/FP4 quantization support, Blackwell compatibility, and ability to export quantized lm_head.

✅ Checklist

General

  • 📜 I have read and followed the contributing guidelines.
  • 🏷️ I am using a clear and descriptive PR title that summarizes the key change or feature introduced.
  • 🎉 The functionality is complete, and I have tested the changes.
  • 📝 I have updated the documentation if needed.
  • ⚠️ The change does not introduce any new issues (e.g., runtime warnings, type checker errors, linting problems, unhandled edge cases).
  • 🧩 I have commented my code, especially in hard-to-understand areas.

Dependencies and Configuration

  • 🐋 I have updated the Docker configuration or dependencies, if applicable.
  • 🔄 I have ensured compatibility with the existing setup after dependency changes.

Testing

  • 🧪 I have added or updated tests to cover my changes.
  • ✔️ New and existing tests pass locally with my changes.
  • 🚦 I have tested these changes on GPUs and verified training stability.
  • 🏋️ I have tested the changes on realistic training workloads, if applicable.

Performance Impact

  • 📊 I have run benchmarks where applicable to evaluate the performance impact.
  • ✅ The benchmarks show no performance regression.
  • 🚀 The benchmarks indicate a potential performance improvement.
  • ⚠️ The benchmarks indicate a potential performance degradation.
  • 📈 I have provided benchmark results and detailed any performance impact below, if applicable.

📊 Performance Impact Details

Benchmarks still need to be run to validate training behavior and performance on the updated stack. This includes verifying training throughput, GPU utilization, memory footprint, and loss curves.

Copy link
Collaborator

@jlamypoirier jlamypoirier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no wheel pre-built flash-attn for torch 2.7 https://github.com/Dao-AILab/flash-attention/releases/tag/v2.7.4.post1, same for mamba #194, so I recommend we wait

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants