Update to NGC PyTorch release 25.03 #219

tscholak · 2025-04-02T17:09:13Z

✨ Description

This PR updates the Dockerfile base image from nvcr.io/nvidia/pytorch:24.11-py3 to nvcr.io/nvidia/pytorch:25.03-py3.

The new base image brings updated versions of CUDA, PyTorch, cuDNN, NCCL, RAPIDS, and other key components. Notably, it includes PyTorch 2.7.0 RC1 (2.7.0a0+7c8ec84dab).

This change ensures compatibility with newer hardware and libraries, unlocks recent performance improvements, and aligns us with the most up-to-date NVIDIA PyTorch ecosystem.

Benchmarks and stability checks are still required to confirm that training behavior is unchanged.

🔍 Type of change

📦 Dependency bump (updates dependencies, including Dockerfile or package changes)

📝 Changes

Updated Dockerfile base image to nvcr.io/nvidia/pytorch:25.03-py3.

🔄 Package Version Changes

Package	Previous Version	Updated Version	Key User-Facing Changes or Enhancements
PyTorch	2.6.0a0+df5bbc09d1	2.7.0a0+7c8ec84dab (RC1)	Still pre-release—use with caution.
NVIDIA CUDA	12.6.3	12.8.1.012	Better compatibility with newer GPUs, bugfixes, minor perf improvements.
NVIDIA cuDNN	9.5.1.17	9.8.0.87	Adds support for CUDA compute 12.0 and 10.0; improves inference perf on newer architectures.
NVIDIA NCCL	2.22.3	2.25.1	Automatically loads XML topology; better multi-node config out of the box.
NVIDIA RAPIDS™	24.10	25.02	Docker images now use CUDA 12.8; no major user-facing API changes but improves compatibility with rest of stack.
rdma-core	39.0	50.0	Adds support for new RDMA adapters (Alibaba iWarp, Azure). Only relevant in specialized network setups.
Nsight Compute	2024.3.2.3	2025.1.1.2	Enables kernel profiling on Microsoft Compute Driver Model (MCDM); useful for profiling on Windows setups.
Nsight Systems	2024.6.1.90	2025.1.1.110	Adds Dask and PyTorch profiling support, and trace enhancements for new GPU generations (Blackwell).
TensorRT™	10.6.0.26	10.9.0.34	Internal perf tuning for inference workloads; no new APIs exposed in public docs.
Torch-TensorRT	2.6.0a0	2.7.0a0	~4x speedup in graph isomorphism check via new hash function; improves compilation time for large models.
NVIDIA DALI®	1.43	1.47	Adds CUDA 12.8 support, graph-level optimizations (common subgraph elimination), more flexible operator specs.
JupyterLab	4.2.5	4.3.5	Fixes common UI bugs (undo, scrolling), improves accessibility and editor contrast.
TransformerEngine	1.12	2.1	Better optimizer API compatibility, improvements in Mixture of Experts (MoE) routing, support for new attention formats.
TensorRT Model Optimizer	0.19.0	0.23	Adds FP8/FP6/FP4 quantization support, Blackwell compatibility, and ability to export quantized `lm_head`.

✅ Checklist

General

📜 I have read and followed the contributing guidelines.
🏷️ I am using a clear and descriptive PR title that summarizes the key change or feature introduced.
🎉 The functionality is complete, and I have tested the changes.
📝 I have updated the documentation if needed.
⚠️ The change does not introduce any new issues (e.g., runtime warnings, type checker errors, linting problems, unhandled edge cases).
🧩 I have commented my code, especially in hard-to-understand areas.

Dependencies and Configuration

🐋 I have updated the Docker configuration or dependencies, if applicable.
🔄 I have ensured compatibility with the existing setup after dependency changes.

Testing

🧪 I have added or updated tests to cover my changes.
✔️ New and existing tests pass locally with my changes.
🚦 I have tested these changes on GPUs and verified training stability.
🏋️ I have tested the changes on realistic training workloads, if applicable.

Performance Impact

📊 I have run benchmarks where applicable to evaluate the performance impact.
✅ The benchmarks show no performance regression.
🚀 The benchmarks indicate a potential performance improvement.
⚠️ The benchmarks indicate a potential performance degradation.
📈 I have provided benchmark results and detailed any performance impact below, if applicable.

📊 Performance Impact Details

Benchmarks still need to be run to validate training behavior and performance on the updated stack. This includes verifying training throughput, GPU utilization, memory footprint, and loss curves.

jlamypoirier

There is no wheel pre-built flash-attn for torch 2.7 https://github.com/Dao-AILab/flash-attention/releases/tag/v2.7.4.post1, same for mamba #194, so I recommend we wait

jlamypoirier · 2025-06-10T16:47:06Z

#292

update to ngc pytorch release 25.03

7b1e761

jlamypoirier mentioned this pull request Apr 4, 2025

[feat] Hybrid Mamba model with Mamba and discrete Mamba 2 layers #194

Merged

27 tasks

jlamypoirier requested changes Apr 4, 2025

View reviewed changes

jlamypoirier marked this pull request as draft May 2, 2025 15:39

jlamypoirier closed this Jun 10, 2025

tscholak deleted the tscholak/pytorch-25.03 branch June 27, 2025 18:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update to NGC PyTorch release 25.03 #219

Update to NGC PyTorch release 25.03 #219

Uh oh!

tscholak commented Apr 2, 2025

Uh oh!

jlamypoirier left a comment

Uh oh!

jlamypoirier commented Jun 10, 2025

Uh oh!

Uh oh!

Update to NGC PyTorch release 25.03 #219

Update to NGC PyTorch release 25.03 #219

Uh oh!

Conversation

tscholak commented Apr 2, 2025

✨ Description

🔍 Type of change

📝 Changes

🔄 Package Version Changes

✅ Checklist

General

Dependencies and Configuration

Testing

Performance Impact

📊 Performance Impact Details

Uh oh!

jlamypoirier left a comment

Choose a reason for hiding this comment

Uh oh!

jlamypoirier commented Jun 10, 2025

Uh oh!

Uh oh!