26 Jun 00:26

dhuangnm

a0ba49b

v0.7.1.3

What's Changed

Allow requests 2.32.5+ versions by @dhuangnm in #2859

Full Changelog: 0.7.1.2...0.7.1.3

Contributors

dhuangnm

Assets 5

15 Jun 13:02

dsikka

0.12.0

6d2a090

v0.12.0 Latest

Latest

LLM Compressor v0.12.0 Release Notes

This release upgrades to Transformers v5 with improved MoE support, streamlines the dataset interface, and adds multi-GPU acceleration for model-free PTQ. Major highlights include comprehensive Transformers v5 integration with refactored MoE linearization, a simplified dataset split API that removes legacy multi-stage logic, multi-GPU distribution for model-free PTQ workflows, and expanded model coverage with Nemotron Ultra FP8 examples.

This release contains changes to example scripts with backwards compatibility with previous examples and scripts. Please read Transformers v5 for more information.

Key Highlights ✨

Transformers v5 Upgrade (#2647): Full integration with Transformers v5, including refactored MoE linearization with load_context for efficient loading, updated model structure handling, and improved tied embeddings support. Maintains LM eval performance across the transition. Note: LLM Compressor no longer supports installation with transformers<5.0.0.
Simplified Dataset Interface (#2551): Removed legacy multi-split logic, replacing splits={"calibration": "train[:100]"} with cleaner split="train[:100]" API. Legacy argument usage is deprecated and will be removed in a future release.
Multi-GPU Model-Free PTQ (#2773): Added support to distribute model-free PTQ jobs across multiple GPUs for significant parallelization and speedup for quantization workflows.
Nemotron Ultra Support (#2803): Added FP8 quantization example for Nemotron Ultra models in the model-free PTQ examples.

Transformers v5

Examples and Model Loading

Example regexes and recipes have been updated to reflect new model structures introduced by Transformers v5
Examples which utilize disk offloading or mixture-of-experts (MoE) calibration now load models with load_context provided by llmcompressor.utils. This context is a catch-all context and should be used in all scripts for efficient model loading.

- from compressed_tensors.offload import load_offloaded_model
- from llmcompressor.modeling.moe.linearize import load_quantizable_moe
- 
- with load_offloaded_model(), load_quantizable_moe():
-     model = AutoModelForCausalLM.from_pretrained(model_id)

+ from llmcompressor.utils import load_context
+ 
+ with load_context():
+     model = AutoModelForCausalLM.from_pretrained(model_id)

dtype now defaults to "auto", so this explicit argument has been removed to reduce verbosity

- model = AutoModelForCausalLM.from_pretrained(model_id, dtype=”auto”)
+ model = AutoModelForCausalLM.from_pretrained(model_id)

from_pretrained no longer supports use_auth_token. This argument has been removed from oneshot

Expanded and Refactored MoE Support

Applying quantization to Mixture-of-Experts (MoE) models requires explicit linearization and class overriding in order to efficiently calibrate experts. This logic has been implemented by LLM Compressor through two pathways:
llmcompressor.modeling.moe.linearize::linearize_moe which replaces experts modules with linearized and calibration-friendly classes AFTER weights have already been loaded
llmcompressor.modeling.moe.linearize::load_quantizable_moe which replaces experts modules with linearized and calibration-friendly classes BEFORE weights have been loaded. This context is more efficient and reduces runtime during model loading.

Both of these pathways are called as needed by llmcompressor.utils::load_context. These implementations are capable of automatically handling >90% of all model definitions provided by transformers. For unconventional or custom model definitions, see Adding MoE Calibration Support for a New Model

Multi-GPU Model-Free PTQ

Model-free PTQ now supports distributing quantization jobs across multiple GPUs when available. This feature automatically detects available GPUs and parallelizes the quantization workflow, significantly reducing processing time for large models.

Simplified Dataset Interface

The dataset split interface has been refactored to remove legacy multi-stage logic that previously supported separate datasets for training, oneshot, and eval in a single command. Since training and eval tasks are no longer supported in the same command, the API has been simplified.

Old interface:

oneshot(
  model,
  dataset="ultrachat",
  splits={"calibration": "train_sft[:100]"}
)

New interface:

oneshot(
  model,
  dataset="ultrachat",
  split="train_sft[:100]"
)

The new API is backwards compatible and will issue warnings when using the old dictionary-based splits argument.

Nemotron 3 Ultra Examples

This release adds model-free PTQ examples for NVIDIA's Nemotron-3-Ultra-550B model.
Pre-quantized FP8 checkpoints are available on HuggingFace:

NVIDIA-Nemotron-3-Ultra-550B-A55B-FP8-dynamic
NVIDIA-Nemotron-3-Ultra-550B-A55B-FP8-block
See the model-free PTQ examples for usage details.

Breaking Changes

The minimum transformers version has been bumped up to v5.9

New Contributors

@JINO-ROHIT made their first contribution in #2773
@u7k4rs6 made their first contribution in #2779
@soyr-redhat made their first contribution in #2794
@arpitkh101 made their first contribution in #2589
@Priya95715 made their first contribution in #2768

Full Changelog: 0.11.0...0.12.0

Contributors

arpitkh101, JINO-ROHIT, and 3 other contributors

Assets 4

02 Jun 17:19

dhuangnm

0.11.0

8dc4851

v0.11.0

LLM Compressor v0.11.0

This release focuses on distributed computing enhancements, quantization lifecycle improvements, and expanded model support. Major highlights include DDP support for AWQ and SmoothQuant with significant speedups (up to 3.2x), a comprehensive refactor of the Compressed Tensors API, and observer/lifecycle refactors that simplify quantization workflows. New model support includes Qwen 3.5/3.6, Gemma 4, Kimi K2.6, and experimental DeepSeek-V4 support along with quantized checkpoints.

Note: LLM Compressor v0.11.0 removes support for Sparse24 quantization formats and sparse model compression. This decision was made based upon lack of community interest and maintainability concerns. Support for sparse compression may be re-introduced as part of a future release. For Sparse24 compression support, please use LLM Compressor v0.10.0.2.

Key Highlights ✨

DDP Performance: AWQ and SmoothQuant now support DDP with 2.9-3.2x speedups and up to 51% memory reduction per GPU (with 4 GPUs).
Compressed Tensors Refactor: Simplified API with clear entrypoints, removed sparsity support, streamlined compressor architecture
Quantization Lifecycle: Unified calibration timing (now at epoch end), decoupled observation from qparam calculation
Extended Quantization Support: GPTQ actorder now works across all weight strategies, AWQ refactored for NVFP4 compatibility
Converter Entrypoint: New tool and framework for converting from AutoAWQ and ModelOpt NVFP4 to Compressed-Tensors, as well as decompressing Compressed-Tensors checkpoints
Large Model Support: DDP+GPTQ+disk offloading fixes for models like Qwen3-VL-235B-A22B

DDP and Lifecycle Updates

AWQ DDP Support: Added DDP (Distributed Data Parallel) functionality for AWQ resulting in significant speedups and reduced GPU memory usage:

Model	Single-GPU Time	DDP Time (4 GPUs)	Speedup	Single-GPU Memory	DDP Memory	Memory Reduction
Llama-3-8B	7.02 min	2.40 min	2.9x	10.20 GB	4.99 GB	51%
Llama-3-8B (masked)	8.13 min	2.67 min	3.0x	10.14 GB	4.98 GB	51%
Qwen3-30B-A3B	459.65 min	143.90 min	3.2x	4.13 GB	3.36 GB	19%

Accuracy metrics remain comparable between DDP and single-GPU approaches.

SmoothQuant DDP Support: Added DDP support for SmoothQuant resulting in significant speedups:

GPUs Total Time Peak GPU Mem Speedup

1 GPU 94.1 min 8.93 GB 1.00x

2 GPU 58.7 min 7.06 GB 1.60x

4 GPU 28.7 min 7.06 GB 3.28x

GPUs	Total Time	Peak GPU Mem	Speedup
1 GPU	94.1 min	8.93 GB	1.00x
2 GPU	58.7 min	7.06 GB	1.60x
4 GPU	28.7 min	7.06 GB	3.28x

Special thanks to @dzhengAP for their excellent contributions to the SmoothQuantModifier!

Observer Refactor: Decoupled observation from quantization parameter calculation, allowing natural separation of responsibilities where observer.forward() records statistics about observed tensors while get_qparams() performs qparam calculation. This simplifies design and expands the types of observers supported. Key changes:
- Observers now have update_statistics_from_observed() for forward pass and get_qparams() for parameter calculation
- Global scale logic now entirely contained in observers (observers have references to fused weight observers for global_scale calculation)
- Removed module references from observers, simplified observer utilities
- Fixed imatrix observer synchronization in DDP and imatrix+global_scale bug
- Consolidated synchronization logic with new activation_statistics concept for activation observers and one weight observer
DDP Support for Activation Quantization: Added DDP support for quantization schemes with activation quantization. Extended QuantizationModifier to support distributed activation calibration via PR #2391 (merged Mar 27, 2026).

Implementation: At SEQUENTIAL_EPOCH_END and CALIBRATION_EPOCH_END, activation observer min/max values are all-reduced across ranks. Scale/zero-point are then recomputed from the global statistics so all ranks have identical quantization parameters.

Key Changes:
- Added synchronize(), recompute_qparams(), recompute_global_scale() to Observer base class
- Added sync_activation_observers() to QuantizationMixin (shared by QuantizationModifier and GPTQModifier)
- Batch all async dist.all_reduce operations and wait once, matching GPTQ DDP pattern
DDP+GPTQ+Disk Offloading for Large Models: Added fixes and features to enable DDP+GPTQ+disk offloading to work for very large models (e.g., Qwen3-VL-235B-A22B). Key improvements include:
- Reduced shared memory overload and mmap issues for big models with DDP + CPU/disk offloading
- Fixed MoE calibration context to use same offloading as original module (previously reverted to CPU offloading causing issues)
- Only store original modules when needed to avoid mmap issues
- Added synchronization steps during model saving to prevent thread timeout issues
- Added sync points for MoE calibration context to handle NCCL timeout when different threads take varying time on large models
- Fixed NVFP4 DDP support on A100 (NCCL broadcast workaround for FP8)
- Reduced memory requirements of moe_calibration_context by removing retained module references after replacement
Distributed Model Compression: Accelerate the model compression step (bit packing) by assigning modules across ranks and compressing them in parallel, greatly reducing runtime for large models, scaling linearly with the number of GPUs available.
Quantization Lifecycle Refactor: Altered quantization lifecycle so weight and activation calibration both now happen on epoch end (previously weight calibration happened at start for QuantizationModifier but end for other modifiers). Benefits include simpler code, faster runtime due to reduced on/offloading during quantization, and quantization now disabled across the board during calibration (previously modifier-dependent).
Microscale Calibration Refactor: Refactored microscale formats which require fused global_scale calculation. Rather than treating global scale as a generic qparam in the observer with additional post-modifications, the observer is now entirely responsible for global_scale. Observers are now fused (made aware of other observers with which they share a global_scale) so they can calculate a joint global_scale. Note: this requires that all fused observers have generated statistics through their forward method. This massively simplifies global_scale handling while maintaining accuracy.

New Model Support

Qwen 3.5 and Qwen 3.6: Calibration support has been added as part of this release with instructions summarized in the documentation for Qwen3.5 and Qwen3.6. Several quantized checkpoints have also been released, including:
- RedHatAI/Qwen3.6-35B-A3B-NVFP4
- RedHatAI/Qwen3.6-35B-A3B-FP8-dynamic
Gemma 4: Calibration support has been added with details listed in the documentation for Gemma 4. Several quantized checkpoints have also been released, including:
Kimi K2.6: This model was originally released in W4A16 packed quantized format. Decompression support has been enabled through the converters entrypoint and calibration support has also been added with details listed in the documentation for Kimi K2.6. Quantized checkpoints have also been released:
- RedHatAI/Kimi-K2.6-NVFP4
- RedHatAI/Kimi-K2.6-FP8-BLOCK
DeepSeek-V4: Support for quantization of DeepSeekV4 Flash and Pro models. These features are currently available via experimental branches, but are planned for integration as part of the next release of LLM Compressor. More details can be found here. Sample checkpoint:
- RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8

Converter Entrypoint (Compressed-Tensors)

Model Format Conversion: Added Converter entrypoint to enable decompression and conversion of models from various packed quantized formats to Compressed-Tensors format. Currently supports:
- AutoAWQ to CT conversion
- Compressed-Tensors Decompression
- ModelOpt NVFP4 to CT Conversion
- FP8 Block Decompression (popularized by DeepSeek)
More details: https://docs.vllm.ai/projects/llm-compressor/en/latest/guides/entrypoints/convert/

Compressed Tensors

Contributors

markypizz, prdeepakbabu, and 25 other contributors

Assets 4

06 May 20:04

dhuangnm

0.7.1.2

e481603

v0.7.1.2

What's Changed

[For 0.7.1.2] Update pillow version to fix security issue by @dhuangnm in #2683

Full Changelog: 0.7.1.1...0.7.1.2

Contributors

dhuangnm

Assets 4

05 May 14:09

dhuangnm

0.9.0.3

f7b4d7a

v0.9.0.3

What's Changed

[For 0.9.0.3] Update pillow to fix security issue for release 0.9.0.3 by @dhuangnm in #2660

Full Changelog: 0.9.0.2...0.9.0.3

Contributors

dhuangnm

Assets 4

01 May 13:22

dhuangnm

0.10.0.2

5a1552e

v0.10.0.2

What's Changed

[For 0.10.0.2] Update pillow to fix security issue for release by @dhuangnm in #2661

Full Changelog: 0.10.0.1...0.10.0.2

Contributors

dhuangnm

Assets 4

13 Mar 14:41

dhuangnm

0.10.0.1

6a6cfec

v0.10.0.1

What's Changed

[Patch Release] Update compressed-tensors version in setup.py by @dsikka in #2466

Full Changelog: 0.10.0...0.10.0.1

Contributors

dsikka

Assets 4

04 Mar 23:50

dhuangnm

0.7.1.1

068af0c

v0.7.1.1

What's Changed

Update pillow upper bound to 12.1.1 for 0.7.1.n release by @dhuangnm in #2429

Full Changelog: 0.7.1...0.7.1.1

Contributors

dhuangnm

Assets 4

02 Mar 16:23

dhuangnm

0.10.0

bdb6547

v0.10.0

LLM Compressor v0.10.0

We're excited to announce LLM Compressor v0.10.0! This release brings significant performance improvements, updated quantization capabilities, and enhanced model offloading support.

Highlights:

Distributed GPTQ with major performance improvements
Enhanced compressed-tensors offloading (disk and distributed)
Migration from accelerate to compressed-tensors offloading
GPTQ support for FP4 microscale schemes (NVFP4, MXFP4)
MXFP4 accuracy improvements

Distributed Data Parallel (DDP) GPTQ Support ✨

GPTQ now supports fully distributed functionality, resulting in significant speedups across the board.

Performance Benchmarks

model_id	world_size	max_time	max_memory	save_time	flex_extract	eval_time
Meta-Llama-3-8B-Instruct	1	745.03	5.82	19.57	0.7066	95.28
Meta-Llama-3-8B-Instruct	2	372.20	5.57	49.10	0.7089	95.24
Meta-Llama-3-8B-Instruct	4	264.07	5.82	52.50	0.7180	96.74
Qwen3-30B-A3B	1	14207.53	6.56	748.23	0.8704	209.93
Qwen3-30B-A3B	2	7018.25	6.36	696.65	0.8810	205.89
Qwen3-30B-A3B	4	3694.46	6.36	723.05	0.8832	217.62

GPTQ takes advantage of the underlying DDP improvements for calibration and adds on weight parallel compression. We also improved non-DDP GPTQ to be more accurate, resulting in a free 5% GSM8K accuracy improvement for Meta-Llama-3-8B-Instruct.

An example leveraging DDP with GPTQ can be found here and can be run using the following command, if running with 2 GPUs.

torchrun --nproc_per_node=2 llama3_ddp_example.py

Enhanced Compressed-Tensors Offloading ✨

Compressed-tensors now supports loading transformers models that are offloaded to disk and/or offloaded across distributed process ranks.

Disk Offloading

Disk offloading allows users to load and compress very large models which normally would not fit in CPU memory.

Usage:

from compressed_tensors.offload import offloaded_model

with offloaded_model():
    AutoModelForCausalLM.from_pretrained(
        model_id,
        device_map="auto_offload",
        offload_folder="./offload_folder",
    )

Examples:

Kimi-K2 with NVFP4

Distributed Offloading

When loading offloaded models across distributed process ranks, offloaded_model ensures that the offloaded model memory is shared between ranks.

Usage:

from compressed_tensors.offload import dist_init, offloaded_model

dist_init()  # initialize distributed process group
with offloaded_model():  # enables CT offloading
    AutoModelForCausalLM.from_pretrained(
        model_id,
        device_map="auto_offload",  # set device map
        offload_folder="./offload_folder",
    )

# (optional) partition dataset so don't have to load full dataset into cpu for each rank
ds = load_dataset(
    DATASET_ID,
    split=get_rank_partition(DATASET_SPLIT, NUM_CALIBRATION_SAMPLES)
)
# note: oneshot will also do partitioning if it detects DDP + all ranks have same data

# rest of flow is unchanged, set up modifiers and call oneshot, etc

Invoke the script with:

torchrun --nproc_per_node=<num_threads> script.py

Offload Options Reference

Non-Distributed Mode:

device_map	"auto"	"cuda"	"cpu"	"auto_offload"
offloaded_model required?	No	No	No	Yes
Behavior	Try to load model onto all visible cuda devices. Fallback to cpu and disk if model too large	Try to load model onto first cuda device only. Error if model is too large	Try to load model onto cpu. Error if the model is too large	Try to load model onto cpu. Fallback to disk if model is too large
LLM Compressor Use Case	Recommended for "basic" pipeline			Recommended for "sequential" pipeline

Distributed Mode:

device_map	"auto"	"cuda"	"cpu"	"auto_offload"
offloaded_model required?	Yes	Yes	Yes	Yes
Behavior	Try to load model onto device 0, then broadcast replicas to other devices. Fallback to cpu and disk if model too large	Try to load model onto device 0 only, then broadcast replicas to other devices. Error if model is too large	Try to load model onto cpu. Error if the model is too large	Try to load model onto cpu. Fallback to disk if model is too large
LLM Compressor Use Case	Recommended for "basic" pipeline			Recommended for "sequential" pipeline

For more information regarding the behavior and options for loading offloaded models, see the compressed-tensors PR #572.

Migration from Accelerate to Compressed-Tensors Offloading

Important: LLM Compressor v0.10 will no longer utilize offloading logic provided by huggingface's accelerate library, instead opting to integrate with model offloading provided by compressed-tensors.

Benefits of Compressed-Tensors Offloading

The compressed-tensors offloading implementation provides many practical benefits over the accelerate library:

Built for dynamic workloads - CT offloading was designed for use cases like LLM Compressor's, where parameters are added and removed to modules, and module offloads can dynamically change.
Universal model compatibility - The architecture of accelerate offloading meant that many transformer models did not fully support it. Adding support often required changes and patches to the model definition. By contrast, CT offloading does not require any modifications to model definitions and works with full transparency across all transformer model definitions we've tested.
Better performance - CT offloading is often faster and requires lower peak memory than accelerate offloading due to its lazily loading implementation whereby individual parameters are only onloaded when required.
Distributed support - CT offloading supports distributed offloads coordinated between process ranks, allowing for models to be offloaded across ranks for parallelized workloads.

Models such as qwen2_audio, whisper, and others are now supported without requiring patches to the model definition.

For further details on DDP and offloading support, see the Big Models and Distributed Support guide

GPTQ FP4 Microscale Schemes Support

GPTQ now supports FP4 microscale schemes including NVFP4 and MXFP4. Applying GPTQ to these schemes can result in improved recovery and overall quantization accuracy.

Examples:

MXFP4 Accuracy Improvements

MXFP4 support has been updated with accuracy improvements for its weight scale generation. The updated models can now be validated in vLLM using the marlin kernel when doing weight-only quantization (MXFP4A16).

This is supported as of vLLM v0.14.0: compressed_tensors_w4a16_mxfp4.py

Note: MXFP4 with activation qauntization is not yet enabled in vLLM for compressed-tensors models.

AWQ Performance Improvements

Large scale refactor and optimization of AWQ resulted in:

5-10% speedup on dense models
1-5% speedup on MoE models

Package Updates

AutoRound is now a required package
An optional extra qwen has been added for pre-processing utilities (e.g qwen_vl_utils) that can be used for Qwen VL examples, such as https://github.com/vllm-project/llm-compressor/blob/main/examples/multimodal_vision/qwen3_vl_example.py