Skip to content

Support loading Quark quantized models in Transformers #36372

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 22 commits into from
Mar 20, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
1f87b7d
add quark quantizer
fxmarty-amd Feb 21, 2025
c405adb
add quark doc
fxmarty-amd Feb 21, 2025
eb189de
clean up doc
fxmarty-amd Feb 21, 2025
36d18cf
fix tests
fxmarty-amd Feb 21, 2025
8d233b4
make style
fxmarty-amd Feb 21, 2025
5f24cee
more style fixes
fxmarty-amd Feb 21, 2025
d275c87
cleanup imports
fxmarty-amd Feb 21, 2025
f5e1817
cleaning
fxmarty-amd Feb 24, 2025
70e30fa
precise install
fxmarty-amd Feb 24, 2025
05efcb0
Merge branch 'main' into quark-quantizer-upstream
fxmarty-amd Mar 7, 2025
ea2b62e
Merge branch 'quark-quantizer-upstream' of https://github.com/fxmarty…
fxmarty-amd Mar 7, 2025
c2e5ba0
Update docs/source/en/quantization/quark.md
fxmarty-amd Mar 7, 2025
9ee20b1
Update tests/quantization/quark_integration/test_quark.py
fxmarty-amd Mar 7, 2025
9b0c135
Update src/transformers/utils/quantization_config.py
fxmarty-amd Mar 7, 2025
a1b2c8b
remove import guard as suggested
fxmarty-amd Mar 7, 2025
93d8480
update copyright headers
fxmarty-amd Mar 10, 2025
2be83a1
add quark to transformers-quantization-latest-gpu Dockerfile
fxmarty-amd Mar 10, 2025
3f76848
make tests pass on transformers main + quark==0.7
fxmarty-amd Mar 10, 2025
fda836f
add missing F8_E4M3 and F8_E5M2 keys from str_to_torch_dtype
fxmarty-amd Mar 10, 2025
d8ca5e5
Merge remote-tracking branch 'origin/main' into quark-quantizer-upstream
BowenBao Mar 13, 2025
7da2a57
Merge remote-tracking branch 'origin/main' into quark-quantizer-upstream
BowenBao Mar 19, 2025
f6dbb79
Merge branch 'main' into quark-quantizer-upstream
MekkCyber Mar 20, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions docker/transformers-quantization-latest-gpu/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,9 @@ RUN git clone https://github.com/NetEase-FuXi/EETQ.git && cd EETQ/ && git submod
# Add compressed-tensors for quantization testing
RUN python3 -m pip install --no-cache-dir compressed-tensors

# Add AMD Quark for quantization testing
RUN python3 -m pip install --no-cache-dir amd-quark

# Add transformers in editable mode
RUN python3 -m pip install --no-cache-dir -e ./transformers[dev-torch]

Expand Down
2 changes: 2 additions & 0 deletions docs/source/en/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -187,6 +187,8 @@
title: Optimum
- local: quantization/quanto
title: Quanto
- local: quantization/quark
title: Quark
- local: quantization/torchao
title: torchao
- local: quantization/spqr
Expand Down
4 changes: 4 additions & 0 deletions docs/source/en/main_classes/quantization.md
Original file line number Diff line number Diff line change
Expand Up @@ -88,3 +88,7 @@ Learn how to quantize models in the [Quantization](../quantization) guide.
## FineGrainedFP8Config

[[autodoc]] FineGrainedFP8Config

## QuarkConfig

[[autodoc]] QuarkConfig
3 changes: 2 additions & 1 deletion docs/source/en/quantization/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,7 @@ Use the Space below to help you pick a quantization method depending on your har
| [VPTQ](./vptq) | 🔴 | 🔴 | 🟢 | 🟡 | 🔴 | 🔴 | 🟢 | 1/8 | 🔴 | 🟢 | 🟢 | https://github.com/microsoft/VPTQ |
| [FINEGRAINED_FP8](./finegrained_fp8) | 🟢 | 🔴 | 🟢 | 🔴 | 🔴 | 🔴 | 🔴 | 8 | 🔴 | 🟢 | 🟢 | |
| [SpQR](./spqr) | 🔴 | 🔴 | 🟢 | 🔴 | 🔴 | 🔴 | 🟢 | 3 | 🔴 | 🟢 | 🟢 | https://github.com/Vahe1994/SpQR/ |
| [Quark](./quark.md) | 🔴 | 🟢 | 🟢 | 🟢 | 🟢 | 🟢 | ? | 2/4/6/8/9/16 | 🔴 | 🔴 | 🟢 | https://quark.docs.amd.com/latest/ |

## Resources

Expand All @@ -55,4 +56,4 @@ If you are looking for a user-friendly quantization experience, you can use the
* [Bitsandbytes Space](https://huggingface.co/spaces/bnb-community/bnb-my-repo)
* [GGUF Space](https://huggingface.co/spaces/ggml-org/gguf-my-repo)
* [MLX Space](https://huggingface.co/spaces/mlx-community/mlx-my-repo)
* [AuoQuant Notebook](https://colab.research.google.com/drive/1b6nqC7UZVt8bx4MksX7s656GXPM-eWw4?usp=sharing#scrollTo=ZC9Nsr9u5WhN)
* [AuoQuant Notebook](https://colab.research.google.com/drive/1b6nqC7UZVt8bx4MksX7s656GXPM-eWw4?usp=sharing#scrollTo=ZC9Nsr9u5WhN)
84 changes: 84 additions & 0 deletions docs/source/en/quantization/quark.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
<!--Copyright 2025 Advanced Micro Devices, Inc. and The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.

⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.

-->

# Quark

[Quark](https://quark.docs.amd.com/latest/) is a deep learning quantization toolkit designed to be agnostic to specific data types, algorithms, and hardware. Different pre-processing strategies, algorithms and data-types can be combined in Quark.

The PyTorch support integrated through 🤗 Transformers primarily targets AMD CPUs and GPUs, and is primarily meant to be used for evaluation purposes. For example, it is possible to use [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) with 🤗 Transformers backend and evaluate a wide range of models quantized through Quark seamlessly.

Users interested in Quark can refer to its [documentation](https://quark.docs.amd.com/latest/) to get started quantizing models and using them in supported open-source libraries!

Although Quark has its own checkpoint / [configuration format](https://huggingface.co/amd/Llama-3.1-8B-Instruct-FP8-KV-Quark-test/blob/main/config.json#L26), the library also supports producing models with a serialization layout compliant with other quantization/runtime implementations ([AutoAWQ](https://huggingface.co/docs/transformers/quantization/awq), [native fp8 in 🤗 Transformers](https://huggingface.co/docs/transformers/quantization/finegrained_fp8)).

To be able to load Quark quantized models in Transformers, the library first needs to be installed:

```bash
pip install amd-quark
```

## Support matrix

Models quantized through Quark support a large range of features, that can be combined together. All quantized models independently of their configuration can seamlessly be reloaded through `PretrainedModel.from_pretrained`.

The table below shows a few features supported by Quark:

| **Feature** | **Supported subset in Quark** | |
|---------------------------------|-----------------------------------------------------------------------------------------------------------|---|
| Data types | int8, int4, int2, bfloat16, float16, fp8_e5m2, fp8_e4m3, fp6_e3m2, fp6_e2m3, fp4, OCP MX, MX6, MX9, bfp16 | |
| Pre-quantization transformation | SmoothQuant, QuaRot, SpinQuant, AWQ | |
| Quantization algorithm | GPTQ | |
| Supported operators | ``nn.Linear``, ``nn.Conv2d``, ``nn.ConvTranspose2d``, ``nn.Embedding``, ``nn.EmbeddingBag`` | |
| Granularity | per-tensor, per-channel, per-block, per-layer, per-layer type | |
| KV cache | fp8 | |
| Activation calibration | MinMax / Percentile / MSE | |
| Quantization strategy | weight-only, static, dynamic, with or without output quantization | |

## Models on Hugging Face Hub

Public models using Quark native serialization can be found at https://huggingface.co/models?other=quark.

Although Quark also supports [models using `quant_method="fp8"`](https://huggingface.co/models?other=fp8) and [models using `quant_method="awq"`](https://huggingface.co/models?other=awq), Transformers loads these models rather through [AutoAWQ](https://huggingface.co/docs/transformers/quantization/awq) or uses the [native fp8 support in 🤗 Transformers](https://huggingface.co/docs/transformers/quantization/finegrained_fp8).

Comment on lines +54 to +55
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can do something there so that we are able to run these checkpoints in quark. Will it work OTB if we modify the config.quantization_config and pass the new config to the model in from_pretrained ?
Or we could add a function / context manager that modify AUTO_QUANTIZATION_CONFIG_MAPPING and AUTO_QUANTIZER_MAPPING

## Using Quark models in Transformers

Here is an example of how one can load a Quark model in Transformers:

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "EmbeddedLLM/Llama-3.1-8B-Instruct-w_fp8_per_channel_sym"
model = AutoModelForCausalLM.from_pretrained(model_id)
model = model.to("cuda")

print(model.model.layers[0].self_attn.q_proj)
# QParamsLinear(
# (weight_quantizer): ScaledRealQuantizer()
# (input_quantizer): ScaledRealQuantizer()
# (output_quantizer): ScaledRealQuantizer()
# )

tokenizer = AutoTokenizer.from_pretrained(model_id)
inp = tokenizer("Where is a good place to cycle around Tokyo?", return_tensors="pt")
inp = inp.to("cuda")

res = model.generate(**inp, min_new_tokens=50, max_new_tokens=100)

print(tokenizer.batch_decode(res)[0])
# <|begin_of_text|>Where is a good place to cycle around Tokyo? There are several places in Tokyo that are suitable for cycling, depending on your skill level and interests. Here are a few suggestions:
# 1. Yoyogi Park: This park is a popular spot for cycling and has a wide, flat path that's perfect for beginners. You can also visit the Meiji Shrine, a famous Shinto shrine located in the park.
# 2. Imperial Palace East Garden: This beautiful garden has a large, flat path that's perfect for cycling. You can also visit the
```
2 changes: 2 additions & 0 deletions src/transformers/__init__.py
100755 → 100644
Original file line number Diff line number Diff line change
Expand Up @@ -1042,6 +1042,7 @@
"HiggsConfig",
"HqqConfig",
"QuantoConfig",
"QuarkConfig",
"SpQRConfig",
"TorchAoConfig",
"VptqConfig",
Expand Down Expand Up @@ -6278,6 +6279,7 @@
HiggsConfig,
HqqConfig,
QuantoConfig,
QuarkConfig,
SpQRConfig,
TorchAoConfig,
VptqConfig,
Expand Down
8 changes: 8 additions & 0 deletions src/transformers/modeling_utils.py
100755 → 100644
Original file line number Diff line number Diff line change
Expand Up @@ -536,6 +536,10 @@ def load_sharded_checkpoint(model, folder, strict=True, prefer_safe=True):
str_to_torch_dtype["U32"] = torch.uint32
str_to_torch_dtype["U64"] = torch.uint64

if is_torch_greater_or_equal("2.1.0"):
str_to_torch_dtype["F8_E4M3"] = torch.float8_e4m3fn
str_to_torch_dtype["F8_E5M2"] = torch.float8_e5m2
Comment on lines +539 to +541
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SunMarc @MekkCyber I also had to add this in fda836f following recent changes to modeling_utils.py, in order for the example in the documentation to work.

This corresponds to https://github.com/huggingface/safetensors/blob/53fe06c3efd40ff62520f74818819590b2bc25de/bindings/python/py_src/safetensors/torch.py#L385-L386

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't rocm only support torch.float8_e4m3fnz ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, only torch.float8_e4m3fnuz.

However, we are able to load models quantized in torch.float8_e4m3fn format and ~convert to fnuz, similar to https://github.com/ROCm/vllm/blob/0f2300e3d831de673f4b2aef96aff2d38c499263/vllm/model_executor/layers/quantization/utils/w8a8_utils.py#L290-L311. I think fnuz is not in safetensors spec



def load_state_dict(
checkpoint_file: Union[str, os.PathLike],
Expand Down Expand Up @@ -3672,6 +3676,10 @@ def to(self, *args, **kwargs):

if getattr(self, "quantization_method", None) == QuantizationMethod.HQQ:
raise ValueError("`.to` is not supported for HQQ-quantized models.")

if dtype_present_in_args and getattr(self, "quantization_method", None) == QuantizationMethod.QUARK:
raise ValueError("Casting a Quark quantized model to a new `dtype` is not supported.")

# Checks if the model has been loaded in 4-bit or 8-bit with BNB
if getattr(self, "quantization_method", None) == QuantizationMethod.BITS_AND_BYTES:
if dtype_present_in_args:
Expand Down
5 changes: 5 additions & 0 deletions src/transformers/quantizers/auto.py
100755 → 100644
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
# Modifications Copyright (C) 2025, Advanced Micro Devices, Inc. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
Expand Down Expand Up @@ -31,6 +32,7 @@
QuantizationConfigMixin,
QuantizationMethod,
QuantoConfig,
QuarkConfig,
SpQRConfig,
TorchAoConfig,
VptqConfig,
Expand All @@ -49,6 +51,7 @@
from .quantizer_higgs import HiggsHfQuantizer
from .quantizer_hqq import HqqHfQuantizer
from .quantizer_quanto import QuantoHfQuantizer
from .quantizer_quark import QuarkHfQuantizer
from .quantizer_spqr import SpQRHfQuantizer
from .quantizer_torchao import TorchAoHfQuantizer
from .quantizer_vptq import VptqHfQuantizer
Expand All @@ -61,6 +64,7 @@
"gptq": GptqHfQuantizer,
"aqlm": AqlmHfQuantizer,
"quanto": QuantoHfQuantizer,
"quark": QuarkHfQuantizer,
"eetq": EetqHfQuantizer,
"higgs": HiggsHfQuantizer,
"hqq": HqqHfQuantizer,
Expand All @@ -81,6 +85,7 @@
"gptq": GPTQConfig,
"aqlm": AqlmConfig,
"quanto": QuantoConfig,
"quark": QuarkConfig,
"hqq": HqqConfig,
"compressed-tensors": CompressedTensorsConfig,
"fbgemm_fp8": FbgemmFp8Config,
Expand Down
113 changes: 113 additions & 0 deletions src/transformers/quantizers/quantizer_quark.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
# coding=utf-8
# Copyright 2025 Advanced Micro Devices, Inc. and The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

from typing import TYPE_CHECKING, Any, Dict

from ..file_utils import is_torch_available
from .base import HfQuantizer


if TYPE_CHECKING:
from ..modeling_utils import PreTrainedModel

if is_torch_available():
import torch

from ..utils import is_accelerate_available, is_quark_available, logging


if is_accelerate_available():
from accelerate.utils import set_module_tensor_to_device

logger = logging.get_logger(__name__)


CHECKPOINT_KEYS = {
"weight_scale": "weight_quantizer.scale",
"bias_scale": "bias_quantizer.scale",
"input_scale": "input_quantizer.scale",
"output_scale": "output_quantizer.scale",
"weight_zero_point": "weight_quantizer.zero_point",
"bias_zero_point": "bias_quantizer.zero_point",
"input_zero_point": "input_quantizer.zero_point",
"output_zero_point": "output_quantizer.zero_point",
}


class QuarkHfQuantizer(HfQuantizer):
"""
Quark quantizer (https://quark.docs.amd.com/latest/).
"""

requires_calibration = True # On-the-fly quantization with quark is not supported for now.
required_packages = ["quark"]

# Checkpoints are expected to be already quantized when loading a quark model. However, as some keys from
# the checkpoint might mismatch the model parameters keys, we use the `create_quantized_param` method
# to load the checkpoints, remapping the keys.
requires_parameters_quantization = True

def __init__(self, quantization_config, **kwargs):
super().__init__(quantization_config, **kwargs)

self.json_export_config = quantization_config.json_export_config

def validate_environment(self, *args, **kwargs):
if not is_quark_available():
raise ImportError(
"Loading a Quark quantized model requires the `quark` library but it was not found in the environment. Please refer to https://quark.docs.amd.com/latest/install.html."
)

def _process_model_before_weight_loading(self, model: "PreTrainedModel", **kwargs):
from quark.torch.export.api import _map_to_quark

_map_to_quark(
model,
self.quantization_config.quant_config,
pack_method=self.json_export_config.pack_method,
custom_mode=self.quantization_config.custom_mode,
)

return model

def check_quantized_param(
self,
model: "PreTrainedModel",
param_value: "torch.Tensor",
param_name: str,
state_dict: Dict[str, Any],
**kwargs,
) -> bool:
return True

def create_quantized_param(
self, model, param, param_name, param_device, state_dict, unexpected_keys
) -> "torch.nn.Parameter":
postfix = param_name.split(".")[-1]

if postfix in CHECKPOINT_KEYS:
param_name = param_name.replace(postfix, CHECKPOINT_KEYS[postfix])

set_module_tensor_to_device(model, param_name, param_device, value=param)

def _process_model_after_weight_loading(self, model: "PreTrainedModel", **kwargs):
return model

def is_serializable(self, safe_serialization=None):
return False

@property
def is_trainable(self):
return False
8 changes: 8 additions & 0 deletions src/transformers/testing_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -116,6 +116,7 @@
is_pytesseract_available,
is_pytest_available,
is_pytorch_quantization_available,
is_quark_available,
is_rjieba_available,
is_sacremoses_available,
is_safetensors_available,
Expand Down Expand Up @@ -1299,6 +1300,13 @@ def require_fbgemm_gpu(test_case):
return unittest.skipUnless(is_fbgemm_gpu_available(), "test requires fbgemm-gpu")(test_case)


def require_quark(test_case):
"""
Decorator for quark dependency
"""
return unittest.skipUnless(is_quark_available(), "test requires quark")(test_case)


def require_flute_hadamard(test_case):
"""
Decorator marking a test that requires higgs and hadamard
Expand Down
1 change: 1 addition & 0 deletions src/transformers/utils/__init__.py
100755 → 100644
Original file line number Diff line number Diff line change
Expand Up @@ -181,6 +181,7 @@
is_pytesseract_available,
is_pytest_available,
is_pytorch_quantization_available,
is_quark_available,
is_rich_available,
is_rjieba_available,
is_sacremoses_available,
Expand Down
16 changes: 16 additions & 0 deletions src/transformers/utils/import_utils.py
100755 → 100644
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,11 @@ def _is_package_available(pkg_name: str, return_version: bool = False) -> Union[
package_version = "N/A"
if package_exists:
try:
# TODO: Once python 3.9 support is dropped, `importlib.metadata.packages_distributions()`
# should be used here to map from package name to distribution names
# e.g. PIL -> Pillow, Pillow-SIMD; quark -> amd-quark; onnxruntime -> onnxruntime-gpu.
# `importlib.metadata.packages_distributions()` is not available in Python 3.9.

# Primary method to get the package version
package_version = importlib.metadata.version(pkg_name)
except importlib.metadata.PackageNotFoundError:
Expand All @@ -62,6 +67,12 @@ def _is_package_available(pkg_name: str, return_version: bool = False) -> Union[
except ImportError:
# If the package can't be imported, it's not available
package_exists = False
elif pkg_name == "quark":
# TODO: remove once `importlib.metadata.packages_distributions()` is supported.
try:
package_version = importlib.metadata.version("amd-quark")
except Exception:
package_exists = False
else:
# For packages other than "torch", don't attempt the fallback and set as not available
package_exists = False
Expand Down Expand Up @@ -150,6 +161,7 @@ def _is_package_available(pkg_name: str, return_version: bool = False) -> Union[
_gptqmodel_available = _is_package_available("gptqmodel")
# `importlib.metadata.version` doesn't work with `awq`
_auto_awq_available = importlib.util.find_spec("awq") is not None
_quark_available = _is_package_available("quark")
_is_optimum_quanto_available = False
try:
importlib.metadata.version("optimum_quanto")
Expand Down Expand Up @@ -1118,6 +1130,10 @@ def is_optimum_quanto_available():
return _is_optimum_quanto_available


def is_quark_available():
return _quark_available


def is_compressed_tensors_available():
return _compressed_tensors_available

Expand Down
Loading