Support loading Quark quantized models in Transformers #36372

fxmarty-amd · 2025-02-24T13:55:52Z

This PR adds the ability to load Quark models through PreTrainedModel.from_pretrained by integrating Quark library through a quantizer in Transformers.

Example:

model_id = "EmbeddedLLM/Llama-3.1-8B-Instruct-w_fp8_per_channel_sym"
model = AutoModelForCausalLM.from_pretrained(model_id)
model = model.to("cuda")

AMD is developing its in-house quantizer (https://quark.docs.amd.com/latest/) released under MIT license, which supports a broad range of quantization pre-processing, algorithms, dtypes and target hardware.

This PR can be tested with the release version of Quark on PyPI:

pip install amd-quark

Why integrating in Transformers?

Quark is helpful in exploring a wide range of quantization strategies, and integrating in Transformers would mean easy binding with the broader OSS ecosystem, e.g. including lm-evaluation-harness or Transformers as a vLLM backend.

In the future, we plan to broaden the support of data types and algorithms in Quark to support current and future hardware, and usable through Transformers.

To who goes the maintenance burden?

@fxmarty-amd @kewang-xlnx @BowenBao are committed to support Quark support development and maintenance in Transformers.

Alternative

#35915 -> could be an option, but does not really allow integrating with external OSS libraries, including native Transformers (without pulling our own code registering quant config / quantizer).

src/transformers/utils/import_utils.py

Rocketknight1 · 2025-02-24T16:06:34Z

cc @SunMarc @MekkCyber for quantization!

fxmarty-amd · 2025-03-05T13:22:23Z

@SunMarc let me fix the conflicts that emerged. Happy to share more context if needed!

SunMarc · 2025-03-05T16:39:54Z

Yeah, I'll review it after the conflits are fixed. Sorry for the delay ;)

SunMarc

Thanks for the PR ! Left a few comments. I was thinking it could be nice to have a blog/article to promote quark wdyt ? (e.g. how it works in general then talk about the different integration in vllm/transformers)

docs/source/en/quantization/quark.md

SunMarc · 2025-03-06T16:49:43Z

docs/source/en/quantization/quark.md

+Although Quark also supports [models using `quant_method="fp8"`](https://huggingface.co/models?other=fp8) and [models using `quant_method="awq"`](https://huggingface.co/models?other=awq), Transformers loads these models rather through [AutoAWQ](https://huggingface.co/docs/transformers/quantization/awq) or uses the [native fp8 support in 🤗 Transformers](https://huggingface.co/docs/transformers/quantization/finegrained_fp8).
+


Maybe we can do something there so that we are able to run these checkpoints in quark. Will it work OTB if we modify the config.quantization_config and pass the new config to the model in from_pretrained ?
Or we could add a function / context manager that modify AUTO_QUANTIZATION_CONFIG_MAPPING and AUTO_QUANTIZER_MAPPING

src/transformers/quantizers/quantizer_quark.py

src/transformers/utils/import_utils.py

src/transformers/utils/quantization_config.py

tests/quantization/quark_integration/test_quark.py

…-amd/transformers into quark-quantizer-upstream

Co-authored-by: Marc Sun <[email protected]>

MekkCyber · 2025-03-07T12:44:42Z

Thanks for this integration @fxmarty-amd ! Only two very small nits

tests/quantization/quark_integration/test_quark.py

fxmarty-amd · 2025-03-10T14:42:40Z

src/transformers/modeling_utils.py

-            param = param[:]
+            if param_ndim > 0:
+                param = param[:]
+            else:
+                # param[:] does not work on 0-dim tensors. Nevertheless, we need to materialize the PySafeSlice in case the model is loaded using safetensors.
+                param = param[...]


@SunMarc @MekkCyber Are the changes in modeling_utils.py in 3f76848 acceptable to you?

Some of our tensors are 0-dim, and neither pytorch nor safetensors can execute param = param[:] on 0-dim tensors (IndexError: slice() cannot be applied to a 0-dim tensor.). However, param = param[...] is acceptable for 0-dim tensors. Actually, I think we could even just use param = param[...] in all cases here.

Re-running the tests, I noticed failures following #36512 due to this issue.

WDYT?

Indeed, we also found out about this issue and I think it is better to fix it in a separate PR. I will merge it asap the tests are green.

sounds good! will reset modeling_utils.py in this PR afterwards.

fxmarty-amd · 2025-03-10T14:42:49Z

src/transformers/modeling_utils.py

+        if shard_file.endswith(".safetensors"):
+            param = file_pointer.get_slice(serialized_param_name)
+            param_ndim = len(param.get_shape())
+        else:
+            param = bin_state_dict[serialized_param_name]
+            param_ndim = param.ndim


and this change

fxmarty-amd · 2025-03-10T15:01:27Z

src/transformers/modeling_utils.py

+if is_torch_greater_or_equal("2.1.0"):
+    str_to_torch_dtype["F8_E4M3"] = torch.float8_e4m3fn
+    str_to_torch_dtype["F8_E5M2"] = torch.float8_e5m2


@SunMarc @MekkCyber I also had to add this in fda836f following recent changes to modeling_utils.py, in order for the example in the documentation to work.

This corresponds to https://github.com/huggingface/safetensors/blob/53fe06c3efd40ff62520f74818819590b2bc25de/bindings/python/py_src/safetensors/torch.py#L385-L386

Doesn't rocm only support torch.float8_e4m3fnz ?

yes, only torch.float8_e4m3fnuz.

However, we are able to load models quantized in torch.float8_e4m3fn format and ~convert to fnuz, similar to https://github.com/ROCm/vllm/blob/0f2300e3d831de673f4b2aef96aff2d38c499263/vllm/model_executor/layers/quantization/utils/w8a8_utils.py#L290-L311. I think fnuz is not in safetensors spec

MekkCyber · 2025-03-13T10:21:26Z

Hi @fxmarty-amd! Can you resolve the conflicts please, otherwise good to go from my side

fxmarty-amd · 2025-03-13T10:25:02Z

@MekkCyber thanks a lot, will resolve the conflicts once #36580 is merged.

I'll be away for a bit, so @kewang-xlnx will take over this PR. cc @kewang-xlnx

BowenBao · 2025-03-14T00:06:10Z

Thanks @fxmarty-amd, merged with main and resolved conflicts.

@MekkCyber thanks for reviewing! Please take a look.

HuggingFaceDocBuilderDev · 2025-03-18T12:01:49Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

MekkCyber · 2025-03-18T12:54:53Z

Thanks @BowenBao ! I just left some small comments and questions

src/transformers/quantizers/quantizer_quark.py

src/transformers/utils/quantization_config.py

SunMarc

Thanks again for iterating ! For driving usage, as I said before, it would be nice to have blogpost or maybe a space to easily quantize model with quark ;) (similar to bnb-my-repo, gguf-my-repo or mlx-my-repo).

…6372) * add quark quantizer * add quark doc * clean up doc * fix tests * make style * more style fixes * cleanup imports * cleaning * precise install * Update docs/source/en/quantization/quark.md Co-authored-by: Marc Sun <[email protected]> * Update tests/quantization/quark_integration/test_quark.py Co-authored-by: Marc Sun <[email protected]> * Update src/transformers/utils/quantization_config.py Co-authored-by: Marc Sun <[email protected]> * remove import guard as suggested * update copyright headers * add quark to transformers-quantization-latest-gpu Dockerfile * make tests pass on transformers main + quark==0.7 * add missing F8_E4M3 and F8_E5M2 keys from str_to_torch_dtype --------- Co-authored-by: Marc Sun <[email protected]> Co-authored-by: Bowen Bao <[email protected]> Co-authored-by: Mohamed Mekkouri <[email protected]>

fxmarty-amd added 8 commits February 21, 2025 14:42

add quark quantizer

1f87b7d

add quark doc

c405adb

clean up doc

eb189de

fix tests

36d18cf

make style

8d233b4

more style fixes

5f24cee

cleanup imports

d275c87

cleaning

f5e1817

fxmarty-amd commented Feb 24, 2025

View reviewed changes

src/transformers/utils/import_utils.py Show resolved Hide resolved

precise install

70e30fa

SunMarc requested review from MekkCyber and SunMarc February 24, 2025 15:10

SunMarc approved these changes Mar 6, 2025

View reviewed changes

fxmarty-amd and others added 6 commits March 7, 2025 11:47

Merge branch 'main' into quark-quantizer-upstream

05efcb0

Merge branch 'quark-quantizer-upstream' of https://github.com/fxmarty…

ea2b62e

…-amd/transformers into quark-quantizer-upstream

Update docs/source/en/quantization/quark.md

c2e5ba0

Co-authored-by: Marc Sun <[email protected]>

Update tests/quantization/quark_integration/test_quark.py

9ee20b1

Co-authored-by: Marc Sun <[email protected]>

Update src/transformers/utils/quantization_config.py

9b0c135

Co-authored-by: Marc Sun <[email protected]>

remove import guard as suggested

a1b2c8b

MekkCyber approved these changes Mar 7, 2025

View reviewed changes

tests/quantization/quark_integration/test_quark.py Outdated Show resolved Hide resolved

tests/quantization/quark_integration/test_quark.py Show resolved Hide resolved

fxmarty-amd added 3 commits March 10, 2025 14:20

update copyright headers

93d8480

add quark to transformers-quantization-latest-gpu Dockerfile

2be83a1

make tests pass on transformers main + quark==0.7

3f76848

fxmarty-amd commented Mar 10, 2025

View reviewed changes

add missing F8_E4M3 and F8_E5M2 keys from str_to_torch_dtype

fda836f

fxmarty-amd commented Mar 10, 2025

View reviewed changes

fxmarty-amd requested review from SunMarc and MekkCyber March 10, 2025 15:01

SunMarc mentioned this pull request Mar 11, 2025

Fix slicing for 0-dim param #36580

Merged

Merge remote-tracking branch 'origin/main' into quark-quantizer-upstream

d8ca5e5

MekkCyber reviewed Mar 18, 2025

View reviewed changes

src/transformers/quantizers/quantizer_quark.py Show resolved Hide resolved

src/transformers/quantizers/quantizer_quark.py Show resolved Hide resolved

src/transformers/utils/quantization_config.py Show resolved Hide resolved

src/transformers/utils/quantization_config.py Show resolved Hide resolved

BowenBao and others added 2 commits March 19, 2025 22:54

Merge remote-tracking branch 'origin/main' into quark-quantizer-upstream

7da2a57

Merge branch 'main' into quark-quantizer-upstream

f6dbb79

SunMarc approved these changes Mar 20, 2025

View reviewed changes

SunMarc merged commit 1a37479 into huggingface:main Mar 20, 2025
21 checks passed

		Although Quark also supports [models using `quant_method="fp8"`](https://huggingface.co/models?other=fp8) and [models using `quant_method="awq"`](https://huggingface.co/models?other=awq), Transformers loads these models rather through [AutoAWQ](https://huggingface.co/docs/transformers/quantization/awq) or uses the [native fp8 support in 🤗 Transformers](https://huggingface.co/docs/transformers/quantization/finegrained_fp8).

Support loading Quark quantized models in Transformers #36372

Support loading Quark quantized models in Transformers #36372

Uh oh!

Conversation

fxmarty-amd commented Feb 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why integrating in Transformers?

To who goes the maintenance burden?

Alternative

Uh oh!

Uh oh!

Rocketknight1 commented Feb 24, 2025

Uh oh!

fxmarty-amd commented Mar 5, 2025

Uh oh!

SunMarc commented Mar 5, 2025

Uh oh!

SunMarc left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

SunMarc Mar 6, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MekkCyber commented Mar 7, 2025

Uh oh!

Uh oh!

Uh oh!

fxmarty-amd Mar 10, 2025

Choose a reason for hiding this comment

Uh oh!

SunMarc Mar 11, 2025

Choose a reason for hiding this comment

Uh oh!

fxmarty-amd Mar 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fxmarty-amd Mar 10, 2025

Choose a reason for hiding this comment

Uh oh!

fxmarty-amd Mar 10, 2025

Choose a reason for hiding this comment

Uh oh!

MekkCyber Mar 12, 2025

Choose a reason for hiding this comment

Uh oh!

fxmarty-amd Mar 12, 2025

Choose a reason for hiding this comment

Uh oh!

MekkCyber commented Mar 13, 2025

Uh oh!

fxmarty-amd commented Mar 13, 2025

Uh oh!

BowenBao commented Mar 14, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Mar 18, 2025

Uh oh!

MekkCyber commented Mar 18, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SunMarc left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

fxmarty-amd commented Feb 24, 2025 •

edited

Loading

fxmarty-amd Mar 11, 2025 •

edited

Loading