-
Notifications
You must be signed in to change notification settings - Fork 30.1k
Support loading Quark quantized models in Transformers #36372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support loading Quark quantized models in Transformers #36372
Conversation
cc @SunMarc @MekkCyber for quantization! |
@SunMarc let me fix the conflicts that emerged. Happy to share more context if needed! |
Yeah, I'll review it after the conflits are fixed. Sorry for the delay ;) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR ! Left a few comments. I was thinking it could be nice to have a blog/article to promote quark wdyt ? (e.g. how it works in general then talk about the different integration in vllm/transformers)
Although Quark also supports [models using `quant_method="fp8"`](https://huggingface.co/models?other=fp8) and [models using `quant_method="awq"`](https://huggingface.co/models?other=awq), Transformers loads these models rather through [AutoAWQ](https://huggingface.co/docs/transformers/quantization/awq) or uses the [native fp8 support in 🤗 Transformers](https://huggingface.co/docs/transformers/quantization/finegrained_fp8). | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can do something there so that we are able to run these checkpoints in quark. Will it work OTB if we modify the config.quantization_config
and pass the new config to the model in from_pretrained ?
Or we could add a function / context manager that modify AUTO_QUANTIZATION_CONFIG_MAPPING
and AUTO_QUANTIZER_MAPPING
…-amd/transformers into quark-quantizer-upstream
Co-authored-by: Marc Sun <[email protected]>
Co-authored-by: Marc Sun <[email protected]>
Co-authored-by: Marc Sun <[email protected]>
Thanks for this integration @fxmarty-amd ! Only two very small nits |
src/transformers/modeling_utils.py
Outdated
param = param[:] | ||
if param_ndim > 0: | ||
param = param[:] | ||
else: | ||
# param[:] does not work on 0-dim tensors. Nevertheless, we need to materialize the PySafeSlice in case the model is loaded using safetensors. | ||
param = param[...] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SunMarc @MekkCyber Are the changes in modeling_utils.py
in 3f76848 acceptable to you?
Some of our tensors are 0-dim, and neither pytorch nor safetensors can execute param = param[:]
on 0-dim tensors (IndexError: slice() cannot be applied to a 0-dim tensor.
). However, param = param[...]
is acceptable for 0-dim tensors. Actually, I think we could even just use param = param[...]
in all cases here.
Re-running the tests, I noticed failures following #36512 due to this issue.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, we also found out about this issue and I think it is better to fix it in a separate PR. I will merge it asap the tests are green.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good! will reset modeling_utils.py in this PR afterwards.
src/transformers/modeling_utils.py
Outdated
if shard_file.endswith(".safetensors"): | ||
param = file_pointer.get_slice(serialized_param_name) | ||
param_ndim = len(param.get_shape()) | ||
else: | ||
param = bin_state_dict[serialized_param_name] | ||
param_ndim = param.ndim |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and this change
if is_torch_greater_or_equal("2.1.0"): | ||
str_to_torch_dtype["F8_E4M3"] = torch.float8_e4m3fn | ||
str_to_torch_dtype["F8_E5M2"] = torch.float8_e5m2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SunMarc @MekkCyber I also had to add this in fda836f following recent changes to modeling_utils.py, in order for the example in the documentation to work.
This corresponds to https://github.com/huggingface/safetensors/blob/53fe06c3efd40ff62520f74818819590b2bc25de/bindings/python/py_src/safetensors/torch.py#L385-L386
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't rocm only support torch.float8_e4m3fnz
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, only torch.float8_e4m3fnuz
.
However, we are able to load models quantized in torch.float8_e4m3fn
format and ~convert to fnuz, similar to https://github.com/ROCm/vllm/blob/0f2300e3d831de673f4b2aef96aff2d38c499263/vllm/model_executor/layers/quantization/utils/w8a8_utils.py#L290-L311. I think fnuz is not in safetensors spec
Hi @fxmarty-amd! Can you resolve the conflicts please, otherwise good to go from my side |
@MekkCyber thanks a lot, will resolve the conflicts once #36580 is merged. I'll be away for a bit, so @kewang-xlnx will take over this PR. cc @kewang-xlnx |
Thanks @fxmarty-amd, merged with main and resolved conflicts. @MekkCyber thanks for reviewing! Please take a look. |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Thanks @BowenBao ! I just left some small comments and questions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again for iterating ! For driving usage, as I said before, it would be nice to have blogpost or maybe a space to easily quantize model with quark ;) (similar to bnb-my-repo, gguf-my-repo or mlx-my-repo).
…6372) * add quark quantizer * add quark doc * clean up doc * fix tests * make style * more style fixes * cleanup imports * cleaning * precise install * Update docs/source/en/quantization/quark.md Co-authored-by: Marc Sun <[email protected]> * Update tests/quantization/quark_integration/test_quark.py Co-authored-by: Marc Sun <[email protected]> * Update src/transformers/utils/quantization_config.py Co-authored-by: Marc Sun <[email protected]> * remove import guard as suggested * update copyright headers * add quark to transformers-quantization-latest-gpu Dockerfile * make tests pass on transformers main + quark==0.7 * add missing F8_E4M3 and F8_E5M2 keys from str_to_torch_dtype --------- Co-authored-by: Marc Sun <[email protected]> Co-authored-by: Bowen Bao <[email protected]> Co-authored-by: Mohamed Mekkouri <[email protected]>
This PR adds the ability to load Quark models through
PreTrainedModel.from_pretrained
by integrating Quark library through a quantizer in Transformers.Example:
AMD is developing its in-house quantizer (https://quark.docs.amd.com/latest/) released under MIT license, which supports a broad range of quantization pre-processing, algorithms, dtypes and target hardware.
This PR can be tested with the release version of Quark on PyPI:
Why integrating in Transformers?
Quark is helpful in exploring a wide range of quantization strategies, and integrating in Transformers would mean easy binding with the broader OSS ecosystem, e.g. including
lm-evaluation-harness
or Transformers as a vLLM backend.In the future, we plan to broaden the support of data types and algorithms in Quark to support current and future hardware, and usable through Transformers.
To who goes the maintenance burden?
@fxmarty-amd @kewang-xlnx @BowenBao are committed to support Quark support development and maintenance in Transformers.
Alternative
#35915 -> could be an option, but does not really allow integrating with external OSS libraries, including native Transformers (without pulling our own code registering quant config / quantizer).