Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Intel GPU] Docs of XPUInductorQuantizer #3293

Open
wants to merge 41 commits into
base: main
Choose a base branch
from

Conversation

ZhiweiYan-96
Copy link

@ZhiweiYan-96 ZhiweiYan-96 commented Mar 18, 2025

Description

Add tutorials for XPUInductorQuantzer, which serves as the INT8 quantization backend for Intel GPU inside PT2E.

cc @gujinghui @EikanWang @fengyuan14 @guangyey

Copy link

pytorch-bot bot commented Mar 18, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/tutorials/3293

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 41fc5b6 with merge base 63295e8 (image):

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Comment on lines 180 to 183
::
quantizer = XPUInductorQuantizer()
quantizer.set_global(get_xpu_inductor_symm_quantization_config())

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code format has not taken effect.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for reminding, added the fix.

@ZhiweiYan-96 ZhiweiYan-96 marked this pull request as draft March 19, 2025 05:48
@svekars svekars added the 2.7 label Mar 19, 2025
@svekars svekars requested a review from AlannaBurke March 21, 2025 16:14
@@ -96,6 +96,13 @@ Prototype features are not available as part of binary distributions like PyPI o
:link: ../prototype/pt2e_quant_x86_inductor.html
:tags: Quantization

.. customcarditem::
:header: PyTorch 2 Export Quantization with Intel GPU Backend through Inductor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel XPU

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At previous stage when we upload RFCs, we recommend using GPU instead of XPU for readability for users. Do we have some changes on this description desicsion?

@@ -0,0 +1,234 @@
PyTorch 2 Export Quantization with Intel GPU Backend through Inductor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel XPU

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
PyTorch 2 Export Quantization with Intel GPU Backend through Inductor
Export Quantization with Intel GPU Backend through Inductor

utilizes PyTorch 2 Export Quantization flow and lowers the quantized model into the inductor.

The pytorch 2 export quantization flow uses the torch.export to capture the model into a graph and perform quantization transformations on top of the ATen graph.
This approach is expected to have significantly higher model coverage, better programmability, and a simplified UX.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach is expected to have significantly higher model coverage with better programmability and a simplified user experience.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for suggestions, modified.

The quantization flow mainly includes three steps:

- Step 1: Capture the FX Graph from the eager Model based on the `torch export mechanism <https://pytorch.org/docs/main/export.html>`_.
- Step 2: Apply the Quantization flow based on the captured FX Graph, including defining the backend-specific quantizer, generating the prepared model with observers,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apply the quantization flow based on the captured FX Graph, including defining the backend-specific quantizer, generating the prepared model with observers,

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for suggestions, has changed the description here.

performing the prepared model's calibration, and converting the prepared model into the quantized model.
- Step 3: Lower the quantized model into inductor with the API ``torch.compile``.

During Step 3, the inductor would decide which kernels are dispatched into. There are two kinds of kernels the Intel GPU would obtain benefits, oneDNN kernels and triton kernels. `Intel oneAPI Deep Neural Network Library (oneDNN) <https://github.com/uxlfoundation/oneDNN>`_ contains

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a end-user documentation, I think we could focus on PyTorch itself, and remove this section explanation.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for suggestion, I removed the prolonged description over oneDNN and triton. Instead, I add a simple mention at Step 3 above.

Post Training Quantization
----------------------------

Static quantization is the only method we support currently. QAT and dynamic quantization will be available in later versions.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove the further ready context from current introduction - "QAT and dynamic quantization will be available in later versions."

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for suggestion, removed.


::

pip install torchvision pytorch-triton-xpu --index-url https://download.pytorch.org/whl/nightly/xpu

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use standard "pip install torch torchvision torchaudio", not separate internal commands to highlight the internal dependencies command.

Copy link
Author

@ZhiweiYan-96 ZhiweiYan-96 Mar 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may need keep using our own channels, since torchvision is customized on XPU, we need let user could run example in this doc successfully. Standard channel would have runtime error. Synced with @jingxu10 I changed to use pip3 install torch torchvision torchaudio pytorch-triton-xpu --index-url https://download.pytorch.org/whl/xpu, instead of nightly wheel.

@ZhiweiYan-96 ZhiweiYan-96 requested a review from CuiYifeng March 24, 2025 06:52

The high-level architecture of this flow could look like this:

.. image:: ../_static/img/pt2e_quant_xpu_inductor.png

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note that Float Model, Example Input and XPUInductorQuantizer is invisible in dark mode.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for reminding, the pictures is moidified

PyTorch 2 Export Quantization with Intel GPU Backend through Inductor
==================================================================

**Author**: `Yan Zhiwei <https://github.com/ZhiweiYan-96>`_, `Wang Eikan <https://github.com/EikanWang>`_, `Zhang, Liangang <https://github.com/liangan1>`_, `Liu River <https://github.com/riverliuintel>`_, `Cui Yifeng <https://github.com/CuiYifeng>`_

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please unify the style of names.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, modified

Comment on lines 110 to 112
quant_min=-128,
quant_max=127,
qscheme=torch.per_tensor_symmetric,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please consider whether we need more detailed annotations here to explain the meaning of these key parameters to users.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, explanation is added.

dtype=torch.int8,
quant_min=-128,
quant_max=127,
qscheme=torch.per_channel_symmetric,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, explanation is added.

@ZhiweiYan-96 ZhiweiYan-96 requested a review from CuiYifeng March 24, 2025 08:10
--------------

This tutorial introduces XPUInductorQuantizer aiming for serving the quantized model inference on Intel GPUs. The tutorial will cover how it
utilizes PyTorch 2 Export Quantization flow and lowers the quantized model into the inductor.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are you trying to say in this phrase: "lowers the quantized model into the inductor"?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the terminology in torch.compile

optimized_model(*example_inputs)

In a more advanced scenario, int8-mixed-bf16 quantization comes into play. In this instance,
a convolution or GEMM operator produces the output in BFloat16 instead of Float32 in the absence

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
a convolution or GEMM operator produces the output in BFloat16 instead of Float32 in the absence
a Convolution or GEMM operator produces the output in BFloat16 instead of Float32 in the absence

or

Suggested change
a convolution or GEMM operator produces the output in BFloat16 instead of Float32 in the absence
a Conv or GEMM operator produces the output in BFloat16 instead of Float32 in the absence

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for suggestion. We may keep this as here is a vanilla noun.

--------------

This tutorial introduces XPUInductorQuantizer, which aims to serve quantized models for inference on Intel GPUs.
It utilizes the PyTorch 2 Export Quantization flow and lowers the quantized model into the inductor.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we standardize capitalization of Inductor?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for reminding, has align the style now

@ZhiweiYan-96 ZhiweiYan-96 marked this pull request as ready for review April 1, 2025 08:01
@ZhiweiYan-96 ZhiweiYan-96 requested a review from alexsin368 April 1, 2025 13:30
@ZhiweiYan-96
Copy link
Author

ZhiweiYan-96 commented Apr 2, 2025

hi, @svekars @AlannaBurke could you please help review our documentation? The PR serves as a tutorial for PT2E int8 on Intel GPU backend. Appreciation for your feedback and suggestions.

Copy link
Contributor

@svekars svekars left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few editorial suggestions.

@@ -0,0 +1,234 @@
PyTorch 2 Export Quantization with Intel GPU Backend through Inductor
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
PyTorch 2 Export Quantization with Intel GPU Backend through Inductor
Export Quantization with Intel GPU Backend through Inductor

This tutorial introduces XPUInductorQuantizer, which aims to serve quantized models for inference on Intel GPUs.
It utilizes the PyTorch 2 Export Quantization flow and lowers the quantized model into the inductor.

The Pytorch 2 Export Quantization flow uses `torch.export` to capture the model into a graph and perform quantization transformations on top of the ATen graph.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to call it "PyTorch 2 Export Quantization flow" or can it be just "Export Quantization flow"?

Suggested change
The Pytorch 2 Export Quantization flow uses `torch.export` to capture the model into a graph and perform quantization transformations on top of the ATen graph.
The PyTorch 2 Export Quantization flow uses ``torch.export`` to capture the model into a graph and perform quantization transformations on top of the ATen graph.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi, @svekars , PyTorch 2 Export here should be a full description of pt2e in APIs like prepare_pt2e, convert_pt2e. Could we keep this just like x86InductorQuantizer here https://pytorch.org/tutorials/prototype/pt2e_quant_x86_inductor.html?

@AlannaBurke AlannaBurke requested review from HamidShojanazeri and removed request for riverliuintel and alexsin368 April 7, 2025 19:30
@AlannaBurke AlannaBurke added the module: xpu XPU related issues label Apr 7, 2025
Copy link
Contributor

@AlannaBurke AlannaBurke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update with @svekars's suggestions and then I think this will be good. Also requested a review from @HamidShojanazeri.

@ZhiweiYan-96
Copy link
Author

ZhiweiYan-96 commented Apr 9, 2025

hi @AlannaBurke @svekars @HamidShojanazeri , I've applied the suggestions in latest commits. Could you please help review it again and approve it if no further issues in this tutorial? Great thanks for your advice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants