[#2188] Align test_workflow_ops_xpu tests with pytorch.#2893
[#2188] Align test_workflow_ops_xpu tests with pytorch.#2893jmamzax wants to merge 5 commits intointel:mainfrom
Conversation
test_learnable_forward_per_channel_cuda_xpu test_learnable_backward_per_channel_cuda_xpu
There was a problem hiding this comment.
Pull request overview
Updates the XPU quantization workflow tests to match upstream PyTorch’s learnable per-channel fake-quant test cases, addressing failures tracked in #2188.
Changes:
- Removed Hypothesis-driven input generation for the two learnable per-channel CUDA/XPU tests and replaced it with fixed shapes/axes.
- Added dtype coverage for the learnable per-channel forward/backward tests (float32 and bfloat16).
- Dropped the unused
to_tensorimport after refactoring the backward test setup.
Comments suppressed due to low confidence (2)
test/xpu/quantization/core/test_workflow_ops_xpu.py:104
shape = (2, 1, 2, 10)withaxis = 1makeschannel_size = X_base.size(axis)equal to 1, so this “per-channel” test only exercises the single-channel case and won’t catch channel-dependent bugs. Consider using a shape/axis combination where the selected dimension is > 1 (while still matching the intended PyTorch reference).
shape = (2, 1, 2, 10)
axis = 1
for dtype in [torch.float32, torch.bfloat16]:
X_base = torch.randn(shape, device="xpu").to(dtype)
channel_size = X_base.size(axis)
test/xpu/quantization/core/test_workflow_ops_xpu.py:106
- Inside the dtype loop,
torch.randn(...).to(dtype)(and the subsequent.to(dtype)conversions) introduces extra allocations/copies on XPU. Prefer creating tensors with the target dtype directly (e.g., passdtype=to the factory functions) to keep the test lighter and reduce overhead.
for dtype in [torch.float32, torch.bfloat16]:
X_base = torch.randn(shape, device="xpu").to(dtype)
channel_size = X_base.size(axis)
scale_base = (
torch.normal(mean=0, std=1, size=(channel_size,)).clamp(1e-4, 100).to(dtype)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| scale_base = ( | ||
| torch.normal(mean=0, std=1, size=(channel_size,)).clamp(1e-4, 100).to(dtype) | ||
| ) | ||
| zero_point_base = torch.normal(mean=0, std=128, size=(channel_size,)).to(dtype) |
There was a problem hiding this comment.
In _test_learnable_forward_per_channel_cuda, scale_base and zero_point_base are created on the CPU and only cast to dtype. Since X_base is on XPU, this introduces host→device transfers (or potential device-mismatch issues if the downstream helper doesn’t move them). Consider creating these tensors directly on device='xpu' (and with the target dtype at creation) to keep all inputs on the same device and avoid extra copies.
| scale_base = ( | |
| torch.normal(mean=0, std=1, size=(channel_size,)).clamp(1e-4, 100).to(dtype) | |
| ) | |
| zero_point_base = torch.normal(mean=0, std=128, size=(channel_size,)).to(dtype) | |
| scale_base = torch.normal( | |
| mean=0, | |
| std=1, | |
| size=(channel_size,), | |
| device="xpu", | |
| dtype=dtype, | |
| ).clamp(1e-4, 100) | |
| zero_point_base = torch.normal( | |
| mean=0, | |
| std=128, | |
| size=(channel_size,), | |
| device="xpu", | |
| dtype=dtype, | |
| ) |
Part of #2188 issue. Tests test_learnable_forward_per_channel_cuda_xpu, test_learnable_backward_per_channel_cuda_xpu were updated to match pytorch test cases.