Skip to content

[#2188] Align test_workflow_ops_xpu tests with pytorch.#2893

Open
jmamzax wants to merge 5 commits intointel:mainfrom
jmamzax:dev/jmamzax/issue-2188_fix_test_learnable_per_channel_cuda
Open

[#2188] Align test_workflow_ops_xpu tests with pytorch.#2893
jmamzax wants to merge 5 commits intointel:mainfrom
jmamzax:dev/jmamzax/issue-2188_fix_test_learnable_per_channel_cuda

Conversation

@jmamzax
Copy link

@jmamzax jmamzax commented Feb 16, 2026

Part of #2188 issue. Tests test_learnable_forward_per_channel_cuda_xpu, test_learnable_backward_per_channel_cuda_xpu were updated to match pytorch test cases.

test_learnable_forward_per_channel_cuda_xpu
test_learnable_backward_per_channel_cuda_xpu
@astachowiczhabana astachowiczhabana linked an issue Feb 24, 2026 that may be closed by this pull request
Copilot AI review requested due to automatic review settings March 2, 2026 11:04
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the XPU quantization workflow tests to match upstream PyTorch’s learnable per-channel fake-quant test cases, addressing failures tracked in #2188.

Changes:

  • Removed Hypothesis-driven input generation for the two learnable per-channel CUDA/XPU tests and replaced it with fixed shapes/axes.
  • Added dtype coverage for the learnable per-channel forward/backward tests (float32 and bfloat16).
  • Dropped the unused to_tensor import after refactoring the backward test setup.
Comments suppressed due to low confidence (2)

test/xpu/quantization/core/test_workflow_ops_xpu.py:104

  • shape = (2, 1, 2, 10) with axis = 1 makes channel_size = X_base.size(axis) equal to 1, so this “per-channel” test only exercises the single-channel case and won’t catch channel-dependent bugs. Consider using a shape/axis combination where the selected dimension is > 1 (while still matching the intended PyTorch reference).
    shape = (2, 1, 2, 10)
    axis = 1

    for dtype in [torch.float32, torch.bfloat16]:
        X_base = torch.randn(shape, device="xpu").to(dtype)
        channel_size = X_base.size(axis)

test/xpu/quantization/core/test_workflow_ops_xpu.py:106

  • Inside the dtype loop, torch.randn(...).to(dtype) (and the subsequent .to(dtype) conversions) introduces extra allocations/copies on XPU. Prefer creating tensors with the target dtype directly (e.g., pass dtype= to the factory functions) to keep the test lighter and reduce overhead.
    for dtype in [torch.float32, torch.bfloat16]:
        X_base = torch.randn(shape, device="xpu").to(dtype)
        channel_size = X_base.size(axis)
        scale_base = (
            torch.normal(mean=0, std=1, size=(channel_size,)).clamp(1e-4, 100).to(dtype)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings March 6, 2026 07:34
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +105 to +108
scale_base = (
torch.normal(mean=0, std=1, size=(channel_size,)).clamp(1e-4, 100).to(dtype)
)
zero_point_base = torch.normal(mean=0, std=128, size=(channel_size,)).to(dtype)
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In _test_learnable_forward_per_channel_cuda, scale_base and zero_point_base are created on the CPU and only cast to dtype. Since X_base is on XPU, this introduces host→device transfers (or potential device-mismatch issues if the downstream helper doesn’t move them). Consider creating these tensors directly on device='xpu' (and with the target dtype at creation) to keep all inputs on the same device and avoid extra copies.

Suggested change
scale_base = (
torch.normal(mean=0, std=1, size=(channel_size,)).clamp(1e-4, 100).to(dtype)
)
zero_point_base = torch.normal(mean=0, std=128, size=(channel_size,)).to(dtype)
scale_base = torch.normal(
mean=0,
std=1,
size=(channel_size,),
device="xpu",
dtype=dtype,
).clamp(1e-4, 100)
zero_point_base = torch.normal(
mean=0,
std=128,
size=(channel_size,),
device="xpu",
dtype=dtype,
)

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug Skip]: new failures in 2025-10-17

4 participants