QR operator utilizing XPU. #2399

mwiktor-intel · 2025-11-24T12:23:17Z

fixes #1900. The implementation utilizes two mkl::lapack libraries,
geqrf and geqrf for recovering pure Q matrix. Since torch and lapack use different storage formats, a hard transposition (memory layout not only stride ) was necessary. The iteraion over batch utilizes internal memory layout of processed data.

CuiYifeng · 2025-11-26T03:35:01Z

QR is MKL Op rather than SYCL Op. Please move kernel code to xpu/mkl/.
Existed xpu/mkl/BatchLinearAlgebra.cpp is a choice for kernel code (refer to stock Pytorch). Correspondingly, op level code can be added in xpu/BatchLinearAlgebra.cpp.
Please check lint error.
Adding new test case is good. Please check if there are related cases in test/xpu/skip_list_common.py. If so, please reactivate them.

Silv3S · 2025-11-26T13:42:33Z

There were no tests on the skip lists, as QR was silently falling back to CPU. Maybe after removing it, some tests will start to fail now

CuiYifeng

Please check comments and fix related failed cases.

CuiYifeng · 2025-11-28T08:54:00Z

src/ATen/native/xpu/BatchLinearAlgebra.cpp

+TORCH_IMPL_FUNC(linalg_qr_xpu_out)(const Tensor& A,
+                               std::string_view mode,
+                               const Tensor & Q,
+                               const Tensor & R) {
+#if defined(USE_ONEMKL_XPU)
+  xpu::linalg_qr_kernel(A, mode, Q, R);
+#else
+  auto A_cpu = A.to(at::kCPU);
+  auto Q_cpu = at::empty_like(Q, at::kCPU);
+  auto R_cpu = at::empty_like(R, at::kCPU);
+  at::cpu::linalg_qr_out(Q_cpu, R_cpu, A_cpu, mode);
+  Q.copy_(Q_cpu);
+  R.copy_(R_cpu);
+#endif // USE_ONEMKL_XPU
+}


My suggestion is to register geqrf_kerenl_xpu/orgqr_kernel_xpu to geqrf_stub/orgqr_stub, which allows us to reuse op level code in stock Pytorch and reuse these two kernels in future.

CuiYifeng · 2025-11-28T08:54:49Z

yaml/native/native_functions.yaml

+- func: linalg_qr(Tensor A, str mode='reduced') -> (Tensor Q, Tensor R)
+  python_module: linalg
+  variants: function
+  structured_delegate: linalg_qr.out
+
+- func: linalg_qr.out(Tensor A, str mode='reduced', *, Tensor(a!) Q, Tensor(b!) R) -> (Tensor(a!) Q, Tensor(b!) R)
+  python_module: linalg
+  structured: True
+  dispatch:
+    XPU: linalg_qr_xpu_out
+


Please refer to https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/native_functions.yaml#L14623 and consider stub mentioned above.

Using existing PT stub for implementing QR has both pros and cons. At the side of pros, we take the most of the existing infrastructure with simplified flow on sycl/xpu side. The implementation proposed here has, however also several strengths. Namely:

Although QR is a composition of geqrf and orgqr, the initial request was for QR, not a wrappers for mentioned mkl funciotns intended to fit QR meta-operator.

MKL functions are matrix oriented and can operate only on matrices. Thus, every subsequent call would require slicing a tensor and sending its parts to the kernel. It would require forcing contiquous function separately for each slice, instead of once, at the beginning of the proposed kernel. Since the proposed kernel uses direct pointers it can go through batch simply by manipulating pointers, no slicing and memory reorganization needed.

The biggest possible efficiency trap lays in matrix layout differences between PT and LAPACK. The former stores data column wise the latter row wise. In order to properly call mkl functions, the data must be transposed.
In fused version, as proposed in this PR, the transposition is called twice, at the beginning and exit. Splitting this kernel into two separate calls would require fitting what user expect calling geqrf and orgqr, which means, the data must have been transposed twice per call. Saying transposed means real reordering data in memory, since unlike PT functions, MKL does not support strides.

Thus, although provitidn these two kernels separately is a natural development path, I would prioritize it for later, after QR and two pending requests will be ready.

@mwiktor-intel For layout differences, could you indicate in MKL document the reason why LAPACK is row-major? https://www.intel.com/content/www/us/en/docs/onemkl/developer-reference-dpcpp/2025-2/geqrf-usm-version.html Please pay attention to leading dimension lda of input matrix. Suppose matrix A is col-major storage and shape is (m, n), we can set lda=m to indicate correct layout.

LDA cannot be replacement of stride. LDA can be used to skip some elements in matrix, but with LDA we cannot force difference data orientation. Specifically: say we have 3x2 matrix. Lapack expects this 6 element array in memory to represent the data as follows:
[ 0 2 4
1 3 5]
LDA here is 2, Setting LDA 3 does not change anything here The purpose of LDA is when, say, we have for any reason some bigger structure in memory, say,
0 3 6
1 4 7
2 5 8 ]
and we want to take only 2 upper rows to process. LDA here should be 3, giving information to the function, that apart from logical number of rows equal to 2, the procedure should skip 3 values to get to the data for 2nd column.

Some BLAS function have row_major of col_major order, but it is not the case within LAPACK.
Note, in the documentation, LDA cannot be set to less than number of rows in the processed matrix.

Thanks for the elaboration. Let me rephase your example 1: given 3x2 matrix stored as [0, 2, 4, 1, 3, 5] in memory, the logic order is [[0, 1], [2, 3], [4, 5]] if this matrix is col-major, right? You mentioned "LDA here is 2", but why LDA here can be smaller than the number of rows?
In your example 2: given memory buffer [0, 3, 6, 1, 4, 7, 2, 5, 8], the logic order of a 3x3 col-major matrix is [[0, 1, 2], [3, 4, 5], [6, 7, 8]]. We can use shape 2x3 and LDA=3 to get col-major matrix [[0, 1, 2], [3, 4, 5]] from this buffer. In this example, I agree with your understanding.
LDA is the distance in memory between the start of consecutive columns (for column-major storage) or consecutive rows (for row-major storage). In other words, LDA means stride[0] for row-major or stride[1] for col-major in buffer (not stride in matrix, as the bigger memory you mentioned).
Let's return to the API of geqrf. Given a buffer, shape (m, n) and LDA tell API how to get data from buffer. Furthermore, since you have noticed that The leading dimension of a, at least max(1, m), why not at least max(1, n)? In my understanding, this implies that API geqrf requires col-major input matrix.

Copilot

Pull request overview

This PR implements QR decomposition operator for XPU devices using Intel's oneMKL LAPACK libraries (geqrf for QR factorization and orgqr for recovering the orthogonal Q matrix). The implementation handles batch processing and requires hard memory transposition due to different storage formats between PyTorch and LAPACK.

Key changes:

Adds native XPU implementation of linalg_qr operator with support for 'reduced', 'complete', and 'r' modes
Implements batch processing for QR decomposition with proper memory layout handling
Comprehensive test coverage for different tensor dimensions, batch sizes, and decomposition modes

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
yaml/native/native_functions.yaml	Registers `linalg_qr` function with XPU dispatch
test/xpu/test_linalg_xpu.py	Adds comprehensive test cases for QR modes with orthogonality and triangularity verification
src/ATen/native/xpu/mkl/BatchLinearAlgebra.h	Declares QR kernel function interface
src/ATen/native/xpu/mkl/BatchLinearAlgebra.cpp	Implements QR decomposition kernel using oneMKL LAPACK
src/ATen/native/xpu/XPUFallback.template	Removes `linalg_qr.out` from fallback list (now natively supported)
src/ATen/native/xpu/BatchLinearAlgebra.cpp	Implements dispatcher function with CPU fallback when oneMKL unavailable

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-03T03:02:37Z

src/ATen/native/xpu/mkl/BatchLinearAlgebra.cpp

+  int64_t n = a_contig.sizes().at(range - 2);
+  int64_t m = a_contig.sizes().at(range - 1);
+  int64_t mn = int64_t(m * n);
+  int64_t b = numel ==0 ? 0 : numel / mn;


Missing space around '==' operator.

Suggested change

int64_t b = numel ==0 ? 0 : numel / mn;

int64_t b = numel == 0 ? 0 : numel / mn;

Copilot · 2025-12-03T03:02:38Z

src/ATen/native/xpu/mkl/BatchLinearAlgebra.cpp

+  int64_t b = numel ==0 ? 0 : numel / mn;
+
+
+  if (b==0 && mode=="complete" && n>0) {


Missing spaces around comparison operators. Should be b == 0, mode == \"complete\", and n > 0 for consistency with coding standards.

Suggested change

if (b==0 && mode=="complete" && n>0) {

if (b == 0 && mode == "complete" && n > 0) {

Copilot · 2025-12-03T03:02:38Z

src/ATen/native/xpu/mkl/BatchLinearAlgebra.cpp

+  std::vector v(dimensions.begin(), dimensions.end());
+  if (mode != "r") {
+    v[range - 1] = v[range - 2];
+    v[range - 2] = out_q_columns;
+  } else {
+    v = std::vector<long>({0});
+  }
+  auto q_dimensions = at::IntArrayRef(v);


[nitpick] Variable name 'v' is ambiguous. Consider renaming to something more descriptive like 'q_dimensions_vec' to clarify its purpose.

Suggested change

std::vector v(dimensions.begin(), dimensions.end());

if (mode != "r") {

v[range - 1] = v[range - 2];

v[range - 2] = out_q_columns;

} else {

v = std::vector<long>({0});

}

auto q_dimensions = at::IntArrayRef(v);

std::vector<long> q_dimensions_vec(dimensions.begin(), dimensions.end());

if (mode != "r") {

q_dimensions_vec[range - 1] = q_dimensions_vec[range - 2];

q_dimensions_vec[range - 2] = out_q_columns;

} else {

q_dimensions_vec = std::vector<long>({0});

}

auto q_dimensions = at::IntArrayRef(q_dimensions_vec);

Copilot · 2025-12-03T03:02:38Z

src/ATen/native/xpu/mkl/BatchLinearAlgebra.cpp

+    v[range - 1] = v[range - 2];
+    v[range - 2] = out_q_columns;
+  } else {
+    v = std::vector<long>({0});


Creating a vector with size 0 dimension {0} may cause issues when used to construct q_dimensions. This should likely be an empty vector std::vector<long>() or retain the original dimensions.

Suggested change

v = std::vector<long>({0});

v = std::vector<long>();

Copilot · 2025-12-03T03:02:38Z

src/ATen/native/xpu/mkl/BatchLinearAlgebra.cpp

+      oneapi::mkl::lapack::geqrf_scratchpad_size<scalar_t>(queue, n+1, m+1, n+1);
+  int64_t bufsize2 =
+      oneapi::mkl::lapack::orgqr_scratchpad_size<scalar_t>(queue, n+1, m+1, m+1, n+1);


The scratchpad size calculations use n+1 and m+1, but the actual LAPACK calls on lines 636 and 644-654 use n and m without the +1 offset. This dimension mismatch could lead to buffer overflows or incorrect scratchpad allocation.

Suggested change

oneapi::mkl::lapack::geqrf_scratchpad_size<scalar_t>(queue, n+1, m+1, n+1);

int64_t bufsize2 =

oneapi::mkl::lapack::orgqr_scratchpad_size<scalar_t>(queue, n+1, m+1, m+1, n+1);

oneapi::mkl::lapack::geqrf_scratchpad_size<scalar_t>(queue, n, m, n);

int64_t bufsize2 =

oneapi::mkl::lapack::orgqr_scratchpad_size<scalar_t>(queue, n, out_q_columns, tau_len, n);

Copilot · 2025-12-03T03:02:39Z

src/ATen/native/xpu/mkl/BatchLinearAlgebra.cpp

+  for (int batch_item = 0; batch_item < b; batch_item++) {
+
+
+    if (mn!=0) // make QR if there is something to orthogonalize


Missing spaces around '!=' operator. Should be mn != 0 for consistency.

Suggested change

if (mn!=0) // make QR if there is something to orthogonalize

if (mn != 0) // make QR if there is something to orthogonalize

Copilot · 2025-12-03T03:02:39Z

src/ATen/native/xpu/BatchLinearAlgebra.cpp

+#endif // USE_ONEMKL_XPU
+}
+
+


Empty comment line should be removed.

Suggested change

Copilot · 2025-12-03T03:02:39Z

src/ATen/native/xpu/BatchLinearAlgebra.cpp

 #include <ATen/native/DispatchStub.h>
 #include <ATen/native/LinearAlgebraUtils.h>
+#include <ATen/ops/linalg_qr_native.h>
+#include <ATen/ops/linalg_qr_cpu_dispatch.h>


The include linalg_qr_cpu_dispatch.h appears unused in this file since the CPU fallback uses at::cpu::linalg_qr_out which should be available through the native header.

Suggested change

#include <ATen/ops/linalg_qr_cpu_dispatch.h>

CuiYifeng

Please check comments (including Copilot's). This PR needs to be reviewed again after fixing a series of failed cases.
Suggest to refresh code format before review. Furthermore, please also check performance compared to CPU.

CuiYifeng · 2025-12-03T02:29:43Z

src/ATen/native/xpu/mkl/BatchLinearAlgebra.cpp

+    b=1;
+    for (int dimension=0; dimension<range-2; dimension++) {
+      b*=a_contig.sizes().at(dimension);
+    }


at::native::batchCount can help you get flattened batch size. https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/LinearAlgebraUtils.h#L112

Silv3S and others added 15 commits October 29, 2025 20:18

mm

78e95cd

remove duplicates

60b868b

Merge branch 'intel:main' into dev_wiktor

2267fca

dev3

ba893b0

First callable poc

e0f5f6b

Merge branch 'intel:main' into dev_wiktor

16028d1

Merge branch 'intel:main' into dev_wiktor

c40b846

simple test

025291a

cleanup

6e025e1

format

5e593d8

remove vars

2038538

Merge branch 'intel:main' into dev_wiktor

e564e1a

Merge branch 'intel:main' into pre_qr_commit

8f5a74c

New implementation of QR kernel based on mkl::geqrf and orgqr

3c26a9e

cleaned version of QR

1531fa1

mwiktor-intel requested review from CuiYifeng and Silv3S November 24, 2025 12:23

Enable more dev tests

980732b

MarekStrachacki requested a review from astachowiczhabana November 25, 2025 10:13

Silv3S added 4 commits November 26, 2025 11:21

Move QR to mkl

e8b49e7

Port ut to test_linalg_xpu

419eb57

Merge remote-tracking branch 'iops/main' into pre_qr_commit

a56a4ff

Fix the non-MKL fallback path

7906ef2

CuiYifeng requested changes Nov 28, 2025

View reviewed changes

Production ready fused QR kernel

c995189

CuiYifeng requested a review from Copilot December 3, 2025 03:01

Copilot AI reviewed Dec 3, 2025

View reviewed changes

CuiYifeng requested changes Dec 3, 2025

View reviewed changes

	int64_t b = numel ==0 ? 0 : numel / mn;
	int64_t b = numel == 0 ? 0 : numel / mn;

		int64_t b = numel ==0 ? 0 : numel / mn;


		if (b==0 && mode=="complete" && n>0) {

	if (b==0 && mode=="complete" && n>0) {
	if (b == 0 && mode == "complete" && n > 0) {

		for (int batch_item = 0; batch_item < b; batch_item++) {


		if (mn!=0) // make QR if there is something to orthogonalize

	if (mn!=0) // make QR if there is something to orthogonalize
	if (mn != 0) // make QR if there is something to orthogonalize

		#endif // USE_ONEMKL_XPU
		}

QR operator utilizing XPU. #2399

Are you sure you want to change the base?

QR operator utilizing XPU. #2399

Uh oh!

Conversation

mwiktor-intel commented Nov 24, 2025

Uh oh!

CuiYifeng commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Silv3S commented Nov 26, 2025

Uh oh!

CuiYifeng left a comment

Choose a reason for hiding this comment

Uh oh!

CuiYifeng Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

CuiYifeng Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

mwiktor-intel Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CuiYifeng Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mwiktor-intel Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

CuiYifeng Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

CuiYifeng left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CuiYifeng Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

CuiYifeng commented Nov 26, 2025 •

edited

Loading

mwiktor-intel Dec 1, 2025 •

edited

Loading

CuiYifeng Dec 2, 2025 •

edited

Loading

CuiYifeng Dec 2, 2025 •

edited

Loading

CuiYifeng left a comment •

edited

Loading