From a40b2f456f3feb192a54b72e35140c5699482e97 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Pekka=20J=C3=A4=C3=A4skel=C3=A4inen?= Date: Tue, 13 Dec 2022 13:47:12 +0200 Subject: [PATCH] cl_khr_defined_builtin_kernels First WiP draft of a defined BiKs extension. --- ext/cl_khr_defined_builtin_kernels.asciidoc | 330 +++++ ext/cl_khr_defined_builtin_kernels.html | 1288 +++++++++++++++++++ 2 files changed, 1618 insertions(+) create mode 100644 ext/cl_khr_defined_builtin_kernels.asciidoc create mode 100644 ext/cl_khr_defined_builtin_kernels.html diff --git a/ext/cl_khr_defined_builtin_kernels.asciidoc b/ext/cl_khr_defined_builtin_kernels.asciidoc new file mode 100644 index 00000000..a503b5a2 --- /dev/null +++ b/ext/cl_khr_defined_builtin_kernels.asciidoc @@ -0,0 +1,330 @@ +// Copyright 2018-2022 The Khronos Group. This work is licensed under a +// Creative Commons Attribution 4.0 International License; see +// http://creativecommons.org/licenses/by/4.0/ += cl_khr_defined_builtin_kernels = + +:source-highlighter: coderay + +[[cl_khr_defined_builtin_kernels]] +== Khronos-Defined Built-in Kernels (Early Draft) + +The purpose of this extension is to provide a standardized set of built-in +kernels with well-defined semantics useful for accelerating applications +from various domains. The extension specification is designed to rapidly +expand and "live" via addition of new well-defined built-in kernel +definitions and updating of previously defined ones. + +=== General Information + +==== Name Strings + +`cl_khr_defined_builtin_kernels` + +==== Version History + +[cols="1,1,3",options="header",] +|==== +| *Date* | *Version* | *Description* +| 2022-12-13 | 0.1.0 | First formulation as an extension specification like proposed by Ben Ashbaugh. +|==== + +==== Dependencies + +This extension is written against the OpenCL Specification version 3.0.12. + +This extension requires OpenCL 1.2 or later. + +==== Contributors + +Pekka Jääskeläinen, Intel and Tampere University. + +Topi Leppänen, Tampere University. + +Jan Solanti, Tampere University. + +Ben Ashbaugh, Intel. + + +=== Overview + +OpenCL 1.2 specifies a built-in kernel (BiK) as a kernel that is executed on +an OpenCL device or custom device by fixed-function hardware or in firmware. +Applications can query the built-in kernels supported by a device or custom +device. + +BiKs are referred to by a name (a C string) without any semantics attached +to the functionality. The semantics behind the name is completely device +specific, typically documented in vendor-specific extension specifications. + +The goal for this extension is to lower the bar for utilizing hardware +accelerated functions in drivers by providing a library of +well-defined BiKs with good coverage for common acceleration needs +and which is designed to easily evolve over time. + +The device drivers that implement this extension can freely choose which +subset of defined BiKs they implement and advertise to the clients. The +clients can use the BiKs to accelerate their applications by manually +executing invoking the BiKs. The extension is designed to also support using +automated task graph lowering tooling later. + +==== Background + +ASIC-based coarse-grained hardware accelerators are specialized logic meant to +speed up execution of workloads of interest, or to provide improvements in +energy-efficiency. Examples of contemporary workloads that are beneficially hardware +accelerated over software-based implementations include video coding, deep learning, +cryptography, software-defined radio and graphics rendering. + +FPGAs form a special case somewhere between instruction-set architectures and fixed +function hardware accelerators. While advances in high-level synthesis tools +have attempted to bridge the programmability gap between GPU and FPGA programming, +FPGAs are still considered as devices which are challenging to achieve efficient +implementations with. Due to extensive manual optimization work required for efficient +implementations of the accelerated functionality, defining FPGA designs as +a system of "hardware accelerator IPs" is still a widely used "application abstraction". +FPGAs can be thus seen as a platform that can realize and integrate any +hardware accelerator implementable with the programmable fabric. + +The means to utilize hardware accelerators have typically been +vendor-specific and abstracted behind domain-specific libraries. +The overhead with the "bunch of libraries"-approach is seen in the lowest level +of integration: The libraries utilize a low level library (typically +vendor-specific) to interface with the actual hardware, and thus does not +integrate efficiently with other libraries or software-programmable processors +that might be available on the same chip. + +==== Rationale + +OpenCL's built-in kernel abstraction allows pushing both hardware +accelerated and software defined kernels to the same command-queues, +providing a powerful means for asynchronous execution of heterogeneous +task graphs on diverse heterogeneous platforms. The ability to invoke hardware +accelerators while being able to synchronize and optimize data transfers at +the lowest levels of the driver stack can provide significant latency benefits, +especially when combined with the command-buffering mechanism. + +However, the BiK abstraction works well only when it is widely adopted by +vendors, and when multiple vendors implement the same definitions. Otherwise +each vendor specifies and implements their own BiKs closely matching their +own hardware accelerator properties, resulting in lack of cross-vendor +portability in the API abstraction presented to the upper layers of +heterogeneous computing software stacks. + +This extension standardizes a set of well-defined BiKs the clients can +call from higher level programming stacks built with different languages +and multiple libraries, possibly mix accelerator calls with calls to software kernel +commands, and rely on the driver stack to optimize the execution (especially +the synchronization and communication) as a low level heterogeneous task graph. +It aims to promote the use of BiKs as a programming model for hardware accelerated +functionality, to improve cross-vendor portability of hardware accelerated computing. + +=== Modifications to section 4.2 of the OpenCL API Specification + +Modify *Table 5*, _Device Queries_, of section 4.2, by adding the following +sentences to the description cell of `CL_DEVICE_BUILT_IN_KERNELS`: + +[quote] +The semantics of the returned built-in kernels are undefined or defined in +vendor-specific documentation, unless the name starts with prefix `khr_', +which means it's a built-in kernel with semantics defined in Appendix I. + +=== Add new appendix "Appendix I - Defined Built-in Kernels" to OpenCL API Specification + +This chapter describes standard built-in kernels (BiK) with well-defined +semantics. A conformant device can report to support zero or more of the built-in +kernels via `CL_DEVICE_BUILT_IN_KERNELS` or `CL_DEVICE_BUILT_IN_KERNELS_WITH_VERSION` device queries. + +The general client-side abstraction of the defined built-in kernels is similar to a call +to a C function of which implementation is hidden. The device driver can invoke one or +more physical hardware accelerators combined with firmware to implement the semantics +as efficiently as possible. + +It is the driver's responsibility to handle efficient synchronization and communication +to the hardware accelerator, the internal accelerator state management and resource sharing +across multiple OpenCL contexts. + +==== Standard Built-in Kernels ==== + +The following list of recognized built-ins is organized according to their application +domain and handled data types. It is expected to grow and update while preserving backwards +compatibility. + +[caption="Table A.I.1. "] +.Standard Built-in Kernels and Their Semantics. *The table has been populated with a small set of non-trivial example entries which are subject to change and the list to expand during drafting.* +[cols="1,3,2,2"] +|=== +4+| *General linear algebra* +// https://netlib.org/blas/blasqr.pdf +| Name | Description | NDRange Dimensions | Arguments +| *khr_blas_gemm_float* +| xGEMM: General matrix multiplication with real single precision floating point numbers as described in Basic Linear Algebra Subprograms. Performs C = alpha * trans(A) * trans(B) + beta*C, where A, B and C are matrices, and alpha and beta scalars. trans() is a configurable transpose operation. +a| +[start=1] +. The height. +. The width. +a| +[start=0] +. int: transpose operation (trans) type for matrix A (0 = none, 1 = transpose, 2 = conjugate transpose) +. int: transpose type for matrix B (0 = none, 1 = transpose, 2 = conjugate transpose) +. float: scalar (alpha) to multiply the matrix multiplication result elements with +. float* (input): matrix A +. int: leading dimension of A (0 = row-major, 1 = column-major) +. float* (input): matrix B +. int: leading dimension of B (0 = row-major, 1 = column-major) +. float: scalar (beta) to multiply the C matrix elements with before adding it to the result +. float* (input&output): matrix C which is added to the matrix multiplication result, and stores the output +. int: leading dimension of C (0 = row-major, 1 = column-major) +4+| OpenCL C Semantics +4+a| +[source,c] +---- +__kernel void __khr_blas_gemm_float( + int transA, int transB, float alpha, const global float *A, int ldA, + const global float *B, int ldB, + float beta, global float *C, int ldC) { + // TBD: An example implementation that can be used for verification + // and as a fallback SW implementation. +} +---- + +4+| *OpenVX Neural Network Extension Compatible Kernels* +// Copied from https://registry.khronos.org/OpenVX/extensions/vx_khr_nn/1.2/html/d6/d9a/group__group__cnn.html#ga69764625f436c14d739fc467515c1584 +| Name | Description | NDRange Dimensions | Arguments +| *khr_openvx_nn_extension_convolution_uchar* +| Convolution for 8bit unsigned integer inputs and weights. +a| +[start=1] +. Batch size. +. Width. +. Height. +a| +[start=0] +. uchar* [in]: The input tensor data. 3 lower dimensions represent a single input, all following dimensions represent number of batches, possibly nested. The dimension order is [width, height, #IFM, #batches]. +. uchar* [in]: Weights, as a 4d tensor with dimensions [kernel_x, kernel_y, #IFM, #OFM]. +. uchar* [in]: Biases (optional, ignored if NULL). The biases, which may be shared (one per ofm) or unshared (one per ofm * output location). The possible layouts are either [#OFM] or [width, height, #OFM]. Biases data type must match the data type of the inputs. (Kernel parameter #2) +. size_t: (dilation_x) “inflate” the kernel by inserting zeros between the kernel elements in the x direction. The value is the number of zeros to insert. +. size_t: (dilation_y) “inflate” the kernel by inserting zeros between the kernel elements in the y direction. The value is the number of zeros to insert. +. int: Rounding method for calculating output dimensions. +. int: A VX_TYPE_ENUM of the vx_convert_policy_e enumeration. +. size_t: Number of elements padded at each side in the x dimension of the input. +. size_t: Number of elements padded at each side in the y dimension of the input. +. int: A VX_TYPE_ENUM of the vx_round_policy_e enumeration. +. uchar* [out]: The output tensor data. Output will have the same number and structure of dimensions as input. Output tensor data type must be same as the inputs. (Kernel parameter #4) + +4+| OpenCL C Semantics +4+a| +[source,c] +---- +__kernel void __khr_openvx_nn_extension_convolution_uchar( + const uchar *input, const uchar *weights, const uchar *biases, + size_t dilation_x, size_t dilation_y, + int down_scale_rounding, int overflow_policy, size_t padding_x, size_t padding_y, + int rounding_policy, uchar *output) { + // TBD. +} +---- + +4+| *Direct Input/Output Operations* +4+| Kernels for accessing data sources and destinations directly without host involvement. +| Name | Description | NDRange Dimensions | Arguments +| *khr_io_stream_in_uchar* +| Non-blocking read of data from a sensor/stream associated with the device. +a| - +a| +[start=0] +. uchar* [out]: The data. +. size_t* [in+out]: In: number of bytes to read. Out: Number of bytes that could be read (can be 0). (Compatible with the `cl_pocl_content_size` extension to optimize data transfers with.) + +4+| OpenCL C Semantics +4+a| +[source,c] +---- +__kernel void __khr_io_stream_in_uchar( + uchar *output, size_t *num) { + // It is not feasible to describe this kernel in OpenCL C as I/O devices + // are not representable with it. +} +---- + +| *khr_io_stream_out_uchar* +| Non-blocking write of data to an output/sink associated with the device. +| - +a| +[start=0] +. uchar* [in]: The data to write. +. size_t* [in+out]: In: Number of bytes to write. Out: Number of bytes that could be written (can be 0). +4+| OpenCL C Semantics +4+a| +[source,c] +---- +__kernel void __khr_io_stream_out_uchar( + uchar *input, size_t *num) { + // It is not feasible to describe this kernel in OpenCL C as I/O devices + // are not representable with it. +} +---- + +| *khr_io_stream_in_blocking_uchar* +| Blocking read of data from a sensor/stream associated with the device. +a| - +a| +[start=0] +. uchar* [out]: The data. +* size_t* [in]: How many bytes to read before returning. + +4+| OpenCL C Semantics +4+a| +[source,c] +---- +__kernel void __khr_io_stream_in_blocking_uchar(uchar *output, size_t *num) { + while (*num) { + size_t num_read = *num; + __khr_io_stream_in_uchar(output, &num_read); + num -= num_read; + output += num_read; + } +} +---- + +|=== + +==== Launching BiKs from the Device Side ==== + +BiKs are primarily meant to be launched as kernel commands via host-side command queues. +Optionally, they can be callable from device-side via +`enqueue_kernel`: This capability can be queried on per BiK basis at compile-time in OpenCL C by checking for macro definitions which has the following naming convention: `cl_khr_bik_BUILTIN_KERNEL_NAME`. In case a BiK macro is defined, a kernel with a naming convention `__khr_BUILTIN_KERNEL_NAME()` can be enqueued by the program at device side as software-defined kernels. + + +=== Open questions + +. Should we enable launching BiKs from the device side without requiring device-side enqueue? The main problem is those with NDRange as they are not simple single-WI helper functions. ++ +-- +*UNRESOLVED* + +-- + +. Should the NDRange be used at all in BiKs? It feels sort of unnatural as typically the NDRange is used to imply SPMD parallelism while the hardware/firmware is free to choose whatever parallelism degree to implement the function. On the other hand, similar applies to software kernel launches as the work-items can be executed serially if adhering to barrier semantics. ++ +-- +*UNRESOLVED* + +-- + +. Different accelerators prefer different channel orders (NHWC vs. NCHW...) for the processed data. Should the channel order be passed as a BiK argument (like in the example GEMM's row/column order) or is it better to have different BiK variations for each? ++ +-- +*UNRESOLVED* + +-- + +. How to denote preference? Some of the BiKs are more efficient on a given device as they map more naturally to the underlying HW accelerator, but the slower variations (for example, with unoptimal channel order in NN accelerators) might be still beneficially accelerated. ++ +-- +*UNRESOLVED* + +-- + +. Since the defined built-in kernel concept is basically just a C-like API inside another API, should it be made more generic and thus directly usable for SYCL and Vulkan as well? ++ +-- +*UNRESOLVED* + +-- + diff --git a/ext/cl_khr_defined_builtin_kernels.html b/ext/cl_khr_defined_builtin_kernels.html new file mode 100644 index 00000000..3fda4c9d --- /dev/null +++ b/ext/cl_khr_defined_builtin_kernels.html @@ -0,0 +1,1288 @@ + + + + + + +cl_khr_defined_builtin_kernels + + + + + +
+
+

Khronos-Defined Built-in Kernels (Early Draft)

+
+

The purpose of this extension is to provide a standardized set of built-in +kernels with well-defined semantics useful for accelerating applications +from various domains. The extension specification is designed to rapidly +expand and "live" via addition of new well-defined built-in kernel +definitions and updating of previously defined ones.

+
+

General Information

+
+

Name Strings

+

cl_khr_defined_builtin_kernels

+
+
+

Version History

+
+ ++++ + + + + + + + + + + + + + +
Date Version Description

2022-12-13

0.1.0

First formulation as an extension specification like proposed by Ben Ashbaugh.

+
+
+
+

Dependencies

+

This extension is written against the OpenCL Specification version 3.0.12.

+

This extension requires OpenCL 1.2 or later.

+
+
+

Contributors

+

Pekka Jääskeläinen, Intel and Tampere University.
+Topi Leppänen, Tampere University.
+Jan Solanti, Tampere University.
+Ben Ashbaugh, Intel.

+
+
+
+

Overview

+

OpenCL 1.2 specifies a built-in kernel (BiK) as a kernel that is executed on +an OpenCL device or custom device by fixed-function hardware or in firmware. +Applications can query the built-in kernels supported by a device or custom +device.

+

BiKs are referred to by a name (a C string) without any semantics attached +to the functionality. The semantics behind the name is completely device +specific, typically documented in vendor-specific extension specifications.

+

The goal for this extension is to lower the bar for utilizing hardware +accelerated functions in drivers by providing a library of +well-defined BiKs with good coverage for common acceleration needs +and which is designed to easily evolve over time.

+

The device drivers that implement this extension can freely choose which +subset of defined BiKs they implement and advertise to the clients. The +clients can use the BiKs to accelerate their applications by manually +executing invoking the BiKs. The extension is designed to also support using +automated task graph lowering tooling later.

+
+

Background

+

ASIC-based coarse-grained hardware accelerators are specialized logic meant to +speed up execution of workloads of interest, or to provide improvements in +energy-efficiency. Examples of contemporary workloads that are beneficially hardware +accelerated over software-based implementations include video coding, deep learning, +cryptography, software-defined radio and graphics rendering.

+

FPGAs form a special case somewhere between instruction-set architectures and fixed +function hardware accelerators. While advances in high-level synthesis tools +have attempted to bridge the programmability gap between GPU and FPGA programming, +FPGAs are still considered as devices which are challenging to achieve efficient +implementations with. Due to extensive manual optimization work required for efficient +implementations of the accelerated functionality, defining FPGA designs as +a system of "hardware accelerator IPs" is still a widely used "application abstraction". +FPGAs can be thus seen as a platform that can realize and integrate any +hardware accelerator implementable with the programmable fabric.

+

The means to utilize hardware accelerators have typically been +vendor-specific and abstracted behind domain-specific libraries. +The overhead with the "bunch of libraries"-approach is seen in the lowest level +of integration: The libraries utilize a low level library (typically +vendor-specific) to interface with the actual hardware, and thus does not +integrate efficiently with other libraries or software-programmable processors +that might be available on the same chip.

+
+
+

Rationale

+

OpenCL’s built-in kernel abstraction allows pushing both hardware +accelerated and software defined kernels to the same command-queues, +providing a powerful means for asynchronous execution of heterogeneous +task graphs on diverse heterogeneous platforms. The ability to invoke hardware +accelerators while being able to synchronize and optimize data transfers at +the lowest levels of the driver stack can provide significant latency benefits, +especially when combined with the command-buffering mechanism.

+

However, the BiK abstraction works well only when it is widely adopted by +vendors, and when multiple vendors implement the same definitions. Otherwise +each vendor specifies and implements their own BiKs closely matching their +own hardware accelerator properties, resulting in lack of cross-vendor +portability in the API abstraction presented to the upper layers of +heterogeneous computing software stacks.

+

This extension standardizes a set of well-defined BiKs the clients can +call from higher level programming stacks built with different languages +and multiple libraries, possibly mix accelerator calls with calls to software kernel +commands, and rely on the driver stack to optimize the execution (especially +the synchronization and communication) as a low level heterogeneous task graph. +It aims to promote the use of BiKs as a programming model for hardware accelerated +functionality, to improve cross-vendor portability of hardware accelerated computing.

+
+
+
+

Modifications to section 4.2 of the OpenCL API Specification

+

Modify Table 5, Device Queries, of section 4.2, by adding the following +sentences to the description cell of CL_DEVICE_BUILT_IN_KERNELS:

+
+
The semantics of the returned built-in kernels are undefined or defined in +vendor-specific documentation, unless the name starts with prefix ‘khr_’, +which means it’s a built-in kernel with semantics defined in Appendix I.
+
+
+
+
+

Add new appendix "Appendix I - Defined Built-in Kernels" to OpenCL API Specification

+

This chapter describes standard built-in kernels (BiK) with well-defined +semantics. A conformant device can report to support zero or more of the built-in +kernels via CL_DEVICE_BUILT_IN_KERNELS or CL_DEVICE_BUILT_IN_KERNELS_WITH_VERSION device queries.

+

The general client-side abstraction of the defined built-in kernels is similar to a call +to a C function of which implementation is hidden. The device driver can invoke one or +more physical hardware accelerators combined with firmware to implement the semantics +as efficiently as possible.

+

It is the driver’s responsibility to handle efficient synchronization and communication +to the hardware accelerator, the internal accelerator state management and resource sharing +across multiple OpenCL contexts.

+
+

Standard Built-in Kernels

+

The following list of recognized built-ins is organized according to their application +domain and handled data types. It is expected to grow and update while preserving backwards +compatibility.

+
+ + +++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Table A.I.1. Standard Built-in Kernels and Their Semantics. The table has been populated with a small set of non-trivial example entries which are subject to change and the list to expand during drafting.

General linear algebra

Name

Description

NDRange Dimensions

Arguments

khr_blas_gemm_float

xGEMM: General matrix multiplication with real single precision floating point numbers as described in Basic Linear Algebra Subprograms. Performs C = alpha * trans(A) * trans(B) + beta*C, where A, B and C are matrices, and alpha and beta scalars. trans() is a configurable transpose operation.

    +
  1. +

    +The height. +

    +
  2. +
  3. +

    +The width. +

    +
  4. +
    +
  1. +

    +int: transpose operation (trans) type for matrix A (0 = none, 1 = transpose, 2 = conjugate transpose) +

    +
  2. +
  3. +

    +int: transpose type for matrix B (0 = none, 1 = transpose, 2 = conjugate transpose) +

    +
  4. +
  5. +

    +float: scalar (alpha) to multiply the matrix multiplication result elements with +

    +
  6. +
  7. +

    +float* (input): matrix A +

    +
  8. +
  9. +

    +int: leading dimension of A (0 = row-major, 1 = column-major) +

    +
  10. +
  11. +

    +float* (input): matrix B +

    +
  12. +
  13. +

    +int: leading dimension of B (0 = row-major, 1 = column-major) +

    +
  14. +
  15. +

    +float: scalar (beta) to multiply the C matrix elements with before adding it to the result +

    +
  16. +
  17. +

    +float* (input&output): matrix C which is added to the matrix multiplication result, and stores the output +

    +
  18. +
  19. +

    +int: leading dimension of C (0 = row-major, 1 = column-major) +

    +
  20. +

OpenCL C Semantics

+
+
__kernel void __khr_blas_gemm_float(
+   int transA, int transB, float alpha, const global float *A, int ldA,
+   const global float *B, int ldB,
+   float beta, global float *C, int ldC) {
+   // TBD: An example implementation that can be used for verification
+   // and as a fallback SW implementation.
+}

OpenVX Neural Network Extension Compatible Kernels

Name

Description

NDRange Dimensions

Arguments

khr_openvx_nn_extension_convolution_uchar

Convolution for 8bit unsigned integer inputs and weights.

    +
  1. +

    +Batch size. +

    +
  2. +
  3. +

    +Width. +

    +
  4. +
  5. +

    +Height. +

    +
  6. +
    +
  1. +

    +uchar* [in]: The input tensor data. 3 lower dimensions represent a single input, all following dimensions represent number of batches, possibly nested. The dimension order is [width, height, #IFM, #batches]. +

    +
  2. +
  3. +

    +uchar* [in]: Weights, as a 4d tensor with dimensions [kernel_x, kernel_y, #IFM, #OFM]. +

    +
  4. +
  5. +

    +uchar* [in]: Biases (optional, ignored if NULL). The biases, which may be shared (one per ofm) or unshared (one per ofm * output location). The possible layouts are either [#OFM] or [width, height, #OFM]. Biases data type must match the data type of the inputs. (Kernel parameter #2) +

    +
  6. +
  7. +

    +size_t: (dilation_x) “inflate” the kernel by inserting zeros between the kernel elements in the x direction. The value is the number of zeros to insert. +

    +
  8. +
  9. +

    +size_t: (dilation_y) “inflate” the kernel by inserting zeros between the kernel elements in the y direction. The value is the number of zeros to insert. +

    +
  10. +
  11. +

    +int: Rounding method for calculating output dimensions. +

    +
  12. +
  13. +

    +int: A VX_TYPE_ENUM of the vx_convert_policy_e enumeration. +

    +
  14. +
  15. +

    +size_t: Number of elements padded at each side in the x dimension of the input. +

    +
  16. +
  17. +

    +size_t: Number of elements padded at each side in the y dimension of the input. +

    +
  18. +
  19. +

    +int: A VX_TYPE_ENUM of the vx_round_policy_e enumeration. +

    +
  20. +
  21. +

    +uchar* [out]: The output tensor data. Output will have the same number and structure of dimensions as input. Output tensor data type must be same as the inputs. (Kernel parameter #4) +

    +
  22. +

OpenCL C Semantics

+
+
__kernel void __khr_openvx_nn_extension_convolution_uchar(
+   const uchar *input, const uchar *weights, const uchar *biases,
+   size_t dilation_x, size_t dilation_y,
+   int down_scale_rounding, int overflow_policy, size_t padding_x, size_t padding_y,
+   int rounding_policy, uchar *output) {
+   // TBD.
+}

Direct Input/Output Operations

Kernels for accessing data sources and destinations directly without host involvement.

Name

Description

NDRange Dimensions

Arguments

khr_io_stream_in_uchar

Non-blocking read of data from a sensor/stream associated with the device.

+
+
-
+
    +
  1. +

    +uchar* [out]: The data. +

    +
  2. +
  3. +

    +size_t* [in+out]: In: number of bytes to read. Out: Number of bytes that could be read (can be 0). (Compatible with the cl_pocl_content_size extension to optimize data transfers with.) +

    +
  4. +

OpenCL C Semantics

+
+
__kernel void __khr_io_stream_in_uchar(
+   uchar *output, size_t *num) {
+   // It is not feasible to describe this kernel in OpenCL C as I/O devices
+   // are not representable with it.
+}

khr_io_stream_out_uchar

Non-blocking write of data to an output/sink associated with the device.

-

    +
  1. +

    +uchar* [in]: The data to write. +

    +
  2. +
  3. +

    +size_t* [in+out]: In: Number of bytes to write. Out: Number of bytes that could be written (can be 0). +

    +
  4. +

OpenCL C Semantics

+
+
__kernel void __khr_io_stream_out_uchar(
+   uchar *input, size_t *num) {
+   // It is not feasible to describe this kernel in OpenCL C as I/O devices
+   // are not representable with it.
+}

khr_io_stream_in_blocking_uchar

Blocking read of data from a sensor/stream associated with the device.

+
+
-
+
    +
  1. +

    +uchar* [out]: The data. +

    +
      +
    • +

      +size_t* [in]: How many bytes to read before returning. +

      +
    • +
    +
  2. +

OpenCL C Semantics

+
+
__kernel void __khr_io_stream_in_blocking_uchar(uchar *output, size_t *num) {
+   while (*num) {
+       size_t num_read = *num;
+       __khr_io_stream_in_uchar(output, &num_read);
+       num -= num_read;
+       output += num_read;
+   }
+}
+
+
+
+

Launching BiKs from the Device Side

+

BiKs are primarily meant to be launched as kernel commands via host-side command queues. +Optionally, they can be callable from device-side via +enqueue_kernel: This capability can be queried on per BiK basis at compile-time in OpenCL C by checking for macro definitions which has the following naming convention: cl_khr_bik_BUILTIN_KERNEL_NAME. In case a BiK macro is defined, a kernel with a naming convention __khr_BUILTIN_KERNEL_NAME() can be enqueued by the program at device side as software-defined kernels.

+
+
+
+

Open questions

+
    +
  1. +

    +Should we enable launching BiKs from the device side without requiring device-side enqueue? The main problem is those with NDRange as they are not simple single-WI helper functions. +

    +
    +
    +

    UNRESOLVED

    +
    +
  2. +
  3. +

    +Should the NDRange be used at all in BiKs? It feels sort of unnatural as typically the NDRange is used to imply SPMD parallelism while the hardware/firmware is free to choose whatever parallelism degree to implement the function. On the other hand, similar applies to software kernel launches as the work-items can be executed serially if adhering to barrier semantics. +

    +
    +
    +

    UNRESOLVED

    +
    +
  4. +
  5. +

    +Different accelerators prefer different channel orders (NHWC vs. NCHW…) for the processed data. Should the channel order be passed as a BiK argument (like in the example GEMM’s row/column order) or is it better to have different BiK variations for each? +

    +
    +
    +

    UNRESOLVED

    +
    +
  6. +
  7. +

    +How to denote preference? Some of the BiKs are more efficient on a given device as they map more naturally to the underlying HW accelerator, but the slower variations (for example, with unoptimal channel order in NN accelerators) might be still beneficially accelerated. +

    +
    +
    +

    UNRESOLVED

    +
    +
  8. +
  9. +

    +Since the defined built-in kernel concept is basically just a C-like API inside another API, should it be made more generic and thus directly usable for SYCL and Vulkan as well? +

    +
    +
    +

    UNRESOLVED

    +
    +
  10. +
+
+
+
+
+

+ + +