From a40b2f456f3feb192a54b72e35140c5699482e97 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Pekka=20J=C3=A4=C3=A4skel=C3=A4inen?=
 <pekka.jaaskelainen@intel.com>
Date: Tue, 13 Dec 2022 13:47:12 +0200
Subject: [PATCH] cl_khr_defined_builtin_kernels

First WiP draft of a defined BiKs extension.
---
 ext/cl_khr_defined_builtin_kernels.asciidoc |  330 +++++
 ext/cl_khr_defined_builtin_kernels.html     | 1288 +++++++++++++++++++
 2 files changed, 1618 insertions(+)
 create mode 100644 ext/cl_khr_defined_builtin_kernels.asciidoc
 create mode 100644 ext/cl_khr_defined_builtin_kernels.html

diff --git a/ext/cl_khr_defined_builtin_kernels.asciidoc b/ext/cl_khr_defined_builtin_kernels.asciidoc
new file mode 100644
index 00000000..a503b5a2
--- /dev/null
+++ b/ext/cl_khr_defined_builtin_kernels.asciidoc
@@ -0,0 +1,330 @@
+// Copyright 2018-2022 The Khronos Group. This work is licensed under a
+// Creative Commons Attribution 4.0 International License; see
+// http://creativecommons.org/licenses/by/4.0/
+= cl_khr_defined_builtin_kernels =
+
+:source-highlighter: coderay
+
+[[cl_khr_defined_builtin_kernels]]
+== Khronos-Defined Built-in Kernels (Early Draft)
+
+The purpose of this extension is to provide a standardized set of built-in
+kernels with well-defined semantics useful for accelerating applications
+from various domains.  The extension specification is designed to rapidly
+expand and "live" via addition of new well-defined built-in kernel
+definitions and updating of previously defined ones.
+
+=== General Information
+
+==== Name Strings
+
+`cl_khr_defined_builtin_kernels`
+
+==== Version History
+
+[cols="1,1,3",options="header",]
+|====
+| *Date*     | *Version* | *Description*
+| 2022-12-13 | 0.1.0     | First formulation as an extension specification like proposed by Ben Ashbaugh.
+|====
+
+==== Dependencies
+
+This extension is written against the OpenCL Specification version 3.0.12.
+
+This extension requires OpenCL 1.2 or later.
+
+==== Contributors
+
+Pekka Jääskeläinen, Intel and Tampere University. +
+Topi Leppänen, Tampere University. +
+Jan Solanti, Tampere University. +
+Ben Ashbaugh, Intel. +
+
+=== Overview
+
+OpenCL 1.2 specifies a built-in kernel (BiK) as a kernel that is executed on
+an OpenCL device or custom device by fixed-function hardware or in firmware.
+Applications can query the built-in kernels supported by a device or custom
+device.
+
+BiKs are referred to by a name (a C string) without any semantics attached
+to the functionality. The semantics behind the name is completely device
+specific, typically documented in vendor-specific extension specifications.
+
+The goal for this extension is to lower the bar for utilizing hardware
+accelerated functions in drivers by providing a library of
+well-defined BiKs with good coverage for common acceleration needs
+and which is designed to easily evolve over time.
+
+The device drivers that implement this extension can freely choose which
+subset of defined BiKs they implement and advertise to the clients. The
+clients can use the BiKs to accelerate their applications by manually
+executing invoking the BiKs. The extension is designed to also support using
+automated task graph lowering tooling later.
+
+==== Background
+
+ASIC-based coarse-grained hardware accelerators are specialized logic meant to
+speed up execution of workloads of interest, or to provide improvements in
+energy-efficiency. Examples of contemporary workloads that are beneficially hardware
+accelerated over software-based implementations include video coding, deep learning,
+cryptography, software-defined radio and graphics rendering.
+
+FPGAs form a special case somewhere between instruction-set architectures and fixed
+function hardware accelerators. While advances in high-level synthesis tools
+have attempted to bridge the programmability gap between GPU and FPGA programming,
+FPGAs are still considered as devices which are challenging to achieve efficient
+implementations with. Due to extensive manual optimization work required for efficient
+implementations of the accelerated functionality, defining FPGA designs as
+a system of "hardware accelerator IPs" is still a widely used "application abstraction".
+FPGAs can be thus seen as a platform that can realize and integrate any
+hardware accelerator implementable with the programmable fabric.
+
+The means to utilize hardware accelerators have typically been
+vendor-specific and abstracted behind domain-specific libraries.
+The overhead with the "bunch of libraries"-approach is seen in the lowest level
+of integration: The libraries utilize a low level library (typically
+vendor-specific) to interface with the actual hardware, and thus does not
+integrate efficiently with other libraries or software-programmable processors
+that might be available on the same chip.
+
+==== Rationale
+
+OpenCL's built-in kernel abstraction allows pushing both hardware
+accelerated and software defined kernels to the same command-queues,
+providing a powerful means for asynchronous execution of heterogeneous
+task graphs on diverse heterogeneous platforms. The ability to invoke hardware
+accelerators while being able to synchronize and optimize data transfers at
+the lowest levels of the driver stack can provide significant latency benefits,
+especially when combined with the command-buffering mechanism.
+
+However, the BiK abstraction works well only when it is widely adopted by
+vendors, and when multiple vendors implement the same definitions. Otherwise
+each vendor specifies and implements their own BiKs closely matching their
+own hardware accelerator properties, resulting in lack of cross-vendor
+portability in the API abstraction presented to the upper layers of
+heterogeneous computing software stacks.
+
+This extension standardizes a set of well-defined BiKs the clients can
+call from higher level programming stacks built with different languages
+and multiple libraries, possibly mix accelerator calls with calls to software kernel
+commands, and rely on the driver stack to optimize the execution (especially
+the synchronization and communication) as a low level heterogeneous task graph.
+It aims to promote the use of BiKs as a programming model for hardware accelerated
+functionality, to improve cross-vendor portability of hardware accelerated computing.
+
+=== Modifications to section 4.2 of the OpenCL API Specification
+
+Modify *Table 5*, _Device Queries_, of section 4.2, by adding the following
+sentences to the description cell of `CL_DEVICE_BUILT_IN_KERNELS`:
+
+[quote]
+The semantics of the returned built-in kernels are undefined or defined in
+vendor-specific documentation, unless the name starts with prefix `khr_',
+which means it's a built-in kernel with semantics defined in Appendix I.
+
+=== Add new appendix "Appendix I - Defined Built-in Kernels" to OpenCL API Specification
+
+This chapter describes standard built-in kernels (BiK) with well-defined
+semantics. A conformant device can report to support zero or more of the built-in
+kernels via `CL_DEVICE_BUILT_IN_KERNELS` or `CL_DEVICE_BUILT_IN_KERNELS_WITH_VERSION` device queries.
+
+The general client-side abstraction of the defined built-in kernels is similar to a call
+to a C function of which implementation is hidden. The device driver can invoke one or
+more physical hardware accelerators combined with firmware to implement the semantics
+as efficiently as possible.
+
+It is the driver's responsibility to handle efficient synchronization and communication
+to the hardware accelerator, the internal accelerator state management and resource sharing
+across multiple OpenCL contexts.
+
+==== Standard Built-in Kernels ====
+
+The following list of recognized built-ins is organized according to their application
+domain and handled data types. It is expected to grow and update while preserving backwards
+compatibility.
+
+[caption="Table A.I.1. "]
+.Standard Built-in Kernels and Their Semantics. *The table has been populated with a small set of non-trivial example entries which are subject to change and the list to expand during drafting.*
+[cols="1,3,2,2"]
+|===
+4+| *General linear algebra*
+// https://netlib.org/blas/blasqr.pdf
+| Name | Description | NDRange Dimensions | Arguments
+| *khr_blas_gemm_float*
+| xGEMM: General matrix multiplication with real single precision floating point numbers as described in Basic Linear Algebra Subprograms. Performs C = alpha * trans(A) * trans(B) + beta*C, where A, B and C are matrices, and alpha and beta scalars. trans() is a configurable transpose operation.
+a|
+[start=1]
+. The height.
+. The width.
+a|
+[start=0]
+. int: transpose operation (trans) type for matrix A (0 = none, 1 = transpose, 2 = conjugate transpose)
+. int: transpose type for matrix B (0 = none, 1 = transpose, 2 = conjugate transpose)
+. float: scalar (alpha) to multiply the matrix multiplication result elements with
+. float* (input): matrix A
+. int: leading dimension of A (0 = row-major, 1 = column-major)
+. float* (input): matrix B
+. int: leading dimension of B (0 = row-major, 1 = column-major)
+. float: scalar (beta) to multiply the C matrix elements with before adding it to the result
+. float* (input&output): matrix C which is added to the matrix multiplication result, and stores the output
+. int: leading dimension of C (0 = row-major, 1 = column-major)
+4+| OpenCL C Semantics
+4+a|
+[source,c]
+----
+__kernel void __khr_blas_gemm_float(
+   int transA, int transB, float alpha, const global float *A, int ldA,
+   const global float *B, int ldB,
+   float beta, global float *C, int ldC) {
+   // TBD: An example implementation that can be used for verification
+   // and as a fallback SW implementation.
+}
+----
+
+4+| *OpenVX Neural Network Extension Compatible Kernels*
+// Copied from https://registry.khronos.org/OpenVX/extensions/vx_khr_nn/1.2/html/d6/d9a/group__group__cnn.html#ga69764625f436c14d739fc467515c1584
+| Name | Description | NDRange Dimensions | Arguments
+| *khr_openvx_nn_extension_convolution_uchar*
+| Convolution for 8bit unsigned integer inputs and weights.
+a|
+[start=1]
+. Batch size.
+. Width.
+. Height.
+a|
+[start=0]
+. uchar* [in]: The input tensor data. 3 lower dimensions represent a single input, all following dimensions represent number of batches, possibly nested. The dimension order is [width, height, #IFM, #batches].
+. uchar* [in]: Weights, as a 4d tensor with dimensions [kernel_x, kernel_y, #IFM, #OFM].
+. uchar* [in]: Biases (optional, ignored if NULL). The biases, which may be shared (one per ofm) or unshared (one per ofm * output location). The possible layouts are either [#OFM] or [width, height, #OFM]. Biases data type must match the data type of the inputs. (Kernel parameter #2) 
+. size_t: (dilation_x) “inflate” the kernel by inserting zeros between the kernel elements in the x direction. The value is the number of zeros to insert.
+. size_t: (dilation_y) “inflate” the kernel by inserting zeros between the kernel elements in the y direction. The value is the number of zeros to insert.
+. int: Rounding method for calculating output dimensions. 
+. int: A VX_TYPE_ENUM of the vx_convert_policy_e enumeration.
+. size_t: Number of elements padded at each side in the x dimension of the input.
+. size_t: Number of elements padded at each side in the y dimension of the input.
+. int:  A VX_TYPE_ENUM of the vx_round_policy_e enumeration.
+. uchar* [out]: The output tensor data. Output will have the same number and structure of dimensions as input. Output tensor data type must be same as the inputs. (Kernel parameter #4)
+
+4+| OpenCL C Semantics
+4+a|
+[source,c]
+----
+__kernel void __khr_openvx_nn_extension_convolution_uchar(
+   const uchar *input, const uchar *weights, const uchar *biases,
+   size_t dilation_x, size_t dilation_y,
+   int down_scale_rounding, int overflow_policy, size_t padding_x, size_t padding_y,
+   int rounding_policy, uchar *output) {
+   // TBD.
+}
+----
+
+4+| *Direct Input/Output Operations*
+4+| Kernels for accessing data sources and destinations directly without host involvement.
+| Name | Description | NDRange Dimensions | Arguments
+| *khr_io_stream_in_uchar*
+| Non-blocking read of data from a sensor/stream associated with the device.
+a| -
+a|
+[start=0]
+. uchar* [out]: The data.
+. size_t* [in+out]: In: number of bytes to read. Out: Number of bytes that could be read (can be 0). (Compatible with the `cl_pocl_content_size` extension to optimize data transfers with.)
+
+4+| OpenCL C Semantics
+4+a|
+[source,c]
+----
+__kernel void __khr_io_stream_in_uchar(
+   uchar *output, size_t *num) {
+   // It is not feasible to describe this kernel in OpenCL C as I/O devices
+   // are not representable with it.
+}
+----
+
+| *khr_io_stream_out_uchar*
+| Non-blocking write of data to an output/sink associated with the device.
+| -
+a|
+[start=0]
+. uchar* [in]: The data to write.
+. size_t* [in+out]: In: Number of bytes to write. Out: Number of bytes that could be written (can be 0).
+4+| OpenCL C Semantics
+4+a|
+[source,c]
+----
+__kernel void __khr_io_stream_out_uchar(
+   uchar *input, size_t *num) {
+   // It is not feasible to describe this kernel in OpenCL C as I/O devices
+   // are not representable with it.
+}
+----
+
+| *khr_io_stream_in_blocking_uchar*
+| Blocking read of data from a sensor/stream associated with the device.
+a| -
+a|
+[start=0]
+. uchar* [out]: The data.
+* size_t* [in]: How many bytes to read before returning.
+
+4+| OpenCL C Semantics
+4+a|
+[source,c]
+----
+__kernel void __khr_io_stream_in_blocking_uchar(uchar *output, size_t *num) {
+   while (*num) {
+       size_t num_read = *num;
+       __khr_io_stream_in_uchar(output, &num_read);
+       num -= num_read;
+       output += num_read;
+   }
+}
+----
+
+|===
+
+==== Launching BiKs from the Device Side ====
+
+BiKs are primarily meant to be launched as kernel commands via host-side command queues.
+Optionally, they can be callable from device-side via
+`enqueue_kernel`: This capability can be queried on per BiK basis at compile-time in OpenCL C by checking for macro definitions which has the following naming convention: `cl_khr_bik_BUILTIN_KERNEL_NAME`. In case a BiK macro is defined, a kernel with a naming convention `__khr_BUILTIN_KERNEL_NAME()` can be enqueued by the program at device side as software-defined kernels.
+
+
+=== Open questions
+
+. Should we enable launching BiKs from the device side without requiring device-side enqueue? The main problem is those with NDRange as they are not simple single-WI helper functions.
++
+--
+*UNRESOLVED*
+
+--
+
+. Should the NDRange be used at all in BiKs? It feels sort of unnatural as typically the NDRange is used to imply SPMD parallelism while the hardware/firmware is free to choose whatever parallelism degree to implement the function. On the other hand, similar applies to software kernel launches as the work-items can be executed serially if adhering to barrier semantics.
++
+--
+*UNRESOLVED*
+
+--
+
+. Different accelerators prefer different channel orders (NHWC vs. NCHW...) for the processed data. Should the channel order be passed as a BiK argument (like in the example GEMM's row/column order) or is it better to have different BiK variations for each?
++
+--
+*UNRESOLVED*
+
+--
+
+. How to denote preference? Some of the BiKs are more efficient on a given device as they map more naturally to the underlying HW accelerator, but the slower variations (for example, with unoptimal channel order in NN accelerators) might be still beneficially accelerated.
++
+--
+*UNRESOLVED*
+
+--
+
+. Since the defined built-in kernel concept is basically just a C-like API inside another API, should it be made more generic and thus directly usable for SYCL and Vulkan as well?
++
+--
+*UNRESOLVED*
+
+--
+
diff --git a/ext/cl_khr_defined_builtin_kernels.html b/ext/cl_khr_defined_builtin_kernels.html
new file mode 100644
index 00000000..3fda4c9d
--- /dev/null
+++ b/ext/cl_khr_defined_builtin_kernels.html
@@ -0,0 +1,1288 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
+    "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
+<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
+<head>
+<meta http-equiv="Content-Type" content="application/xhtml+xml; charset=UTF-8" />
+<meta name="generator" content="AsciiDoc 10.1.2" />
+<title>cl_khr_defined_builtin_kernels</title>
+<style type="text/css">
+/* Shared CSS for AsciiDoc xhtml11 and html5 backends */
+
+/* Default font. */
+body {
+  font-family: Georgia,serif;
+}
+
+/* Title font. */
+h1, h2, h3, h4, h5, h6,
+div.title, caption.title,
+thead, p.table.header,
+#toctitle,
+#author, #revnumber, #revdate, #revremark,
+#footer {
+  font-family: Arial,Helvetica,sans-serif;
+}
+
+body {
+  margin: 1em 5% 1em 5%;
+}
+
+a {
+  color: blue;
+  text-decoration: underline;
+}
+a:visited {
+  color: fuchsia;
+}
+
+em {
+  font-style: italic;
+  color: navy;
+}
+
+strong {
+  font-weight: bold;
+  color: #083194;
+}
+
+h1, h2, h3, h4, h5, h6 {
+  color: #527bbd;
+  margin-top: 1.2em;
+  margin-bottom: 0.5em;
+  line-height: 1.3;
+}
+
+h1, h2, h3 {
+  border-bottom: 2px solid silver;
+}
+h2 {
+  padding-top: 0.5em;
+}
+h3 {
+  float: left;
+}
+h3 + * {
+  clear: left;
+}
+h5 {
+  font-size: 1.0em;
+}
+
+div.sectionbody {
+  margin-left: 0;
+}
+
+hr {
+  border: 1px solid silver;
+}
+
+p {
+  margin-top: 0.5em;
+  margin-bottom: 0.5em;
+}
+
+ul, ol, li > p {
+  margin-top: 0;
+}
+ul > li     { color: #aaa; }
+ul > li > * { color: black; }
+
+.monospaced, code, pre {
+  font-family: "Courier New", Courier, monospace;
+  font-size: inherit;
+  color: navy;
+  padding: 0;
+  margin: 0;
+}
+pre {
+  white-space: pre-wrap;
+}
+
+#author {
+  color: #527bbd;
+  font-weight: bold;
+  font-size: 1.1em;
+}
+#email {
+}
+#revnumber, #revdate, #revremark {
+}
+
+#footer {
+  font-size: small;
+  border-top: 2px solid silver;
+  padding-top: 0.5em;
+  margin-top: 4.0em;
+}
+#footer-text {
+  float: left;
+  padding-bottom: 0.5em;
+}
+#footer-badges {
+  float: right;
+  padding-bottom: 0.5em;
+}
+
+#preamble {
+  margin-top: 1.5em;
+  margin-bottom: 1.5em;
+}
+div.imageblock, div.exampleblock, div.verseblock,
+div.quoteblock, div.literalblock, div.listingblock, div.sidebarblock,
+div.admonitionblock {
+  margin-top: 1.0em;
+  margin-bottom: 1.5em;
+}
+div.admonitionblock {
+  margin-top: 2.0em;
+  margin-bottom: 2.0em;
+  margin-right: 10%;
+  color: #606060;
+}
+
+div.content { /* Block element content. */
+  padding: 0;
+}
+
+/* Block element titles. */
+div.title, caption.title {
+  color: #527bbd;
+  font-weight: bold;
+  text-align: left;
+  margin-top: 1.0em;
+  margin-bottom: 0.5em;
+}
+div.title + * {
+  margin-top: 0;
+}
+
+td div.title:first-child {
+  margin-top: 0.0em;
+}
+div.content div.title:first-child {
+  margin-top: 0.0em;
+}
+div.content + div.title {
+  margin-top: 0.0em;
+}
+
+div.sidebarblock > div.content {
+  background: #ffffee;
+  border: 1px solid #dddddd;
+  border-left: 4px solid #f0f0f0;
+  padding: 0.5em;
+}
+
+div.listingblock > div.content {
+  border: 1px solid #dddddd;
+  border-left: 5px solid #f0f0f0;
+  background: #f8f8f8;
+  padding: 0.5em;
+}
+
+div.quoteblock, div.verseblock {
+  padding-left: 1.0em;
+  margin-left: 1.0em;
+  margin-right: 10%;
+  border-left: 5px solid #f0f0f0;
+  color: #888;
+}
+
+div.quoteblock > div.attribution {
+  padding-top: 0.5em;
+  text-align: right;
+}
+
+div.verseblock > pre.content {
+  font-family: inherit;
+  font-size: inherit;
+}
+div.verseblock > div.attribution {
+  padding-top: 0.75em;
+  text-align: left;
+}
+/* DEPRECATED: Pre version 8.2.7 verse style literal block. */
+div.verseblock + div.attribution {
+  text-align: left;
+}
+
+div.admonitionblock .icon {
+  vertical-align: top;
+  font-size: 1.1em;
+  font-weight: bold;
+  text-decoration: underline;
+  color: #527bbd;
+  padding-right: 0.5em;
+}
+div.admonitionblock td.content {
+  padding-left: 0.5em;
+  border-left: 3px solid #dddddd;
+}
+
+div.exampleblock > div.content {
+  border-left: 3px solid #dddddd;
+  padding-left: 0.5em;
+}
+
+div.imageblock div.content { padding-left: 0; }
+span.image img { border-style: none; vertical-align: text-bottom; }
+a.image:visited { color: white; }
+
+dl {
+  margin-top: 0.8em;
+  margin-bottom: 0.8em;
+}
+dt {
+  margin-top: 0.5em;
+  margin-bottom: 0;
+  font-style: normal;
+  color: navy;
+}
+dd > *:first-child {
+  margin-top: 0.1em;
+}
+
+ul, ol {
+    list-style-position: outside;
+}
+ol.arabic {
+  list-style-type: decimal;
+}
+ol.loweralpha {
+  list-style-type: lower-alpha;
+}
+ol.upperalpha {
+  list-style-type: upper-alpha;
+}
+ol.lowerroman {
+  list-style-type: lower-roman;
+}
+ol.upperroman {
+  list-style-type: upper-roman;
+}
+
+div.compact ul, div.compact ol,
+div.compact p, div.compact p,
+div.compact div, div.compact div {
+  margin-top: 0.1em;
+  margin-bottom: 0.1em;
+}
+
+tfoot {
+  font-weight: bold;
+}
+td > div.verse {
+  white-space: pre;
+}
+
+div.hdlist {
+  margin-top: 0.8em;
+  margin-bottom: 0.8em;
+}
+div.hdlist tr {
+  padding-bottom: 15px;
+}
+dt.hdlist1.strong, td.hdlist1.strong {
+  font-weight: bold;
+}
+td.hdlist1 {
+  vertical-align: top;
+  font-style: normal;
+  padding-right: 0.8em;
+  color: navy;
+}
+td.hdlist2 {
+  vertical-align: top;
+}
+div.hdlist.compact tr {
+  margin: 0;
+  padding-bottom: 0;
+}
+
+.comment {
+  background: yellow;
+}
+
+.footnote, .footnoteref {
+  font-size: 0.8em;
+}
+
+span.footnote, span.footnoteref {
+  vertical-align: super;
+}
+
+#footnotes {
+  margin: 20px 0 20px 0;
+  padding: 7px 0 0 0;
+}
+
+#footnotes div.footnote {
+  margin: 0 0 5px 0;
+}
+
+#footnotes hr {
+  border: none;
+  border-top: 1px solid silver;
+  height: 1px;
+  text-align: left;
+  margin-left: 0;
+  width: 20%;
+  min-width: 100px;
+}
+
+div.colist td {
+  padding-right: 0.5em;
+  padding-bottom: 0.3em;
+  vertical-align: top;
+}
+div.colist td img {
+  margin-top: 0.3em;
+}
+
+@media print {
+  #footer-badges { display: none; }
+}
+
+#toc {
+  margin-bottom: 2.5em;
+}
+
+#toctitle {
+  color: #527bbd;
+  font-size: 1.1em;
+  font-weight: bold;
+  margin-top: 1.0em;
+  margin-bottom: 0.1em;
+}
+
+div.toclevel0, div.toclevel1, div.toclevel2, div.toclevel3, div.toclevel4 {
+  margin-top: 0;
+  margin-bottom: 0;
+}
+div.toclevel2 {
+  margin-left: 2em;
+  font-size: 0.9em;
+}
+div.toclevel3 {
+  margin-left: 4em;
+  font-size: 0.9em;
+}
+div.toclevel4 {
+  margin-left: 6em;
+  font-size: 0.9em;
+}
+
+span.aqua { color: aqua; }
+span.black { color: black; }
+span.blue { color: blue; }
+span.fuchsia { color: fuchsia; }
+span.gray { color: gray; }
+span.green { color: green; }
+span.lime { color: lime; }
+span.maroon { color: maroon; }
+span.navy { color: navy; }
+span.olive { color: olive; }
+span.purple { color: purple; }
+span.red { color: red; }
+span.silver { color: silver; }
+span.teal { color: teal; }
+span.white { color: white; }
+span.yellow { color: yellow; }
+
+span.aqua-background { background: aqua; }
+span.black-background { background: black; }
+span.blue-background { background: blue; }
+span.fuchsia-background { background: fuchsia; }
+span.gray-background { background: gray; }
+span.green-background { background: green; }
+span.lime-background { background: lime; }
+span.maroon-background { background: maroon; }
+span.navy-background { background: navy; }
+span.olive-background { background: olive; }
+span.purple-background { background: purple; }
+span.red-background { background: red; }
+span.silver-background { background: silver; }
+span.teal-background { background: teal; }
+span.white-background { background: white; }
+span.yellow-background { background: yellow; }
+
+span.big { font-size: 2em; }
+span.small { font-size: 0.6em; }
+
+span.underline { text-decoration: underline; }
+span.overline { text-decoration: overline; }
+span.line-through { text-decoration: line-through; }
+
+div.unbreakable { page-break-inside: avoid; }
+
+
+/*
+ * xhtml11 specific
+ *
+ * */
+
+div.tableblock {
+  margin-top: 1.0em;
+  margin-bottom: 1.5em;
+}
+div.tableblock > table {
+  border: 3px solid #527bbd;
+}
+thead, p.table.header {
+  font-weight: bold;
+  color: #527bbd;
+}
+p.table {
+  margin-top: 0;
+}
+/* Because the table frame attribute is overridden by CSS in most browsers. */
+div.tableblock > table[frame="void"] {
+  border-style: none;
+}
+div.tableblock > table[frame="hsides"] {
+  border-left-style: none;
+  border-right-style: none;
+}
+div.tableblock > table[frame="vsides"] {
+  border-top-style: none;
+  border-bottom-style: none;
+}
+
+
+/*
+ * html5 specific
+ *
+ * */
+
+table.tableblock {
+  margin-top: 1.0em;
+  margin-bottom: 1.5em;
+}
+thead, p.tableblock.header {
+  font-weight: bold;
+  color: #527bbd;
+}
+p.tableblock {
+  margin-top: 0;
+}
+table.tableblock {
+  border-width: 3px;
+  border-spacing: 0px;
+  border-style: solid;
+  border-color: #527bbd;
+  border-collapse: collapse;
+}
+th.tableblock, td.tableblock {
+  border-width: 1px;
+  padding: 4px;
+  border-style: solid;
+  border-color: #527bbd;
+}
+
+table.tableblock.frame-topbot {
+  border-left-style: hidden;
+  border-right-style: hidden;
+}
+table.tableblock.frame-sides {
+  border-top-style: hidden;
+  border-bottom-style: hidden;
+}
+table.tableblock.frame-none {
+  border-style: hidden;
+}
+
+th.tableblock.halign-left, td.tableblock.halign-left {
+  text-align: left;
+}
+th.tableblock.halign-center, td.tableblock.halign-center {
+  text-align: center;
+}
+th.tableblock.halign-right, td.tableblock.halign-right {
+  text-align: right;
+}
+
+th.tableblock.valign-top, td.tableblock.valign-top {
+  vertical-align: top;
+}
+th.tableblock.valign-middle, td.tableblock.valign-middle {
+  vertical-align: middle;
+}
+th.tableblock.valign-bottom, td.tableblock.valign-bottom {
+  vertical-align: bottom;
+}
+
+
+/*
+ * manpage specific
+ *
+ * */
+
+body.manpage h1 {
+  padding-top: 0.5em;
+  padding-bottom: 0.5em;
+  border-top: 2px solid silver;
+  border-bottom: 2px solid silver;
+}
+body.manpage h2 {
+  border-style: none;
+}
+body.manpage div.sectionbody {
+  margin-left: 3em;
+}
+
+@media print {
+  body.manpage div#toc { display: none; }
+}
+
+
+</style>
+<script type="text/javascript">
+/*<![CDATA[*/
+var asciidoc = {  // Namespace.
+
+/////////////////////////////////////////////////////////////////////
+// Table Of Contents generator
+/////////////////////////////////////////////////////////////////////
+
+/* Author: Mihai Bazon, September 2002
+ * http://students.infoiasi.ro/~mishoo
+ *
+ * Table Of Content generator
+ * Version: 0.4
+ *
+ * Feel free to use this script under the terms of the GNU General Public
+ * License, as long as you do not remove or alter this notice.
+ */
+
+ /* modified by Troy D. Hanson, September 2006. License: GPL */
+ /* modified by Stuart Rackham, 2006, 2009. License: GPL */
+
+// toclevels = 1..4.
+toc: function (toclevels) {
+
+  function getText(el) {
+    var text = "";
+    for (var i = el.firstChild; i != null; i = i.nextSibling) {
+      if (i.nodeType == 3 /* Node.TEXT_NODE */) // IE doesn't speak constants.
+        text += i.data;
+      else if (i.firstChild != null)
+        text += getText(i);
+    }
+    return text;
+  }
+
+  function TocEntry(el, text, toclevel) {
+    this.element = el;
+    this.text = text;
+    this.toclevel = toclevel;
+  }
+
+  function tocEntries(el, toclevels) {
+    var result = new Array;
+    var re = new RegExp('[hH]([1-'+(toclevels+1)+'])');
+    // Function that scans the DOM tree for header elements (the DOM2
+    // nodeIterator API would be a better technique but not supported by all
+    // browsers).
+    var iterate = function (el) {
+      for (var i = el.firstChild; i != null; i = i.nextSibling) {
+        if (i.nodeType == 1 /* Node.ELEMENT_NODE */) {
+          var mo = re.exec(i.tagName);
+          if (mo && (i.getAttribute("class") || i.getAttribute("className")) != "float") {
+            result[result.length] = new TocEntry(i, getText(i), mo[1]-1);
+          }
+          iterate(i);
+        }
+      }
+    }
+    iterate(el);
+    return result;
+  }
+
+  var toc = document.getElementById("toc");
+  if (!toc) {
+    return;
+  }
+
+  // Delete existing TOC entries in case we're reloading the TOC.
+  var tocEntriesToRemove = [];
+  var i;
+  for (i = 0; i < toc.childNodes.length; i++) {
+    var entry = toc.childNodes[i];
+    if (entry.nodeName.toLowerCase() == 'div'
+     && entry.getAttribute("class")
+     && entry.getAttribute("class").match(/^toclevel/))
+      tocEntriesToRemove.push(entry);
+  }
+  for (i = 0; i < tocEntriesToRemove.length; i++) {
+    toc.removeChild(tocEntriesToRemove[i]);
+  }
+
+  // Rebuild TOC entries.
+  var entries = tocEntries(document.getElementById("content"), toclevels);
+  for (var i = 0; i < entries.length; ++i) {
+    var entry = entries[i];
+    if (entry.element.id == "")
+      entry.element.id = "_toc_" + i;
+    var a = document.createElement("a");
+    a.href = "#" + entry.element.id;
+    a.appendChild(document.createTextNode(entry.text));
+    var div = document.createElement("div");
+    div.appendChild(a);
+    div.className = "toclevel" + entry.toclevel;
+    toc.appendChild(div);
+  }
+  if (entries.length == 0)
+    toc.parentNode.removeChild(toc);
+},
+
+
+/////////////////////////////////////////////////////////////////////
+// Footnotes generator
+/////////////////////////////////////////////////////////////////////
+
+/* Based on footnote generation code from:
+ * http://www.brandspankingnew.net/archive/2005/07/format_footnote.html
+ */
+
+footnotes: function () {
+  // Delete existing footnote entries in case we're reloading the footnodes.
+  var i;
+  var noteholder = document.getElementById("footnotes");
+  if (!noteholder) {
+    return;
+  }
+  var entriesToRemove = [];
+  for (i = 0; i < noteholder.childNodes.length; i++) {
+    var entry = noteholder.childNodes[i];
+    if (entry.nodeName.toLowerCase() == 'div' && entry.getAttribute("class") == "footnote")
+      entriesToRemove.push(entry);
+  }
+  for (i = 0; i < entriesToRemove.length; i++) {
+    noteholder.removeChild(entriesToRemove[i]);
+  }
+
+  // Rebuild footnote entries.
+  var cont = document.getElementById("content");
+  var spans = cont.getElementsByTagName("span");
+  var refs = {};
+  var n = 0;
+  for (i=0; i<spans.length; i++) {
+    if (spans[i].className == "footnote") {
+      n++;
+      var note = spans[i].getAttribute("data-note");
+      if (!note) {
+        // Use [\s\S] in place of . so multi-line matches work.
+        // Because JavaScript has no s (dotall) regex flag.
+        note = spans[i].innerHTML.match(/\s*\[([\s\S]*)]\s*/)[1];
+        spans[i].innerHTML =
+          "[<a id='_footnoteref_" + n + "' href='#_footnote_" + n +
+          "' title='View footnote' class='footnote'>" + n + "</a>]";
+        spans[i].setAttribute("data-note", note);
+      }
+      noteholder.innerHTML +=
+        "<div class='footnote' id='_footnote_" + n + "'>" +
+        "<a href='#_footnoteref_" + n + "' title='Return to text'>" +
+        n + "</a>. " + note + "</div>";
+      var id =spans[i].getAttribute("id");
+      if (id != null) refs["#"+id] = n;
+    }
+  }
+  if (n == 0)
+    noteholder.parentNode.removeChild(noteholder);
+  else {
+    // Process footnoterefs.
+    for (i=0; i<spans.length; i++) {
+      if (spans[i].className == "footnoteref") {
+        var href = spans[i].getElementsByTagName("a")[0].getAttribute("href");
+        href = href.match(/#.*/)[0];  // Because IE return full URL.
+        n = refs[href];
+        spans[i].innerHTML =
+          "[<a href='#_footnote_" + n +
+          "' title='View footnote' class='footnote'>" + n + "</a>]";
+      }
+    }
+  }
+},
+
+install: function(toclevels) {
+  var timerId;
+
+  function reinstall() {
+    asciidoc.footnotes();
+    if (toclevels) {
+      asciidoc.toc(toclevels);
+    }
+  }
+
+  function reinstallAndRemoveTimer() {
+    clearInterval(timerId);
+    reinstall();
+  }
+
+  timerId = setInterval(reinstall, 500);
+  if (document.addEventListener)
+    document.addEventListener("DOMContentLoaded", reinstallAndRemoveTimer, false);
+  else
+    window.onload = reinstallAndRemoveTimer;
+}
+
+}
+asciidoc.install();
+/*]]>*/
+</script>
+</head>
+<body class="article">
+<div id="header">
+<h1>cl_khr_defined_builtin_kernels</h1>
+</div>
+<div id="content">
+<div class="sect1">
+<h2 id="cl_khr_defined_builtin_kernels">Khronos-Defined Built-in Kernels (Early Draft)</h2>
+<div class="sectionbody">
+<div class="paragraph"><p>The purpose of this extension is to provide a standardized set of built-in
+kernels with well-defined semantics useful for accelerating applications
+from various domains.  The extension specification is designed to rapidly
+expand and "live" via addition of new well-defined built-in kernel
+definitions and updating of previously defined ones.</p></div>
+<div class="sect2">
+<h3 id="_general_information">General Information</h3>
+<div class="sect3">
+<h4 id="_name_strings">Name Strings</h4>
+<div class="paragraph"><p><code>cl_khr_defined_builtin_kernels</code></p></div>
+</div>
+<div class="sect3">
+<h4 id="_version_history">Version History</h4>
+<div class="tableblock">
+<table rules="all"
+width="100%"
+frame="border"
+cellspacing="0" cellpadding="4">
+<col width="20%" />
+<col width="20%" />
+<col width="60%" />
+<thead>
+<tr>
+<th align="left" valign="top"> <strong>Date</strong>     </th>
+<th align="left" valign="top"> <strong>Version</strong> </th>
+<th align="left" valign="top"> <strong>Description</strong></th>
+</tr>
+</thead>
+<tbody>
+<tr>
+<td align="left" valign="top"><p class="table">2022-12-13</p></td>
+<td align="left" valign="top"><p class="table">0.1.0</p></td>
+<td align="left" valign="top"><p class="table">First formulation as an extension specification like proposed by Ben Ashbaugh.</p></td>
+</tr>
+</tbody>
+</table>
+</div>
+</div>
+<div class="sect3">
+<h4 id="_dependencies">Dependencies</h4>
+<div class="paragraph"><p>This extension is written against the OpenCL Specification version 3.0.12.</p></div>
+<div class="paragraph"><p>This extension requires OpenCL 1.2 or later.</p></div>
+</div>
+<div class="sect3">
+<h4 id="_contributors">Contributors</h4>
+<div class="paragraph"><p>Pekka Jääskeläinen, Intel and Tampere University.<br />
+Topi Leppänen, Tampere University.<br />
+Jan Solanti, Tampere University.<br />
+Ben Ashbaugh, Intel.<br /></p></div>
+</div>
+</div>
+<div class="sect2">
+<h3 id="_overview">Overview</h3>
+<div class="paragraph"><p>OpenCL 1.2 specifies a built-in kernel (BiK) as a kernel that is executed on
+an OpenCL device or custom device by fixed-function hardware or in firmware.
+Applications can query the built-in kernels supported by a device or custom
+device.</p></div>
+<div class="paragraph"><p>BiKs are referred to by a name (a C string) without any semantics attached
+to the functionality. The semantics behind the name is completely device
+specific, typically documented in vendor-specific extension specifications.</p></div>
+<div class="paragraph"><p>The goal for this extension is to lower the bar for utilizing hardware
+accelerated functions in drivers by providing a library of
+well-defined BiKs with good coverage for common acceleration needs
+and which is designed to easily evolve over time.</p></div>
+<div class="paragraph"><p>The device drivers that implement this extension can freely choose which
+subset of defined BiKs they implement and advertise to the clients. The
+clients can use the BiKs to accelerate their applications by manually
+executing invoking the BiKs. The extension is designed to also support using
+automated task graph lowering tooling later.</p></div>
+<div class="sect3">
+<h4 id="_background">Background</h4>
+<div class="paragraph"><p>ASIC-based coarse-grained hardware accelerators are specialized logic meant to
+speed up execution of workloads of interest, or to provide improvements in
+energy-efficiency. Examples of contemporary workloads that are beneficially hardware
+accelerated over software-based implementations include video coding, deep learning,
+cryptography, software-defined radio and graphics rendering.</p></div>
+<div class="paragraph"><p>FPGAs form a special case somewhere between instruction-set architectures and fixed
+function hardware accelerators. While advances in high-level synthesis tools
+have attempted to bridge the programmability gap between GPU and FPGA programming,
+FPGAs are still considered as devices which are challenging to achieve efficient
+implementations with. Due to extensive manual optimization work required for efficient
+implementations of the accelerated functionality, defining FPGA designs as
+a system of "hardware accelerator IPs" is still a widely used "application abstraction".
+FPGAs can be thus seen as a platform that can realize and integrate any
+hardware accelerator implementable with the programmable fabric.</p></div>
+<div class="paragraph"><p>The means to utilize hardware accelerators have typically been
+vendor-specific and abstracted behind domain-specific libraries.
+The overhead with the "bunch of libraries"-approach is seen in the lowest level
+of integration: The libraries utilize a low level library (typically
+vendor-specific) to interface with the actual hardware, and thus does not
+integrate efficiently with other libraries or software-programmable processors
+that might be available on the same chip.</p></div>
+</div>
+<div class="sect3">
+<h4 id="_rationale">Rationale</h4>
+<div class="paragraph"><p>OpenCL&#8217;s built-in kernel abstraction allows pushing both hardware
+accelerated and software defined kernels to the same command-queues,
+providing a powerful means for asynchronous execution of heterogeneous
+task graphs on diverse heterogeneous platforms. The ability to invoke hardware
+accelerators while being able to synchronize and optimize data transfers at
+the lowest levels of the driver stack can provide significant latency benefits,
+especially when combined with the command-buffering mechanism.</p></div>
+<div class="paragraph"><p>However, the BiK abstraction works well only when it is widely adopted by
+vendors, and when multiple vendors implement the same definitions. Otherwise
+each vendor specifies and implements their own BiKs closely matching their
+own hardware accelerator properties, resulting in lack of cross-vendor
+portability in the API abstraction presented to the upper layers of
+heterogeneous computing software stacks.</p></div>
+<div class="paragraph"><p>This extension standardizes a set of well-defined BiKs the clients can
+call from higher level programming stacks built with different languages
+and multiple libraries, possibly mix accelerator calls with calls to software kernel
+commands, and rely on the driver stack to optimize the execution (especially
+the synchronization and communication) as a low level heterogeneous task graph.
+It aims to promote the use of BiKs as a programming model for hardware accelerated
+functionality, to improve cross-vendor portability of hardware accelerated computing.</p></div>
+</div>
+</div>
+<div class="sect2">
+<h3 id="_modifications_to_section_4_2_of_the_opencl_api_specification">Modifications to section 4.2 of the OpenCL API Specification</h3>
+<div class="paragraph"><p>Modify <strong>Table 5</strong>, <em>Device Queries</em>, of section 4.2, by adding the following
+sentences to the description cell of <code>CL_DEVICE_BUILT_IN_KERNELS</code>:</p></div>
+<div class="quoteblock">
+<div class="content">The semantics of the returned built-in kernels are undefined or defined in
+vendor-specific documentation, unless the name starts with prefix &#8216;khr_&#8217;,
+which means it&#8217;s a built-in kernel with semantics defined in Appendix I.</div>
+<div class="attribution">
+</div></div>
+</div>
+<div class="sect2">
+<h3 id="_add_new_appendix_appendix_i_defined_built_in_kernels_to_opencl_api_specification">Add new appendix "Appendix I - Defined Built-in Kernels" to OpenCL API Specification</h3>
+<div class="paragraph"><p>This chapter describes standard built-in kernels (BiK) with well-defined
+semantics. A conformant device can report to support zero or more of the built-in
+kernels via <code>CL_DEVICE_BUILT_IN_KERNELS</code> or <code>CL_DEVICE_BUILT_IN_KERNELS_WITH_VERSION</code> device queries.</p></div>
+<div class="paragraph"><p>The general client-side abstraction of the defined built-in kernels is similar to a call
+to a C function of which implementation is hidden. The device driver can invoke one or
+more physical hardware accelerators combined with firmware to implement the semantics
+as efficiently as possible.</p></div>
+<div class="paragraph"><p>It is the driver&#8217;s responsibility to handle efficient synchronization and communication
+to the hardware accelerator, the internal accelerator state management and resource sharing
+across multiple OpenCL contexts.</p></div>
+<div class="sect3">
+<h4 id="_standard_built_in_kernels">Standard Built-in Kernels</h4>
+<div class="paragraph"><p>The following list of recognized built-ins is organized according to their application
+domain and handled data types. It is expected to grow and update while preserving backwards
+compatibility.</p></div>
+<div class="tableblock">
+<table rules="all"
+width="100%"
+frame="border"
+cellspacing="0" cellpadding="4">
+<caption class="title">Table A.I.1. Standard Built-in Kernels and Their Semantics. <strong>The table has been populated with a small set of non-trivial example entries which are subject to change and the list to expand during drafting.</strong></caption>
+<col width="12%" />
+<col width="37%" />
+<col width="25%" />
+<col width="25%" />
+<tbody>
+<tr>
+<td colspan="4" align="left" valign="top"><p class="table"><strong>General linear algebra</strong></p></td>
+</tr>
+<tr>
+<td align="left" valign="top"><p class="table">Name</p></td>
+<td align="left" valign="top"><p class="table">Description</p></td>
+<td align="left" valign="top"><p class="table">NDRange Dimensions</p></td>
+<td align="left" valign="top"><p class="table">Arguments</p></td>
+</tr>
+<tr>
+<td align="left" valign="top"><p class="table"><strong>khr_blas_gemm_float</strong></p></td>
+<td align="left" valign="top"><p class="table">xGEMM: General matrix multiplication with real single precision floating point numbers as described in Basic Linear Algebra Subprograms. Performs C = alpha * trans(A) * trans(B) + beta*C, where A, B and C are matrices, and alpha and beta scalars. trans() is a configurable transpose operation.</p></td>
+<td align="left" valign="top"><div><div class="olist arabic"><ol class="arabic" start="1">
+<li>
+<p>
+The height.
+</p>
+</li>
+<li>
+<p>
+The width.
+</p>
+</li>
+</ol></div></div></td>
+<td align="left" valign="top"><div><div class="olist arabic"><ol class="arabic" start="0">
+<li>
+<p>
+int: transpose operation (trans) type for matrix A (0 = none, 1 = transpose, 2 = conjugate transpose)
+</p>
+</li>
+<li>
+<p>
+int: transpose type for matrix B (0 = none, 1 = transpose, 2 = conjugate transpose)
+</p>
+</li>
+<li>
+<p>
+float: scalar (alpha) to multiply the matrix multiplication result elements with
+</p>
+</li>
+<li>
+<p>
+float* (input): matrix A
+</p>
+</li>
+<li>
+<p>
+int: leading dimension of A (0 = row-major, 1 = column-major)
+</p>
+</li>
+<li>
+<p>
+float* (input): matrix B
+</p>
+</li>
+<li>
+<p>
+int: leading dimension of B (0 = row-major, 1 = column-major)
+</p>
+</li>
+<li>
+<p>
+float: scalar (beta) to multiply the C matrix elements with before adding it to the result
+</p>
+</li>
+<li>
+<p>
+float* (input&amp;output): matrix C which is added to the matrix multiplication result, and stores the output
+</p>
+</li>
+<li>
+<p>
+int: leading dimension of C (0 = row-major, 1 = column-major)
+</p>
+</li>
+</ol></div></div></td>
+</tr>
+<tr>
+<td colspan="4" align="left" valign="top"><p class="table">OpenCL C Semantics</p></td>
+</tr>
+<tr>
+<td colspan="4" align="left" valign="top"><div><div class="listingblock">
+<div class="content"><!-- Generator: GNU source-highlight
+by Lorenzo Bettini
+http://www.lorenzobettini.it
+http://www.gnu.org/software/src-highlite -->
+<pre><tt><span style="color: #008080">__kernel</span> <span style="color: #009900">void</span> <span style="font-weight: bold"><span style="color: #000000">__khr_blas_gemm_float</span></span><span style="color: #990000">(</span>
+   <span style="color: #009900">int</span> transA<span style="color: #990000">,</span> <span style="color: #009900">int</span> transB<span style="color: #990000">,</span> <span style="color: #009900">float</span> alpha<span style="color: #990000">,</span> <span style="font-weight: bold"><span style="color: #0000FF">const</span></span> <span style="color: #008080">global</span> <span style="color: #009900">float</span> <span style="color: #990000">*</span>A<span style="color: #990000">,</span> <span style="color: #009900">int</span> ldA<span style="color: #990000">,</span>
+   <span style="font-weight: bold"><span style="color: #0000FF">const</span></span> <span style="color: #008080">global</span> <span style="color: #009900">float</span> <span style="color: #990000">*</span>B<span style="color: #990000">,</span> <span style="color: #009900">int</span> ldB<span style="color: #990000">,</span>
+   <span style="color: #009900">float</span> beta<span style="color: #990000">,</span> <span style="color: #008080">global</span> <span style="color: #009900">float</span> <span style="color: #990000">*</span>C<span style="color: #990000">,</span> <span style="color: #009900">int</span> ldC<span style="color: #990000">)</span> <span style="color: #FF0000">{</span>
+   <span style="font-style: italic"><span style="color: #9A1900">// TBD: An example implementation that can be used for verification</span></span>
+   <span style="font-style: italic"><span style="color: #9A1900">// and as a fallback SW implementation.</span></span>
+<span style="color: #FF0000">}</span></tt></pre></div></div></div></td>
+</tr>
+<tr>
+<td colspan="4" align="left" valign="top"><p class="table"><strong>OpenVX Neural Network Extension Compatible Kernels</strong></p></td>
+</tr>
+<tr>
+<td align="left" valign="top"><p class="table">Name</p></td>
+<td align="left" valign="top"><p class="table">Description</p></td>
+<td align="left" valign="top"><p class="table">NDRange Dimensions</p></td>
+<td align="left" valign="top"><p class="table">Arguments</p></td>
+</tr>
+<tr>
+<td align="left" valign="top"><p class="table"><strong>khr_openvx_nn_extension_convolution_uchar</strong></p></td>
+<td align="left" valign="top"><p class="table">Convolution for 8bit unsigned integer inputs and weights.</p></td>
+<td align="left" valign="top"><div><div class="olist arabic"><ol class="arabic" start="1">
+<li>
+<p>
+Batch size.
+</p>
+</li>
+<li>
+<p>
+Width.
+</p>
+</li>
+<li>
+<p>
+Height.
+</p>
+</li>
+</ol></div></div></td>
+<td align="left" valign="top"><div><div class="olist arabic"><ol class="arabic" start="0">
+<li>
+<p>
+uchar* [in]: The input tensor data. 3 lower dimensions represent a single input, all following dimensions represent number of batches, possibly nested. The dimension order is [width, height, #IFM, #batches].
+</p>
+</li>
+<li>
+<p>
+uchar* [in]: Weights, as a 4d tensor with dimensions [kernel_x, kernel_y, #IFM, #OFM].
+</p>
+</li>
+<li>
+<p>
+uchar* [in]: Biases (optional, ignored if NULL). The biases, which may be shared (one per ofm) or unshared (one per ofm * output location). The possible layouts are either [#OFM] or [width, height, #OFM]. Biases data type must match the data type of the inputs. (Kernel parameter #2)
+</p>
+</li>
+<li>
+<p>
+size_t: (dilation_x) “inflate” the kernel by inserting zeros between the kernel elements in the x direction. The value is the number of zeros to insert.
+</p>
+</li>
+<li>
+<p>
+size_t: (dilation_y) “inflate” the kernel by inserting zeros between the kernel elements in the y direction. The value is the number of zeros to insert.
+</p>
+</li>
+<li>
+<p>
+int: Rounding method for calculating output dimensions.
+</p>
+</li>
+<li>
+<p>
+int: A VX_TYPE_ENUM of the vx_convert_policy_e enumeration.
+</p>
+</li>
+<li>
+<p>
+size_t: Number of elements padded at each side in the x dimension of the input.
+</p>
+</li>
+<li>
+<p>
+size_t: Number of elements padded at each side in the y dimension of the input.
+</p>
+</li>
+<li>
+<p>
+int:  A VX_TYPE_ENUM of the vx_round_policy_e enumeration.
+</p>
+</li>
+<li>
+<p>
+uchar* [out]: The output tensor data. Output will have the same number and structure of dimensions as input. Output tensor data type must be same as the inputs. (Kernel parameter #4)
+</p>
+</li>
+</ol></div></div></td>
+</tr>
+<tr>
+<td colspan="4" align="left" valign="top"><p class="table">OpenCL C Semantics</p></td>
+</tr>
+<tr>
+<td colspan="4" align="left" valign="top"><div><div class="listingblock">
+<div class="content"><!-- Generator: GNU source-highlight
+by Lorenzo Bettini
+http://www.lorenzobettini.it
+http://www.gnu.org/software/src-highlite -->
+<pre><tt><span style="color: #008080">__kernel</span> <span style="color: #009900">void</span> <span style="font-weight: bold"><span style="color: #000000">__khr_openvx_nn_extension_convolution_uchar</span></span><span style="color: #990000">(</span>
+   <span style="font-weight: bold"><span style="color: #0000FF">const</span></span> <span style="color: #008080">uchar</span> <span style="color: #990000">*</span>input<span style="color: #990000">,</span> <span style="font-weight: bold"><span style="color: #0000FF">const</span></span> <span style="color: #008080">uchar</span> <span style="color: #990000">*</span>weights<span style="color: #990000">,</span> <span style="font-weight: bold"><span style="color: #0000FF">const</span></span> <span style="color: #008080">uchar</span> <span style="color: #990000">*</span>biases<span style="color: #990000">,</span>
+   <span style="color: #008080">size_t</span> dilation_x<span style="color: #990000">,</span> <span style="color: #008080">size_t</span> dilation_y<span style="color: #990000">,</span>
+   <span style="color: #009900">int</span> down_scale_rounding<span style="color: #990000">,</span> <span style="color: #009900">int</span> overflow_policy<span style="color: #990000">,</span> <span style="color: #008080">size_t</span> padding_x<span style="color: #990000">,</span> <span style="color: #008080">size_t</span> padding_y<span style="color: #990000">,</span>
+   <span style="color: #009900">int</span> rounding_policy<span style="color: #990000">,</span> <span style="color: #008080">uchar</span> <span style="color: #990000">*</span>output<span style="color: #990000">)</span> <span style="color: #FF0000">{</span>
+   <span style="font-style: italic"><span style="color: #9A1900">// TBD.</span></span>
+<span style="color: #FF0000">}</span></tt></pre></div></div></div></td>
+</tr>
+<tr>
+<td colspan="4" align="left" valign="top"><p class="table"><strong>Direct Input/Output Operations</strong></p></td>
+</tr>
+<tr>
+<td colspan="4" align="left" valign="top"><p class="table">Kernels for accessing data sources and destinations directly without host involvement.</p></td>
+</tr>
+<tr>
+<td align="left" valign="top"><p class="table">Name</p></td>
+<td align="left" valign="top"><p class="table">Description</p></td>
+<td align="left" valign="top"><p class="table">NDRange Dimensions</p></td>
+<td align="left" valign="top"><p class="table">Arguments</p></td>
+</tr>
+<tr>
+<td align="left" valign="top"><p class="table"><strong>khr_io_stream_in_uchar</strong></p></td>
+<td align="left" valign="top"><p class="table">Non-blocking read of data from a sensor/stream associated with the device.</p></td>
+<td align="left" valign="top"><div><div class="literalblock">
+<div class="content">
+<pre><code>-</code></pre>
+</div></div></div></td>
+<td align="left" valign="top"><div><div class="olist arabic"><ol class="arabic" start="0">
+<li>
+<p>
+uchar* [out]: The data.
+</p>
+</li>
+<li>
+<p>
+size_t* [in+out]: In: number of bytes to read. Out: Number of bytes that could be read (can be 0). (Compatible with the <code>cl_pocl_content_size</code> extension to optimize data transfers with.)
+</p>
+</li>
+</ol></div></div></td>
+</tr>
+<tr>
+<td colspan="4" align="left" valign="top"><p class="table">OpenCL C Semantics</p></td>
+</tr>
+<tr>
+<td colspan="4" align="left" valign="top"><div><div class="listingblock">
+<div class="content"><!-- Generator: GNU source-highlight
+by Lorenzo Bettini
+http://www.lorenzobettini.it
+http://www.gnu.org/software/src-highlite -->
+<pre><tt><span style="color: #008080">__kernel</span> <span style="color: #009900">void</span> <span style="font-weight: bold"><span style="color: #000000">__khr_io_stream_in_uchar</span></span><span style="color: #990000">(</span>
+   <span style="color: #008080">uchar</span> <span style="color: #990000">*</span>output<span style="color: #990000">,</span> <span style="color: #008080">size_t</span> <span style="color: #990000">*</span>num<span style="color: #990000">)</span> <span style="color: #FF0000">{</span>
+   <span style="font-style: italic"><span style="color: #9A1900">// It is not feasible to describe this kernel in OpenCL C as I/O devices</span></span>
+   <span style="font-style: italic"><span style="color: #9A1900">// are not representable with it.</span></span>
+<span style="color: #FF0000">}</span></tt></pre></div></div></div></td>
+</tr>
+<tr>
+<td align="left" valign="top"><p class="table"><strong>khr_io_stream_out_uchar</strong></p></td>
+<td align="left" valign="top"><p class="table">Non-blocking write of data to an output/sink associated with the device.</p></td>
+<td align="left" valign="top"><p class="table">-</p></td>
+<td align="left" valign="top"><div><div class="olist arabic"><ol class="arabic" start="0">
+<li>
+<p>
+uchar* [in]: The data to write.
+</p>
+</li>
+<li>
+<p>
+size_t* [in+out]: In: Number of bytes to write. Out: Number of bytes that could be written (can be 0).
+</p>
+</li>
+</ol></div></div></td>
+</tr>
+<tr>
+<td colspan="4" align="left" valign="top"><p class="table">OpenCL C Semantics</p></td>
+</tr>
+<tr>
+<td colspan="4" align="left" valign="top"><div><div class="listingblock">
+<div class="content"><!-- Generator: GNU source-highlight
+by Lorenzo Bettini
+http://www.lorenzobettini.it
+http://www.gnu.org/software/src-highlite -->
+<pre><tt><span style="color: #008080">__kernel</span> <span style="color: #009900">void</span> <span style="font-weight: bold"><span style="color: #000000">__khr_io_stream_out_uchar</span></span><span style="color: #990000">(</span>
+   <span style="color: #008080">uchar</span> <span style="color: #990000">*</span>input<span style="color: #990000">,</span> <span style="color: #008080">size_t</span> <span style="color: #990000">*</span>num<span style="color: #990000">)</span> <span style="color: #FF0000">{</span>
+   <span style="font-style: italic"><span style="color: #9A1900">// It is not feasible to describe this kernel in OpenCL C as I/O devices</span></span>
+   <span style="font-style: italic"><span style="color: #9A1900">// are not representable with it.</span></span>
+<span style="color: #FF0000">}</span></tt></pre></div></div></div></td>
+</tr>
+<tr>
+<td align="left" valign="top"><p class="table"><strong>khr_io_stream_in_blocking_uchar</strong></p></td>
+<td align="left" valign="top"><p class="table">Blocking read of data from a sensor/stream associated with the device.</p></td>
+<td align="left" valign="top"><div><div class="literalblock">
+<div class="content">
+<pre><code>-</code></pre>
+</div></div></div></td>
+<td align="left" valign="top"><div><div class="olist arabic"><ol class="arabic" start="0">
+<li>
+<p>
+uchar* [out]: The data.
+</p>
+<div class="ulist"><ul>
+<li>
+<p>
+size_t* [in]: How many bytes to read before returning.
+</p>
+</li>
+</ul></div>
+</li>
+</ol></div></div></td>
+</tr>
+<tr>
+<td colspan="4" align="left" valign="top"><p class="table">OpenCL C Semantics</p></td>
+</tr>
+<tr>
+<td colspan="4" align="left" valign="top"><div><div class="listingblock">
+<div class="content"><!-- Generator: GNU source-highlight
+by Lorenzo Bettini
+http://www.lorenzobettini.it
+http://www.gnu.org/software/src-highlite -->
+<pre><tt><span style="color: #008080">__kernel</span> <span style="color: #009900">void</span> <span style="font-weight: bold"><span style="color: #000000">__khr_io_stream_in_blocking_uchar</span></span><span style="color: #990000">(</span><span style="color: #008080">uchar</span> <span style="color: #990000">*</span>output<span style="color: #990000">,</span> <span style="color: #008080">size_t</span> <span style="color: #990000">*</span>num<span style="color: #990000">)</span> <span style="color: #FF0000">{</span>
+   <span style="font-weight: bold"><span style="color: #0000FF">while</span></span> <span style="color: #990000">(*</span>num<span style="color: #990000">)</span> <span style="color: #FF0000">{</span>
+       <span style="color: #008080">size_t</span> num_read <span style="color: #990000">=</span> <span style="color: #990000">*</span>num<span style="color: #990000">;</span>
+       <span style="font-weight: bold"><span style="color: #000000">__khr_io_stream_in_uchar</span></span><span style="color: #990000">(</span>output<span style="color: #990000">,</span> <span style="color: #990000">&amp;</span>num_read<span style="color: #990000">);</span>
+       num <span style="color: #990000">-=</span> num_read<span style="color: #990000">;</span>
+       output <span style="color: #990000">+=</span> num_read<span style="color: #990000">;</span>
+   <span style="color: #FF0000">}</span>
+<span style="color: #FF0000">}</span></tt></pre></div></div></div></td>
+</tr>
+</tbody>
+</table>
+</div>
+</div>
+<div class="sect3">
+<h4 id="_launching_biks_from_the_device_side">Launching BiKs from the Device Side</h4>
+<div class="paragraph"><p>BiKs are primarily meant to be launched as kernel commands via host-side command queues.
+Optionally, they can be callable from device-side via
+<code>enqueue_kernel</code>: This capability can be queried on per BiK basis at compile-time in OpenCL C by checking for macro definitions which has the following naming convention: <code>cl_khr_bik_BUILTIN_KERNEL_NAME</code>. In case a BiK macro is defined, a kernel with a naming convention <code>__khr_BUILTIN_KERNEL_NAME()</code> can be enqueued by the program at device side as software-defined kernels.</p></div>
+</div>
+</div>
+<div class="sect2">
+<h3 id="_open_questions">Open questions</h3>
+<div class="olist arabic"><ol class="arabic">
+<li>
+<p>
+Should we enable launching BiKs from the device side without requiring device-side enqueue? The main problem is those with NDRange as they are not simple single-WI helper functions.
+</p>
+<div class="openblock">
+<div class="content">
+<div class="paragraph"><p><strong>UNRESOLVED</strong></p></div>
+</div></div>
+</li>
+<li>
+<p>
+Should the NDRange be used at all in BiKs? It feels sort of unnatural as typically the NDRange is used to imply SPMD parallelism while the hardware/firmware is free to choose whatever parallelism degree to implement the function. On the other hand, similar applies to software kernel launches as the work-items can be executed serially if adhering to barrier semantics.
+</p>
+<div class="openblock">
+<div class="content">
+<div class="paragraph"><p><strong>UNRESOLVED</strong></p></div>
+</div></div>
+</li>
+<li>
+<p>
+Different accelerators prefer different channel orders (NHWC vs. NCHW&#8230;) for the processed data. Should the channel order be passed as a BiK argument (like in the example GEMM&#8217;s row/column order) or is it better to have different BiK variations for each?
+</p>
+<div class="openblock">
+<div class="content">
+<div class="paragraph"><p><strong>UNRESOLVED</strong></p></div>
+</div></div>
+</li>
+<li>
+<p>
+How to denote preference? Some of the BiKs are more efficient on a given device as they map more naturally to the underlying HW accelerator, but the slower variations (for example, with unoptimal channel order in NN accelerators) might be still beneficially accelerated.
+</p>
+<div class="openblock">
+<div class="content">
+<div class="paragraph"><p><strong>UNRESOLVED</strong></p></div>
+</div></div>
+</li>
+<li>
+<p>
+Since the defined built-in kernel concept is basically just a C-like API inside another API, should it be made more generic and thus directly usable for SYCL and Vulkan as well?
+</p>
+<div class="openblock">
+<div class="content">
+<div class="paragraph"><p><strong>UNRESOLVED</strong></p></div>
+</div></div>
+</li>
+</ol></div>
+</div>
+</div>
+</div>
+</div>
+<div id="footnotes"><hr /></div>
+<div id="footer">
+<div id="footer-text">
+Last updated
+ 2022-12-13 15:52:00 EET
+</div>
+</div>
+</body>
+</html>