[GPU] Fix oneDNN FP16 convolution format selection for channel expansion operations (#33131)

andrew-k-park · web-flow · commit b36c13454bd5 · 2025-12-10T06:20:10.000Z
### Details: - When FP16 dynamic convolution has small input channels (≤4) and large output channels (e.g., 1024), the current format selection logic chooses `bfyx → fsv16`, which triggers oneDNN reference kernel instead of optimized JIT kernel, resulting in significant performance degradation. - Override output format to planar (bfyx) when input channels are small (≤ 16), and output channels are large (≥ 32) **Current behavior:** - Input: 3 channels → Converted to `bfyx` - Output: 1024 channels → Remains `fsv16` (only changed when output ≤ 4) - Result: `bfyx → fsv16` combination uses **reference kernel** (slow) #### Root Cause The fsv16 blocked format is optimized for reading many channels but introduces overhead when used for writing outputs in channel-expansion scenarios (small input → large output). oneDNN's reference kernel is selected because: 1. **Inefficient write pattern**: fsv16 output requires interleaved writes every 16 elements (non-contiguous) 2. **No optimized implementation**: oneDNN doesn't provide JIT-optimized kernel for fsv16 output generation from small input channels 3. **Scatter write overhead**: Writing 1024 channels in fsv16 format requires complex block-strided access ### Tickets: - [CVS-177671](https://jira.devtools.intel.com/browse/CVS-177671) Signed-off-by: Andrew Park <andrew.park@intel.com>
diff --git a/src/plugins/intel_gpu/src/graph/layout_optimizer.cpp b/src/plugins/intel_gpu/src/graph/layout_optimizer.cpp
@@ -1028,12 +1028,30 @@ void layout_optimizer::set_onednn_dyn_conv_preferred_format(convolution_node& no
         node.set_preferred_input_fmt(0, get_fsv16_format(rank));
         node.set_preferred_output_fmt(0, get_fsv16_format(rank));
 
-        // Override with default format for small channels (≤ 4)
-        if (input_channels > 0 && input_channels <= 4) {
+        // Override input for small channels (≤ 16)
+        // fsv16 format uses 16-element blocks. channels ≤ 16 waste block padding
+        // e.g. 8ch uses only 8/16 elements per block (50% waste), planar format is more efficient
+        if (input_channels > 0 && input_channels <= 16) {
             node.set_preferred_input_fmt(0, format::get_default_format(rank));
         }
 
-        if (output_channels > 0 && output_channels <= 4) {
+        // Override output for small channels (≤ 16)
+        // same as input - avoid fsv16 block padding overhead for small channel counts
+        if (output_channels > 0 && output_channels <= 16) {
+            node.set_preferred_output_fmt(0, format::get_default_format(rank));
+        }
+
+        // Override output for channel expansion operations (small input → large output)
+        // when expanding from small input channels (≤16) to large output channels (≥32),
+        // planar output format enables OneDNN to select optimized JIT kernel instead of reference kernel
+        // Thresholds explained:
+        //   - input ≤ 16: matches fsv16 block size, input side uses planar format (set above)
+        //   - output ≥ 32: 2 or more fsv16 blocks (32/16=2), where blocked write overhead exceeds
+        //                  sequential write benefits. planar format provides better cache locality
+        //                  and memory access patterns for large channel generation
+        // e.g. 3ch → 1024ch would create 64 fsv16 blocks with scattered writes,
+        //      but planar format allows efficient sequential writes
+        if (input_channels > 0 && input_channels <= 16 && output_channels >= 32) {
             node.set_preferred_output_fmt(0, format::get_default_format(rank));
         }
     }