This repository contains the analysis and progressive optimisation of a generic matrix multiplication algorithm (GEMM) for large matrices (
The repository develops the algorithm through various stages:
GEMM_O2.c: Pure scalar code.GEMM_O3.c: code vectorised by the compiler using the-O3flag.GEMM_vaddmul.c: First vector approach usingvle32.v,vfmul.vf,vfadd.vv.GEMM_vmacc.c: Latency reduction by replacing the Mul/Add pair with the combined instructionvfmacc.vf.GEMM_tiled.c: Implementation of Loop Nest Blocking (Tiling) by dividing the spatial problem into submatrices that fit into the L1 cache, eliminating stores with strides to RAM.GEMM_unroll.c: Enabling ILP (Instruction Level Parallelism) with the implementation of loop unrolling with depth 4 and interleaving to enable OOO (Out-of-Order) execution, utilising the VPU pipeline and mitigating RAW risks.GEMM_omp.c: multithreading application using OpenMP, to make the most of the SBC’s 8 physical cores.
The experiment was carried out on a Banana BPI-F3 SBC (with a RISC-V architecture processor), measuring the pure computation time required to calculate the resulting matrix
perf stat -e cycles,instructions,branches,branch-misses,L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores,L1-dcache-store-misses ./P3GEMM_version 2000 3000 100| Metric | GEMM_O2 (Base) |
GEMM_O3 |
GEMM_vaddmul |
GEMM_vmacc |
GEMM_tiled |
GEMM_unrolled_tiled |
GEMM_omp |
|---|---|---|---|---|---|---|---|
| Computing Time | 2.881 s | 0.812 s | 0.776 s | 0.708 s | 0.319 s | 0.211 s | 0.043 s |
| Speedup | 1.0x | 3.75x | 3.69x | 4.06x | 8.99x | 15.02x | 66.58x |
CPU Cycles (cycles) |
5.484 M | 2.177 M | 2.138 M | 2.010 M | 1.389 M | 1.234 M | 1.436 M |
| Instructions | 5.412 M | 1.562 M | 793 M | 782 M | 675 M | 674 M | 741 M |
IPC (insn per cycle) |
0.99 | 0.72 | 0.37 | 0.39 | 0.49 | 0.55 | 0.52 |
L1 Loads (loads) |
1.949 M | 466 M | 574 M | 574 M | 361 M | 223 M | 231 M |
| L1 Load Misses | 1.955.513 | 1.877.319 | 2.044.323 | 2.036.231 | 2.479.215 | 2.415.152 | 2.300.855 |
L1 Stores (stores) |
689 M | 244 M | 280 M | 280 M | 91 M | 91 M | 91 M |
| L1 Store Misses | 568 K | 553 K | 550 K | 550 K | 546 K | 543 K | 549 M |
Key takeaway: Maximum optimisation is not achieved simply by inserting vector instructions, but by managing how data flows between RAM, L1 cache blocks and the silicon’s vector ALUs and then enabling multithreading.
- Instruction reduction: Manual vector optimisation (_vaddmul and _vmacc) and block-based optimisation (_tiled) significantly reduce the total number of instructions compared to automatic vectorisation _O3. This is because the compiler is very conservative in its application, and uses an LMUL of 1 to avoid excessive pressure on registers. This can be examined in the assembly code generated by the compiler:
90 0082 D7F7060D vsetvli a5,a3,e32,m1,ta,ma # LMUL=1
91 .loc 1 20 8 is_stmt 1
92 .loc 1 20 22 is_stmt 0
93 0086 87600502 vle32.v v1,0(a0)
94 008a 93952700 slli a1,a5,2
95 .loc 1 20 40
96 008e 07610602 vle32.v v2,0(a2)
19:P3GEMM.c **** for (j = 0; j < m; j++)
97 .loc 1 19 24 discriminator 1
98 0092 9D8E sub a3,a3,a5
99 0094 2E95 add a0,a0,a1
100 0096 2E96 add a2,a2,a1
101 .loc 1 20 26
102 0098 D79021B2 vfmacc.vv v1,v3,v2 #Uses vmacc
103 .loc 1 20 16
104 009c A7600702 vse32.v v1,0(a4)- The misleading IPC of _O2: Although _O2 has the highest IPC (0.99), it is inefficient; the processor runs fast but only executes scalar instructions.
- Efficient operation fusion: The _vmacc version reduces the number of instructions compared to _vaddmul (from 793M to 782M), demonstrating that the multiply and accumulate operations are fused into a single CPU cycle.
- Loop Nest Blocking (Tiling): the tiled version achieves an 9x speedup reduction (down from 2.881 s to 0.319 s) compared to the base _O2 version. This performance. This performance improvement can only be achieved by implementing an efficient load-and-store strategy.
- Hardware-Saturating Instruction Interleaving: Implementing a row-wise loop unrolling factor of 4 effectively hidden FMA (Fused Multiply-Add) latency, maximizing pipeline throughput and driving instruction per cycle (IPC) efficiency up to 0.55.
- Store Loads traffic: _tiled reduces L1 cache writes from 689 million to 91 million (a reduction of 86.7%), confirming that partial sums are retained in the registers before being written to memory.
- Data bus independence: _tiled reduces L1 reads (loads) to one-fifth of the base version, preventing the CPU from suffering from data starvation and raising its actual IPC to 0.49.
- Consistency of mandatory failures: Write failures (store-misses) remain constant (~550 K) across all versions because they correspond to the mandatory initialisation of the array.
A strong scaling experiment was conducted on the Banana Pi board to evaluate the parallel efficiency of the 2D Register-Blocked GEMM kernel. The matrix sizes for the benchmark were set to
The table below summarizes the execution time, speedup factor, and parallel efficiency achieved as the thread count scales from 1 to 8 cores, with OpenMP thread affinity fully enabled (OMP_PLACES=cores, OMP_PROC_BIND=close).
| Threads ( |
Compute Time (s) | Measured Speedup ( |
Ideal Speedup | Parallel Efficiency ( |
|---|---|---|---|---|
| 1 | 0.199959 | 1.00x | 1.00x | 100.0% |
| 2 | 0.109174 | 1.83x | 2.00x | 91.5% |
| 3 | 0.081090 | 2.47x | 3.00x | 82.3% |
| 4 | 0.065696 | 3.04x | 4.00x | 76.0% |
| 5 | 0.057543 | 3.47x | 5.00x | 69.4% |
| 6 | 0.051422 | 3.89x | 6.00x | 64.8% |
| 7 | 0.048268 | 4.14x | 7.00x | 59.1% |
| 8 | 0.043392 | 4.61x | 8.00x | 57.6% |
-
High Initial Efficiency (1 to 4 Cores): The application scales exceptionally well up to 4 threads, maintaining a parallel efficiency of 76%. This confirms that the OpenMP work-sharing directive
#pragma omp parallel for collapse(2)successfully achieves an optimal load balance when distributing the matrix tiles across the available processors. -
Memory Bandwidth Bottleneck (5 to 8 Cores): Beyond 4 threads, the speedup curve experiences a standard sub-linear saturation, reaching a final acceleration of 4.61x at 8 threads. This behavior is a textbook example of a memory-bound limitation in Dense Linear Algebra. As more processing cores concurrently issue vectorized vector-register loads (
__riscv_vle32), the shared L2/L3 cache system and the main RAM bus bandwidth become saturated, introducing slight memory stalls. -
Thread Affinity Impact: Forcing tight hardware constraints prevents the Linux kernel scheduler from migrating threads across different physical cores. This guarantees strict cache-locality for the
$64 \times 64$ tiles, eliminating cache thrashing and providing highly reproducible compute timings. -
Numerical Consistency: The output matrix
$C$ yields identical floating-point precision data across all thread configurations, verifying the mathematical correctness of the loop tail cleaning logic and ensuring that the implementation remains entirely race-condition free.
The Banana Pi features a 32 KB L1 data cache and operates with 32-bit floating-point precision (4 bytes per element). An empirical study was carried out by varying the tile size to find the hardware’s optimal saturation point. Here are the results for the tiled version without unrolling :
| Tile Size | Computation Time | L1 Loads | L1 Misses | % Misses | L1 Cache Status |
|---|---|---|---|---|---|
| 32 x 32 | 0.712 s | 588,147,925 | 2,267,033 | 0.39% | Underutilised |
| 50 x 50 | 0.310 s | 368,267,051 | 2,762,719 | 0.75% | Optimal net balance |
| 64 x 64 | 0.320 s | 361,152,490 | 2,463,928 | 0.68% | Alignment optimum |
| 128 x 128 | 0.452 s | 359,436,762 | 8,375,178 | 2.33% | Saturation and Overflow |
To compile natively in the RISC-V environment using vector support (requires gcc with support for RVV 1.0):
# Compile all versions
make
# Run by entering dimensions: N P M (N x P) (P x M)
./python3 P3GEMM_su 2000 3000 100-
Implement the same algorithm in C++ to measure the impact of zero-cost abstractions on the code.
-
Use GEMM to implement a 2D convolution using the im2col algorithm.
