GitHub - AlexanderGSC/rvv-gemm: This repository contains the analysis and progressive optimisation of a generic matrix multiplication algorithm (GEMM) for large matrices running natively on a 64-bit RISC-V architecture utilising the RVV 1.0 vector extension and advanced cache locality techniques (Tiling).

GEMM Optimization with RISC-V RVV 1.0

This repository contains the analysis and progressive optimisation of a generic matrix multiplication algorithm (GEMM) for large matrices ($2000 \times 3000 \times 100$), running natively on a 64-bit RISC-V architecture utilising the RVV 1.0 vector extension and advanced cache locality techniques (Tiling), ILP, Out Of Order execution and multithreading via openMP. All improvements are cumulative:

The repository develops the algorithm through various stages:

GEMM_O2.c: Pure scalar code.
GEMM_O3.c: code vectorised by the compiler using the -O3 flag.
GEMM_vaddmul.c: First vector approach using vle32.v, vfmul.vf, vfadd.vv.
GEMM_vmacc.c: Latency reduction by replacing the Mul/Add pair with the combined instruction vfmacc.vf.
GEMM_tiled.c: Implementation of Loop Nest Blocking (Tiling) by dividing the spatial problem into submatrices that fit into the L1 cache, eliminating stores with strides to RAM.
GEMM_unroll.c: Enabling ILP (Instruction Level Parallelism) with the implementation of loop unrolling with depth 4 and interleaving to enable OOO (Out-of-Order) execution, utilising the VPU pipeline and mitigating RAW risks.
GEMM_omp.c: multithreading application using OpenMP, to make the most of the SBC’s 8 physical cores.

Benchmark Results

The experiment was carried out on a Banana BPI-F3 SBC (with a RISC-V architecture processor), measuring the pure computation time required to calculate the resulting matrix $C$:

perf stat -e cycles,instructions,branches,branch-misses,L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores,L1-dcache-store-misses ./P3GEMM_version 2000 3000 100

Metric	`GEMM_O2` (Base)	`GEMM_O3`	`GEMM_vaddmul`	`GEMM_vmacc`	`GEMM_tiled`	`GEMM_unrolled_tiled`	`GEMM_omp`
Computing Time	2.881 s	0.812 s	0.776 s	0.708 s	0.319 s	0.211 s	0.043 s
Speedup	1.0x	3.75x	3.69x	4.06x	8.99x	15.02x	66.58x
CPU Cycles (`cycles`)	5.484 M	2.177 M	2.138 M	2.010 M	1.389 M	1.234 M	1.436 M
Instructions	5.412 M	1.562 M	793 M	782 M	675 M	674 M	741 M
IPC (`insn per cycle`)	0.99	0.72	0.37	0.39	0.49	0.55	0.52
L1 Loads (`loads`)	1.949 M	466 M	574 M	574 M	361 M	223 M	231 M
L1 Load Misses	1.955.513	1.877.319	2.044.323	2.036.231	2.479.215	2.415.152	2.300.855
L1 Stores (`stores`)	689 M	244 M	280 M	280 M	91 M	91 M	91 M
L1 Store Misses	568 K	553 K	550 K	550 K	546 K	543 K	549 M

Key takeaway: Maximum optimisation is not achieved simply by inserting vector instructions, but by managing how data flows between RAM, L1 cache blocks and the silicon’s vector ALUs and then enabling multithreading.

Instruction reduction: Manual vector optimisation (_vaddmul and _vmacc) and block-based optimisation (_tiled) significantly reduce the total number of instructions compared to automatic vectorisation _O3. This is because the compiler is very conservative in its application, and uses an LMUL of 1 to avoid excessive pressure on registers. This can be examined in the assembly code generated by the compiler:

  90 0082 D7F7060D 		vsetvli	a5,a3,e32,m1,ta,ma                 # LMUL=1
  91              		.loc 1 20 8 is_stmt 1
  92              		.loc 1 20 22 is_stmt 0
  93 0086 87600502 		vle32.v	v1,0(a0)
  94 008a 93952700 		slli	a1,a5,2
  95              		.loc 1 20 40
  96 008e 07610602 		vle32.v	v2,0(a2)
  19:P3GEMM.c      ****         	for (j = 0; j < m; j++) 
  97              		.loc 1 19 24 discriminator 1
  98 0092 9D8E     		sub	a3,a3,a5
  99 0094 2E95     		add	a0,a0,a1
 100 0096 2E96     		add	a2,a2,a1
 101              		.loc 1 20 26
 102 0098 D79021B2 		vfmacc.vv	v1,v3,v2                       #Uses vmacc
 103              		.loc 1 20 16
 104 009c A7600702 		vse32.v	v1,0(a4)

The misleading IPC of _O2: Although _O2 has the highest IPC (0.99), it is inefficient; the processor runs fast but only executes scalar instructions.
Efficient operation fusion: The _vmacc version reduces the number of instructions compared to _vaddmul (from 793M to 782M), demonstrating that the multiply and accumulate operations are fused into a single CPU cycle.
Loop Nest Blocking (Tiling): the tiled version achieves an 9x speedup reduction (down from 2.881 s to 0.319 s) compared to the base _O2 version. This performance. This performance improvement can only be achieved by implementing an efficient load-and-store strategy.
Hardware-Saturating Instruction Interleaving: Implementing a row-wise loop unrolling factor of 4 effectively hidden FMA (Fused Multiply-Add) latency, maximizing pipeline throughput and driving instruction per cycle (IPC) efficiency up to 0.55.
Store Loads traffic: _tiled reduces L1 cache writes from 689 million to 91 million (a reduction of 86.7%), confirming that partial sums are retained in the registers before being written to memory.
Data bus independence: _tiled reduces L1 reads (loads) to one-fifth of the base version, preventing the CPU from suffering from data starvation and raising its actual IPC to 0.49.
Consistency of mandatory failures: Write failures (store-misses) remain constant (~550 K) across all versions because they correspond to the mandatory initialisation of the array.

Strong Scaling

A strong scaling experiment was conducted on the Banana Pi board to evaluate the parallel efficiency of the 2D Register-Blocked GEMM kernel. The matrix sizes for the benchmark were set to $N=2000$, $M=3000$, and $K=100$, utilizing a fixed cache tile size of $64 \times 64$ elements.

Execution Results & Speedup Table

The table below summarizes the execution time, speedup factor, and parallel efficiency achieved as the thread count scales from 1 to 8 cores, with OpenMP thread affinity fully enabled (OMP_PLACES=cores, OMP_PROC_BIND=close).

Threads ($N$)	Compute Time (s)	Measured Speedup ($S$)	Ideal Speedup	Parallel Efficiency ($E$)
1	0.199959	1.00x	1.00x	100.0%
2	0.109174	1.83x	2.00x	91.5%
3	0.081090	2.47x	3.00x	82.3%
4	0.065696	3.04x	4.00x	76.0%
5	0.057543	3.47x	5.00x	69.4%
6	0.051422	3.89x	6.00x	64.8%
7	0.048268	4.14x	7.00x	59.1%
8	0.043392	4.61x	8.00x	57.6%

Key Insights & Performance Takeaways

High Initial Efficiency (1 to 4 Cores): The application scales exceptionally well up to 4 threads, maintaining a parallel efficiency of 76%. This confirms that the OpenMP work-sharing directive #pragma omp parallel for collapse(2) successfully achieves an optimal load balance when distributing the matrix tiles across the available processors.
Memory Bandwidth Bottleneck (5 to 8 Cores): Beyond 4 threads, the speedup curve experiences a standard sub-linear saturation, reaching a final acceleration of 4.61x at 8 threads. This behavior is a textbook example of a memory-bound limitation in Dense Linear Algebra. As more processing cores concurrently issue vectorized vector-register loads (__riscv_vle32), the shared L2/L3 cache system and the main RAM bus bandwidth become saturated, introducing slight memory stalls.
Thread Affinity Impact: Forcing tight hardware constraints prevents the Linux kernel scheduler from migrating threads across different physical cores. This guarantees strict cache-locality for the $64 \times 64$ tiles, eliminating cache thrashing and providing highly reproducible compute timings.
Numerical Consistency: The output matrix $C$ yields identical floating-point precision data across all thread configurations, verifying the mathematical correctness of the loop tail cleaning logic and ensuring that the implementation remains entirely race-condition free.

Tile Scaling Analysis and L1 Cache Behaviour

The Banana Pi features a 32 KB L1 data cache and operates with 32-bit floating-point precision (4 bytes per element). An empirical study was carried out by varying the tile size to find the hardware’s optimal saturation point. Here are the results for the tiled version without unrolling :

Tile Size	Computation Time	L1 Loads	L1 Misses	% Misses	L1 Cache Status
32 x 32	0.712 s	588,147,925	2,267,033	0.39%	Underutilised
50 x 50	0.310 s	368,267,051	2,762,719	0.75%	Optimal net balance
64 x 64	0.320 s	361,152,490	2,463,928	0.68%	Alignment optimum
128 x 128	0.452 s	359,436,762	8,375,178	2.33%	Saturation and Overflow

Compilation

To compile natively in the RISC-V environment using vector support (requires gcc with support for RVV 1.0):

# Compile all versions
make

# Run by entering dimensions: N P M (N x P) (P x M)
./python3 P3GEMM_su 2000 3000 100

Next Steps

Implement the same algorithm in C++ to measure the impact of zero-cost abstractions on the code.
Use GEMM to implement a 2D convolution using the im2col algorithm.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
src		src
LICENSE		LICENSE
P3.md		P3.md
P3sinsu.png		P3sinsu.png
Readme.md		Readme.md
result.png		result.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GEMM Optimization with RISC-V RVV 1.0

Benchmark Results

Strong Scaling

Execution Results & Speedup Table

Key Insights & Performance Takeaways

Tile Scaling Analysis and L1 Cache Behaviour

Compilation

Next Steps

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GEMM Optimization with RISC-V RVV 1.0

Benchmark Results

Strong Scaling

Execution Results & Speedup Table

Key Insights & Performance Takeaways

Tile Scaling Analysis and L1 Cache Behaviour

Compilation

Next Steps

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages