Skip to content

Commit 1ed7fca

Browse files
Completed Python program
1 parent 1c0b872 commit 1ed7fca

10 files changed

+283
-0
lines changed

.gitignore

+2
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
.idea
2+
**/__pycache__

README.md

+44
Original file line numberDiff line numberDiff line change
@@ -1 +1,45 @@
11
# MatrixMultBenchmark
2+
3+
This project benchmarks the matrix multiply algorithm against different hardware and language combinations.
4+
For hardware, there are 3 options: Single-core, Multi-core and GPU.
5+
6+
Adjust the matrix size so each run takes ~10s.
7+
8+
9+
For CPU testing, take a screenshot of HWinfo when the benchmark is done.
10+
For GPU testing, start logging when you choose the device for benchmark. stop logging when the benchmark is done.
11+
Make sure that there are at least 3 rows with 100% core utilization.
12+
13+
# Results
14+
`Power` measures the clock speed and power consumption when under certain workloads.
15+
They are measured using the HWiNFO program.
16+
17+
Bandwidth is measured in Gbps. It serves as a metric of data transfer rate: if the recorded bandwidth reaches the design limit, it hinders performance and the run should be disqualified.
18+
19+
For CPU:
20+
* Power refers to the "Core+SoC power".
21+
* Single-core frequency refers to the frequency of the core that is under stress.
22+
23+
GPU benchmarks are complicated because we need to copy data to/from the main memory.
24+
I've decided to record the average/max of several data.
25+
Max shows the performance in a single event. For example, my dGPU clocks at 1.8-9GHz when it's doing actual computation.
26+
Average shows the performance with all factors considered. For example, dGPU clocks at just 300MHz when receiving data from main memory. The average clock from CPU sending data to CPU receiving results is ~1.5GHz. This gives an estimation of the relative performance over the entire test.
27+
* Main memory bandwidth DOES NOT mean video RAM bandwidth, but rather the rate of the CPU pulling from/pushing to main memory.
28+
* PCIe bandwidth: In this benchmark I am only counting uni-directional bandwidth. The equation is `Link speed * Encoding(128/130) * Lanes`.
29+
* VRAM bandwidth: The equation is `Clock * Bus width * pump rate`.
30+
31+
For dGPU:
32+
* Power is the sum of CPU "Core+SoC power" and "GPU power". Avg only.
33+
* PCIe transfer rate: Much lower when idle. Avg&Max.
34+
* VRAM bandwidth: Much lower when idle. Avg&Max.
35+
36+
For iGPU:
37+
* Power refers to the "CPU Package power". [See this](https://www.hwinfo.com/forum/threads/how-to-read-apu-power-consumption-properly.8206/)
38+
* Bandwidth: iGPU doesn't use PCIe (I think), it uses the main memory instead and the memory clock is constant.
39+
40+
41+
# Tools
42+
The `tool` folder contains some tools for processing benchmark results.
43+
* calc.xlsx: A spreadsheet that computes the average of several runs. The data is my Python iGPU benchmark. Ignore the colored columns if you're just using this.
44+
* bandwidth-calc.py: Calculates PCIe/VRAM bandwidth.
45+
* efficiency-calc.py: Calculates computation per second and computation per joule.

gpu/opencl.cl

+11
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
__kernel void multiply(int n, int m, int p,
2+
__global int *a, __global int *b, __global int *c)
3+
{
4+
int t = get_global_id(0);
5+
int row_a = t/n;
6+
int coln_b = t%p;
7+
8+
for (int i=0; i <m; i++) {
9+
c[t] += a[row_a*m+i] * b[i*p+coln_b];
10+
}
11+
}

lang/py3/benchmark.py

+114
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,114 @@
1+
from multiprocessing import Pool
2+
import pyopencl as ocl
3+
import pyopencl.array
4+
import numpy as np
5+
import time
6+
from random import randint
7+
8+
with open('../../gpu/opencl.cl') as f:
9+
cl_prg = ''.join(f.readlines())
10+
11+
class Matrix:
12+
13+
def __init__(self):
14+
self.data = [[]]
15+
self.rows = 0
16+
self.cols = 0
17+
18+
def print(self):
19+
for row in self.data:
20+
print(' '.join(str(c) for c in row) + ';\n')
21+
print(f'{self.rows}x{self.cols}')
22+
23+
@staticmethod
24+
def __gen__(col):
25+
return [randint(0, 1000) for _ in range(col)]
26+
27+
def resize(self, row , col):
28+
if self.rows != row or self.cols != col:
29+
with Pool() as pool:
30+
self.data = pool.map(self.__gen__, map(lambda r: col, range(row)))
31+
self.rows = row
32+
self.cols = col
33+
34+
35+
class Matrices:
36+
37+
def __init__(self):
38+
self.a = Matrix()
39+
self.b = Matrix()
40+
41+
def resize(self, np, m):
42+
self.a.resize(np, m)
43+
self.b.resize(m, np)
44+
45+
46+
class Stopwatch:
47+
48+
def __init__(self):
49+
self.msgs = []
50+
self.start = time.perf_counter()
51+
52+
def lap(self, msg: str):
53+
t = time.perf_counter()
54+
duration = t - self.start
55+
self.start = t
56+
self.msgs.append(f'{msg}: {duration:.4f}s')
57+
58+
def print(self):
59+
print('\n'.join(self.msgs))
60+
61+
62+
def __row__(row, b):
63+
cl = []
64+
for column in range(len(b[0])):
65+
s = 0
66+
for i, otr in enumerate(b):
67+
s += row[i] * otr[column]
68+
cl.append(s)
69+
return cl
70+
71+
72+
def single(m: Matrices) -> [[int]]:
73+
"""Multiplies using single core"""
74+
start = time.perf_counter()
75+
n = []
76+
for row in m.a.data:
77+
n.append(__row__(row, m.b.data))
78+
duration = time.perf_counter() - start
79+
print(f'Single-core: {duration:.4f}s')
80+
return n
81+
82+
83+
def multiple(m: Matrices) -> [[int]]:
84+
"""Multiplies using multiple-core"""
85+
start = time.perf_counter()
86+
with Pool() as p:
87+
result = p.starmap(__row__, map(lambda row: (row, m.b.data), m.a.data))
88+
duration = time.perf_counter() - start
89+
print(f'Multi-core: {duration:.4f}s')
90+
return result
91+
92+
93+
def opencl(m: Matrices, dev: ocl.Device) -> [[int]]:
94+
"""Multiplies using OpenCL"""
95+
ctx = ocl.Context(devices=(dev,))
96+
sw = Stopwatch()
97+
with ocl.CommandQueue(ctx) as q:
98+
a = ocl.array.to_device(q, np.array(m.a.data))
99+
b = ocl.array.to_device(q, np.array(m.b.data))
100+
s = ocl.array.Array(q, m.a.rows ** 2, np.int32)
101+
sw.lap('->GPU')
102+
103+
prg = ocl.Program(ctx, cl_prg).build()
104+
prg.multiply(q, s.shape, None,
105+
np.int32(m.a.rows), np.int32(m.a.cols), np.int32(m.b.cols),
106+
a.data, b.data, s.data)
107+
s = s.reshape(m.a.rows, m.b.cols)
108+
q.finish()
109+
sw.lap('GPU compute')
110+
111+
result = s.map_to_host().tolist()
112+
sw.lap('->CPU')
113+
sw.print()
114+
return result

lang/py3/main.py

+38
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
import os
2+
3+
import pyopencl as ocl
4+
5+
import benchmark
6+
7+
os.environ['PYOPENCL_NO_CACHE'] = '1'
8+
9+
if __name__ == '__main__':
10+
m = benchmark.Matrices()
11+
while True:
12+
hw_choice = int(input(
13+
'1. Single-core\n'
14+
'2. Multi-core\n'
15+
'3. GPU (OpenCL)\n'
16+
))
17+
if hw_choice == 1:
18+
m.resize(1200, 100)
19+
input('Start')
20+
benchmark.single(m)
21+
print(benchmark.count)
22+
elif hw_choice == 2:
23+
m.resize(1900, 190)
24+
input('Start')
25+
benchmark.multiple(m)
26+
else:
27+
m.resize(8250, 1500)
28+
devices = []
29+
print('The following devices in your system support OpenCL:')
30+
for platform in ocl.get_platforms():
31+
print(platform.name + ' | ' + platform.version)
32+
for device in platform.get_devices():
33+
print(f'{len(devices)}. {device.name} | {device.version} | {device.max_compute_units} CU')
34+
devices.append(device)
35+
device_choice = int(input('Enter device number to benchmark: '))
36+
device = devices[device_choice]
37+
38+
benchmark.opencl(m, device)

lang/py3/requirements.txt

+1
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
pyopencl

result/legion-r7000-2020.md

+45
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
# System
2+
* Name: Lenovo Legion R7000 2020
3+
* Class: Laptop
4+
* CPU: AMD Ryzen 5 4600H (TDP 45W)
5+
* RAM: Dual-channel DDR4-3200 CL22
6+
* GPU0: Integrated AMD Vega 'gfx902'
7+
* 6 CU, 384 shader units, 512MB 128bit DDR4
8+
* GPU1: Nvidia GeForce GTX 1650 Ti (TDP 50W)
9+
* 16 CU, 1024 shader units, 4GB 128bit GDDR6
10+
* Max PCIe bandwidth: Gen3 x16 -> 126Gbps
11+
12+
13+
# Note
14+
iGPU
15+
* Memory capacity: With most of my background tasks closed, it still consumes ~300MB. When running the tests, it reaches the 512MB limit.
16+
* The results fluctuate a lot for some reason. GPU processing time can vary between 5s-9s. I've run 20 benchmarks for the iGPU. Other devices produce consistent result and I only ran 3 times per device. See the yellow columns in `calc.xlsx` for details.
17+
18+
19+
# Power
20+
* CPU
21+
* Single core: 4.0GHz, 9.703W
22+
* Main memory bandwidth: RAvg 1.198 RMax 3.661 WAvg 0.368 WMax 0.903
23+
* All cores: ~3.9GHz 49.567W
24+
* Main memory bandwidth: RAvg 5.517 RMax 20.396 WAvg 2.180 WMax 3.255
25+
* GPU0
26+
* Core: ~1.104GHz, max 1.5GHz. 15.233W
27+
* Main memory bandwidth: RAvg 36.104 RMax 47.743 WAvg 0.413 WMax 2.285
28+
* 'VRAM' bandwidth: 1.6GHz -> 409.6Gbps
29+
* GPU1
30+
* Core: ~1.601GHz, max 1.9GHz. CPU 5.278W + GPU 33.389W = Total 38.667W
31+
* Main memory bandwidth: RAvg 2.177 RMax 11.036 WAvg 0.606 WMax 2.758
32+
* PCIe bandwidth: Avg 7.228GHz-> 114Gbps Max 8GHz-> 126Gbps
33+
* VRAM Bandwidth: Avg 1.304GHz-> 1335Gbps Max 1.5GHz-> 1536Gbps
34+
35+
36+
# Results (Windows 10 21H2)
37+
## Python 3
38+
* Single-core: 10.458s 1200x100|100x1200
39+
* 13769 comp/s, 1419 comp/j
40+
* All cores: 10.016s 1900x190|190x1900
41+
* 68480 comp/s, 1381 comp/j
42+
* GPU0: ->GPU 1.666s Compute: 7.821s ->CPU: 1.264s Total 10.752s 8250x1500|1500x8250
43+
* 9495326 comp/s, 623339 comp/j
44+
* GPU1: ->GPU 2.293s Compute: 5.542s ->CPU: 1.916s Total 9.751s 9000x2100|2100x9000
45+
* 17444364 comp/s, 451143 comp/j

tool/bandwidth-calc.py

+14
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
choice = int(input(
2+
'1. PCIe\n'
3+
'2. VRAM\n'
4+
))
5+
6+
if choice == 1:
7+
link_speed = float(input('Enter link speed in GHz: '))
8+
lane_count = int(input('Enter number of lanes: '))
9+
print(link_speed * 128/130 * lane_count) # 128/130 is the encoding scheme of PCIe Gen3+
10+
elif choice == 2:
11+
clock_speed = float(input('Enter clock speed in GHz: '))
12+
bus_width = int(input('Enter bus width in bits: '))
13+
pump_rate = int(input('Enter pump rate: '))
14+
print(clock_speed * bus_width * pump_rate)

tool/calc.xlsx

11.7 KB
Binary file not shown.

tool/efficiency-calc.py

+14
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
np = int(input('Enter the first number of the Matrices class: '))
2+
m = int(input('Enter the second number of the Matrices class: '))
3+
4+
computation = np ** 2 * m
5+
print(str(computation) + ' computations')
6+
7+
t = float(input('Enter time: '))
8+
power = float(input('Enter power: '))
9+
10+
comp_per_time = int(computation / t / 1000)
11+
print(str(comp_per_time) + ' computations per second')
12+
13+
comp_per_j = int(computation / (power * t) / 1000)
14+
print(str(comp_per_j) +' computations per joule')

0 commit comments

Comments
 (0)