Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need help understanding and running likwid on a dual socket system; failure to detect / bench inter-socket link AKA UPI #664

Open
simonhf opened this issue Feb 25, 2025 · 2 comments

Comments

@simonhf
Copy link

simonhf commented Feb 25, 2025

I'm trying to use likwid to better understand and detect the "inter-socket link" AKA UPI on my system, which is a:

CPU name:       Intel(R) Xeon(R) Platinum 8480+
CPU type:       Intel SapphireRapids processor

But so far nothing has worked as expected:

I built and installed likwid like this:

$ sudo apt update
$ sudo apt install --yes build-essential git libnuma-dev
$ git clone https://github.com/RRZE-HPC/likwid.git
$ cd likwid
$ make
$ sudo make install

When I ask it to "print available performance groups for current processor":

$ sudo likwid-perfctr -a
Group name	Description
--------------------------------------------------------------------------------
   MEM_DP	Overview of DP arithmetic and main memory performance
   BRANCH	Branch prediction miss rate/ratio
     DATA	Load to store ratio
   MEM_SP	Overview of SP arithmetic and main memory performance
  L2CACHE	L2 cache miss rate/ratio
    CLOCK	Power and Energy consumption
   ENERGY	Power and Energy consumption
 FLOPS_HP	Half Precision MFLOP/s
       L3	L3 cache bandwidth in MBytes/s
   HBM_SP	Overview of SP arithmetic and main memory performance
      MEM	Memory bandwidth in MBytes/s
 TLB_DATA	L1 data TLB miss rate/ratio
   DIVIDE	Divide unit information
FLOPS_AVX	Packed AVX MFLOP/s
 FLOPS_DP	Double Precision MFLOP/s
  L3CACHE	L3 cache miss rate/ratio
TLB_INSTR	L1 Instruction TLB miss rate/ratio
  DDR_HBM	Memory bandwidth in MBytes/s for DDR and HBM
   MEM_HP	Overview of HP arithmetic and main memory performance
       L2	L2 cache bandwidth in MBytes/s
   HBM_HP	Overview of HP arithmetic and main memory performance
   HBM_DP	Overview of DP arithmetic and main memory performance
      HBM	HBM bandwidth in MBytes/s
      TMA	Top down cycle allocation
 FLOPS_SP	Single Precision MFLOP/s
  SPECI2M	Memory bandwidth in MBytes/s including SpecI2M

This is where things get confusing already because I'm expecting it to show something like this [1], but above UPI is missing:

$ likwid-perfctr -a
Group name  Description
--------------------------------------------------------------------------------
MEM_DP  Overview of arithmetic and main memory performance
UPI  UPI data traffic
...

Why is UPI missing for me? And how to get it to show up?

Also, [1] shows commands to test the performance impact of UPI, e.g.:

# pin domain is the whole node but workgroup specifies 20 threads on first S0 socket and four
# data streams for the data are also on first socket S0

$ likwid-pin -c N:0-39 likwid-bench -t triad_mem_avx512_fma -w S0:20GB:20-0:S0,1:S0,2:S0,3:S0
Cycles: 7995555762
Time: 3.198232e+00 sec
MByte/s: 100055.29
Cycles per update: 0.799556
Cycles per cacheline: 6.396445

# pin domain is still whole node but four data streams are now pinned on second socket S1

$ likwid-pin -c N:0-39 likwid-bench -t triad_mem_avx512_fma -w S0:20GB:20-0:S1,1:S1,2:S1,3:S1
Cycles: 19063612050
Time: 7.625461e+00 sec
MByte/s: 41964.68
Cycles per update: 1.906361
Cycles per cacheline: 15.250890

And for its own example above, [1] says "In the above example, we can see that the bandwidth is dropped from ~100 GB/s to ~41.9 GB/s. This is almost a 2.5x performance difference."

However, when I run the above commands on my dual socket system then curiously there appears to be little "MByte/s" difference:

$ likwid-pin -c N:0-39 likwid-bench -t triad_mem_avx512_fma -w S0:20GB:20-0:S0,1:S0,2:S0,3:S0 2>&1 | egrep "(Cycles:|Time:|MByte/s:|Cycles per update:|Cycles per cacheline:)"
Cycles:			3371184384
Time:			1.685579e+00 sec
MByte/s:		189845.71
Cycles per update:	0.337118
Cycles per cacheline:	2.696948

$ likwid-pin -c N:0-39 likwid-bench -t triad_mem_avx512_fma -w S0:20GB:20-0:S1,1:S1,2:S1,3:S1 2>&1 | egrep "(Cycles:|Time:|MByte/s:|Cycles per update:|Cycles per cacheline:)"
Cycles:			3350410040
Time:			1.675192e+00 sec
MByte/s:		191022.87
Cycles per update:	0.335041
Cycles per cacheline:	2.680328

How can there be so little difference assuming the UPI should / must be working hard to synchronize the memory between the 2 sockets? Or how to modify these commands to make them work as expected?

Thanks!

[1] https://pramodkumbhar.com/2020/03/architectural-optimisations-using-likwid-profiler/

@simonhf simonhf changed the title Need help understanding and running likwid on a dual socket system Need help understanding and running likwid on a dual socket system; failure to detect / bench inter-socket link AKA UPI Feb 25, 2025
TomTheBear added a commit that referenced this issue Feb 26, 2025
@TomTheBear
Copy link
Member

LIKWID does not offer the UPI group for Intel SapphireRapids. The system used in the tutorial page is a Intel Haswell EP system. Counters, events and therefore performance groups are architecture-specific. So, it doesn't hide just for you ;)

I checked on two of our SPR nodes:

  • In the production system Linux's NUMA balancing feature is active (/proc/sys/kernel/numa_balancing) and I get similar performance independent of the data location. The Linux kernel detects that all accesses to the data are from the remote socket, so it moves the data over.
  • For the test system, I can enable/disable the NUMA balancing feature. When NUMA balancing is off, you get the expected results:
$ likwid-bench -t triad_mem_avx512_fma -w S0:20GB:20:1:2-0:S1,1:S1,2:S1,3:S1 2>&1 | egrep "(Cycles:|Time:|MByte/s:|Cycles per update:|Cycles per cacheline:)"
Cycles:			11011722738
Time:			5.505768e+00 sec
MByte/s:		116241.74
Cycles per update:	0.550586
Cycles per cacheline:	4.404689
$ likwid-bench -t triad_mem_avx512_fma -w S0:20GB:20:1:2-0:S0,1:S0,2:S0,3:S0 2>&1 | egrep "(Cycles:|Time:|MByte/s:|Cycles per update:|Cycles per cacheline:)"
Cycles:			5490815030
Time:			2.745366e+00 sec
MByte/s:		233120.10
Cycles per update:	0.274541
Cycles per cacheline:	2.196326

I committed a UPI group for SPR to the master branch. You can reinstall LIKWID from the master branch or you just download the file and put it into ~/.likwid/groups/SPR.

$ likwid-perfctr -c S0:0@S1:0 -g UPI likwid-bench -t triad_mem_avx512_fma -w S0:20GB:20:1:2-0:S1,1:S1,2:S1,3:S1
MByte/s:		116132.91
+----------------------------------------+-------------+------------+------------+------------+
|                 Metric                 |     Sum     |     Min    |     Max    |     Avg    |
+----------------------------------------+-------------+------------+------------+------------+
| Received data bandwidth [MByte/s] STAT |  25805.0898 | 12902.5449 | 12902.5449 | 12902.5449 |
|    Received data volume [GByte] STAT   |    348.5890 |   174.2945 |   174.2945 |   174.2945 |
|   Sent data bandwidth [MByte/s] STAT   |  77215.5450 | 38607.7725 | 38607.7725 | 38607.7725 |
|      Sent data volume [GByte] STAT     |   1043.0690 |   521.5345 |   521.5345 |   521.5345 |
|   Total data bandwidth [MByte/s] STAT  | 103020.6348 | 51510.3174 | 51510.3174 | 51510.3174 |
|     Total data volume [GByte] STAT     |   1391.6582 |   695.8291 |   695.8291 |   695.8291 |
+----------------------------------------+-------------+------------+------------+------------+
$ likwid-perfctr -c S0:0@S1:0 -g UPI likwid-bench -t triad_mem_avx512_fma -w S0:20GB:20:1:2-0:S0,1:S0,2:S0,3:S0
MByte/s:		230375.63
+----------------------------------------+---------+---------+---------+---------+
|                 Metric                 |   Sum   |   Min   |   Max   |   Avg   |
+----------------------------------------+---------+---------+---------+---------+
| Received data bandwidth [MByte/s] STAT | 55.1544 | 27.5772 | 27.5772 | 27.5772 |
|    Received data volume [GByte] STAT   |  0.5542 |  0.2771 |  0.2771 |  0.2771 |
|   Sent data bandwidth [MByte/s] STAT   | 15.7742 |  7.8871 |  7.8871 |  7.8871 |
|      Sent data volume [GByte] STAT     |  0.1584 |  0.0792 |  0.0792 |  0.0792 |
|   Total data bandwidth [MByte/s] STAT  | 70.9288 | 35.4644 | 35.4644 | 35.4644 |
|     Total data volume [GByte] STAT     |  0.7126 |  0.3563 |  0.3563 |  0.3563 |
+----------------------------------------+---------+---------+---------+---------+

This is without MarkerAPI. If you want to use the MarkerAPI and NUMA balancing is active, you probably do not see the UPI traffic because the Linux kernel would move it already in the warmup phase thus no UPI traffic while the benchmark runs.

I hope this clarifies it for you. If you have further questions, feel free to ask.

@simonhf
Copy link
Author

simonhf commented Feb 26, 2025

@TomTheBear thanks so much for the comment and explanation which all makes sense to me! :-)

However, when trying out the commands then I'm still getting unexpected results.

Here I disable NUMA balancing and re-run the command lines:

$ cat /proc/sys/kernel/numa_balancing
1

$ sudo sysctl -w kernel.numa_balancing=0; cat /proc/sys/kernel/numa_balancing # disable NUMA balancing
kernel.numa_balancing = 0
0

$ likwid-pin -c N:0-39 likwid-bench -t triad_mem_avx512_fma -w S0:20GB:20-0:S0,1:S0,2:S0,3:S0 2>&1 | egrep "(Cycles:|Time:|MByte/s:|Cycles per update:|Cycles per cacheline:)"
Cycles:			6696678972
Time:			3.348323e+00 sec
MByte/s:		191140.47
Cycles per update:	0.334834
Cycles per cacheline:	2.678672

$ likwid-pin -c N:0-39 likwid-bench -t triad_mem_avx512_fma -w S0:20GB:20-0:S1,1:S1,2:S1,3:S1 2>&1 | egrep "(Cycles:|Time:|MByte/s:|Cycles per update:|Cycles per cacheline:)"
Cycles:			3510586692
Time:			1.755282e+00 sec
MByte/s:		182306.94
Cycles per update:	0.351059
Cycles per cacheline:	2.808469

$ sudo sysctl -w kernel.numa_balancing=1; cat /proc/sys/kernel/numa_balancing # enable NUMA balancing
kernel.numa_balancing = 1
1

But the MByte/s is still unexpectedly similar. Why?

Then I noticed that in your commands above, there is no likwid-pin command, so I tried again without the likwid-pin command:

$ sudo sysctl -w kernel.numa_balancing=0; cat /proc/sys/kernel/numa_balancing # disable NUMA balancing
kernel.numa_balancing = 0
0

$ likwid-bench -t triad_mem_avx512_fma -w S0:20GB:20-0:S0,1:S0,2:S0,3:S0 2>&1 | egrep "(Cycles:|Time:|MByte/s:|Cycles per update:|Cycles per cacheline:)"
Cycles:			8593043156
Time:			4.296490e+00 sec
MByte/s:		148958.80
Cycles per update:	0.429652
Cycles per cacheline:	3.437217

$ likwid-bench -t triad_mem_avx512_fma -w S0:20GB:20-0:S1,1:S1,2:S1,3:S1 2>&1 | egrep "(Cycles:|Time:|MByte/s:|Cycles per update:|Cycles per cacheline:)"
Cycles:			6035619464
Time:			3.017794e+00 sec
MByte/s:		106037.72
Cycles per update:	0.603562
Cycles per cacheline:	4.828496

$ sudo sysctl -w kernel.numa_balancing=1; cat /proc/sys/kernel/numa_balancing # enable NUMA balancing
kernel.numa_balancing = 1
1

$ likwid-bench -t triad_mem_avx512_fma -w S0:20GB:20-0:S0,1:S0,2:S0,3:S0 2>&1 | egrep "(Cycles:|Time:|MByte/s:|Cycles per update:|Cycles per cacheline:)"
Cycles:			4378500026
Time:			2.189245e+00 sec
MByte/s:		146169.12
Cycles per update:	0.437850
Cycles per cacheline:	3.502800

$ likwid-bench -t triad_mem_avx512_fma -w S0:20GB:20-0:S1,1:S1,2:S1,3:S1 2>&1 | egrep "(Cycles:|Time:|MByte/s:|Cycles per update:|Cycles per cacheline:)"
Cycles:			5847790280
Time:			2.923883e+00 sec
MByte/s:		109443.50
Cycles per update:	0.584779
Cycles per cacheline:	4.678232

Now I do see a difference with MByte/s but it appears to have nothing to do with the NUMA balancing setting.

Questions:

  1. Why does likwid-pin appear -- in the above examples -- to unexpectedly always cause UPI usage / activity, i.e. force MByte/s slower?
  2. Why does enabling NUMA balancing work for you, but not for me?

Thanks for you help so far! :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants