Software Tuning, Performance Optimization & Platform Monitoring
Discussion around monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform monitoring
1607 Discussions

An interesting issue when testing inter-socket latency on skylake processor

NickChiu
Beginner
261 Views

Hi, all

I tested the inter-socket latency of a dual Xeon platinum 8180 server with lmbench-lat_mem_rd, the command and the output was:

numactl –C 0 –m 1 ./lat_mem_rd -P 1 -N 5 -t 4096m 1024

NickChiu_0-1617776222356.png

(The left column is the test buffer size, and the right column is the test result with the unit being nano-second)

The funny thing was that, after the size of the test buffer overflowed all levels of cache on chip, a 10ns drop occurred with the buffer size continued to increase. I tried increasing the number of iteration by parameter –N, or warmup period by –W, the latency drop still occurred.

Then I wrote a simpler benchmark, trying to figure it out. My code initiates a total random link, one node of the link contains the pointer to the next node. The link part is exactly the same with lmbench, the only difference is the address pattern initialization. My code’s result is:

NickChiu_1-1617776222357.png

154ns, which is close to the peak of lmbench’s result. Then I shifted the code step by step towards lmbench-like to find out what caused the latency drop. I reproduced the result successfully when I just simply set the one way link access length within a specific range. If the access length is beyond a certain value, the resulted latency turns higher.

To verify this conclusion, I adjusted lmbench’s code:

NickChiu_2-1617776222361.png

NickChiu_3-1617776222366.png

Simply just increased the one-way link access length by 9 times, and modify the count parameter to let it do the correct calculation. Now the new result for the same command became:

NickChiu_4-1617776222367.png

The significant latency drop disappeared, and the result agreed with my code. I also tested it with intel’s official tool, Memory Latency Checker:

NickChiu_5-1617776222367.png

It went with the original lmbench.

I did some more digging and I found that I can eliminate this phenomenon by disabling “directory mode” in BIOS→UPI config menu.

NickChiu_6-1617776222372.png

With directory mode disabled, lmbench, intel MLC, my code, all of them gave the same inter-socket latency, 169~170ns.

NickChiu_7-1617776222379.png

NickChiu_8-1617776222379.png

NickChiu_9-1617776222379.png

So, here’s the hypothesis: there’s certain mechanism associated with “directory mode” to do the inter-socket latency optimization. It “cheats” successfully on most commonly used latency benchmarks, but somehow it fails on simple shift to these benchmarks.

Here’s the table summarizing all the data mentioned previously:

NickChiu_10-1617776222380.png

As an extension to this topic, this mechanism is able to do more with heavily loaded traffic.

NickChiu_11-1617776222380.png

What I would like to discuss here is, what mechanism would it be? Is there any chance that I could take advantage of it and optimize my code?

0 Kudos
0 Replies
Reply