I tested the inter-socket latency of a dual Xeon platinum 8180 server with lmbench-lat_mem_rd, the command and the output was:
numactl –C 0 –m 1 ./lat_mem_rd -P 1 -N 5 -t 4096m 1024
(The left column is the test buffer size, and the right column is the test result with the unit being nano-second)
The funny thing was that, after the size of the test buffer overflowed all levels of cache on chip, a 10ns drop occurred with the buffer size continued to increase. I tried increasing the number of iteration by parameter –N, or warmup period by –W, the latency drop still occurred.
Then I wrote a simpler benchmark, trying to figure it out. My code initiates a total random link, one node of the link contains the pointer to the next node. The link part is exactly the same with lmbench, the only difference is the address pattern initialization. My code’s result is:
154ns, which is close to the peak of lmbench’s result. Then I shifted the code step by step towards lmbench-like to find out what caused the latency drop. I reproduced the result successfully when I just simply set the one way link access length within a specific range. If the access length is beyond a certain value, the resulted latency turns higher.
To verify this conclusion, I adjusted lmbench’s code:
Simply just increased the one-way link access length by 9 times, and modify the count parameter to let it do the correct calculation. Now the new result for the same command became:
The significant latency drop disappeared, and the result agreed with my code. I also tested it with intel’s official tool, Memory Latency Checker:
It went with the original lmbench.
I did some more digging and I found that I can eliminate this phenomenon by disabling “directory mode” in BIOS→UPI config menu.
With directory mode disabled, lmbench, intel MLC, my code, all of them gave the same inter-socket latency, 169~170ns.
So, here’s the hypothesis: there’s certain mechanism associated with “directory mode” to do the inter-socket latency optimization. It “cheats” successfully on most commonly used latency benchmarks, but somehow it fails on simple shift to these benchmarks.
Here’s the table summarizing all the data mentioned previously:
As an extension to this topic, this mechanism is able to do more with heavily loaded traffic.
What I would like to discuss here is, what mechanism would it be? Is there any chance that I could take advantage of it and optimize my code?