Significant increase in cache to cache data transfer

Vinay_Y_ · ‎10-12-2017

Hi All,

I am working on low latency software where I need to transfer data between cores very fast. I was exploring these two machine with intel mlc too.

I ran exactly this command on both machine
sudo ./mlc -e -r -c3 -i2 -l128 --c2c_latency

and following are the results for different CPUs

[CPU 1]
Intel(R) Xeon(R) Gold 6144 CPU @ 3.50GHz
No of numa node = 1
uname -a
Linux cresco31 4.4.87-18.29-default #1 SMP Wed Sep 13 07:07:43 UTC 2017 (3e35b20) x86_64 x86_64 x86_64 GNU/Linux
sudo ./mlc -e -r -c3 -i2 -l128 --c2c_latency
Latency = 231.1 core clocks (66.0 ns)

[CPU 2]
Intel(R) Xeon(R) CPU E5-2643 v2 @ 3.50GHz
No of numa node = 1
uname -a
Linux cresco29 4.4.73-18.17-default #1 SMP Fri Jun 23 20:25:06 UTC 2017 (f462a66) x86_64 x86_64 x86_64 GNU/Linux
sudo ./mlc -e -r -c3 -i2 -l128 --c2c_latency
Latency = 157.6 core clocks (45.0 ns)

We can see that for newer CPU cache to cache latency has significantly increased. Does this means that new CPUs are slower in this regard?

Thanks

Vinay

McCalpinJohn · ‎10-16-2017

As core counts increase, designers often have to make changes that allow increased throughput but at the cost of increased latency. The Xeon E5-2643 v2 is a 6-core part built on a single ring, while the Intel Gold 6144 is built on a two-dimensional mesh, so it is not surprising that there is an increase in cache-to-cache transfer latency.

The specific numbers you show are a bit odd, and when I try this test I also get numbers that don't make any sense -- they are slow and they don't change when I change the values for the "-i" and "-c" options. (I am testing on a two-socket Xeon Platinum 8160 node -- 24 cores, 2.1 GHz nominal, 3.7 GHz max Turbo.) There may be something funny with the core bindings on the Xeon Scalable processors.

The default version of the command "sudo ./mlc --c2c_latency" gives more reasonable results:

# ./mlc --c2c_latency
Intel(R) Memory Latency Checker - v3.4
Command line parameters: --c2c_latency

Measuring cache-to-cache transfer latency (in ns)...
Local Socket L2->L2 HIT latency   48.3
Local Socket L2->L2 HITM latency   48.3
Remote Socket L2->L2 HITM latency (data address homed in writer socket)
           Reader Numa Node
Writer Numa Node     0         1
            0         -   112.2
            1   113.1         -
Remote Socket L2->L2 HITM latency (data address homed in reader socket)
           Reader Numa Node
Writer Numa Node     0         1
            0         -   177.9
            1   181.2         -

Vinay_Y_ · ‎10-17-2017

Thanks John, for your response. When I run just basic command ./mlc --c2c_latency. Then I get 13.5ns but I have doubt on this number because as I add -r option means (./mlc -r --c2c_latency) then latency again becomes 66ns. So it is still confusing how there is this much difference.

Just to let you know about the problem I working on. We have a set of heavy calculations. So I am trying to create a 3 stage pipeline on cpu by dividing calculation in multiple stages. In this pipeline, only 128 bytes are send to next cpu for next set of calculation. So its like single producer/consumer.

When I measured time, on older cpu it was data transfer was taking 16% of overall calculation but on new cpu this time has become 25% which is causing all the problem. Any Ideas?