I don't know what the right

Florian_H_ · ‎10-10-2016

We are currently testing different cpus for usage in multithreaded low-latency applications.

In a comparison of E5-2689 v4 versus E5-2667 v3 cpus, we are seeing cache-to-cache latencies higher by around 35% on the v4 cpu with 2 more cores. Is this to be expected by design?

E5-2667 v3
./mlc --c2c_latency
Intel(R) Memory Latency Checker - v3.1a
Command line parameters: --c2c_latency

Measuring cache-to-cache transfer latency (in ns)...
Local Socket L2->L2 HIT latency        27.8
Local Socket L2->L2 HITM latency        32.3
Remote Socket LLC->LLC HITM latency (data address homed in writer socket)
                        Reader Numa Node
Writer Numa Node     0       1
            0        -    63.9
            1     63.7       -
Remote Socket LLC->LLC HITM latency (data address homed in reader socket)
                        Reader Numa Node
Writer Numa Node     0       1
            0        -    63.8
            1     63.9       -

E5-2689 v4
./mlc --c2c_latency
Intel(R) Memory Latency Checker - v3.1a
Command line parameters: --c2c_latency

Measuring cache-to-cache transfer latency (in ns)...
Local Socket L2->L2 HIT latency        37.6
Local Socket L2->L2 HITM latency        40.9
Remote Socket LLC->LLC HITM latency (data address homed in writer socket)
                        Reader Numa Node
Writer Numa Node     0       1
            0        -    74.4
            1     75.0       -
Remote Socket LLC->LLC HITM latency (data address homed in reader socket)
                        Reader Numa Node
Writer Numa Node     0       1
            0        -    74.2
            1     74.5       -

McCalpinJohn · ‎10-10-2016

I don't know what the right answer is supposed to be, but there are potentially a lot of confounding factors....

The Xeon E5-2667 v3 is an 8-core processor, so it has only one ring and one home agent. Therefore there will be a smaller average number of "hops" between the core making the request and the L3 slice that holds the data.
The default snooping mode is different for Xeon E5 v3 and v4. I have not had a chance to study the default mode on the Xeon E5 v4 in detail, but it is often the case that changes made to increase throughput have the side effect of slight increases in latency.
The cache-to-cache intervention latency is dependent on the core frequency and the uncore frequency.
- The nominal frequency on the Xeon E5-2689 v4 is slightly lower than the Xeon E5-2667 v3, but the max single-core Turbo frequency is higher. This needs to be either monitor or controlled (or both).
- The uncore frequency will vary depending on the load and on several BIOS settings related to performance and energy efficiency. This can also be controlled explicitly as discussed at https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/600913#comment-1872473

Higher cache-to-cache and memory latencies on E5-2689 v4 vs E5-2667 v3