Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.
1757 Discussions

Higher cache-to-cache and memory latencies on E5-2689 v4 vs E5-2667 v3

Florian_H_
Beginner
997 Views

We are currently testing different cpus for usage in multithreaded low-latency applications.

In a comparison of E5-2689 v4 versus E5-2667 v3 cpus, we are seeing cache-to-cache latencies higher by around 35% on the v4 cpu with 2 more cores.  Is this to be expected by design?

E5-2667 v3
./mlc --c2c_latency
Intel(R) Memory Latency Checker - v3.1a
Command line parameters: --c2c_latency

Measuring cache-to-cache transfer latency (in ns)...
Local Socket L2->L2 HIT  latency        27.8
Local Socket L2->L2 HITM latency        32.3
Remote Socket LLC->LLC HITM latency (data address homed in writer socket)
                        Reader Numa Node
Writer Numa Node     0       1
            0        -    63.9
            1     63.7       -
Remote Socket LLC->LLC HITM latency (data address homed in reader socket)
                        Reader Numa Node
Writer Numa Node     0       1
            0        -    63.8
            1     63.9       -

E5-2689 v4
./mlc --c2c_latency
Intel(R) Memory Latency Checker - v3.1a
Command line parameters: --c2c_latency

Measuring cache-to-cache transfer latency (in ns)...
Local Socket L2->L2 HIT  latency        37.6
Local Socket L2->L2 HITM latency        40.9
Remote Socket LLC->LLC HITM latency (data address homed in writer socket)
                        Reader Numa Node
Writer Numa Node     0       1
            0        -    74.4
            1     75.0       -
Remote Socket LLC->LLC HITM latency (data address homed in reader socket)
                        Reader Numa Node
Writer Numa Node     0       1
            0        -    74.2
            1     74.5       -

 

0 Kudos
1 Reply
McCalpinJohn
Honored Contributor III
997 Views

I don't know what the right answer is supposed to be, but there are potentially a lot of confounding factors....

  • The Xeon E5-2667 v3 is an 8-core processor, so it has only one ring and one home agent.   Therefore there will be a smaller average number of "hops" between the core making the request and the L3 slice that holds the data.
  • The default snooping mode is different for Xeon E5 v3 and v4.   I have not had a chance to study the default mode on the Xeon E5 v4 in detail, but it is often the case that changes made to increase throughput have the side effect of slight increases in latency.
  • The cache-to-cache intervention latency is dependent on the core frequency and the uncore frequency. 
0 Kudos
Reply