- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We are currently testing different cpus for usage in multithreaded low-latency applications.
In a comparison of E5-2689 v4 versus E5-2667 v3 cpus, we are seeing cache-to-cache latencies higher by around 35% on the v4 cpu with 2 more cores. Is this to be expected by design?
E5-2667 v3
./mlc --c2c_latency
Intel(R) Memory Latency Checker - v3.1a
Command line parameters: --c2c_latency
Measuring cache-to-cache transfer latency (in ns)...
Local Socket L2->L2 HIT latency 27.8
Local Socket L2->L2 HITM latency 32.3
Remote Socket LLC->LLC HITM latency (data address homed in writer socket)
Reader Numa Node
Writer Numa Node 0 1
0 - 63.9
1 63.7 -
Remote Socket LLC->LLC HITM latency (data address homed in reader socket)
Reader Numa Node
Writer Numa Node 0 1
0 - 63.8
1 63.9 -
E5-2689 v4
./mlc --c2c_latency
Intel(R) Memory Latency Checker - v3.1a
Command line parameters: --c2c_latencyMeasuring cache-to-cache transfer latency (in ns)...
Local Socket L2->L2 HIT latency 37.6
Local Socket L2->L2 HITM latency 40.9
Remote Socket LLC->LLC HITM latency (data address homed in writer socket)
Reader Numa Node
Writer Numa Node 0 1
0 - 74.4
1 75.0 -
Remote Socket LLC->LLC HITM latency (data address homed in reader socket)
Reader Numa Node
Writer Numa Node 0 1
0 - 74.2
1 74.5 -
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I don't know what the right answer is supposed to be, but there are potentially a lot of confounding factors....
- The Xeon E5-2667 v3 is an 8-core processor, so it has only one ring and one home agent. Therefore there will be a smaller average number of "hops" between the core making the request and the L3 slice that holds the data.
- The default snooping mode is different for Xeon E5 v3 and v4. I have not had a chance to study the default mode on the Xeon E5 v4 in detail, but it is often the case that changes made to increase throughput have the side effect of slight increases in latency.
- The cache-to-cache intervention latency is dependent on the core frequency and the uncore frequency.
- The nominal frequency on the Xeon E5-2689 v4 is slightly lower than the Xeon E5-2667 v3, but the max single-core Turbo frequency is higher. This needs to be either monitor or controlled (or both).
- The uncore frequency will vary depending on the load and on several BIOS settings related to performance and energy efficiency. This can also be controlled explicitly as discussed at https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/600913#comment-1872473
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page