Understanding Core-to-Core latencies on different generations

angus-hewlett · ‎01-09-2016

Hi all,

I'm implementing some multi-threaded, multi-core code & profiling with VTune.

First off: I know that inter-thread / inter-core dependency should be kept to a minimum. "Vectorize innermost, parallelize outermost". Unfortunately my outermost parallelizable code is still pretty fine-grained (of the order of 100uSec per task) & relatively tightly integrated layout-wise. The working set of each task is small, <32KB.

I have three quad-core systems at my disposal - an old Harpertown Xeon, a Sandy Bridge (i5 - no HT) and an Atom Silvermont (Z3xxx). The first two are development systems, the latter is the ultimate target.

So far I've been able to isolate data such that simultaneously executing tasks aren't trying to write to the same cache line, but nevertheless some core-to-core communication is required, particularly at job wakeup. The jobs are recurring, so I've experimented with pinning each job to a specific thread and core, so far that hasn't improved performance as much as I'd like (and as the cores are paired on Silvermont & Harpertown, there can be no more than two threads in close interaction).

What I'm trying to get my head around is the relative penalty of going core-to-core on each system, and the characteristics of going core-to-core. Is it typically equivalent to the slowest shared memory, or potentially worse?

I understand that:

Sandy Bridge has 4 cores sharing only the LLC, which is latency ~30 cycles.

Silvermont has 2x 2 cores, each pair sharing 1M L2 with a latency of 12, no LLC.

Harpertown has 2x 2 cores, each pair sharing 6M L2 with a latency of 15 no LLC.

I've done all I can to minimize false sharing, but nevertheless there will be some situations where data needs to be fetched from another core. So I'm trying to understand the relative penalty for fetching data from another core with shared cache, vs another core without. If there's no shared cache, is the request going all the way to main memory, or worse does it have to wait for a write from the other core to main memory & then fetch the data -- or is the system agent able to transfer data from one L2 to another without going to main memory and back? How about core-to-core latency via L2..? Is that just the L2 latency, or if the other core has the cache line in "M" or "E" state, is there a further penalty?

Is it even likely to be worth trying to synchronize tasks of 100us or less? I know larger would be better - just not an option here. Does anyone have any advice re the smallest work chunk it's feasible to attempt to send to a remote core?

(FWIW, the synchronization primitives & OS - semaphores, and Windows 7/8/10 - seem to be behaving well, I'm not experiencing priority inversions or scheduling problems, but VTune is flagging up performance degradation in the tasks themselves - i.e. clocks-per-instruction for the same code is increasing dramatically when they're run across multiple cores vs single threaded).

Thanks much.

jimdempseyatthecove · ‎01-09-2016

With that short of task and if the process is generally busy (tasks in queue), then I would recommend that you write your own synchronization primitives targeted at the Z3xxx.

RE: clocks per instruction

What may be of more interest is clocks per instruction exclusive of synchronization. You may have to calculate that yourself.
Also, consider the throughput as well as worst latency. These may be more important to your application.

Jim Dempsey

angus-hewlett · ‎01-09-2016

Thank you. When you say own synchronization primitives.. I'm familiar with atomic operations, spinlocks and suchlike; I've been using WaitForSingleObject() & a Windows semaphore as I was under the impression that would burn fewer CPU cycles with busy-waits. Is there a half-way house that's more efficient than the OS synch primitives & less wasteful than spinlocks? NtDelayExecution perhaps?

Throughput-wise, I think I'm OK as the working set is so small. Even to reload it in its entirety on an Atom (6.4GB/s => 6kB/us) it shouldn't be too bad a penalty - and with appropriate thread pinning prefetch, L2 misses should, I think, be relatively infrequent.

McCalpinJohn · ‎01-11-2016

I have not seen any good documentation of cache intervention latencies on the various Intel processors. There are some hints in Chapter 2 of the Optimization Reference Manual (document 248966), but it is not always clear which processor models are being referred to, and the "client" and "server" parts often have sufficiently different uncore implementations that the performance values may be quite different.

The inclusive L3 cache on the Sandy Bridge gives it a completely different cache-to-cache intervention process than on a Silvermont system, so I think considering those results is mostly going to make things confusing. The Harpertown is fairly old now, and I would not trust that the performance characteristics are much like those of the Atom-based systems.

The questions you have on intervention latency are good ones, and I could answer them for Sandy Bridge, Ivy Bridge, or Haswell processors. Atom is a very different beast, and Atoms with more than one core pair are even more differenter.... ;-)

To get a handle on the low-level cache-to-cache intervention performance characteristics, I have three approaches:

(1) Use a pointer-chasing code with a short (cache-containable) pointer chain. Initialize it using one core and then chase it exactly once using a different core. Low-overhead timing is critical, since the execution time is short.

(2) Use a "producer-consumer" code with two threads alternating increments to a shared data item a large number of times. Use a separate variable (in a different cache line) as the "flag" to show when the data is ready for the other thread to increment. This is not quite as easy to understand, since there are several different types of transactions taking place, but it has the advantage of being repeatable for as many iterations as desired. The setup needs to be done carefully, but the producer and consumer both run something like the following code:

            /* === Begin Main Ping-Pong Loop === */
            while (increments <= NTRADES) {
                // 1. wait for the flag to tell me that the data is ready
                while (*flag != READY) {
                    spins++;         // negligible overhead to track this
                }
                // 2. once the data is ready, read it
                new_data = *data;
                // 3. check to see if the other thread is finished, else increment the data
                if (new_data == -1) {   /* special value:  other thread already finished */
                    *data = 0;
                    break;  /* exit pingpong loop */
                } else {
                    *data = *data + 1;
                }
               // 4. Flip the flag bit -- can be avoided with sneaky double-buffering
                *flag = NOTREADY;       /* this really means "READY" -- just flipping state */
                increments++;     // only need to track this separately for debugging
            }
            /*  === End Main PingPong Loop === */
            *data = -1;     /* should not be necessary, but tell the other thread to quit now */

There are lots of variations and sneaky tricks that can be used, and the whole thing should be double-buffered if you have a system that uses home snooping (so that you only write to local variables). Double-buffering can also be used to remove one of the writes from the critical path (if the other thread has set the flag to notify you that the new data is ready, then you are absolutely guaranteed that the other thread has finished using the buffer that you previously sent to it (that it used for input), so you are now free to over-write it.) Combining the two double-buffering schemes give you a quad-buffering scheme with a minimum number of transactions on the critical path.

I did not run this when I was testing a Silvermont system, but on Sandy Bridge systems the cross-chip producer/consumer latency could be as low as 240 ns or so (one-way handoff latency with both chips running at 3.1 GHz).

(3) For a measurement using a more complete software stack, I run the OpenMP benchmarks from the Edinburgh Parallel Computing Center. https://www.epcc.ed.ac.uk/research/computing/performance-characterisation-and-benchmarking/epcc-openmp-micro-benchmark-suite

jimdempseyatthecove · ‎01-11-2016

>>some core-to-core communication is required, particularly at job wakeup.

If the job suspends for a very short time (less than a few ms), and the intention is to force it to suspend, not for other processes but rather to reduce power (embedded solution for your Z3xxx) then consider using the MONITOR and MWAIT, assuming they are supported and enabled in the BIOS.

If on the other hand you are connected to power, you might also consider a tight flag test loop containing the PAUSE instruction.

Both solutions can perform an initial short-run test loop prior to falling into the longer secondary wait loops.

Use of single-producer/single-consumer queues will have the lowest latency and best throughput. You can have multiple of these queues if needed. For the small number of threads (4 max) managing multiple queues should not be difficult. Note, there is a second class of queue called single-producer/multi-consumer queue, as well as multi-producer/single-consumer. All three of these classes of thread-safe queues are more efficient to use than a generic multi-producer/multi-consumer queue.

As to which queue (or queues) are best for your application, you will have to make that determination.

Note, there two classes of each of these queues: lock-free/wait-free and locking.

The lock-free/wait-free queue has the highest overhead but has the benefit that it cannot stall. Example, a thread obtaining the lock gets preempted by the O/S to run something else. In this situation, the queue is locked out for the duration of the preemption of the thread holding the lock.

Jim Dempsey

McCalpinJohn · ‎01-12-2016

I don't know if it supports the Atom-based processors, but version 3 of the Intel Memory Latency Checker has a cache-to-cache intervention sub-test. I am just starting to play with it on Haswell processors, but it looks like it has enough knobs to generate a good breadth of information.

McCalpinJohn · ‎01-12-2016

I will have to write my own tests to really understand what is happening, but the cache-to-cache intervention latency test in version 3 of the Intel Memory Latency Checker is flexible enough to show some aspects of the internal structure of the Xeon E5-26xx v3 processors. The Xeon E5 v3 Uncore Performance Monitoring Reference Manual (document 331051) Figure 1-2 shows a block diagram of a 12-core Xeon E5-1600/2600/4600 v3 processor, showing 8 cores on the "main" ring and four cores on a separate loop connected to the "main" ring via bridges (SBox0-SBox3). One might guess from such a diagram that there will be a certain asymmetry in L3 cache access and in core-to-core intervention latency.

On a system with Xeon E5-2680 v3 (12-core) processors, I measured the cache intervention latency for all core pairs (using the "-cN" option to set the reading thread and the "-wM" option to set the writing thread), and found that the results grouped clearly into two groups -- one group contained 8 cores and one group contained 4 cores. Intervention latencies were slightly different within and across the two groups, with numbers in the range of:

Between cores in the 4-group: 84-86 cycles
Reader in 8-group, Writer in 4-group: 87-90 cycles
Reader in 4-group, Writer in 8-group: 88-91 cycles
Between cores in the 8-group: 90-94 cycles

These are clearly fairly small differences, but the data is clean enough that they appear to be real.

If I understand the tester correctly, these are average latencies over a block of addresses. The cache lines within this block will map across the 12 L3 slices using an undocumented hash, and there is no guarantee that the addresses being used are mapped to all of the L3 slices an equal number of times.

With some effort, one could extend this to three dimensions: reader location, writer location, L3 slice location. The mapping of addresses to L3 slices can be obtained by repeated accessing one address and using the access counters in the L3 CBoxes to figure out which L3 slice controls that address. It seems likely that the range of intervention latency values will increase when each L3 slice is considered separately, and it is possible that the difference will be large enough to consider exploiting for data placement of shared-memory variables used in communication and synchronization between core pairs.

I also measured cross-socket intervention latencies, with a clear split 8/4 split being visible in these results as well:

Writer in 4-group on Socket 1, Reader in 8-group on Socket 0: 212-215 cycles
Writer in 4-group on Socket 1, Reader in 4-group on Socket 0: 219-221 cycles
Writer in 8-group on Socket 1, Reader in 8-group on Socket 0: 215-218 cycles
Writer in 8-group on Socket 1, Reader in 4-group on Socket 0: 222-225 cycles

These differences are almost certainly not large enough to worry about, but it is fun to see some reflection of the processor die layout in the performance values....

angus-hewlett · ‎01-13-2016

Thanks for all your help - am using double buffered writes already (locals get written back to member vars at the end of a 64-iteration work loop) & lock-free 1W/1R queues. (As an aside - the locals are interleaved vector data, the member variables are array-of-structures).

Single producer / single consumer also seems to have the advantage that if I pin a thread to a particular core, the objects used exclusively by the consumer will tend to remain in that core's L2. Slightly less elegant, but not enough to worry about.

I've implemented busy-waits with PAUSE for the worker threads' active spin, and a semaphore & priority drop for their sleep-state (the parent thread wakes them up when it wakes up, and puts them back to sleep right before it sleeps). (It's a realtime audio application, so there are periodic bursts of work every couple of msec).

I seem to be able to get pretty good performance with two threads -- if I try to use all four cores, the probability of stalls of one kind or another becomes too high. (The workload doesn't partition nicely for 3 threads - has to be powers of two). Affinity / pinning doesn't seem to help things much on either Sandy Bridge or Silvermont - the former because its LLC is doing the job, the latter because the risk of getting stalled waiting for a system task seems to outweigh the gain from better cache use.

Thanks again!

- Angus.