The pause instruction

Nick_W_ · ‎10-03-2013

Hi

We have a native MIC application which divides the input buffer amongst many threads (configurable - usually at least 4 per CPU)

The input arrives via DMA in 2M lumps. When the application is started, our intention was to start all the threads required and halt them at a barrier between each DMA block.

First we tried a spinlock barrier: while(volatile int < TOTAL_THREADS) { do nothing }

This was very inefficient, then we tried the synchronisation method given on this page:

http://en.cppreference.com/w/cpp/thread/condition_variable/wait

The whole thread cohort is taking about 400,000 microseconds to run BUT if we time the contents of an individual thread, it only take 30-70us to run. I appreciate that their execution is probably staggered but the synchronisation method is clearly proving to be a huge overhead.

We don't need our code to be portable so we could use intel intrinsics/atomics. All we need to is to be able to put the threads to sleep when they complete each iteration of their main loop - and then to wake them all up simultaneously.

TIA

jimdempseyatthecove · ‎10-03-2013

First, assure that the 4 threads you assume are in a single core are in fact in a single core, and will remain so for the duration of the run of the application. This will reqire affinity pinning of theads to core or thread to specific logical processor within core.

Next, create an array of 4 volatile int(s), one for each logical processor (HT) of the same core. Initialize to 0.
Each of the four threads of a core knows its HT sibling number (0:3, your responsibility to figure this out)

On entry to barrier, each thread knowing its thread number within the core (0:3), incriments its private counter in the array indicating it has reached the barrier. This is done without LOCK. After incrimenting, it waits until all other entries are .ge. its value.

Use __int64 if you think you need to protect against overflow (or add logic to handle situation)

Jim Dempsey

Nick_W_ · ‎10-04-2013

Thanks, will def. give that a try.

Is there anything I can put inside the spinwait loop? I found _mm_pause in the docs - but the compiler tells me that is not supported on the MIC.

I have pthread_yield() in there at the moment but I can't say that it makes any difference compared to an empty loop.

*edit - thread cohort completes in about 950microseconds now...

jimdempseyatthecove · ‎10-04-2013

Try an asm block with PAUSE. It may be the case that the intrinsic _mm_pause() was not defined in a header. It migh be that the Phi does not have this instruction. This has been available since P4, I cannot see why it would be removed for Phi.

BTW, _mm_pause is in the emmintrin,h, maybe you have a missing header.

Jim Dempsey

robert-reed · ‎10-04-2013

The pause instruction represented by the _mm_pause() intrinsic is an artifact of the Intel Pentium 4 implementaton of Intel Hyper-Threading Technology, a technology intended for non-arithmetic workloads, that could take advantage of idle ALU cycles by splitting the resource between two threads: while one thread may be blocked waiting for memory, the other could be issuing instructions for the pipeline. However, one thread could saturate the pipeline by issuing instructions every cycle (like spin-waiting). The pause instruction was designed basically as a hardware yield, to give the other thread a chance to get some cycles. Most HPC applications turned Intel Hyper-Threading technology off, since maximum floating-point throughput could be achieved with just one thread per core, which also gave lower thread management overhead.

With the Intel Xeon Phi coprocessor, the scheduler uses "smart round robin scheduling" to distribute the cycles among the HW threads. No thread can execute two cycles in succession (a compromise made in the decoder in order to increase its operating frequency), so there is no need for a pause instruction: if all four HW threads on a core are trying to issue instructions, none of them can get locked out, by the basic nature of the architecture.

In addition to ganging the requests among the local teams on each core, you might also experiment with dividing the wakeup in a classic "telephone tree" maneuver, where each awakened local core master awakens its local neighbors and also two (or more) other core masters, cascading and parallelizing the wakeup, and then reversing the process on the join.

McCalpinJohn · ‎10-04-2013

As noted in the Xeon Phi Instruction Set Architecture manual (document 327364-001, September 2012, page 659), the PAUSE instruction is not supported on Xeon Phi. Instead, the Xeon Phi implements the DELAY instruction (page 628), which is accessable via the _mm_delay_32() or _mm_delay_64() intrinsics (in zmmintrin.h).

Unlike PAUSE, the DELAY instruction allows (requires) the user to specify a specific number of cycles to wait.

I must confess that I am confused by several aspects of the description of the instruction in the Xeon Phi ISA manual.
(1) I don't understand why there is any issue with the "CURRENT_CLOCK_COUNT" (which is perversely undefined in the text).
If this function counts down from the value that you set to zero, then no other "clock" counts are relevant.
(2) The description includes the phrase "This instruction should prevent the issuing of additional instructions on the issuing thread as soon as possible [...]". What does that mean? Either the instruction *does* prevent the issuing of additional instructions on the issuing thread or it *doesn't *--- this is a statement about the hardware, and I don't see that there is anything that the user can do about this.

I have not tried to use this instruction, but I am interested the development of fast barriers for the four thread contexts on a single physical core. The OpenMP barrier function is painfully slow in this case. Using version 2 of "synchbench" from the EPCC OpenMP benchmarks in C, I just re-measured the OpenMP barrier overhead on a Xeon Phi SE10P using 2, 3, or 4 threads bound to one physical core:
2 threads: BARRIER overhead = 1.20 microseconds (>1300 cycles)
3 threads: BARRIER overhead = 1.76 microseconds (>1900 cycles)
4 threads: BARRIER overhead = 2.39 microseconds (>2600 cycles)

I certainly hope that a barrier in a shared cache can be made faster.

robert-reed · ‎10-04-2013

Yes, the "instruction should prevent..." line DOES sound strange, almost like a specification rather than an explanation. I'll see if I can find out any details. As far as CURRENT_CLOCK_COUNT, I'm not sure what the source of such a clock might be (lower 32-bits of the CPU_CLOCK_UNHALTED?) but to me the description seems pretty straightforward, a means to avoid the modulo arithmetic of overflow.

McCalpinJohn · ‎10-04-2013

Thanks for looking into this, Robert!

The confusion about the "modulo" may come from the lack of a description of how the instruction works.

I read the description as: Put a value in a counter. The counter gets decremented every cycle until it reaches zero, at which point your thread context will be eligible for scheduling again. If this is the case, then overflow and modulo arithmetic don't seem relevant?

I suppose that if you want counts that are bigger than 2^31 but less than 2^32, then you need to worry about the difference between decrementing signed and unsigned integers, and that might be considered a "modulo" problem? If this is the case, then it could probably be explained much more clearly?

In any case, if you are using this for fine-tuning spin-waits, then you probably want to use values that are in the range of 4-8 cycles -- enough to get out of the way of another thread that could use the core, but not so much that you delay progress by delaying your "wake up" time by more cycles than you have saved.

jimdempseyatthecove · ‎10-06-2013

JohnD,

Thanks for pointing out _mm_delay_..(nn)

Not having a Xeon Phi here, I cannot test. Nick W's report seems to indicate the compiler issues "... not supported" for _mm_pause()...
instead of "... not supported - substituting _mm_delay_64(16) instead". Of course the programmer can insert a #define to fix the code, yet keep it portable for those CPUs that do not support _mm_delay_..(nn)

Robert, I disagree with your _mm_pause is not needed (or alternately _mm_delay_..(nn) is not needed), though I have to admit not being able to setup a test, I am unable to demonstrate this with an example. My justification based on observation of others: Many applications report finding a sweet spot of running 3 threads of the 4 threads within a core. Using _mm_delay_..(nn), if implimented as described, should reduce round-robin cycle time by 25%. Note, when running 3 of the 4 threads, typically an application uses scattered affinity and undersubscribes number of threads by 1/4, leaving the O/S to run these other threads in a null-job, presumably issuing _mm_delay_..(nn) in a do-nothing loop. (alternatively MONITOR/MWAIT). This provides you with a "so-what" type of position. To me, it is wasteful to sequester 25% of the hardware resources due to lack of gumption of the programmer to make use of this resource. IOW all threads of a core do not have to do the same thing.

Jim Dempsey

robert-reed · ‎10-06-2013

Jim, I'm not sure what you mean regarding a squester of 25% of the machine. It's my understanding that when you have less than a full complement of HW threads running on an Intel Xeon Phi core, there are no cycles skipped because of the reduced thread count; the available cycles are parcelled out to the HW threads as they become ready, using the "smart" round-robin scheduling. Two threads could use every other cycle if they were not blocked for other reasons like waiting for data. One thread runs into the decoder latency issue and still you can use only evert other cycle. But likely lots of cycles are missed because no thread on the core is ready.

jimdempseyatthecove · ‎10-07-2013

Robert,

It is not 25% of the machine, rather it is 25% of the round-robin scheduler. The actual amount will vary depending on what the other HT siblings are doing. This might make for an interesting and seamingless usless test(s) for someone to do.

Four threads of the core issuing a long line of "inc rax" then loop for period of time/number of iterations, then look at result. Verses, three threads issuing same with forth thread of core issuing _mm_sleep64(1024). This test would likely show the upper limit of the "lost" (recoverable) time.

>> But likely lots of cycles are missed because no thread on the core is ready.

This is correct, and is likely a good indicator of a missed optimization opportunity.

Jim Dempsey

Teaming threads?