The simple answer is "no."

Edgardo_Doerner · ‎07-11-2014

Hi to everyone,

I have this small question about the Intel Xeon Phi Co-processors, they suffer performance penalties when the execution branch of the
different threads diverges?

For example, in GPUs its SIMD execution model imposes heavy performance penalties on kernels with control flow, where the threads
follow different paths of execution. As you know the hardware makes all these paths execute sequentially.

Is this issue also present on the Intel Xeon Phi Co-processors?. I work mostly on Monte Carlo (MC) simulation of particle transport on matter, where the particle histories diverge very quickly. As a first step I want to parallelize the MC codes using OpenMP, may I execute the codes directly on a Xeon Phi co-processor, or I will have performance penalties due to the different paths of execution for each particle?.

robert-reed · ‎07-11-2014

The simple answer is "no." There are more complicated answers because the Intel Many Integrated Core Architecture DOES have SIMD vector registers with masking and the potential to suffer vector efficiency loss if programmed with diverging vector elements. But the traditional GPUs take that vectorization a step further into their core designs. nVidia calls it warps in their streaming multiprocessors. They apply a homogeneity to the operations that can occur on each SM so that if the execution path across the elements remains the same they can apply a lot of vector resources that can really accelerate vector computation.

Whereas the Intel Many Integrated Core architecture leans much closer to a traditional CPU organization. Each core supports four HW threads that are completely independent in their execution--the architecture takes advantage of normal pipeline latencies to intermix up to four instruction streams in a HW scheduler that is referred to as smart round-robin scheduling. If one of the four HW threads takes a different branch, then it follows that branch wherever it might lead, on its portion of the core's available cycles. Each core has its own L1 cache and a segment of the L2 cache so if the threads on that core can coordinate to share data cached in common, you can get some real performance boosts, and conversely, if the focus of the threads on a core diverge then there could be increased memory pressure to supply disparate sets of cache lines for the threads on a core, but it's nothing like the lock-step programming of an SM.

Edgardo_Doerner · ‎07-12-2014

Dear Robert,

thanks for your reply... it seems that this architecture could be very useful for my interests... maybe I will buy a "developer starter kit", hehehe...

jimdempseyatthecove · ‎07-13-2014

In addition to Roberts comments, the only other negative of divergence is effect on the L1 instruction cache. The four threads within the core share the same L1 instruction cache. After divergence, each thread of the core competes with the other threads of the core for the L1 instruction cache.

RE: Roberts: if the threads on that core can coordinate to share data cached in common, you can get some real performance boosts

There is a 5-part series IDZ blog articles I wrote on this subject "The Chronicles of Phi". Go to the IDZ blogs under the Resources link and type in "chronicles" to the search field. All five should show up near the top.

This series of blogs illustrates an exceptional example of the threads on that core can coordinate to share data cached in common. Where an otherwise expertly optimized program, exhibited an additional 47% boost in performance with the addition of a coordinated sharing of the core common L1/L2 data caches (and other tricks).

RE: Monte Carlo (MC) simulation of particle transport on matter, where the particle histories diverge very quickly.

One important thing to keep in mind is the performance advantage of the Xeon Phi is primarily with its wide vector unit. To take full advantage of it you must code your particle divergence such that it maintains a wide vector organization. This generally requires a different design strategy from a host program that has only a 2-wide or 4-wide vector unit, and where the cost of going scalar isn't higher than say 2x the computation time. The vector-to-scalar ratio is much larger for Xeon Phi. Use it to your advantage.

Jim Dempsey

Branch divergence on Intel Xeon Phi Co-processors