Solved: running non vectorized multithreaded applications

Ricardo_F_1 · ‎04-28-2016

hi,

I have a multithreaded application (up to 20 threads for ex) that executes non vectorized code, i.e. each thread performs different operations on different pieces of data. However, some of those operations may be vectorized, but are not very computationally intensive.

I'm not sure if such an application would be good for host execution with vectorized operations offloaded, since they aren't very computationally intensive on each piece of data. I was thinking if it would make sense to run the application natively on phi with 1 thread per core and then issue vectorized operations, which would run on threads of free cores... what do you guys think? Can I have vectorized and non vectorized code running on different cores? are the cores independent?

I have already tested the multithreaded application running several threads per core, and sure enough performance degrades, since those threads have no inherent vectorization.

Gregg_S_Intel · ‎04-29-2016

I think the question is can you run the scalar and vector code simultaneously on different KNC cores, as opposed to the uniformity needed to run on a GPU.

To that question: each KNC core can run fully independent code. No core shares a cache, pipeline, or register with another core.

KNL will be slightly different; two cores share an L2 cache, but are otherwise independent.

View solution in original post

jimdempseyatthecove · ‎04-29-2016

On ~~KNL~~ KNC it is better to run at least 2 threads per core. There are some exceptions to this. If your application is solely the 20 thread non-vectorized code, then it would be better to run it on the host CPU (with 2 to 4 hardware threads available). This is easy enough for you to test. If your application has the 20 non-vector threads (aka scalar) plus large vectorizable and parallelizible which has high communication with the 20 scalar threads, then it may be better to place those threads within the KNC. Again, easy enough to test.

Jim Dempsey

TimP · ‎04-29-2016

Non-vector applications tend to depend more on multiple threads per core for peak performance. If your desire to run dissimilar code on various cores implies poor cache locality, Intel(r) Xeon Phi may not be effective.

I can't make sense of your scheme for choosing between offload and native mode.

Ricardo_F_1 · ‎04-29-2016

thank you for your answers. I'll test the several variations of my app for sure.

Maybe I should have formulated the question in a different way: Are the Phi cores independent, like host Xeon processors, and each core has its own vector pipeline, or is coordinating the vector pipelines of the several Phi cores crucial?

I do know that the Phi cores are weaker, thus leading to lower performance of scalar multithreaded applications than running on the host. But I wanted to know if running a scalar multithreaded application on the Phi is OK and could vectorize just portions of the threads.

Gregg_S_Intel · ‎04-29-2016

The cores are independent like host Xeon processors.

jimdempseyatthecove · ‎04-29-2016

The Knights Corner version KNC has an in order core. Due to latencies within a core pipeline, the core cannot be fully utilized by one thread. Two threads within the same core run with little interference between each other (but share the same L1 and L2 cache). A scalar hardware thread on KNC may run at ~1/10 that of a scalar thread on the Xeon host CPU. This will change for the next gen KNL. The consideration of where to run your scalar code will depend on if your app also has heavy vectorization (and is parallelizable), together with how much communication you require between the scalar and vector code.

Jim Dempsey

Gregg_S_Intel · ‎04-29-2016

I think the question is can you run the scalar and vector code simultaneously on different KNC cores, as opposed to the uniformity needed to run on a GPU.

To that question: each KNC core can run fully independent code. No core shares a cache, pipeline, or register with another core.

KNL will be slightly different; two cores share an L2 cache, but are otherwise independent.