Solved: Dear Tim,

Ashkan_T_ · ‎10-26-2015

Hi, I have been running different benchmarks on the Xeon Phi. In comparison with a E5-2620 Xeon processor running at 2.00GHz, I noticed a large difference in the sequential performance (almost 10x considering different cases)

Can we conclude Xeon Phi always shares the frequency between hardware threads, even for the sequential codes? In other words, the clock frequency of 1.053GHz will be divided by 4 (if it switches between cores in a round-robin fashion)?

If that is true, would it be possible to take advantage of the full core's frequency at all?

jimdempseyatthecove · ‎10-26-2015

Ashkan,

My statement was, and in combination with Tim's, and rephrased here:

Due to in-order-core design, and instructions (principally VPU) taking longer than 0 time, this combined with the clock not being infinitely variable, results in a thread instruction taking at least 1 clock time, plus some value greater than 0, rounded up in whole clock intervals to its thread round-robin scheduled clock turn. This results in a single thread per core operating at best every other clock cycle. With two threads per core, at best, one thread operating every even tick, the other thread operating every odd tick. The actual performance, and optimal thread count will vary due to memory and cache latencies. When a thread (in core) is waiting on cache/RAM latency, it relinquishes its turn in the round-robin pecking order. Thus should you be using 3 or 4 threads per core, the additional threads can advance a time slot in the round-robin pecking order.

In the unlikely event that the majority of the computation of an application is 100% register based, and the floating point instructions were all add or subtract (almost 0 probability), then 2 threads per core would be optimal. If the application is a mix of register and L1 cache and/or longer executing instructions (multiply or divide or trig), then more threads than 2 would be warranted. Generally 3 threads per core are optimal excepting for where code is written where the intra thread (threads in core) can take advantage of the L2 cache data loaded by a different thread of the core, or where data has to be fetched from RAM. In these two cases, it may be beneficial to use 4 threads per core.

The preceding is a generalized description, meant to give you an overview in the coarsest of terms.

A linked list of scalar variables is the least suitable of uses of KNC.

Your problem needs more investigation and description.

If your problem statement is you have say 100K nodes in a linked list, each node containing an array of 1000 integers, then you could write a vectorized version of a Quicksort to produce 16 lanes of sorted values, which are then merged to 8 lanes, then to 4 lanes then the 2 lanes then 1 lane. (note the 8-lane merge can perform two 8-lane merges at a time, 4-lane merge as four 4 lane-merges at a time, ...). Some experimentation will have to be done to get the best timing.

The whole point of the above is to get you to think in terms of vectors, and for you to think of the problem in terms such that the whole width of the vector is put to use for the most of the time.

Jim Dempsey

View solution in original post

jimdempseyatthecove · ‎10-26-2015

Rough description. The core on the Knights Corner processor is in-order. The hardware threads within each core compete for core resources in a round-robin manner only amongst those threads seeking core resources. Because the execution phase time is not 0 (combined with in-order execution on KNC), a single thread within a core cannot attain the maximum throughput for that core. To attain maximum throughput for a core on KNC, it will require the use of two or more threads. The maximum number will vary dependent upon application. It is almost never 1 thread per core.

The KNC is not designed to run single threaded scalar code. It is designed to run many-core wide SIMD vector program. Running a "traditional" scaling performance test not applicable. A KNC scaling benchmark should generally be three tests:

All cores (less 1 for offload), 2 threads/core
All cores (less 1 for offload), 3 threads/core
All cores (less 1 for offload), 4 threads/core

There are some cases where it may be beneficial to test with fewer than all cores. But generally, if the application cannot make use of all cores (in some combination of the threads above), then the application may not be suitable for use on KNC.

You should be able to search for a more detailed description of this behavior.

Jim Dempsey

TimP · ‎10-26-2015

The observation of single thread non-vector performance of KNC being about 10% of host is well in line with expectations. Thus, you have no reason to disbelieve what is written in the architecture guides.

As Jim said, full performance on MIC requires making use of at least 2 threads per core, and all but 2 cores, with SIMD vectorization. If your application depends on VPU, the VPU can take an operand from a given thread usually every other cycle, occasionally every third cycle, in the absence of much competition from other threads.

With single thread scalar code, 32-bit data, you would usually be using less than 0.1% of the capability of a MIC CPU, vs. perhaps 5% of a typical laptop CPU, so it is clear that MIC is specialized to be effective only with fully parallel applications.

Ashkan_T_ · ‎10-26-2015

Dear Jim,

Thanks very much for your description.

In terms of utilizing the Xeon Phi, I agree with you on "using more than 1 thread per core" for parallel programming, but the reason for asking this question was that in a comparison with another parallel hardware, we realized that although the speedup results on the Xeon Phi are much better, still the performance results are almost similar. For that reason, we decided to investigate the differences between the sequential runtime as well.

What I understand from your sentence "The hardware threads within each core compete for core resources in a round-robin manner only amongst those threads seeking core resources" is that if one runs a single-threaded code, there should be no competition and that thread should be able to reach at least near 1GHz. What we get as a result is more close to 250 MHz rather than to 1 GHz. I don't think only the in-order execution can have such a huge impact.

Ashkan_T_ · ‎10-26-2015

Dear Tim,

Thank you!

I bring another example: A linked list where at each node we sort an array of integers. Sequential performance for lists of (100k to 1M) and arrays of 1000 elements on the TILEPro64 is 1.6x better than than on the Xeon Phi. I can't figure out why.

The Xeon Phi L2 caches are 8x larger than those of the TILEPro64. TILEPro64 runs at 866 Mhz. Do you think in this case, the 3-way VLIW architecture on the TILEPro64 could be the reason?

McCalpinJohn · ‎10-26-2015

The Xeon Phi documentation makes it very clear that a single thread cannot issue instructions in two consecutive cycles, so a core must be running at least 2 threads in order to execute instructions every cycle. This information is not difficult to find -- I did a search on "introduction to xeon phi" and the first result was a link to a nice presentation https://software.intel.com/sites/default/files/Slides-Intel-Xeon-Phi-coprocessor-hardware-and-software-architecture-introduction.pdf ; The instruction issue limit is described quite clearly on slide 6 (as well as in most other descriptions of the Xeon Phi processor).

The Xeon Phi core can execute one or two instructions per cycle, of which only one can be a vector instruction. It is a bit more difficult to find detailed information on which instructions can dual-issue, but the limit of one vector instruction per cycle is typically the most important part for CPU-bound workloads. (Memory-bound workloads will have lots of stall cycles, so limits on instruction issue rates will be irrelevant.)

So your Xeon E5-2620 processor core can execute up to 4 instructions per cycle at 2 GHz, while a single thread on the Xeon Phi can execute up to 2 instructions every 2 cycles at 1 GHz. That gives an 8:1 ratio for peak instruction issue. This is only rarely an appropriate comparison, but it shows that a 10:1 ratio is not unreasonable.

There are similarly large differences in some aspects of memory performance. The Xeon E5-2620 has a large L3 cache with a latency of about 40 cycles (20 ns at 2.0 GHz), which is well over 10x faster than the ~275ns latency to memory on the Xeon Phi (which has no L3 cache). If the data is not in the L3 of the Xeon E5-2620, its L2 Hardware Prefetchers are much more aggressive than the L2 Hardware Prefetchers on the Xeon Phi, which can make the ~3x difference in memory latency turn into an even larger performance difference.

Of course all of these are statements about single-thread performance. The Xeon Phi clearly targets multi-threaded workloads, and many optimizations that would improve single-thread performance would hurt overall chip performance (mostly by using too much power and/or area and thereby limiting either the number of cores on the chip or the frequency of the cores or both).

jimdempseyatthecove · ‎10-26-2015

Ashkan,

My statement was, and in combination with Tim's, and rephrased here:

Due to in-order-core design, and instructions (principally VPU) taking longer than 0 time, this combined with the clock not being infinitely variable, results in a thread instruction taking at least 1 clock time, plus some value greater than 0, rounded up in whole clock intervals to its thread round-robin scheduled clock turn. This results in a single thread per core operating at best every other clock cycle. With two threads per core, at best, one thread operating every even tick, the other thread operating every odd tick. The actual performance, and optimal thread count will vary due to memory and cache latencies. When a thread (in core) is waiting on cache/RAM latency, it relinquishes its turn in the round-robin pecking order. Thus should you be using 3 or 4 threads per core, the additional threads can advance a time slot in the round-robin pecking order.

In the unlikely event that the majority of the computation of an application is 100% register based, and the floating point instructions were all add or subtract (almost 0 probability), then 2 threads per core would be optimal. If the application is a mix of register and L1 cache and/or longer executing instructions (multiply or divide or trig), then more threads than 2 would be warranted. Generally 3 threads per core are optimal excepting for where code is written where the intra thread (threads in core) can take advantage of the L2 cache data loaded by a different thread of the core, or where data has to be fetched from RAM. In these two cases, it may be beneficial to use 4 threads per core.

The preceding is a generalized description, meant to give you an overview in the coarsest of terms.

A linked list of scalar variables is the least suitable of uses of KNC.

Your problem needs more investigation and description.

If your problem statement is you have say 100K nodes in a linked list, each node containing an array of 1000 integers, then you could write a vectorized version of a Quicksort to produce 16 lanes of sorted values, which are then merged to 8 lanes, then to 4 lanes then the 2 lanes then 1 lane. (note the 8-lane merge can perform two 8-lane merges at a time, 4-lane merge as four 4 lane-merges at a time, ...). Some experimentation will have to be done to get the best timing.

The whole point of the above is to get you to think in terms of vectors, and for you to think of the problem in terms such that the whole width of the vector is put to use for the most of the time.

Jim Dempsey

Ashkan_T_ · ‎10-27-2015

Thank you all.

John and Jim, perfect answers, thank you! unfortunately, I can flag only one of them as the best reply!

@Jim, only a comment on the last paragraph. We deliberately wanted to compare the platforms in different scenarios: with and without vectorization to highlight their differences. I agree that such a benchmark cannot utilize Xeon Phi at all.

jimdempseyatthecove · ‎10-27-2015

Ashkan,

This is not addressed to you personally, so do not take it personally...

Notwithstanding using tests to determine the basic characteristics of the KNC Xeon Phi, principally to experience for yourself that which is expressed in the literature, the only scaling tests that make sense are those that determine the number of KNC Xeon Phi's to use within a single box, and optionally in a cluster. Bear in mind that an SMP hosted multiple KNC is markedly different from an MPI structured (distributed across multiple host nodes), as is the programming that you would use on each. While on can run MPI and coarrays on a single SMP system, it is not necessarily the most suitable way of programming for an SMP system.

What I am trying to say, the question of "How does it scale per hardware thread" or "How does it scale per core" is not a material test for KNC. The proper scaling question is "How does it scale per KNC CPU". Then for a specific application "What is the aggregate performance difference in using 2, 3 and 4 threads per core?".

Jim Dempsey

TimP · ‎10-27-2015

Several Intel publications advise applying parallel and vector optimization, including evaluation of vector vs. non-vector and threaded performance scaling, to the host version of an application, before trying it on Intel(r) Xeon Phi(tm). One of the points made is that due to the commonality of hardware architectural and compiler implementations, most of the same optimization efforts should be effective on both sides. Another is the one relevant to the earlier parts of this thread, that there is no point in spending time on MIC with applications which aren't amenable to such optimization on the host.

On only a couple of minor points, optimizations which require specific source code modifications on the host side may happen automatically on Xeon (with normal compiler flag settings):

1) VECTOR ALWAYS or equivalent pragmas to over-ride "protects against exception" in vectorization of conditionals requiring speculative evaluation. I sometimes wonder, if -fp-model strict is required to enable handling of floating point exceptions (e.g. Fortran IEEE_arithmetic), why would the compiler make this distinction between MIC and host at the default fp-model fast?

2) vector misalignment, where an array section is stored with one alignment and read back with another, requiring merging/blending of some lanes just updated in register with other lanes from cache.

In those cases, if the specific optimizations aren't applied on the host side, vectorized single thread execution may perform as well on KNC as on host.

Sequential Performance on the Xeon Phi