Jim, you're totally right;

Judicael_Z_ · ‎10-27-2014

Hello,

Is it possible to disable Turbo Boost and SMT for Xeon Phi?

Loc_N_Intel · ‎10-27-2014

The tool /usr/bin/micsmc allows user to enable/disable turbo mode. Please be aware that certain models of coprocessors (not all) allow this. You can type the following command to find out if your coprocessor allows turbo setting:

# micsmc --turbo

jimdempseyatthecove · ‎10-27-2014

You would not want to disable SMT. Xeon Phi is an in-order processor. As such there are inter-stage stalls in the processor pipeline within an HT within a core. The second thread within a core comes with little expense. And for many applications, with high memory flux, three threads per core are significantly better than two. And in some occasions, four is better than three.

If you use various environment variable options you can easily restrict your (OpenMP) processing to 1 thread per core, or 2 or 3 or 4.

Jim Dempsey

JJK · ‎10-28-2014

I fully agree with Jim Dempsey, however, if one really would want to disable SMT then you could turn cores offline using /sys/device/system/cpu/cpuNN/offline

You can find out which logical CPU belongs to which core by looking at /sys/devices/system/cpu/cpuNN/topology/thread_siblings_list, e.g.

$  cat /sys/devices/system/cpu/cpu96/topology/thread_siblings_list
93-96

TaylorIoTKidd · ‎10-28-2014

Assuming you refer to each core running 4 HW threads, you cannot disable SMT as in the big core processors. Though it looks like hyperthreading/smt, it is a different beast.

I recommend you use OpenMP kmp affinities or the like to distribute your threads. The advantage of this technique is that it is architecturally independent and should be more architecturally robust.

Why would you want to have only one thread running per core? Have you looked at kmp affinity "~~compact~~scattered"?

Regards
--
Taylor

Judicael_Z_ · ‎10-28-2014

Hello all,

Thanks a lot for the quick replies; and sorry for getting back with some latency.

Loc, thanks for the micsmc tip; funny, I used micsmc for other matters a few days ago :). I'm freshly reading the guides and my "cache" is still being built :)

Jan Just, thank you. So, setting logical cores offline is the only way.

Jim and Taylor, thanks a lot for confirming something I've only heard of without any sound reading about it yet. All my questions are basically answered at this point ... but new ones came up (further below)

I did read about the kmp affinity in the quick start developer guide; but I haven't played with it. Setting cores offline is in my plans; but not kmp affinity or other enviroment variables because I would like to rule out any interference created by anything (e.g. a kworker) running on a logical sibling. We're trying to observe jitter as well.

OK Jim, I now perfectly understand why multiple HT would be beneficial. The powerPC on Blue gene Q has a similar behaviour; for instance, apparently, it is impossible to reach the maximum memory bandwidth with less than 2 HT per core.

Taylor, could you please give some explanation or point me to something to read to understand why Xeon Phi SMT is different from hyperthreading as we know it in Xeon for instance? I'm in HPC; and the rule of thumb is to disable hypethreading because it is "usually" not beneficial. One reason why hyperthreading is undesirable is that it tends to generate unpredictable performance measures; and as much as possible, repeatable, predictable measures are desired (that's why TurboBoost was being disabled as well). I'm trying to run tests without SMP on Xeon Phi to make observations. Many thanks for your time.

jimdempseyatthecove · ‎10-28-2014

Judicael,

The only way to get consistent performance figures is to turn the power off. Then you consistently get 0 performance.

Disabling HT (and O/S threads for that matter) for the sole purpose of attaining a linear scaling chart is foolhardy (IMHO). A production system would use the number of threads necessary to attain the best performance based on either time or time/power.

One of the main reasons why hyperthreading is "undesirable" is the programmer does not go to the effort of making it productive. Combine this with the use of hyperthreading screwing-up their scaling charts, and not wanting to show what appears to be deficiencies in their code, they would rather disable HT to hide this fact. Well behaving, but poorly performing code, is still poorly performing code.

If I were a system administrator in charge of a production system that could get 5% more efficiency on the equipment I have simply by enabling HT, I would do so. In many cases HT can return much better than 5%.

Jim Dempsey

Judicael_Z_ · ‎10-28-2014

Jim, you're totally right; especially because of the way I presented the hyperthreading issue.

Things can be complex enough for good programming not being able to do the right tuning though. For instance, the HPC runtime (e.g. the MPI middleware) and the HPC application are usually not written by the same person; and there is a not enough room for application-level tuning to avoid cache trashing when an asynchronous middleware thread wakes up once in a while to do some obscure stuff. A well-tuned middleware written by X + a well-tuned application written by Y do not always yield a well-tuned binary, even if the middleware is extremely well documented.

The predictability issue is also not just for nice charts; although that does matter :). HPC is concerned with performance; so 5% performance increase claims are easily publication-worthy; meaning that many people would never throw hyperthreading away for lack of predictability if they can get 5% extra speedup. Another problem that I did not mention is the lack of synchronicity; it can be very detrimental to applications that do a lot of collective communications. This paper " Pete Beckman , Kamil Iskra , Kazutomo Yoshii , Susan Coghlan Operating system issues for petascale systems, ACM SIGOPS Operating Systems Review 40 (2), 29-33" discusses the issue. The paper is not about hyperthreading; but about variability. Variability can give you outright negative speedup; and for the programmer, the assumption of reasonable predictability can dictate both the algorithms and optimizations when it comes to communication orchestration.

TaylorIoTKidd · ‎10-28-2014

I believe the short answer is that 4 threads are optimal to fully utilize the pipeline and maximize the computational bandwidth (i.e. the number of floating point operations) performed per cycle. Executing scalar code, even if effectively using the 512-bit VPU, isn't going to give you competitive performance given that the core is an enhanced Pentium generation CPU.

You can find a longer and more informative answer by Reza Rahman, Intel(r) Xeon Phi(tm) Core Micro-architecture.

Regards
--
Taylor

Judicael_Z_ · ‎10-28-2014

Hello Taylor,

Many thanks for the explanation and the link; very useful!

--Judi

James_C_Intel2 · ‎10-29-2014

If your concern is really about getting "good" scaling charts as you increase the number of threads, then the fundamental issue is that you're using the wrong value on the X-axis. Rather than plotting performance vs "number of threads", you should be plotting it against "number of cores", and showing four different lines (1T/C, 2T/C, 3T/C, 4T/C).

If you just plot against "number of threads", you confuse things. Is your "60 thread" point for 15Cx4T, 20Cx3T, 30Cx2T or 60Cx1T ? They can be expected to have very different performance!

If you are using OpenMP, then KMP_PLACE_THREADS should be your friend. (See https://software.intel.com/en-us/blogs/2013/02/15/new-kmp-place-threads-openmp-affinity-variable-in-update-2-compiler for instance).

(This aside from the whole "Speedup is evil, parallel efficiency is good" rant :-) http://en.wikipedia.org/wiki/Speedup )

Judicael_Z_ · ‎10-29-2014

Hello James,

Good advice! Thanks a lot. I'm actually trying to observe different things such as jitter, impact of resource partitioning on scalability; and overhead (if any) of resource partitioning. We're working on node resource partitioning and OS specialization over these partitions. Think Vanilla Linux with cgroups, resource controller and some hacking in the kernel to make it lean on select partitions meant to host HPC runtime with little interference. So what I have on the X axis is a node configuration (we call it setup). For all the setups, the number of cores or hardware thread per core is kept fixed; but we vary the partitioning as well as which aspect of OS runs on each partition. So if I was running on regular Xeon (which I already did), my concern was to mistakenly think that some variation is due to a setup change while it is being impacted by hyperthreading for instance. Basically, I wanted "deterministic" CPU behaviour (as much as possible) so as to conclude that the variations that I observe are due to setup changes. This said, it appears clear to me now that whatever I do, I'll have to repeat it for each of the for scenarios that you mentioned. The more observations we have, the better it would be anyways :)

Disable Turbo Boost and SMT for Xeon Phi