Impact disabling hyperthreading in linux

SB17 · ‎10-18-2013

not sure that in this forum , but the caption of the forum - the optimization of performance.

Hyperthreading significantly reduces performance on calculate-intensive applications (for example linpak ) . Hyperthreading can be disabled in the BIOS, but it's not always convenient.

Hyperthreading can be disabled into offline hyperthreading cores (in cpufreq or in the options of the Linux kernel).

Based on the logic of hyperthreading whether hyperthreading disconnected from Linux have an impact on the HPC applications (like linpack or something else)

I am sorry for possible offtopic

SB17 · ‎10-18-2013

CPU - Sandy Bridge

Patrick_F_Intel1 · ‎10-18-2013

Hello Black,

I think you are asking: Is disabling one of the HT cpus on a core at run-time (after booting the OS) the same as disabling HT in the BIOS (before booting the OS)?

The answer to this question is: it depends, and it probably varies from one cpu generation to the next (such as core2 to nehalem to sandybridge to etc, etc).

Why does disabling HT help linpack? Well... when HT is enabled, some parts of the core are duplicated for the logical cpu (such as registers) but most of the cpu is not duplicated. In particular, the floating point unit is not duplicated. With linpack, there isn't any 'extra work' the extra HT threads can be doing, the threads are all using the same unshared floating point unit. So you can get better performance by disabling HT.

I asked the above question years ago to the cpu architects and the answer was... it depends. When a cpu is halted by the OS (or whatever process you are using to turn off one of the HT threads on the core) some split resources become available for the remaining thread, some do not. You might be curious to know exactly what happens but that info won't be disclosed.

So we don't know the exact answer to: is shutting down 1 HT thread after booting the same as disabling HT in bios.

In any case, the thing that really matters is: can you tell the difference with a real world application?

I don't know the answer to this question. You will have to try it and see.

Pat

TimP · ‎10-18-2013

I don't see your specific question.

As you maybe implied, the Intel MKL library, as used in HPL benchmarking, automatically uses just 1 logical per core when HyperThreading is detected. It does this on a per shell basis, as may be done (for an entire application) by setting affinity mask (e.g. KMP_AFFINITY=compact,1,1).

Building applications without optimizations to suppress repeated divisions (gcc -freciprocal-math, icc -no-prec-div -no-prec-sqrt) may give rise to situations where HT improves performance. This would not be as pronounced on Ivy Bridge as on Sandy Bridge CPUs, due to reduced latency of those operations on the newer CPU.

It's not unusual to see a shared system set up with HT disabled, on the assumption that only a minority of the jobs could gain performance per watt by using HT.

As to the question of whether setting affinity and number of threads to limit each core to 1 thread being equivalent to disabling HT entirely, one of the more common suspects for differences may be in the effective size of instruction TLB. Most other important resources are shared dynamically on recent CPUs, so as long as only one thread is running it can use the entire cache and fill buffer set, .....

McCalpinJohn · ‎10-18-2013

I would like to repeat & re-emphasize part of Patrick Fay's answer:

"When a cpu is halted by the OS (or whatever process you are using to turn off one of the HT threads on the core) some split resources become available for the remaining thread, some do not."

I worked through many of the details of these issues in the development of IBM's POWER5 processor, and it is very hard to come up with a concise answer. In general, running one thread on a processor with multi-threading enabled is not the same as running one thread on the same processor with multi-threading disabled, because some resources are statically allocated to the extra thread(s) when multi-threading is enabled. Whether the absence of access to these statically allocated resources hurts the performance of your code depends on lots of details of both the implementation (that the vendors don't usually want to discuss) and the application (which the vendor probably does not have time to analyze in detail).

Of course over time the implementations are being tuned and tweaked so that the performance penalty of enabled but unused thread contexts is generally decreasing, but that is an average statement over many workloads and counter-examples are almost certain to exist.

At TACC, HyperThreading is disabled on all of our production systems. Our experience has been that although HyperThreading provides some improvement in throughput for many applications, it also provides a mechanism by which the OS can severely degrade throughput by making poor process placement decisions. I don't think that we have enough quantitative data to know whether the overall workload throughput would increase or decrease with HyperThreading enabled, but we certainly know that the performance degradations would be a nightmare for all of us with user support responsibilities.

On the other hand, when I was at IBM on the POWER5 design team, we ran almost all of our benchmarks with HyperThreading enabled because we knew how to precisely control the system to prevent the OS from making scheduling mistakes. A quick check of the SPECfp2006 rate results shows that for the more than 1100 submissions for Xeon E5 processors that support HyperThreading, there appears to be only 1 result that does not have HyperThreading enabled --- and that entry is clearly a mistake! (The detailed results show that they are running 2 copies of each test code on each core and getting the same performance on each benchmark as they got for submissions with the same processor and same software that are labelled as running with HyperThreading enabled).

So there is no information in the published SPEC benchmarks about the usefulness of HyperThreading (for current processors and current compilers), which means that you are left to test the performance differences for yourself. That is not a bad thing, but it would be nice if the vendors published SPEC results with a wider spread of configurations (auto-parallelization on vs off, turbo mode on vs off, HyperThreading on vs off, etc.)

SB17 · ‎10-18-2013

Thank you very much for your interesting answers.

The practice of our organization shows that hyperthreading reduces performance (float point double precission), so in order to make the overall solution - hyperthreading off.

Hello Pat
It is difficult to say what the real applications are calculated. It is assumed different application workload. Preference is given to applications with high intensity calculation in float point double precision (with ieee 754).

Thanks TimP
Apparently effect with TLB (or something else) I watched.

I assemble linpak with openBLAS for one node (2 CPU).Tested several configurations:
1) hyperthreading turn on in the BIOS
2) hyperthreading turn off in the BIOS
3) hyperthreading turn on in the BIOS with a limited numbers of visible cores (in grub "kernel / boot/vmlinuz-2.6.13-Ora10g root = / dev/sda1 ro maxcpus = 16" can also "echo 0 > /sys/devices/system/cpu/cpuX/online". but the first option more elegant)
4) hyperthreading turn on in the BIOS, linpak run with cpu binding on physical cores ("numactl - physcpubind - localalloc", cores with different coreId (from /proc/cpuinfo))

Results:
Performance 2) = 3)
Performance 2) and 3 ) is considerably greater than 1 )
Performance 4) is approximately equal to 1)

Thus, hyperthreading easier to turn off for our overall solution .

However, as you have noticed, some tasks can have better performance in hyperthreading, but the switch on and of in BIOS for each task a little boring :) It is much more pleasant to do everything remotely from the console and scripts.

From the results so far it is clear that the current assembly linpack not see the difference between hyperthreading off in BIOS and limiting the number of visible cores in linux (maxcpu=16).

However, this does not mean that the difference is not seen in other applications. It was therefore interested in your opinion.

P.S.
Today i learned that intelHPL gives a slightly better result, so there is a field for further study of the problem.

Thank you for your attention to my problem

Sergey

SB17 · ‎10-18-2013

Thanks PhD "Dr. Bandwidth.

May be stupid questions. How you turn off hyperthreading? In manually in the BIOS or what is the utility?

SB17 · ‎10-18-2013

With hyperthreading its clear for me - should be switched off

Since I went to such questions, How do you feel about turboboost in clusters? There is an opinion that turboboost calls job imbalance (such as MPI barriers and synchronizations, etc.) and the result is the same linpack worsens

What do you think about this?

Bernard · ‎10-18-2013

It should be supported by the Bios.It is up to Bios vendor to enable this option.

Bernard · ‎10-18-2013

HT should be helpful in situation when two threads are issueing non interdependend instructions which will not saturate execution units of the same port.Btw I suppose that execution units like adder or multiplier can be pipelined and latency of some arithmetical instruction in this case can be depended upon time in cycles needed to process the operation like fadd or fmul.

McCalpinJohn · ‎10-18-2013

Enabling or disabling HyperThreading has to be done by the BIOS. I suppose it is not inconceivable that one could build hardware that could be switched "live", but it would be a lot of additional work to verify that the switchover was done cleanly. If the difference between 1-thread performance with HyperThreading enabled and disabled is small, then there is not much justification for this extra work.

We do run with Intel Turbo Boost enabled on our production systems. We definitely see an overall speedup with Turbo enabled, but part of the reason might be that our systems are very well cooled, so they are (essentially) always running with all cores at the maximum allowed frequency. For the TACC Stampede system, that is 3.1 GHz on the nominal 2.7 GHz Intel Xeon E5-2680 processors. (For the times when jobs are not running, we switch the cpufreq governor to allow the cores to slow down to their minimum 1.2 GHz to save power.)

I could imagine that a system running with less aggressive cooling (and particularly with more variable cooling) might have more CPU frequency variability, but as long as the processors are running at least as fast as the nominal frequency the variability should not hurt performance very often. (There are always some perverse cases where speeding something up slows the whole process down, but those tend to be rare.)

Higher turbo boost frequencies (up to 3.5 GHz) are possible if fewer cores are "in use", but it looks like this means that the "unused" cores have to be in a C-state of C3 or higher. We limit the processors to C1 so that we don't have trouble with interrupt latency -- in our configuration, waking up a processor core from C1 takes 3 microseconds, while waking it up from C3 takes 96 microseconds. 96 microseconds latency is too long for effective communication between the Xeon E5 and Xeon Phi processors and is too long for effective communication over Infiniband. Since Intel enables these "deep" C states by default, it is safe to assume that this latency (or even higher values) is fine in many environments. It is possible that this problem could be overcome by smarter controller over interrupt "steering", but is getting into parts of the software stack that I really don't know anything about.

SB17 · ‎10-18-2013

Thanks for the detailed answers

Cole__Allen · ‎01-11-2018

Thank You for the detailed answer. I was looking for the solution too. I was a bit confused about my hyper threading of my device and Wanted to know will it harm my PCB Board or the SMT Assembly. BOIS will help me in hyper threading as well.