Maximizing performance for turbo-boost and/or hyperthreaded pro

Daniel_Faken · ‎10-08-2010

Hello all,

Can anyone give guidance on how to get the most performance for processors with these features ?
- it seems like one should adjust the number of threads used by TBB (or Cilk) depending on the architecture of the machine & the nature of the code.

Let me give an example:
My colleague has a 2-core i7 processor with hyperthreading. The best-case timings he sees for a particular test case are approximately as follows:
1 thread - 41 sec
2 threads - 25 sec
3 threads - 24 sec
4 threads - 24 sec

Obviously (?) there is not much benefit from hyperthreading here - the parallel speedup is outweighed by the expense of further dividing up the work & merging (reducing) the results.

So it *seems to me* like the logical thing to do here would be to force TBB/Cilk to use the same number of threads as there are cores. (a similar situation is being discussed in a related thread: http://software.intel.com/en-us/forums/showthread.php?t=77703&o=d&s=lr)

Now actually my colleague sees times varying somewhat wildly, apparently because the clock speed is varying due to the "Turbo Boost" (he has an i7-620m, with normal clock 2.66 GHz, max turbo clock 3.33 GHz).
For instance the single-thread time varies between 41 sec and 76 sec.

So to me this further reinforces that the logical thing would be to only use #threads = #cores (=2 here), since this would minimize the amount of work being done, and so cause the CPU to heat up more slowly, resulting in more consistent higher clock speeds. As mentioned above, the extra work isn't helping anyway, so why not get rid of it to reduce the heating?

Can anyone give guidance on this? Are my conclusions misguided? It seems like this would be a general concern for many developers.

thanks,
Daniel

PS Even less clear would be the situation where hyperthreading *does* provide a speedup, but the extra work deactivates "turbo boost" so much that no benefit is seen. Perhaps this is even the case here, but I can't really tell.

robert-reed · ‎10-08-2010

It does appear that there's some constraint that keeps the application you describe from scaling beyond two threads, but all in all this one run does not suggest that you would need to change anything; while the 3 and 4 thread numbers are not significantly better, they're also not worse. Therefore I suspect that you have examples that don't fare so well under 3 or 4 threads.

Intel TBB and Cilk do have mechanisms to adjust the size of the thread pool (for TBB look at task_scheduler_init), but before exploring those options, I'd suggest you try to understand more completely what is going on with the application now. There are numerous performance tools, including Intel's VTune Analyzer, that you can use to try to understand the nature of the bottleneck. A hot spot analysis would show where the work is being done. You might also try Intel Parallel Amplifier, which has some thread analysis tools built in (VTune Analyzer also has a tool, Intel Thread Profiler, but as a separate install).

Typically scaling is limited because of some resource issue: not enough memory, not enough bandwidth, saturated ALUs, lock contention, etc. Sometimes those limits are hit in a well balanced application and the only solution, if any, is a new algorithm. But sometimes there are easy solutions once the bottleneck is understood.

Daniel_Faken · ‎10-08-2010

Hello Robert,

My problem is not that the application does not scale beyond two threads - its that it doesn't appear to scale beyond the number of "cores".

These numbers were all obtained on the same system (the dual-core i7 mentioned previously), and the number of threads was adjusted by setting CILK_NWORKERS to the appropriate value (except for 1 thread, where a non-Cilk_spawning version was used). (I have a TBB version too)

The task in question is an octree-based operation, where the octree is quite deep (over 25 levels) and there is plenty of opportunity for parallelism. The hot-spot identified by VTune (and I think I tried Parallel Inspector too) is in the 'leaf' operation. (generally; I haven't profiled it on my colleague's machine, but this is by far where most of the work is)

But even beyond my particular application it seems like the question is important -
Intel's article on hyperthreading (http://software.intel.com/en-us/articles/performance-insights-to-intel-hyper-threading-technology/) mentions several cases where hyperthreading will not yield a performance gain - and may yield a performance decrease:

Extremely high memory bandwidth applications. Intel HT Technology increases the demand placed on the memory subsystem when running two threads. If an application is capable of utilizing all the memory bandwidth with Intel HT Technology disabled, then the performance will not increase when Intel HT Technology is enabled. It is possible in some circumstances that performance will degrade, due to increased memory demands and/or data caching effects in these instances. [...]
Extremely compute-efficient applications. If the processor's execution resources are already well utilized, then there is little to be gained by enabling Intel HT Technology. For instance, code that already can execute four instructions per cycle will not increase performance when running with Intel HT Technology enabled, as the process core can only execute a maximum of four instructions per cycle.

So if we assume that a given app has one or more of such characteristics, then it seems like it would make sense to avoid having #threads > #cores.
And as before it seems like "turbo boost" technology could exacerbate the problem, making this even more important.

thanks,
Daniel

robert-reed · ‎10-08-2010

OK, it sounds like you're further along than I surmised from your first note: you have a hot spot identified but you didn't say whether you'd figured out if it is a computational problem (i.e., floating-point activity in the leaf operation) or a bandwidth limit. Hyper-Threading technology can improve bandwidth limits (within reason) through latency hiding but can exacerbate performance issues due to ALU saturation.

And "turbo boost" technology can certainly complicate the analysis process. I've heard suggestions to disable it while doing performance analysis data collection just so you can get a repeatable result, but I don't have any personal experience with doing that.

Do the threads in the octree subdivision carry a lot of context as they descend down 25 or so levels? (Any subdivision within the octree can be represented by a pair of corner points but if you're subdividing lists of occupants or something like that,you could accumulate quite a snowball.) If all four threads are trying to carve their own chunk out of the octree "universe" then you may see higher than expected cache misses on one of the inner caches.

Because Hyper-Threading technology can benefit some algorithms while limiting others, a blanket policy like "avoid having #threads > $cores" only shifts the sweet-spot of which algorithms run optimally without scheduler policy changes. But by providing the thread pool size setting calls that you've discovered, we leave open the possibility that applications can select the optimal pool size based on some heuristic related to the algorithm while expecting that most algorithms will do fine with the default one pool thread per HW thread (HT-threads included).

Daniel_Faken · ‎10-11-2010

Thanks for the reply, Robert.

For the question about octree subdivision: the recursion (and so the threads) carry some context on the stack - say 1K per level of the octree. But most of the state is shared. This doesn't cause a lot of contention, though, because the leaf operation does so much work (all integer, no floating-point).
I suppose if I had >100 threads this on-stack context might be a problem.
There *is* however potentially a lot of cache-missing going on, because the octree itself can be 100MB or larger, but the misses would happen when accessing the shared state, so hyperthreading presumably doesn't help since the structures are mutexed.
So maybe this could be a problem when the number of threads (cores) does get very large - enough of the threads might complete their expensive leaf operation at the same time that lock-contention could prove an issue. So maybe as the number of cores (of our customers) increases we will need to change the shared-state to be less contentious.

But for now, as you suggest, we can tune the #threads ourself. I definitely understand you don't want to make this the "default policy" - but is there a straightforward way to do this tuning oneself? That is, a portable way to get the number of actual cores from within the program.
The current "TBB and hyper-threading" thread I mentioned before has also not answered this question.

Again, with the complications-but-potential-speedup added by Intel's own Hyperthreading & Turbo-Boost technologies, it seems logical to me that Intel address this by (at least) providing a way to do this tuning ourself in their parallel toolkits (TBB, Cilk, etc.) - if they haven't already. Just a suggestion, of course! I'm sure the TBB/Cilk guys are better informed to evaluate the best solution for this.

thanks again,
Daniel

jimdempseyatthecove · ‎10-12-2010

>>the structures are mutexed

I suspect this is where the problem is

My guess is if you try your program on a 4 core with HT you will see approximately the same problem with more than 2 threads.

The Turbo-Boost will muck-up your timing stats so if you can turn that off for testing or test on a processor without Turbo-Boost this will help you straiten out our algorithm. Then later, look into adjusting for systems with Turbo-Boost. Example: On system with 2 cores with HT run on "processors" 0,2 for a while then 1,3 (or 0,1 then 2,3 depending on how HT is mapped on your OS).

Jim Dempsey

jimdempseyatthecove · ‎10-12-2010

I forgot to add:

When you have a system without Turbo-Boost but with HT, try running the app using 2 threads consisting ofthe HT siblings. This will confirm or reject your hypothesis about HT being no good.

can you post your test program?

Jim Dempsey

Maximizing performance for turbo-boost and/or hyperthreaded processors