topic FP-intensive hyperthreading performance on latest Xeons in Intel® Moderncode for Parallel Architectures

FP-intensive hyperthreading performance on latest Xeons

avalys — Thu, 08 Jul 2010 18:32:27 GMT

We have recently purchased a dualIntel X5650workstation to run an internally-developed floating-point intensive simulation, under Ubuntu 10.04.

Each X5650 has 6 cores, so there are 12 cores in total. The code is trivially parallel, so I have been running it mostly with 12 threads, and observing approximately "1200%" processor utilization through "top".

HyperThreading is enabled in the BIOS, so the operating system nominally sees 24 cores available. If I increase the number of threads to 24, top reports approximately 2000% processor utilization - however, it does not appear that the actual code performance increases by 20/12.

My question is - how does HyperThreading actually work on the latest generation of Xeons? Would a floating-point intensive code benefit from scheduling more than one thread per core? Does the answer change if the working set is on the order of the cache size, as compared to several times larger, or if there are substantial I/O operations (e.g. writing simulation outputs to disk)?

Additionally - how should I interpret processor utilization percentages from "top" when hyperthreading is enabled?

FP-intensive hyperthreading performance on latest Xeons

jimdempseyatthecove — Fri, 09 Jul 2010 13:22:11 GMT

In general terms the answer is it depends on the applicaiton.
Some benefit more than others.

When you are 100% FP bound in all threads, and all data is in cache then the benefit from HT is at its minimum (perhapse less than +5%). As some of this data goes from all in cache to all not in cache you see more (could be greater than 30%).

When a good portion of the code is integer, you can see much larger benefits (hard to peg a number to this).

When a thread interacts with the O/S (disk read/write, page fault, ect...) then benefits are even better.

Do not rely on an artificial benchmark to estimate these numbers. Use measurements taken from your code.

Jim Dempsey

FP-intensive hyperthreading performance on latest Xeons

TimP — Fri, 09 Jul 2010 14:30:06 GMT

The Hyperthreads are sharing the floating point unit. If you read the MKL docs, you will see in that ideal case where a single thread makes 100% use of FPU, HT is expected to reduce performance.
If you are hoping to see a gain for HT on an application which spends much time on cache misses, this will depend on the trade-off between earlier cache capacity eviction against better parallelization of cache miss resolution. If the 2 threads are sharing the same pages, you have a better chance.
On most recent distros, the default for top is to lump together all the threads of a single application. So, if you have 4 cores each running 2 hyperthreads, it could rise to 800%, even though the performance is not much better than 4 threads spread out across the cores.

FP-intensive hyperthreading performance on latest Xeons

Arthur_Moroz — Tue, 13 Jul 2010 21:45:33 GMT

Offtop: I'm just wondering if SSE registers/processing are shared across cores or each core has its own instance? (Just in case - I'm banned from google search :)

FP-intensive hyperthreading performance on latest Xeons

Dmitry_Vyukov — Tue, 13 Jul 2010 22:19:33 GMT

Each core has its own set of SSE registers and execution units.
AFAIK HT sibling threads inside of a core share execution units, but have separate sets of registers.