Linpack runs only on cores, not threads?

mprnt · ‎06-17-2016

hi there,

I use the linpack binaries quite a long time for various stuff, most times for performance and stability diagnostics.

I recognized the with current versions of the binaries and current CPUs linpack does only run on the cores and not on th HT threads anymore (quite sure that this was different earlier); e.g. with a E5-2683 v4 (16 cores 32 threads) I see only 16 used CPUs in the O/S...

Any idea why this changed? Reason is CPU or binary?

Thank you for your help

Regards

Martin

TimP · ‎06-17-2016

Did you forget about MKL_DYNAMIC? https://software.intel.com/en-us/node/528547

I don't think there has been any change. MKL has always defaulted to what is now called OMP_PLACES=cores for CPUs where that is optimum for performance.

Ying_H_Intel · ‎06-19-2016

Hi Martin,

Right, MKL have defaulted to using physical core instead of HT logical cores. there is some explain in MKL user guide.

or the post https://software.intel.com/en-us/forums/intel-math-kernel-library/topic/294954 for your reference.

Best Regards,

Ying

McCalpinJohn · ‎06-20-2016

I think that MKL uses two threads per core on the first generation Xeon Phi (Knights Corner) processors because that is necessary to get full performance. On all other Intel processors there is no benefit (for LINPACK) in using more than one thread context per core, and (depending on how one does the parallelization) there can be a significant performance degradation due to the smaller amount of cache available per thread.

TimP · ‎06-20-2016

The hand optimized functions in MKL for Intel(r) Xeon Phi(tm) MKL use all the logical processors effectively. Although 2 or 3 threads per core would be sufficient to keep the fpu running full speed, the extra threads are used efficiently for data movement, such as partial transpose.

On the "big core" CPU, which I thought was the subject of this thread, a single thread per core is sufficient to keep the fpu running at full speed, and the delays associated with switching among hyperthreads are noticeable even if cache capacity is sufficient. The most likely situation in floating point for effective use of a 2nd thread per core has been on CPUs with long latency divide, when divide is not avoided by the Intel compiler "throughput option" -no-prec-div. Another, of course, is the situation of frequent cache misses, where one thread may run while the other is resolving misses. So the relationship of cache activity with effectiveness of hyperthreads is complicated.

OpenMP 4.5 introduced a mechanism for your own program to do what MKL does by default. Under setting of OMP_PLACES=cores, omp_get_num_places() will return the number of threads corresponding to 1 per core. I think so far only the latest Intel beta compiler has implemented OpenMP 4.5 fully, although 16.0.3 may have omp_get_num_places. While Intel libiomp5 sets OMP_PLACES=threads as a default, there are OpenMP implementations which default to OMP_PLACES=cores, as well as those which don't implement it.

I haven't seen reports whether any version of libgomp actually implements omp_get_num_places() (as opposed to accepting it but not returning the correct result). Nor is there documentation on whether that function call from gcc has been tested with libiomp5.