MKL DGEMM Hyperthreading.

seyedalireza_y_ · ‎08-24-2016

Hi,

I'm trying to call DGEMM on relatiely big matrices (m=10000 , n=100000, k=10000) on knights landing.

When I’m profiling using vtune, I can see that a call to MKL DGEMM, is having 68 threads working (which is the number of physical cores) but the expectation is that it uses 272 threads (logical cores) because of hyper-threading. Other parts of my code where i use (openmp simd) directives, are using up to 272 threads. I'm wondering if there is any settings i need to setup in order to get hyper-threading working for my case.

Thanks,

Ali

TimP · ‎08-25-2016

Did you look into MKL_DYNAMIC setting? Did you find an advantage in using all the logical threads (vs. spreading a smaller number evenly across cores) in your omp parallel (not simd alone) regions?

Gregg_S_Intel · ‎09-06-2016

MKL_NUM_THREADS

Note that many applications perform best using fewer than 4 threads per core.

See this article for information about setting the number of threads per core, https://software.intel.com/en-us/articles/process-and-thread-affinity-for-intel-xeon-phi-processors-x200

McCalpinJohn · ‎09-06-2016

On most Intel processors DGEMM only uses one thread per core. The code is very tightly constructed (with very careful cache blocking) to give the best performance in this configuration. There are very few avoidable stalls that could be overlapped with work in the other logical processor. Using the other logical processor would cut the available cache in half, which would reduce the block sizes, increase the cache miss rates, and decrease the overall performance.

I have not looked at Intel's DGEMM implementation for Xeon Phi x200, but it is easy to believe that it has the same properties. (The first generation Xeon Phi (Knights Corner) was an exception because a single thread could only issue instructions every other cycle, so two threads were required to reach maximum speed on compute-bound codes. This limitation is not present in the second generation Xeon Phi (Knights Landing) -- one thread of execution can issue two instructions every cycle, getting reasonably close to peak performance.

seyedalireza_y_ · ‎10-25-2016

Thank you everyone for your thorough help and explanations, I tried and saw that actually as pointed by Dr. McCalpin, it's better not to override the setting, as the performance is declined when using more threads per core and the code already seems to reach peak performance.

SergeyKostrov · ‎12-09-2016

1. I would also recommend to look at compact and scatter settings for KMP_AFFINITY environment variable. 2. My extensive experience with xGEMM MKL functions shows that all these functions are very optimized when it comes to threading and it could be also controlled with OMP_NUM_THREADS and KMP_AFFINITY environment variables. 3. If, for example, a CPU with 4 hardware threads is used ( 4 cores, 8 logical CPUs ) then OMP_NUM_THREADS needs to be set to 4, and it doesn't improve performance if it is set to 8. 4. Take into account that a programmer's control is very simple and this is how it could look like: ... #ifdef _RTTHREADTOPU_BINDING_SHOWINFO _RTLIBAPI RTtchar g_szThreadToPU[] = RTU("KMP_AFFINITY=granularity=fine,proclist=[0,2,4,6],explicit,verbose"); #else _RTLIBAPI RTtchar g_szThreadToPU[] = RTU("KMP_AFFINITY=granularity=fine,proclist=[0,2,4,6],explicit"); #endif ...

SergeyKostrov · ‎12-09-2016

>>...On most Intel processors DGEMM only uses one thread per core... Absolutely correct and I confirm that.

SergeyKostrov · ‎02-06-2017

>>...I'm trying to call DGEMM on relatiely big matrices (m=10000 , n=100000, k=10000) on knights landing... I finally completed a set of tests for 100Kx100K square dense matrices on a KNL server with 64 cores. Three are no any performance improvements if more than 64 threads are used. Here are tests results: ... Matrix multiplication C=A*B where matrix A( 114688x114688 ) and matrix B( 114688x114688 ) Allocating memory for matrices Intializing matrix data Matrix multiplication started Matrix multiplication completed at 1941.544 seconds Deallocating memory Processing Completed ...