I'm trying to call DGEMM on relatiely big matrices (m=10000 , n=100000, k=10000) on knights landing.
When I’m profiling using vtune, I can see that a call to MKL DGEMM, is having 68 threads working (which is the number of physical cores) but the expectation is that it uses 272 threads (logical cores) because of hyper-threading. Other parts of my code where i use (openmp simd) directives, are using up to 272 threads. I'm wondering if there is any settings i need to setup in order to get hyper-threading working for my case.
Did you look into MKL_DYNAMIC setting? Did you find an advantage in using all the logical threads (vs. spreading a smaller number evenly across cores) in your omp parallel (not simd alone) regions?
Note that many applications perform best using fewer than 4 threads per core.
See this article for information about setting the number of threads per core, https://software.intel.com/en-us/articles/process-and-thread-affinity-for-intel-xeon-phi-processors-...
On most Intel processors DGEMM only uses one thread per core. The code is very tightly constructed (with very careful cache blocking) to give the best performance in this configuration. There are very few avoidable stalls that could be overlapped with work in the other logical processor. Using the other logical processor would cut the available cache in half, which would reduce the block sizes, increase the cache miss rates, and decrease the overall performance.
I have not looked at Intel's DGEMM implementation for Xeon Phi x200, but it is easy to believe that it has the same properties. (The first generation Xeon Phi (Knights Corner) was an exception because a single thread could only issue instructions every other cycle, so two threads were required to reach maximum speed on compute-bound codes. This limitation is not present in the second generation Xeon Phi (Knights Landing) -- one thread of execution can issue two instructions every cycle, getting reasonably close to peak performance.
Thank you everyone for your thorough help and explanations, I tried and saw that actually as pointed by Dr. McCalpin, it's better not to override the setting, as the performance is declined when using more threads per core and the code already seems to reach peak performance.