We are using MKL on a RedHat Linux Network Server with Xeon Processor which has 32 Physical Core (64 Logical Core). The Application uses a thread pool to handle network requests in parallel. Each request is handled independently. The performance improves with more threads:
4 threads : 45 seconds
8 threads : 23 seconds
16 threads : 15 seconds
24 threads : 14 seconds
32 threads : 15 seconds
However, the performance always caps at 16 threads, and drops a little bit with 32 threads. I replace the mkl cblas_sgemm function with atlas, then the performance keeps improving from 1 thread to 32 threads linearly.
And limit the mkl thread count by calling mkl_set_num_threads(1) at the beginning of main function or set environment variable to 1, also doesn't work and get the same result. The multiprocess solution also have the same problem(??). Another experiment which sleeps a small amount of time before calling mkl cblas_sgemm shows linear but not ideal result. It looks like there are some resource contention inside the MKL cblas_sgemm implementation? Or do we miss anything here?
Any comment or suggestion is highly appreciated! And thanks much in advance!
Thanks for quick response! Sorry, I can't share the code with you according to the company policy. The problematic part is the first convolution layer which requires 256*256*3 image as input. We are using MKL 11.1 and linked with -l/3rdparty/libmkl.a
Hi Yu, 1/ I am not asking you to share the private code, but you may create the simplest sgemm example which will show the problem. 2/ Am i understand right, that problem sizes in your cases are ~256 x 256? 3/ version 11.1 is 5 years old version of MKL. Could you check the latest MKL 2017 u3 or the newest 2018? you may download these binaries for free.
Thanks for the comments! And yes, the input image size for the convolution layer is 256*256*3 channels. but the actual problem size might be much bigger according to the neural network algorithm. I tried mkl 2018 on the machine and get the same result. But this time, the VTune gave us a very clear report about the memory bandwidth. After some more digging, we think the bottleneck should be the memory bandwidth. MKL does such an excellent job in optimization and fully utilizes the memory bandwidth than atlas does. What's probably why the CPU usage always caps at 16. And it also answers why the multiprocess solution also has the same problem. Thanks again :)