Problematic MKL-DNN convolution performance for some sizes

gustafsson__bengt · ‎03-27-2018

I have encountered a fairly severe performance issue with the MKL DNN convolution operations in MKL2018.1.

As far as I can tell the library falls back to a more or less naive loop when the number of output channels is below 8. Obviously, as I don't have the source code, this is just an educated guess. However, as you can see from the below printout, the performance factor is over 40 so an obvious remedy would be to round up the number of output channels to a number evenly divisible by eight. My question is whether you have a plan to implement something that can make such convolutions go faster -- at least not much slower than the same operation with the number of output channels rounded up! If this has been fixed in MKL2018.2 I apologize, I haven't had time to test it yet.

All of the tests below are run on an Intel i7-8700 CPU with an image size of 256x256 pixels times the stated number of channels. The kernel is 3x3 in the spatial dimensions. No specific thread count was set. Build mode was x64 release using Microsoft VS2017 compiler.

The actual test program is attached, and includes a very naive implementation to get a baseline for the times.

Run with 32 input channels and 4 output channels.
naive took 93.86 ms
mkl took 25.327 ms
Naive 3.70593 times slower
Max diff is 9.15527e-05

Run with 32 input channels and 8 output channels.
naive took 189.08 ms
mkl took 1.214 ms
Naive 155.75 times slower
Max diff is 0.00012207

Run with 32 input channels and 16 output channels.
naive took 343.295 ms
mkl took 1.711 ms
Naive 200.64 times slower
Max diff is 0.000137329

Run with 16 input channels and 16 output channels.
naive took 213.477 ms
mkl took 1.004 ms
Naive 212.626 times slower
Max diff is 3.8147e-05

Evarist_F_Intel · ‎03-27-2018

Hi Gustafsson,

The limitation on input and output channels (to be multiple of 8) is substantial. Whenever channels are not multiple of 8 we use gemm-based fallback algorithm that suffers from extra data movement. We don't have plans to fix that in Intel MKL.

In general I would highly recommend you to try the open-source version https://github.com/intel/mkl-dnn. At the moment it does have the same limitations wrt to # of input/output channels, but provides performance and API improvements over DNN component in Intel MKL. We also have an intention to natively support arbitrary number of channels, but no ETA at the moment.