Description: For Intel-MKL compiled with AVX512 support, matmul performance will be bad for certain matrix size. For example, let C = np.matmul(A, B), where A.shape = (m, k), B.shape = (k, n). If m < 192 and n is multiple of 1024, the performance is not as good as expected. For example, on my machine which has CPU "Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz", if A.size = (191, 20000), B.size = (20000, 1024), np.matmul(A, B) will use 120 ms (export OMP_NUM_THREADS =1), however, if A.size = (191, 20000), B.size = (20000, 1023 or 1025). np.matmul(A, B) will us 80 ms. On the other hand, if A.size = (192, 20000), B.size = (20000, 1024), np.matmul will use 75 ms. I did many experiments, and find that if m < 192 and n is 1024, 2048, 3072 ..., the performance will be bad, the number k seems not relevant. The above test is done using numpy with MKL backend installed by Anaconda, the intel-tensorflow shows the same result.
Operating system and version : CentOS Linux release 7.4.1708
Library version: Intel Optimized tensorflow 1.15.0 installed with "pip install intel-tensorflow==1.15.0", and numpy 1.18.1 shipped with Anaconda
Compiler version: gcc 4.8.5
Steps to reproduce the error (include makefiles, command lines, small test cases, and build instructions)
import numpy as np import time a = np.random.random((191,20000)).astype(np.float32) b = np.random.random((20000,1024)).astype(np.float32) for i in range(20): time1 = time.time() c = np.matmul(a,b) time2 = time.time() print(time2 - time1)
Working compiler, tool, or library version, and accelerator driver version (for regressions)