**Description**: For Intel-MKL compiled with AVX512 support, **matmul **performance will be bad for certain matrix size. For example, let C = np.matmul(A, B), where A.shape = (**m, k**), B.shape = (**k, n**). If **m** < 192 and **n** is multiple of 1024, the performance is not as good as expected. For example, on my machine which has CPU "Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz", if A.size = (191, 20000), B.size = (20000, 1024), np.matmul(A, B) will use 120 ms (*export OMP_NUM_THREADS =1*), however, if A.size = (191, 20000), B.size = (20000, 1023 or 1025). np.matmul(A, B) will us 80 ms. On the other hand, if A.size = (192, 20000), B.size = (20000, 1024), np.matmul will use 75 ms. I did many experiments, and find that if **m **< 192 and **n** is 1024, 2048, 3072 ..., the performance will be bad, the number **k** seems not relevant. The above test is done using numpy with MKL backend installed by Anaconda, the intel-tensorflow shows the same result.

**Operating system and version** : CentOS Linux release 7.4.1708

**Library version**: Intel Optimized tensorflow 1.15.0 installed with "pip install intel-tensorflow==1.15.0", and numpy 1.18.1 shipped with Anaconda

**Compiler version**: gcc 4.8.5

Steps to reproduce the error (include makefiles, command lines, small test cases, and build instructions)

import numpy as np import time a = np.random.random((191,20000)).astype(np.float32) b = np.random.random((20000,1024)).astype(np.float32) for i in range(20): time1 = time.time() c = np.matmul(a,b) time2 = time.time() print(time2 - time1)

Working compiler, tool, or library version, and accelerator driver version (for regressions)

You could submit the report of the problem against the MKL team to the Intel Online Service Center.

