MKL matmul with avx 512 shows bad performance on matrix with certain input size

Wang__Shuo · ‎04-11-2020

Description: For Intel-MKL compiled with AVX512 support, matmul performance will be bad for certain matrix size. For example, let C = np.matmul(A, B), where A.shape = (m, k), B.shape = (k, n). If m < 192 and n is multiple of 1024, the performance is not as good as expected. For example, on my machine which has CPU "Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz", if A.size = (191, 20000), B.size = (20000, 1024), np.matmul(A, B) will use 120 ms (export OMP_NUM_THREADS =1), however, if A.size = (191, 20000), B.size = (20000, 1023 or 1025). np.matmul(A, B) will us 80 ms. On the other hand, if A.size = (192, 20000), B.size = (20000, 1024), np.matmul will use 75 ms. I did many experiments, and find that if m < 192 and n is 1024, 2048, 3072 ..., the performance will be bad, the number k seems not relevant. The above test is done using numpy with MKL backend installed by Anaconda, the intel-tensorflow shows the same result.

Operating system and version : CentOS Linux release 7.4.1708

Library version: Intel Optimized tensorflow 1.15.0 installed with "pip install intel-tensorflow==1.15.0", and numpy 1.18.1 shipped with Anaconda

Compiler version: gcc 4.8.5

Steps to reproduce the error (include makefiles, command lines, small test cases, and build instructions)

import numpy as np
import time
a = np.random.random((191,20000)).astype(np.float32)
b = np.random.random((20000,1024)).astype(np.float32)
for i in range(20):
    time1 = time.time()
    c = np.matmul(a,b)
    time2 = time.time()
    print(time2 - time1)

Working compiler, tool, or library version, and accelerator driver version (for regressions)

Gennady_F_Intel · ‎07-07-2020

You could submit the report of the problem against the MKL team to the Intel Online Service Center.