I will contact with you by email . for the question itself, developer may refer to
The Intel® Math Kernel Library (Intel® MKL) is multi-threaded and employs internal buffers for fast memory allocation. Typically the first subroutine call initializes the threads and internal buffers. Therefore, the first function call may take more time compared to the subsequent calls with the same arguments. Although the initialization time usually insignificant compared to the execution time of SGEMM for large matrices, it can be substantial when timing SGEMM for small matrices. To remove the initialization time from the performance measurement, we recommend making a call to SGEMM with sufficiently bigger parameters (for example, M=N=K=100) and ignoring the time required for the first call. Using a small matrix for the first call won’t initialize the threads since Intel MKL executes multi-threaded code only for sufficiently large matrices.