topic I am 100% certain this had in Intel® oneAPI Math Kernel Library

Is mkl speed dependend upon how contiguous memory is?

L__D__Marks — Wed, 24 Oct 2018 18:44:36 GMT

Does the speed of the mkl blas/lapack library routines change significantly when one has contiguous memory versus not? I have a strange problem that looks like a "Memory Cache Leak" (not a memory leak) leading to a slow down of a program. Let me set the stage first. Reproducibly (using ganglia to monitor), on a cluster I have noticed that the cached memory is increasing, relatively slowly. When it becomes large, something like 2/3 of the total memory (Intel Gold with 32 cores & 192Gb) a program is running slower by about a factor of ~1.5. If I clear the cache and sync the disc (I have not tested which matter) with "sync ; echo 3 > /proc/sys/vm/drop_caches" the speed of the program increases back (~1.5 times faster). The issue seems to be associated with I/O -- the relevant code uses mpi and only the core that is doing any I/O shows the cache leak. The program is doing a fair amount of I/O, but not massive amounts (10-40 Mb). I compile using ifort with -assume buffered_io. My suspicion is that may leave some cached files at the end, effectively a "cache leak". The program uses a large number of blas/lapack calls. It is reasonable that the memory is less contiguous when the cached memory is large -- fragmented RAM. Can this lead to a speed change of the blas/lapack routines?

Hello,

Alice_H_Intel — Thu, 25 Oct 2018 04:17:43 GMT

Hello,

Thanks for your question. I will investigate it and get back to you soon.

Thanks,

Alice

Did you find out anything?

L__D__Marks — Thu, 15 Nov 2018 15:45:45 GMT

Did you find out anything?

exporting MKL_VERBOSE=1 will

Gennady_F_Intel — Thu, 15 Nov 2018 16:08:14 GMT

exporting MKL_VERBOSE=1 will you see changing the lapack/blas execution time? With the same routines and the same input problem sizes. Are you sure that there is no third party process running at the same time?

I am 100% certain this had

L__D__Marks — Thu, 15 Nov 2018 16:14:39 GMT

I am 100% certain this had nothing to do with other processes (there were none). Very reproducibly, "sync ; echo 3 > /proc/sys/vm/drop_caches" improved the speed by about a factor of 1.5.

N.B., the code already has a number of timers in it.