Hello, where would I find the dll without tracing code? Currently I use one from MKL 11.1.2 (64-bit, version 5.0.2013.1126, file size 1043kB, modified 2014-01-31 12:23) and the profiler shows this (judging by OpenMP source I found, I assume that __kmp_print_storage_map_gtid is some printfoid tracing function, which eats 69.31% of the time): Inclusive Samples % Function Name 100.00 cs.exe 99.25 - RtlUserThreadStart 99.25 -- BaseThreadInitThunk 80.50 --- __kmp_launch_worker(void *) 80.50 ---- __kmp_launch_thread 69.32 ----- __kmp_fork_barrier(int,int) 69.31 ------ __kmp_print_storage_map_gtid 11.16 ----- __kmp_invoke_task_func 11.15 ------ __kmp_invoke_microtask 10.85 ------- mkl_blas_dgemm 0.19 ------- mkl_lapack_dlasr3 0.05 ------- etc... 18.76 __tmainCRTStartup 18.76 - AfxWinMain(struct HINSTANCE__ *,struct HINSTANCE__ *,char *,int) I did find: C:\Program Files (x86)\Common Files\Intel\Shared Libraries\redist\intel64\compiler C:\Program Files (x86)\Intel\Composer XE 2013 SP1\redist\intel64\compiler but they are the wrong ones.
The libiomp5md.dll file in the compiler redistribution folder is the right one you can use. If you see some performance issue report, can you post one sample code that may help to have further check?
(Sorry about the disappearing newlines)
I've diffed the files and they are identical:
C:\Windows\System32>fc "C:\Program Files (x86)\Intel\Composer XE 2013 SP1\redist\intel64\compiler\libiomp5md.dll" "
C:\Program Files (x86)\Common Files\Intel\Shared Libraries\redist\intel64\compiler\libiomp5md.dll"
Comparing files C:\PROGRAM FILES (X86)\INTEL\COMPOSER XE 2013 SP1\REDIST\INTEL64\COMPILER\libiomp5md.dll and C:\PRO
GRAM FILES (X86)\COMMON FILES\INTEL\SHARED LIBRARIES\REDIST\INTEL64\COMPILER\LIBIOMP5MD.DLL
FC: no differences encountered
As for the test case: Profile some largeish (1000 x 1000 matrices) dgemm-calls and that should produce something like the above.
Could you provide some details on this? How many threads are you using to run MKL functions? What is the hardware platform? Also how does the sgemm is called there? Is it just a simple dgemm or it is called in some loops?
When I run some simple code here. I do not see this problem.
I misread the test case before - here's what it really does (and it's 8 threads on an i7):
for (int k = 0; k < 5000; ++k)
v = some 999-element row vector;
compute v' * v (via dgemm, result is a 999 x 999 matrix)
int g = f(k); // g = 1,2 or 3
add the result to some matrix M
I'll rewrite this to make the dgemm calls nontrivial, which should make the threading overhead disappear. However, I still think it's a problem that __kmp_print_storage_map_gtid appears at all.
Now the test case went from 40 seconds down to 2 seconds (which is nice :-) but the profiler still shows 46.54% of the time being spent in __kmp_print_storage_map_gtid. With a different call stack though (it's doing some eigenvalue stuff now).
Point being: A build of libiomp5md without tracing would still be nice.