Since we are using intel MKL library we have to load INTEL's OpenMP library (libiomp5md.dll) at run time and exclude vcomp.lib at link time. But we have to compile and link with VC++. With my release 64 bit build if I run it directly, part of my code won't fully utilize the cores I specified and it runs very slowly. It seems to be using multiple cores but might be even slower than one core. If I attach it (release build) to the visual studio debugger without doing anything else, then it fully utilize the cores I specified. Does anybody have any ideas?
We are using Visual Studio 2010 on Window 7 professional. libiomp5md.dll shows file version of 5.0.2012.803.
This is a very big application. The part with issue uses OpenMP but not MKL. Other parts of this application uses MKL. My code uses a lot of OpenMP. Most of them works great and the code in trouble is actually very similar to other part.
The following links may be of use in your case since you mix in the app two OpenMP runtime libraries.
As I read the original post, it was recognized that vcomp.lib has to be excluded so that only the single Intel OpenMP instance is active, as that will support the vcomp calls.
This raises the possibility of working with KMP_AFFINITY and number of threads so as to improve the distribution of work across cores.
If Intel(c) hyperthreading is active, MKL will use a single thread per core, but you will need to set OMP_NUM_THREADS and KMP_AFFINITY to get a similar effect from the C++ parallel regions, e.g.
to spread threads out 1 per core.
I don't know what effects might be produced by transitioning from 1 thread per core in MKL to something different in the C++ code.
If you have a 2 socket platform affinity will be particularly important.
It is hard to guess what may be happening without knowing details of the application. Do the application creates threads for example (I mean non-OpenMP threads)? If it does then the resources oversubscription is possible. Some applications gain from setting environment variable KMP_BLOCKTIME=0, especially in case of oversubscription, when idle-spinning OpenMP worker threads slow down active OpenMP threads.
If the problem is different, then you can try to create small reproducer and submit support request.
After some trial and error the issue is resolved. Part of my code is called repeatedly, in the millions, and it uses a few local std::vector of some data type of size about 100s bytes. The memory management should be very simple compared to the complexity of the computations involved. But somehow the memory management brings down the whole process.