topic MKL and the Parallel option in Intel® oneAPI Math Kernel Library

MKL and the Parallel option

cppcoder — Thu, 24 Feb 2011 18:44:27 GMT

Hi,

I'm using Visual Studio 2008, Intel compiler v11.1 and the MKL library that comes with it. I started my project using the Sequential option for MKL but now I want to use the parallel option. However, when I switch to parallel and recompile (release version), I neither see any performance improvement, nor see that the executable uses more than the CPUs that the sequential version uses (one). I have 8 cores(EDITED) more than 5 Gb RAM, using Windows 7 x64, and generating an x64 executable (fp model used is precise)

In my case, I'm generating about ~800k random numbers with VSL functions, and then getting the log of those numbers using another VSL function. I think that such volume of computations should benefit from parallelism. What am I doing wrong?

The only thing I change is the MKL option from Sequential to Parallel.

Thanks

EDIT: Setting the variableMKL_NUM_THREADS=4 before executing my program from the command line, does not yield any change from what I stated above.

MKL and the Parallel option

Gennady_F_Intel — Thu, 24 Feb 2011 19:15:50 GMT

That's an unexpected behaviour. We need to check it on our side. Did you check the execution time in the case of sequentialvs threaded version?

MKL and the Parallel option

cppcoder — Thu, 24 Feb 2011 19:22:46 GMT

Yes, it's approximately the same time (which I don't find surprising given than no more CPUs appear to be used)

I'm sorry, but it's not the case. I was taking the time in other parts of my program together with the VSL functions. Now that I isolated the times that VSL routines take, I have notice the following (all times were measured with pairs ofGetTickCount() calls):

Execution of the sequential version takes much less time (15-32 ms in several runs) than the parallel version (~1000 - ~2000 ms in several runs)
CPU usage never goes beyond 20% even when I change the number of threads with MKL_NUM_THREADSto the maximum number of processors (8)

I guess MKL is using several cores after all, but the computations I do (random number generation and taking log of those) are not demanding enough to notice any noticeable difference by humans, or to benefit from parallelism

If you have a different take, please let me know.

Thanks.

MKL and the Parallel option

Ilya_B_Intel — Fri, 25 Feb 2011 07:57:58 GMT

cppcoder, can you please name exact routines you were using for RNG generation with method used, and which log function are you using?

it would be also beneficial if you can provide your linking line.

MKL and the Parallel option

Seth_Sampson — Fri, 25 Feb 2011 14:37:12 GMT

I use MKL and IPP to compute FFT on one computer( E8200, 2cores, 2G mem,win XP) and the other ( Xeon X5670 *2, 24 cores with HT, 64G mem, Win7 x64), but the results are of no significant changes. The CPU usage of the Xeon X5670 never goes beyond 10%, and I am also confused.

MKL and the Parallel option

TimP — Fri, 25 Feb 2011 17:04:22 GMT

If your objective is to keep the hyperthreads busy on the Windows task manager, without caring about performance, did you read the discussions about MKL_DYNAMIC? You may be spending much of your time in MKL functions which can't use so many threads, so you would have to answer the questions about specifics before you could get expert comments.

MKL and the Parallel option

cppcoder — Mon, 07 Mar 2011 18:48:29 GMT

Quoting Ilya Burylov (Intel)

cppcoder, can you please name exact routines you were using for RNG generation with method used, and which log function are you using?
it would be also beneficial if you can provide your linking line.

Yes, I use these functions in the order specified below:

vdRngGamma

vdLog10

vdRngUniform

vdLog10

MKL and the Parallel option

cppcoder — Mon, 07 Mar 2011 18:53:19 GMT

(duplicated)

MKL and the Parallel option

Ilya_B_Intel — Wed, 09 Mar 2011 08:36:08 GMT

cppcoder,

vdRngGamma and vdUniform functions are not threaded in MKL. Threaded functins vdLog10 takes around 10-15% of overall time in this call sequence and thus benefit from their parallelization is not visible.

In general case threading of sequences of VML and VSL function calls is more efficient on higher level than function-by-function. Higher level helps to minimize threading overheads and cache issues.

In order to utilize threading of VSL functions you might use one of techniqes:

Creating independent streams
Splitting streams into blocks withvslSkipAheadStream function
Splitting streams into severaldisjoint subsequences withvslLeapfrogStream function

SeeIntel Math Kernel Library Vector Statistical Library Notefor details (chapter 7.3.5)