Re: Implicite multithreading ???

pilot117 · ‎11-23-2009

Hi,

Two question about the MKL functions from my recent test program:

1. Does the mkl function use multiple cores via multithreading implicitely? I have timing of "sgemv" on a 8-core cpu. My testing result shows it's surprisingly fast! Only two times of the GPU blas function: "cublasSgemv" for large size matrices (10000 by 10000).

2. If I repeat calling "sgemv" it becomes faster and faster. Finally for 2000 by 2000 matrix, it is as fast as gpu version "cublasSgemv". But initially, it is much slower than gpu version (about 30 times). the Code is like:

for (int i=0; i<100; i++)
{
// start timing
sgemv(...)
//end timing
}
my timing unit is small enough to distinguish 0.0001 ms.

could any experts explain the reason bebind it?

many thanks in advance!

barragan_villanueva_ · ‎11-24-2009

Hi,

As to the first question:
Just try toset evironment KMP_AFFINITY=verbose or KMP_AFFINITY=verbose,compactto see multuthreading if any.

FYI: Intel thread affinity environment variable KMP_AFFINITY for openMP is explained in Compiler Intel compiler user guide topic "Thread Affinity Interface (Linux* and Windows*)".

Also,please search other articles related to KMP_AFFINITY in MKL knowledge base for more details.

--Victor

Sergey_K_Intel2 · ‎11-24-2009

Hi,

As to second question, I didnt catchwhich thing seemed stranges:
MKL runs faster than Cuda when sizes > 2000 ?
sgemv MKL is running faster when input sizes are growing or number of loop iterations is growing?
BTW what was your 8 cores CPU?

Thanks

Vladimir_Lunev · ‎11-24-2009

Hello, pilot117,

1). Yes, by the default MKL uses "Number-of-physical-cores" threads.
2). Sure, fluctuations are possible. You can try to change the Code to obtain more stable and results:

for (int i=0; i<100; i++)
{
// start timing
for (int j=0; j<100; j++)
{
sgemv(...)
}
//end timing
}

-Vladimir

TimP · ‎11-24-2009

The first run is likely to be slower on account of cache misses. Running the test repeatedly minimizes the cache "warm up" effect. That may or may not be your desire.
As suggested earlier, appropriate KMP_AFFINITY settings appear likely to produce faster results earlier.

pilot117 · ‎11-24-2009

Quoting - Sergey Kazakov (Intel)

Hi,

As to second question, I didnt catchwhich thing seemed stranges:
MKL runs faster than Cuda when sizes > 2000 ?
sgemv MKL is running faster when input sizes are growing or number of loop iterations is growing?
BTW what was your 8 cores CPU?

Thanks

Hi, Sergey,

MKL runs fast but NOT faster than CUDA. my gpu card is tesla c1060 which is powerful. In my test, when the matrix size is greater than 2000, the time of mkl sgemv is around 2 to 3 times of cublasSgemv. But I just feel it should not be so fast. that's why i guess it use multi-cores.

in a loop, it did goes faster and faster. finally for a matrix of 2000 by 2000, mkl is much faster than cublas. very good performance!

my cpu is 2 quad core Xeon X5560 @ 2.80GHz.

thanks!

pilot117 · ‎11-24-2009

many thanks all for the nice information!

I have set the KMP_AFFINITY=verbose, and this is output:

KMP_AFFINITY: Affinity capable, using global cpuid instr info
KMP_AFFINITY: Initial OS proc set respected:
{0,1,2,3,4,5,6,7}
KMP_AFFINITY: 8 available OS procs - Uniform topology of
KMP_AFFINITY: 2 packages x 4 cores/pkg x 1 threads/core (8 total cores)
KMP_AFFINITY: Internal thread 0 bound to OS proc set {0,1,2,3,4,5,6,7}
KMP_AFFINITY: Internal thread 1 bound to OS proc set {0,1,2,3,4,5,6,7}
KMP_AFFINITY: Internal thread 3 bound to OS proc set {0,1,2,3,4,5,6,7}
KMP_AFFINITY: Internal thread 2 bound to OS proc set {0,1,2,3,4,5,6,7}
KMP_AFFINITY: Internal thread 5 bound to OS proc set {0,1,2,3,4,5,6,7}
KMP_AFFINITY: Internal thread 4 bound to OS proc set {0,1,2,3,4,5,6,7}
KMP_AFFINITY: Internal thread 6 bound to OS proc set {0,1,2,3,4,5,6,7}
KMP_AFFINITY: Internal thread 7 bound to OS proc set {0,1,2,3,4,5,6,7}

So I think the mkl did distribute the sgemv computation among the 8 cores. I guess that is why mkl blas 1 function is much faster than cuda blas 1 function.

Another question is that how do I specify the number of procs to use? Using openMP and setting the number of threads is one way which i get desired output. But I want to mkl only use proc 3 and proc 0. I have tried this command:

setenv KMP_AFFINITY verbose,granularity=fine,proclist=[3,0],explicit

but it seems the specified parameters are not recognized:

OMP: Warning #66: KMP_AFFINITY: parameter has been specified already, ignoring "granularity=fine".
OMP: Warning #67: KMP_AFFINITY: parameter invalid, ignoring "proclist=[3".
OMP: Warning #67: KMP_AFFINITY: parameter invalid, ignoring "0]".
OMP: Warning #67: KMP_AFFINITY: parameter invalid, ignoring "explicit".

how should I modify the command so that I can control the usage of the cores?

dbacchus · ‎11-24-2009

A bit off-topic: Does anybody know if any consistent comparative benchmarks exist for BLAS 3/LAPACKbetween MKL and CUDA/CULA (or any other GPU library)? Matvecs, rotational transformations are less interesting for me. Dense linear systems andeigenproblems are of the great interest...

Michael_C_Intel4 · ‎11-25-2009

Quoting - dbacchus

A bit off-topic: Does anybody know if any consistent comparative benchmarks exist for BLAS 3/LAPACKbetween MKL and CUDA/CULA (or any other GPU library)? Matvecs, rotational transformations are less interesting for me. Dense linear systems andeigenproblems are of the great interest...

Hi,

there's a world-wide known benchmark called Linpack solving a system of linear equations- probably that's what you need. MKL has Linpack in its distributive at benchmarks/linpack folder.

Michael.

barragan_villanueva_ · ‎11-25-2009

Quoting - pilot117

Another question is that how do I specify the number of procs to use? Using openMP and setting the number of threads is one way which i get desired output. But I want to mkl only use proc 3 and proc 0.

Hi,

On Linux there exists taskset utility for that. Run taskset --help for more details.
So, just try the following command to only use CPUs 0 and 3:

% taskset -c 0,3

BTW, also set in environment OMP_NUM_THREADS=2 if MKL should be parallel on 2 threads

-- Victor

pilot117 · ‎12-02-2009

Quoting - Victor Pasko (Intel)

Hi,

On Linux there exists taskset utility for that. Run taskset --help for more details.
So, just try the following command to only use CPUs 0 and 3:

% taskset -c 0,3

BTW, also set in environment OMP_NUM_THREADS=2 if MKL should be parallel on 2 threads

-- Victor

hi, Victor, thank you so much for these information! Btw, do you know how to profiling the processors using some unix tools? I mean to get an idea on how the jobs are distributed among the processor cores?

barragan_villanueva_ · ‎12-02-2009

Quoting - pilot117

hi, Victor, thank you so much for these information! Btw, do you know how to profiling the processors using some unix tools? I mean to get an idea on how the jobs are distributed among the processor cores?

Hi,

There are many such tools. For example from Intel:

Intel Thread Checker for Linux
Intel VTune^TM Performance Analyzer for Linux

Just visit: http://software.intel.com/en-us/articles/intel-software-development-products-for-linux/