Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.

Implicite multithreading ???

pilot117
Beginner
1,337 Views
Hi,

Two question about the MKL functions from my recent test program:

1. Does the mkl function use multiple cores via multithreading implicitely? I have timing of "sgemv" on a 8-core cpu. My testing result shows it's surprisingly fast! Only two times of the GPU blas function: "cublasSgemv" for large size matrices (10000 by 10000).

2. If I repeat calling "sgemv" it becomes faster and faster. Finally for 2000 by 2000 matrix, it is as fast as gpu version "cublasSgemv". But initially, it is much slower than gpu version (about 30 times). the Code is like:

for (int i=0; i<100; i++)
{
// start timing
sgemv(...)
//end timing
}
my timing unit is small enough to distinguish 0.0001 ms.

could any experts explain the reason bebind it?

many thanks in advance!
0 Kudos
11 Replies
barragan_villanueva_
Valued Contributor I
1,337 Views

Hi,

As to the first question:
Just try toset evironment KMP_AFFINITY=verbose or KMP_AFFINITY=verbose,compactto see multuthreading if any.

FYI: Intel thread affinity environment variable KMP_AFFINITY for openMP is explained in Compiler Intel compiler user guide topic "Thread Affinity Interface (Linux* and Windows*)".

Also,please search other articles related to KMP_AFFINITY in MKL knowledge base for more details.

--Victor
0 Kudos
Sergey_K_Intel2
Employee
1,337 Views

Hi,

As to second question, I didnt catchwhich thing seemed stranges:
MKL runs faster than Cuda when sizes > 2000 ?
sgemv MKL is running faster when input sizes are growing or number of loop iterations is growing?
BTW what was your 8 cores CPU?

Thanks
0 Kudos
Vladimir_Lunev
New Contributor I
1,337 Views
Hello, pilot117,

1). Yes, by the default MKL uses "Number-of-physical-cores" threads.
2). Sure, fluctuations are possible. You can try to change the Code to obtain more stable and results:

for (int i=0; i<100; i++)
{
// start timing
for (int j=0; j<100; j++)
{
sgemv(...)
}
//end timing
}

-Vladimir

0 Kudos
TimP
Honored Contributor III
1,337 Views
The first run is likely to be slower on account of cache misses. Running the test repeatedly minimizes the cache "warm up" effect. That may or may not be your desire.
As suggested earlier, appropriate KMP_AFFINITY settings appear likely to produce faster results earlier.
0 Kudos
pilot117
Beginner
1,337 Views

Hi,

As to second question, I didnt catchwhich thing seemed stranges:
MKL runs faster than Cuda when sizes > 2000 ?
sgemv MKL is running faster when input sizes are growing or number of loop iterations is growing?
BTW what was your 8 cores CPU?

Thanks


Hi, Sergey,

MKL runs fast but NOT faster than CUDA. my gpu card is tesla c1060 which is powerful. In my test, when the matrix size is greater than 2000, the time of mkl sgemv is around 2 to 3 times of cublasSgemv. But I just feel it should not be so fast. that's why i guess it use multi-cores.

in a loop, it did goes faster and faster. finally for a matrix of 2000 by 2000, mkl is much faster than cublas. very good performance!

my cpu is 2 quad core Xeon X5560 @ 2.80GHz.

thanks!


0 Kudos
pilot117
Beginner
1,337 Views
many thanks all for the nice information!

I have set the KMP_AFFINITY=verbose, and this is output:

KMP_AFFINITY: Affinity capable, using global cpuid instr info
KMP_AFFINITY: Initial OS proc set respected:
{0,1,2,3,4,5,6,7}
KMP_AFFINITY: 8 available OS procs - Uniform topology of
KMP_AFFINITY: 2 packages x 4 cores/pkg x 1 threads/core (8 total cores)
KMP_AFFINITY: Internal thread 0 bound to OS proc set {0,1,2,3,4,5,6,7}
KMP_AFFINITY: Internal thread 1 bound to OS proc set {0,1,2,3,4,5,6,7}
KMP_AFFINITY: Internal thread 3 bound to OS proc set {0,1,2,3,4,5,6,7}
KMP_AFFINITY: Internal thread 2 bound to OS proc set {0,1,2,3,4,5,6,7}
KMP_AFFINITY: Internal thread 5 bound to OS proc set {0,1,2,3,4,5,6,7}
KMP_AFFINITY: Internal thread 4 bound to OS proc set {0,1,2,3,4,5,6,7}
KMP_AFFINITY: Internal thread 6 bound to OS proc set {0,1,2,3,4,5,6,7}
KMP_AFFINITY: Internal thread 7 bound to OS proc set {0,1,2,3,4,5,6,7}

So I think the mkl did distribute the sgemv computation among the 8 cores. I guess that is why mkl blas 1 function is much faster than cuda blas 1 function.

Another question is that how do I specify the number of procs to use? Using openMP and setting the number of threads is one way which i get desired output. But I want to mkl only use proc 3 and proc 0. I have tried this command:

setenv KMP_AFFINITY verbose,granularity=fine,proclist=[3,0],explicit

but it seems the specified parameters are not recognized:

OMP: Warning #66: KMP_AFFINITY: parameter has been specified already, ignoring "granularity=fine".
OMP: Warning #67: KMP_AFFINITY: parameter invalid, ignoring "proclist=[3".
OMP: Warning #67: KMP_AFFINITY: parameter invalid, ignoring "0]".
OMP: Warning #67: KMP_AFFINITY: parameter invalid, ignoring "explicit".

how should I modify the command so that I can control the usage of the cores?

0 Kudos
dbacchus
Beginner
1,337 Views

A bit off-topic: Does anybody know if any consistent comparative benchmarks exist for BLAS 3/LAPACKbetween MKL and CUDA/CULA (or any other GPU library)? Matvecs, rotational transformations are less interesting for me. Dense linear systems andeigenproblems are of the great interest...
0 Kudos
Michael_C_Intel4
Employee
1,337 Views
Quoting - dbacchus

A bit off-topic: Does anybody know if any consistent comparative benchmarks exist for BLAS 3/LAPACKbetween MKL and CUDA/CULA (or any other GPU library)? Matvecs, rotational transformations are less interesting for me. Dense linear systems andeigenproblems are of the great interest...

Hi,

there's a world-wide known benchmark called Linpack solving a system of linear equations- probably that's what you need. MKL has Linpack in its distributive at benchmarks/linpack folder.

Michael.
0 Kudos
barragan_villanueva_
Valued Contributor I
1,337 Views
Quoting - pilot117
Another question is that how do I specify the number of procs to use? Using openMP and setting the number of threads is one way which i get desired output. But I want to mkl only use proc 3 and proc 0.

Hi,

On Linux there exists taskset utility for that. Run taskset --help for more details.
So, just try the following command to only use CPUs 0 and 3:

% taskset -c 0,3

BTW, also set in environment OMP_NUM_THREADS=2 if MKL should be parallel on 2 threads

-- Victor
0 Kudos
pilot117
Beginner
1,337 Views

Hi,

On Linux there exists taskset utility for that. Run taskset --help for more details.
So, just try the following command to only use CPUs 0 and 3:

% taskset -c 0,3

BTW, also set in environment OMP_NUM_THREADS=2 if MKL should be parallel on 2 threads

-- Victor


hi, Victor, thank you so much for these information! Btw, do you know how to profiling the processors using some unix tools? I mean to get an idea on how the jobs are distributed among the processor cores?

0 Kudos
barragan_villanueva_
Valued Contributor I
1,337 Views
Quoting - pilot117


hi, Victor, thank you so much for these information! Btw, do you know how to profiling the processors using some unix tools? I mean to get an idea on how the jobs are distributed among the processor cores?


Hi,

There are many such tools. For example from Intel:

Intel Thread Checker for Linux
Intel VTuneTM Performance Analyzer for Linux


Just visit: http://software.intel.com/en-us/articles/intel-software-development-products-for-linux/
0 Kudos
Reply