- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Two question about the MKL functions from my recent test program:
1. Does the mkl function use multiple cores via multithreading implicitely? I have timing of "sgemv" on a 8-core cpu. My testing result shows it's surprisingly fast! Only two times of the GPU blas function: "cublasSgemv" for large size matrices (10000 by 10000).
2. If I repeat calling "sgemv" it becomes faster and faster. Finally for 2000 by 2000 matrix, it is as fast as gpu version "cublasSgemv". But initially, it is much slower than gpu version (about 30 times). the Code is like:
for (int i=0; i<100; i++)
{
// start timing
sgemv(...)
//end timing
}
my timing unit is small enough to distinguish 0.0001 ms.
could any experts explain the reason bebind it?
many thanks in advance!
Two question about the MKL functions from my recent test program:
1. Does the mkl function use multiple cores via multithreading implicitely? I have timing of "sgemv" on a 8-core cpu. My testing result shows it's surprisingly fast! Only two times of the GPU blas function: "cublasSgemv" for large size matrices (10000 by 10000).
2. If I repeat calling "sgemv" it becomes faster and faster. Finally for 2000 by 2000 matrix, it is as fast as gpu version "cublasSgemv". But initially, it is much slower than gpu version (about 30 times). the Code is like:
for (int i=0; i<100; i++)
{
// start timing
sgemv(...)
//end timing
}
my timing unit is small enough to distinguish 0.0001 ms.
could any experts explain the reason bebind it?
many thanks in advance!
Link Copied
11 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
As to the first question:
Just try toset evironment KMP_AFFINITY=verbose or KMP_AFFINITY=verbose,compactto see multuthreading if any.
FYI: Intel thread affinity environment variable KMP_AFFINITY for openMP is explained in Compiler Intel compiler user guide topic "Thread Affinity Interface (Linux* and Windows*)".
Also,please search other articles related to KMP_AFFINITY in MKL knowledge base for more details.
--Victor
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
As to second question, I didnt catchwhich thing seemed stranges:
MKL runs faster than Cuda when sizes > 2000 ?
sgemv MKL is running faster when input sizes are growing or number of loop iterations is growing?
BTW what was your 8 cores CPU?
Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello, pilot117,
1). Yes, by the default MKL uses "Number-of-physical-cores" threads.
2). Sure, fluctuations are possible. You can try to change the Code to obtain more stable and results:
for (int i=0; i<100; i++)
{
// start timing
for (int j=0; j<100; j++)
{
sgemv(...)
}
//end timing
}
-Vladimir
1). Yes, by the default MKL uses "Number-of-physical-cores" threads.
2). Sure, fluctuations are possible. You can try to change the Code to obtain more stable and results:
for (int i=0; i<100; i++)
{
// start timing
for (int j=0; j<100; j++)
{
sgemv(...)
}
//end timing
}
-Vladimir
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The first run is likely to be slower on account of cache misses. Running the test repeatedly minimizes the cache "warm up" effect. That may or may not be your desire.
As suggested earlier, appropriate KMP_AFFINITY settings appear likely to produce faster results earlier.
As suggested earlier, appropriate KMP_AFFINITY settings appear likely to produce faster results earlier.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - Sergey Kazakov (Intel)
Hi,
As to second question, I didnt catchwhich thing seemed stranges:
MKL runs faster than Cuda when sizes > 2000 ?
sgemv MKL is running faster when input sizes are growing or number of loop iterations is growing?
BTW what was your 8 cores CPU?
Thanks
Hi, Sergey,
MKL runs fast but NOT faster than CUDA. my gpu card is tesla c1060 which is powerful. In my test, when the matrix size is greater than 2000, the time of mkl sgemv is around 2 to 3 times of cublasSgemv. But I just feel it should not be so fast. that's why i guess it use multi-cores.
in a loop, it did goes faster and faster. finally for a matrix of 2000 by 2000, mkl is much faster than cublas. very good performance!
my cpu is 2 quad core Xeon X5560 @ 2.80GHz.
thanks!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
many thanks all for the nice information!
I have set the KMP_AFFINITY=verbose, and this is output:
KMP_AFFINITY: Affinity capable, using global cpuid instr info
KMP_AFFINITY: Initial OS proc set respected:
{0,1,2,3,4,5,6,7}
KMP_AFFINITY: 8 available OS procs - Uniform topology of
KMP_AFFINITY: 2 packages x 4 cores/pkg x 1 threads/core (8 total cores)
KMP_AFFINITY: Internal thread 0 bound to OS proc set {0,1,2,3,4,5,6,7}
KMP_AFFINITY: Internal thread 1 bound to OS proc set {0,1,2,3,4,5,6,7}
KMP_AFFINITY: Internal thread 3 bound to OS proc set {0,1,2,3,4,5,6,7}
KMP_AFFINITY: Internal thread 2 bound to OS proc set {0,1,2,3,4,5,6,7}
KMP_AFFINITY: Internal thread 5 bound to OS proc set {0,1,2,3,4,5,6,7}
KMP_AFFINITY: Internal thread 4 bound to OS proc set {0,1,2,3,4,5,6,7}
KMP_AFFINITY: Internal thread 6 bound to OS proc set {0,1,2,3,4,5,6,7}
KMP_AFFINITY: Internal thread 7 bound to OS proc set {0,1,2,3,4,5,6,7}
So I think the mkl did distribute the sgemv computation among the 8 cores. I guess that is why mkl blas 1 function is much faster than cuda blas 1 function.
Another question is that how do I specify the number of procs to use? Using openMP and setting the number of threads is one way which i get desired output. But I want to mkl only use proc 3 and proc 0. I have tried this command:
setenv KMP_AFFINITY verbose,granularity=fine,proclist=[3,0],explicit
but it seems the specified parameters are not recognized:
OMP: Warning #66: KMP_AFFINITY: parameter has been specified already, ignoring "granularity=fine".
OMP: Warning #67: KMP_AFFINITY: parameter invalid, ignoring "proclist=[3".
OMP: Warning #67: KMP_AFFINITY: parameter invalid, ignoring "0]".
OMP: Warning #67: KMP_AFFINITY: parameter invalid, ignoring "explicit".
how should I modify the command so that I can control the usage of the cores?
I have set the KMP_AFFINITY=verbose, and this is output:
KMP_AFFINITY: Affinity capable, using global cpuid instr info
KMP_AFFINITY: Initial OS proc set respected:
{0,1,2,3,4,5,6,7}
KMP_AFFINITY: 8 available OS procs - Uniform topology of
KMP_AFFINITY: 2 packages x 4 cores/pkg x 1 threads/core (8 total cores)
KMP_AFFINITY: Internal thread 0 bound to OS proc set {0,1,2,3,4,5,6,7}
KMP_AFFINITY: Internal thread 1 bound to OS proc set {0,1,2,3,4,5,6,7}
KMP_AFFINITY: Internal thread 3 bound to OS proc set {0,1,2,3,4,5,6,7}
KMP_AFFINITY: Internal thread 2 bound to OS proc set {0,1,2,3,4,5,6,7}
KMP_AFFINITY: Internal thread 5 bound to OS proc set {0,1,2,3,4,5,6,7}
KMP_AFFINITY: Internal thread 4 bound to OS proc set {0,1,2,3,4,5,6,7}
KMP_AFFINITY: Internal thread 6 bound to OS proc set {0,1,2,3,4,5,6,7}
KMP_AFFINITY: Internal thread 7 bound to OS proc set {0,1,2,3,4,5,6,7}
So I think the mkl did distribute the sgemv computation among the 8 cores. I guess that is why mkl blas 1 function is much faster than cuda blas 1 function.
Another question is that how do I specify the number of procs to use? Using openMP and setting the number of threads is one way which i get desired output. But I want to mkl only use proc 3 and proc 0. I have tried this command:
setenv KMP_AFFINITY verbose,granularity=fine,proclist=[3,0],explicit
but it seems the specified parameters are not recognized:
OMP: Warning #66: KMP_AFFINITY: parameter has been specified already, ignoring "granularity=fine".
OMP: Warning #67: KMP_AFFINITY: parameter invalid, ignoring "proclist=[3".
OMP: Warning #67: KMP_AFFINITY: parameter invalid, ignoring "0]".
OMP: Warning #67: KMP_AFFINITY: parameter invalid, ignoring "explicit".
how should I modify the command so that I can control the usage of the cores?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
A bit off-topic: Does anybody know if any consistent comparative benchmarks exist for BLAS 3/LAPACKbetween MKL and CUDA/CULA (or any other GPU library)? Matvecs, rotational transformations are less interesting for me. Dense linear systems andeigenproblems are of the great interest...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - dbacchus
A bit off-topic: Does anybody know if any consistent comparative benchmarks exist for BLAS 3/LAPACKbetween MKL and CUDA/CULA (or any other GPU library)? Matvecs, rotational transformations are less interesting for me. Dense linear systems andeigenproblems are of the great interest...
Hi,
there's a world-wide known benchmark called Linpack solving a system of linear equations- probably that's what you need. MKL has Linpack in its distributive at benchmarks/linpack folder.
Michael.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - pilot117
Another question is that how do I specify the number of procs to use? Using openMP and setting the number of threads is one way which i get desired output. But I want to mkl only use proc 3 and proc 0.
Hi,
On Linux there exists taskset utility for that. Run taskset --help for more details.
So, just try the following command to only use CPUs 0 and 3:
% taskset -c 0,3
BTW, also set in environment OMP_NUM_THREADS=2 if MKL should be parallel on 2 threads
-- Victor
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - Victor Pasko (Intel)
Hi,
On Linux there exists taskset utility for that. Run taskset --help for more details.
So, just try the following command to only use CPUs 0 and 3:
% taskset -c 0,3
BTW, also set in environment OMP_NUM_THREADS=2 if MKL should be parallel on 2 threads
-- Victor
hi, Victor, thank you so much for these information! Btw, do you know how to profiling the processors using some unix tools? I mean to get an idea on how the jobs are distributed among the processor cores?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - pilot117
hi, Victor, thank you so much for these information! Btw, do you know how to profiling the processors using some unix tools? I mean to get an idea on how the jobs are distributed among the processor cores?
Hi,
There are many such tools. For example from Intel:
Intel Thread Checker for Linux
Intel VTuneTM Performance Analyzer for Linux
Just visit: http://software.intel.com/en-us/articles/intel-software-development-products-for-linux/
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page