- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am implementing a code (Iterative Matrix Solver : Language is Fortran) which basically works like this :



while (error < tolerance OR iterations < max iterations)
{
Few DGEMV
Few DDOT
Few DNRM
}
The problem is as follows :
1. The time required for solving increases erratically with the increase in number of threads.
I put it in a loop which works like (Not actual code)
export MKL_DYNAMIC=FALSE
for i=1 to 8
export MKL_NUM_THREADS=i
end
The CPU usage is in accordance with the number of threads at all times (For instance, 200% for 2 threads, 800% for 8 threads etc.) but the lowest time (and hence best performance) is observed at 2 Threads which is bothering me.
2. I am linking my program as :
ifort -fp-model source ParaQMR.f90 -xSSE3 -lmkl_blas95 -lmkl_intel -lmkl_intel_thread -lmkl_core -liomp5 -lpthread -m32 && time ./a.out
(I can attach the source code but it will be of no real use since the code reads data from a 2 GB text file which obviously I can't upload)
My hardware is Intel Xeon with 8 cores 8 threads. (32 bit Ubuntu 10.04 < I know 32 bit is a mistake but I doubt it will make a difference)
3. I tried VTune and although I am not a computer major or something, I tried my level best at interpreting the results and what I feel from the graph below is that most of the the time is spent creating and destroying threads each time a new iteration starts. But I am not so sure. Is there any way to circumvent this problem ?
I tried everything but nothing seems to work
I have tried :
1. Using my own Matrix-Vector Multiplication code. It suffers immensely when compared to MKL BLAS.
2. Intel Inspector. Didn't seem to make head or tail of what was happening.
3. Experiment with different flags, compiler (gfortran), gfortran and -fopenmp tag instead etc. Didn't work.
(By didn't work I mean, didn't produce the decrease in time as I expected)




Please help me out and let me know if I should upload more VTune Graphs.
Link Copied
8 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Following comments might be more topical and get more expert suggestions on the MKL forum:
By resetting MKL_DYNAMIC, I believe you are disabling MKL's own effort to place threads efficiently, so you should be setting KMP_AFFINITY or MKL equivalent directly for each number of threads, according to your platform. Probably 1 thread per CPU, if a dual CPU, for 2 threads, 1 thread per cache, if a split cache CPU; never 2 threads per core when other cores are idle, ....
As you are using dgemv, you might expect performance to drop as soon as you run 2 threads each on 1 or more cores; are you trying to quantify that?
If your cache footprint is very large, it is possible to see your performance peak as soon as you have enough threads to use all of the cache. In such a case, VTune cache events could clarify it.
By resetting MKL_DYNAMIC, I believe you are disabling MKL's own effort to place threads efficiently, so you should be setting KMP_AFFINITY or MKL equivalent directly for each number of threads, according to your platform. Probably 1 thread per CPU, if a dual CPU, for 2 threads, 1 thread per cache, if a split cache CPU; never 2 threads per core when other cores are idle, ....
As you are using dgemv, you might expect performance to drop as soon as you run 2 threads each on 1 or more cores; are you trying to quantify that?
If your cache footprint is very large, it is possible to see your performance peak as soon as you have enough threads to use all of the cache. In such a case, VTune cache events could clarify it.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I used KMP_AFFINITY and set it as :
export MKL_DYNAMIC=FALSE
export MKL_NUM_THREADS=8
export KMP_AFFINITY="verbose,granularity=fine,proclist=[0,1,2,3,4,5,6,7],explicit"
ifort blah blah
It didn't help. As far as I know, cache was being managed optimally and a second thread was not started on any core if another core was sitting idle.
I still belive that what is happening is
it = 0
create threads > carry out dgemv > destroy threads
create threads > carry out ddot > destroy threads
create threads > carry out dnrm > destroy threads
it = 1
create threads > carry out dgemv > destroy threads
.
.
.
Can this be the problem ? If yes, Can this be avoided ?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Have you tried export MKL_DYNAMIC=TRUE
It will suggest MKl to choose the good the threading number for the problem. As Tim noted, for the DGEMV, DDOT function, increasing the threading number may not improve the performance. If MKL_DYNAMIC is FALSE, it will force MKL to the threading you set.
Thanks,
Chao
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Overheadon joining OMP-threads can be significant if your data-volume in MKL functions is notbig enough.
IsSMT (hyper-threading) on?
So that 8 CPUs means: 4-cores with hyper-threading. Please clarify.
Overheadon joining OMP-threads can be significant if your data-volume in MKL functions is notbig enough.
IsSMT (hyper-threading) on?
So that 8 CPUs means: 4-cores with hyper-threading. Please clarify.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I didn't get your point on joining OMP threads. I have a 10000x10000 matrix being multiplied with a 10000x1 vector.
I don't think SMT is on (I googled and it gave me results related to BIOS settings which I swore never to fiddle with), the only settings I tweaked with are the ones mentioned (MKL_NUM_THREADS and MKL_DYNAMIC and more recently KMP_AFFINITY)
Also,
MKL_DYNAMIC=TRUE uses 2 threads most of the times which is irritating since my goal is to create a code which uses max. system resources.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
To see SMT setting from libiomp5 please add `verbose' to your KMP_AFFINITY setting:
KMP_AFFINITY=verbose,$KMP_AFFINITY
If logical thread is bound tonot oneCPUs (likely two) then SMT is ON.
Also please look at related MKL articles/discussions:
http://software.intel.com/en-us/articles/intel-math-kernel-library-intel-mkl-intel-mkl-100-threading/
http://software.intel.com/en-us/articles/setting-thread-affinity-on-smt-or-ht-enabled-systems/
http://software.intel.com/en-us/forums/showthread.php?t=68141
KMP_AFFINITY=verbose,$KMP_AFFINITY
If logical thread is bound tonot oneCPUs (likely two) then SMT is ON.
Also please look at related MKL articles/discussions:
http://software.intel.com/en-us/articles/intel-math-kernel-library-intel-mkl-intel-mkl-100-threading/
http://software.intel.com/en-us/articles/setting-thread-affinity-on-smt-or-ht-enabled-systems/
http://software.intel.com/en-us/forums/showthread.php?t=68141
BTW, If Hyper-Threading technology is enabled on the systems, it is recommended that the threading numbers be set equal to the number of real processors or cores. That is only half number of the logical processors.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Just one more question :
Will changing my code to C benefit the speed in any way ?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting nunoxic
Will changing my code to C benefit the speed in any way ?
Unlikely from what you have said so far.

Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page