- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I work with a large ccNUMA SGI Altix system. One of our users is trying to benchmark some LAPACK routines on our system and is getting some disappointing scaling - stops scaling after 4 threads.
The test I am running is of diagonalizing a 4097x4097 matrix of double precision floats. It uses the routine DSYEV.
From analysing the hotspots in VTune, I find that almost all the time is spent in overhead and spin time from the functions:
[OpenMP dispatcher]<- pthread_create_child and in [OpenMP fork].
The code was compiled using ifort with the options: -O3 -openmp -g -traceback -xHost -align -ansi-alias -mkl=parallel. Using version 13.1.0.146 of the compiler and version 11 of MKL. The system is made up of 8 core Xeon sandy bridge sockets.
The code was ran with the envars:
OMP_NUM_THREADS=16
MKL_NUM_THREADS=16
KMP_STACKSIZE=2gb
OMP_NESTED=FALSE
MKL_DYNAMIC=FALSE
KMP_LIBRARY=turnaround
KMP_AFFINITY=disabled
It is also ran with the SGI command for NUMA systems 'dplace -x2' which locks the threads to their cores.
So I suspect that there is something up with the options for the MKL, or the library isn't configured properly for our system. I have attached the code used.
Does anybody have any ideas on this?
Jim
Link Copied
- « Previous
-
- 1
- 2
- Next »
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
MKL matrix multiply is hand optimized to maximize effective use of multiple threads. For matrix dimensions of 4096, it can use effectively at least 244 threads on the Intel(c) Xeon Phi(tm). That version of MKL won't perform efficiently on matrices with dimensions less than 32, but it is possible to use the number of threads corresponding to problem size effectively with host MKL or by compiling from source code with OpenMP. For a problem so small that 4 threads would be the limit, single thread in-line expansion, e.g. Fortran MATMUL, should be better than launching a threaded job e.g. by MKL.
By the way, the ifort -opt-matmul (MKL support for MATMUL) isn't available on Intel(c) Xeon Phi(tm). What is available is "automatic offload" where MKL function calls on host are executed on coprocessor, subject to environment variable and suffiiciently large size.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Okay, TimP.
Xeon Phi cards aside - is it true that a routine like DSYEV is only parallel to 4 threads in MKL on Xeon processors?
Also let us not forget, that my main problem is that when running MKL DYSEV on my system even on 4 threads, all the threads spend most of the time idle and about 1% of the time in DYSEV. I still don't know why this is. When you run my program with 4 threads on your machines through VTune hotspots, does it also show that the threads are idle 99% of the time?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
on a 4 core platform mkl defaults to 4 threads even if 8 logical are visible
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi James,
No, my guess is wrong, MKL haven't limitation for the funcion at version of 11. ( we had did this for small data size before).
i did test on one machine Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz , 2 package, 8 core. HT disable totally 2x8=16 thread,
run with matrix 4096x4096
export MKL_NUM_THREADS=2
real 27.0s ; cpu 53.9s
export MKL_NUM_THREADS=4
real 15.2s ; cpu 60.7s
export MKL_NUM_THREADS=8
real 9.5s ; cpu 75.8s
export MKL_NUM_THREADS=16
real 7.4s ; cpu 117.2s.
So the problem should be not here.
Best Regards,
Ying
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Okay that's good to hear that MKL routines can use more than 4 threads!
Thanks for your timings these are a good comparison help push me closer to the source of the problem. So I ran the same thing on our system as you did, Ying, and here are my timings:
run with matrix 4096x4096
export MKL_NUM_THREADS=2
real 27.3s ; cpu 54.2s
export MKL_NUM_THREADS=4
real 15.9s ; cpu 62.3s
export MKL_NUM_THREADS=8
real 11.3s ; cpu 88.4s
export MKL_NUM_THREADS=16
real 13.6s ; cpu 212.6s
These were ran with the other options:
OMP_NUM_THREADS= # 2,4,8,16
MKL_NUM_THREADS= # 2,4,8,16
KMP_STACKSIZE=2gb
OMP_NESTED=FALSE
MKL_DYNAMIC=FALSE
KMP_LIBRARY=turnaround
KMP_AFFINITY=disabled
So our results agree up to 8 threads. At 16 however, things start to look different on my machine. With 16 threads, this is two sockets on my machine. Each socket is connected via a NUMA link, unlike your machine where your two packages will have uniform access to memory.
So basically this MKL routine doesn't seem to scale beyond a single socket on our machine, which is the problem the user reported. Please provide some comment and suggest what I should do next,
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sergey,
Sorry but your tests only demonstrate that your Diag.exe does not scale beyond 4 threads at all. (Times with 8,16 and 32 threads are the same as with 4 threads). Probably because you only seem to have 8 cores in the system. And the 100% utilisation of all cores does not say anything about issues with NUMA. Idle spinning can create that just as well.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
As MKL can use the resources of the 4 cores fully with 1 thread per core, it's hardly surprising that more threads don't improve performance.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- « Previous
-
- 1
- 2
- Next »