Okay, TimP. - Page 2

james_B_8 · ‎06-04-2013

I work with a large ccNUMA SGI Altix system. One of our users is trying to benchmark some LAPACK routines on our system and is getting some disappointing scaling - stops scaling after 4 threads.

The test I am running is of diagonalizing a 4097x4097 matrix of double precision floats. It uses the routine DSYEV.

From analysing the hotspots in VTune, I find that almost all the time is spent in overhead and spin time from the functions:

[OpenMP dispatcher]<- pthread_create_child and in [OpenMP fork].

The code was compiled using ifort with the options: -O3 -openmp -g -traceback -xHost -align -ansi-alias -mkl=parallel. Using version 13.1.0.146 of the compiler and version 11 of MKL. The system is made up of 8 core Xeon sandy bridge sockets.

The code was ran with the envars:

OMP_NUM_THREADS=16
MKL_NUM_THREADS=16
KMP_STACKSIZE=2gb
OMP_NESTED=FALSE
MKL_DYNAMIC=FALSE
KMP_LIBRARY=turnaround
KMP_AFFINITY=disabled

It is also ran with the SGI command for NUMA systems 'dplace -x2' which locks the threads to their cores.

So I suspect that there is something up with the options for the MKL, or the library isn't configured properly for our system. I have attached the code used.

Does anybody have any ideas on this?

Jim

TimP · ‎06-21-2013

MKL matrix multiply is hand optimized to maximize effective use of multiple threads. For matrix dimensions of 4096, it can use effectively at least 244 threads on the Intel(c) Xeon Phi(tm). That version of MKL won't perform efficiently on matrices with dimensions less than 32, but it is possible to use the number of threads corresponding to problem size effectively with host MKL or by compiling from source code with OpenMP. For a problem so small that 4 threads would be the limit, single thread in-line expansion, e.g. Fortran MATMUL, should be better than launching a threaded job e.g. by MKL.

By the way, the ifort -opt-matmul (MKL support for MATMUL) isn't available on Intel(c) Xeon Phi(tm). What is available is "automatic offload" where MKL function calls on host are executed on coprocessor, subject to environment variable and suffiiciently large size.

james_B_8 · ‎06-21-2013

Okay, TimP.

Xeon Phi cards aside - is it true that a routine like DSYEV is only parallel to 4 threads in MKL on Xeon processors?

Also let us not forget, that my main problem is that when running MKL DYSEV on my system even on 4 threads, all the threads spend most of the time idle and about 1% of the time in DYSEV. I still don't know why this is. When you run my program with 4 threads on your machines through VTune hotspots, does it also show that the threads are idle 99% of the time?

SergeyKostrov · ‎06-21-2013

>>...when running MKL DYSEV on my system even on 4 threads, all the threads spend most of the time idle and >>about 1% of the time in DYSEV. I still don't know why this is... I'll repeat tests on my Ivy Bridge with 4 CPUs and provide you additional technical details for comparison. Note: It looks like processing in that case is Memory or I/O bound, and it is Not CPU bound.

TimP · ‎06-22-2013

on a 4 core platform mkl defaults to 4 threads even if 8 logical are visible

SergeyKostrov · ‎06-22-2013

>>...on a 4 core platform mkl defaults to 4 threads even if 8 logical are visible... I Do Not confirm this ( for a 64-bit WIndows platform / Non NUMA ) and I'll provide lots of technical details as soon as all my verifications are completed.

SergeyKostrov · ‎06-22-2013

>>...we have a spare 1800 threads that apparently can never be utilized by MKL... Actually you can but by using a different method and I call it as Application Based Partitioning ( ABP ).

Ying_H_Intel · ‎06-23-2013

Hi James,

No, my guess is wrong, MKL haven't limitation for the funcion at version of 11. ( we had did this for small data size before).

i did test on one machine Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz , 2 package, 8 core. HT disable totally 2x8=16 thread,

run with matrix 4096x4096

export MKL_NUM_THREADS=2

real 27.0s ; cpu 53.9s

export MKL_NUM_THREADS=4

real 15.2s ; cpu 60.7s

export MKL_NUM_THREADS=8

real 9.5s ; cpu 75.8s

export MKL_NUM_THREADS=16

real 7.4s ; cpu 117.2s.

So the problem should be not here.

Best Regards,

Ying

james_B_8 · ‎06-24-2013

Okay that's good to hear that MKL routines can use more than 4 threads!

Thanks for your timings these are a good comparison help push me closer to the source of the problem. So I ran the same thing on our system as you did, Ying, and here are my timings:

run with matrix 4096x4096

export MKL_NUM_THREADS=2

real 27.3s ; cpu 54.2s

export MKL_NUM_THREADS=4

real 15.9s ; cpu 62.3s

export MKL_NUM_THREADS=8

real 11.3s ; cpu 88.4s

export MKL_NUM_THREADS=16

real 13.6s ; cpu 212.6s

These were ran with the other options:

OMP_NUM_THREADS= # 2,4,8,16
MKL_NUM_THREADS= # 2,4,8,16
KMP_STACKSIZE=2gb
OMP_NESTED=FALSE
MKL_DYNAMIC=FALSE
KMP_LIBRARY=turnaround
KMP_AFFINITY=disabled

So our results agree up to 8 threads. At 16 however, things start to look different on my machine. With 16 threads, this is two sockets on my machine. Each socket is connected via a NUMA link, unlike your machine where your two packages will have uniform access to memory.

So basically this MKL routine doesn't seem to scale beyond a single socket on our machine, which is the problem the user reported. Please provide some comment and suggest what I should do next,

SergeyKostrov · ‎06-24-2013

Here are results of another set of tests: [ 4 OMP & KMP threads ] C:\WuTemp\FortTestApp1\x64\Release>Diag.exe Read the Hamilton-matrix... allocation of mat of 16000x16000 Read the Hamilton-matrix... ...end! Diagonalization with dsyev: real 1466.5s ; cpu 11263.2s ...done! FIN! Note: Total number of Win32 threads used during processing was 64 ( plus 1 thread for the main process ). [ 16 OMP & KMP threads ] C:\WuTemp\FortTestApp1\x64\Release>Diag.exe Read the Hamilton-matrix... allocation of mat of 16000x16000 Read the Hamilton-matrix... ...end! Diagonalization with dsyev: real 1435.0s ; cpu 11043.2s ...done! FIN! Note: Total number of Win32 threads used during processing was 64 ( plus 1 thread for the main process ). [ 32 OMP & KMP threads ] Read the Hamilton-matrix... allocation of mat of 16000x16000 Read the Hamilton-matrix... ...end! Diagonalization with dsyev: real 1469.4s ; cpu 11306.9s ...done! FIN! Note: Total number of Win32 threads used during processing was 64 ( plus 1 thread for the main process ).

SergeyKostrov · ‎06-24-2013

Command line options: /nologo /O3 /QaxAVX /QxAVX /Qparallel /heap-arrays1024 /Qopt-matmul- /arch:AVX /fp:fast=2 /module:"x64\Release\\" /object:"x64\Release\\" /Fd"x64\Release\vc90.pdb" /libs:static /threads /Qmkl:parallel /c [ Screenshot 1 ]

SergeyKostrov · ‎06-24-2013

[ Screenshot 2 ]

SergeyKostrov · ‎06-24-2013

Number of OMP and KML threads vs. calculation time ( 4 CPUs / 8 cores ): 04 - Calculated ( in seconds ): ~338 08 - Calculated ( in seconds ): ~338 16 - Calculated ( in seconds ): ~330 32 - Calculated ( in seconds ): ~329 64 - Calculated ( in seconds ): Failed to calculate and errors are as follows: ... OMP: Error #136: Cannot create thread. OMP: System error #1455: The paging file is too small for this operation to complete. OMP: Error #178: Function GetExitCodeThread() failed: OMP: System error #6: The handle is invalid. ...

SergeyKostrov · ‎06-24-2013

[ Screenshot 3 - OMP Errors: 136, 1455 and 178 ]

SergeyKostrov · ‎06-24-2013

Even if a previous post is Not related directly to the subject of the thread I'll provide a reproducer and instructions on how the problem can be reproduced.

SergeyKostrov · ‎06-24-2013

>>...on a 4 core platform mkl defaults to 4 threads even if 8 logical are visible... Tim, As you can see on screenshots in my previous posts 64 worker threads were created plus one thread for the main application ( 65 in total ). >>...when running MKL DYSEV on my system even on 4 threads, all the threads spend most of the time idle and about 1% of >>the time in DYSEV... James, Utilizations for all 8 logical cores were ~100% and this is simply a prove that there some issue with NUMA.

a_kaliazin · ‎06-25-2013

Sergey,

Sorry but your tests only demonstrate that your Diag.exe does not scale beyond 4 threads at all. (Times with 8,16 and 32 threads are the same as with 4 threads). Probably because you only seem to have 8 cores in the system. And the 100% utilisation of all cores does not say anything about issues with NUMA. Idle spinning can create that just as well.

SergeyKostrov · ‎06-25-2013

A.kaliazin, I've stated from the beginning of investigation that a set of tests will be done on an Ivy Bridge system with 4 CPUs and 8 logical CPUs. My another comment was as follows: ... ...I could only confirm that performance scaling for cases with 1 CPU, 2 CPUs and 4 CPUs looks right. Unfortunately, I don't have a system with greater than 4 CPUs... ... However, I know how CPU, or Memory, or I/O bound processings look like and my another statement regarding CPU utilization was: ... ...It looks like processing in that case is Memory or I/O bound, and it is Not CPU bound... ... I know that my tests could be rated as generic because I don't have a NUMA system. If you have a NUMA system please try to test the test application.

TimP · ‎06-25-2013

As MKL can use the resources of the 4 cores fully with 1 thread per core, it's hardly surprising that more threads don't improve performance.

Benchmarking MKL Lapack on ccNUMA systen