I work with a large ccNUMA SGI Altix system. One of our users is trying to benchmark some LAPACK routines on our system and is getting some disappointing scaling - stops scaling after 4 threads.
The test I am running is of diagonalizing a 4097x4097 matrix of double precision floats. It uses the routine DSYEV.
From analysing the hotspots in VTune, I find that almost all the time is spent in overhead and spin time from the functions:
[OpenMP dispatcher]<- pthread_create_child and in [OpenMP fork].
The code was compiled using ifort with the options: -O3 -openmp -g -traceback -xHost -align -ansi-alias -mkl=parallel. Using version 126.96.36.199 of the compiler and version 11 of MKL. The system is made up of 8 core Xeon sandy bridge sockets.
The code was ran with the envars:
It is also ran with the SGI command for NUMA systems 'dplace -x2' which locks the threads to their cores.
So I suspect that there is something up with the options for the MKL, or the library isn't configured properly for our system. I have attached the code used.
Does anybody have any ideas on this?
Hello again. Yes the user had tried larger matrices and got similar problems with scaling. When he ran the same code on a different machine and he managed to get it to scale beyond 8 threads for 16kx16k. I reran the code with a 16kx16k sized matrix with 4, 8, and 16 OMP threads on our ccNUMA system. The results of profiling for 4 threads are:
[OpenMP fork] 1414.601s 1414.601s
[OpenMP dispatcher] 1165.936s 1165.936s
[OpenMP worker] 153.393s 153.393s
lapack_dsyev 45.606s 0s
diag 2.468s 0s
Where the first column is CPU time and the second is Overhead and spin time. The results for 8 and 16 threads show a similar trend.
Nearly all the time is spent idle even for 4 threads. It can't be because there isn't enough work to do, surely?
So does anyone have any ideas on this?
Sorry, that's the input file containing the matrix. the 16kx16k one is ~2gb in size so I didn't include it initially. I'll upload it tomorrow when I go back to work apparently we're allowed up to 4gb on here...
Sorry for the delay. The users code for generating the matrices is leviathon in complexity and takes forever. However all one needs for dsyev is a real symmetric matrix, so I wrote my own code (attached) that generates a simple 16kx16k Fiedler matrix: A(i,j,) = abs(i-j). This will output a file in unformatted fortran called 'matrix.chk'. Use this as the input file for the other program.
I can confirm that this also gives the same problems on our system as our users matrix.
It's a ~200 socket ccNUMA machine. Each socket is a 8 core Intel Xeon E5-4650L with about 7.5gb RAM per core. You request cores and memory for jobs using the MOAB scheduler.
Hi Sergey, James,
Thanks a lot for the test. just quick thought in my mind,
Some of blas functions are threaded by OpeMP, but in order to keep good performance, it only start at most 4 threads. As the function gesv should depend on blas function, so the scaliblity of your test are limited to 4. we will check it again and let you know the details,
Indeed, please let us know ASAP, we have a spare 1800 threads that apparently can never be utilized by MKL.
Also if blas is limited to 4 threads, how can it ever fully utilize a Xeon Phi card?