Benchmarking MKL Lapack on ccNUMA systen

james_B_8 · ‎06-04-2013

I work with a large ccNUMA SGI Altix system. One of our users is trying to benchmark some LAPACK routines on our system and is getting some disappointing scaling - stops scaling after 4 threads.

The test I am running is of diagonalizing a 4097x4097 matrix of double precision floats. It uses the routine DSYEV.

From analysing the hotspots in VTune, I find that almost all the time is spent in overhead and spin time from the functions:

[OpenMP dispatcher]<- pthread_create_child and in [OpenMP fork].

The code was compiled using ifort with the options: -O3 -openmp -g -traceback -xHost -align -ansi-alias -mkl=parallel. Using version 13.1.0.146 of the compiler and version 11 of MKL. The system is made up of 8 core Xeon sandy bridge sockets.

The code was ran with the envars:

OMP_NUM_THREADS=16
MKL_NUM_THREADS=16
KMP_STACKSIZE=2gb
OMP_NESTED=FALSE
MKL_DYNAMIC=FALSE
KMP_LIBRARY=turnaround
KMP_AFFINITY=disabled

It is also ran with the SGI command for NUMA systems 'dplace -x2' which locks the threads to their cores.

So I suspect that there is something up with the options for the MKL, or the library isn't configured properly for our system. I have attached the code used.

Does anybody have any ideas on this?

Jim

SergeyKostrov · ‎06-04-2013

>>...One of our users is trying to benchmark some LAPACK routines on our system and is getting some disappointing >>scaling - stops scaling after 4 threads. >> >>The test I am running is of diagonalizing a 4097x4097 matrix of double precision floats. It uses the routine DSYEV... It seems to me that a little performance advantage could be achieved for a 4097x4097 matrix ( I would rate it as small ). Here are two questions: - Why 4097x4097 and not 4096x4096? - Did you try larger matrix sizes, like 16Kx16K, 32Kx32K, and so on?

james_B_8 · ‎06-10-2013

Hello again. Yes the user had tried larger matrices and got similar problems with scaling. When he ran the same code on a different machine and he managed to get it to scale beyond 8 threads for 16kx16k. I reran the code with a 16kx16k sized matrix with 4, 8, and 16 OMP threads on our ccNUMA system. The results of profiling for 4 threads are:
[OpenMP fork]              1414.601s   1414.601s
[OpenMP dispatcher]   1165.936s   1165.936s
[OpenMP worker]          153.393s       153.393s
lapack_dsyev                45.606s      0s
diag                              2.468s          0s

Where the first column is CPU time and the second is Overhead and spin time. The results for 8 and 16 threads show a similar trend.

Nearly all the time is spent idle even for 4 threads. It can't be because there isn't enough work to do, surely?

So does anyone have any ideas on this?

SergeyKostrov · ‎06-10-2013

>>...Nearly all the time is spent idle even for 4 threads. It can't be because there isn't enough work to do, surely? I agree that something is wrong and here are another questions: - How much memory does the system have? - Could you verify how physical and virtual memory were used during these tests? ( if you're on a Linux system try to use a graphical utility similar to Windows Task Manager ) I'll do a verification of your test codes on my Ivy Bridge system ( see * ) with Intel C++ Compiler XE 13.1.0.149 [ IA-32 & X64 ] ( Update 2 ) and MKL version 11.0.3. ( * ) - Intel Core i7-3840QM ( Ivy Bridge / 4 cores / 8 logical CPUs / ark.intel.com/compare/70846 )

SergeyKostrov · ‎06-11-2013

James, I compiled your test case on a Windows 7 Professional 64-bit OS with 64-bit Fortran compiler using the following command line: ifort.exe /O3 /Qopenmp /QxHost /Qmkl:parallel /Qansi-alias Diag.f90 but execution fails because a matrix.chk file is Not found: ..\DiagTestApp>Diag.exe Read the Hamilton-matrix... forrtl: severe (29): file not found, unit 11, file ..\DiagTestApp\matrix.chk Image PC Routine Line Source Diag.exe 00000001400659C7 Unknown Unknown Unknown Diag.exe 0000000140061383 Unknown Unknown Unknown Diag.exe 0000000140034FA6 Unknown Unknown Unknown Diag.exe 000000014001A975 Unknown Unknown Unknown Diag.exe 00000001400195B0 Unknown Unknown Unknown Diag.exe 000000014000B6E9 Unknown Unknown Unknown Diag.exe 0000000140001985 Unknown Unknown Unknown Diag.exe 0000000140001076 Unknown Unknown Unknown Diag.exe 00000001400F814C Unknown Unknown Unknown Diag.exe 000000014004EC2F Unknown Unknown Unknown kernel32.dll 0000000076B5652D Unknown Unknown Unknown ntdll.dll 000000007724C521 Unknown Unknown Unknown ...

james_B_8 · ‎06-11-2013

Sorry, that's the input file containing the matrix. the 16kx16k one is ~2gb in size so I didn't include it initially. I'll upload it tomorrow when I go back to work apparently we're allowed up to 4gb on here...

SergeyKostrov · ‎06-11-2013

>>...Sorry, that's the input file containing the matrix. the 16kx16k one is ~2gb in size so I didn't include it initially. I'll upload it >>tomorrow when I go back to work apparently we're allowed up to 4gb on here... Is there any chance to modify source codes and generate some random values, or some right numbers to get a solution? I think it will be the best solution... Anyway, on my side the application ( initial version ) is ready for testing. My system has 32GB of physical memory and 96GB of Virtual Memory and I think it will be able to handle your test case.

james_B_8 · ‎06-14-2013

Hi

Sorry for the delay. The users code for generating the matrices is leviathon in complexity and takes forever. However all one needs for dsyev is a real symmetric matrix, so I wrote my own code (attached) that generates a simple 16kx16k Fiedler matrix: A(i,j,) = abs(i-j). This will output a file in unformatted fortran called 'matrix.chk'. Use this as the input file for the other program.

I can confirm that this also gives the same problems on our system as our users matrix.

SergeyKostrov · ‎06-14-2013

>>...I wrote my own code (attached) that generates a simple 16kx16k Fiedler matrix: A(i,j,) = abs(i-j). This will output a file in >>unformatted fortran called 'matrix.chk'. Use this as the input file for the other program... I'll let you know results of my tests and thank you for the matrix generation program.

james_B_8 · ‎06-18-2013

Did you manage to get anywhere with it?

J

SergeyKostrov · ‎06-19-2013

>>>>...One of our users is trying to benchmark some LAPACK routines on our system and is getting some >>>>disappointing scaling - stops scaling after 4 threads... >> >>Did you manage to get anywhere with it? Yes and I'll post my results soon.

SergeyKostrov · ‎06-19-2013

Could you provide some technical details about hardware your user is using?

james_B_8 · ‎06-19-2013

It's a ~200 socket ccNUMA machine. Each socket is a 8 core Intel Xeon E5-4650L with about 7.5gb RAM per core. You request cores and memory for jobs using the MOAB scheduler.

SergeyKostrov · ‎06-19-2013

Here are results on Ivy Bridge system. [ Hardware ] Dell Precision Mobile M4700 Intel Core i7-3840QM ( Ivy Bridge / 4 cores / 8 logical CPUs / ark.intel.com/compare/70846 ) Size of L3 Cache = 8MB ( shared between all cores for data & instructions ) Size of L2 Cache = 1MB ( 256KB per core / shared for data & instructions ) Size of L1 Cache = 256KB ( 32KB per core for data & 32KB per core for instructions ) Windows 7 Professional 64-bit 32GB of RAM 96GB of VM [ 64-bit application on Windows 7 Professional 64-bit OS ] Command line to compile: ifort.exe /O3 /Qopenmp /QxHost /Qmkl:parallel /Qansi-alias /align:array32byte Diag.f90 [ Number of CPUs used: 4 ( 4 threads ) ] Read the Hamilton-matrix... allocation of mat of 16000x 16000 Read the Hamilton-matrix... ...end! Diagonalization with dsyev: real 1096.2s ; cpu 4380.5s ...done! FIN! [ Number of CPUs used: 2 ( 2 threads ) ] Read the Hamilton-matrix... allocation of mat of 16000x 16000 Read the Hamilton-matrix... ...end! Diagonalization with dsyev: real 1454.9s ; cpu 2908.5s ...done! FIN! [ Number of CPUs used: 1 ( 1 thread ) ] Read the Hamilton-matrix... allocation of mat of 16000x 16000 Read the Hamilton-matrix... ...end! Diagonalization with dsyev: real 2532.1s ; cpu 2529.7s ...done! FIN! [ Summary ] 1 CPU - real 2532.1s ; cpu 2529.7s 2 CPUs - real 1454.9s ; cpu 2908.5s 4 CPUs - real 1096.2s ; cpu 4380.5s

SergeyKostrov · ‎06-19-2013

[ 32-bit application on Windows 7 Professional 64-bit OS ] Command line to compile: ifort.exe /O3 /Qopenmp /QxHost /Qmkl:parallel /Qansi-alias /align:array32byte Diag.f90 [ Number of CPUs used: 4 ( 4 threads ) ] Read the Hamilton-matrix... /Error diag 41 trying to allocate arryas mat and e diag, arryas mat and e - out of memory [ Number of CPUs used: 2 ( 2 threads ) ] Read the Hamilton-matrix... /Error diag 41 trying to allocate arryas mat and e diag, arryas mat and e - out of memory [ Number of CPUs used: 1 ( 1 thread ) ] Read the Hamilton-matrix... /Error diag 41 trying to allocate arryas mat and e diag, arryas mat and e - out of memory [ Summary ] 1 CPU - N/A 2 CPU - N/A 4 CPU - N/A Note: As you can see a test for a 32-bit application failed.

SergeyKostrov · ‎06-19-2013

[ 32-bit application on Windows 7 Professional 64-bit OS ] Command line to compile: ifort.exe /O3 /Qopenmp /QxHost /Qmkl:parallel /Qansi-alias /align:array32byte Diag.f90 [ Number of CPUs used: 4 ( 4 threads ) ] Read the Hamilton-matrix... /Error diag 41 trying to allocate arryas mat and e diag, arryas mat and e - out of memory [ Number of CPUs used: 2 ( 2 threads ) ] Read the Hamilton-matrix... /Error diag 41 trying to allocate arryas mat and e diag, arryas mat and e - out of memory [ Number of CPUs used: 1 ( 1 thread ) ] Read the Hamilton-matrix... /Error diag 41 trying to allocate arryas mat and e diag, arryas mat and e - out of memory [ Summary ] 1 CPU - N/A 2 CPU - N/A 4 CPU - N/A Note: As you can see a test for a 32-bit application failed.

SergeyKostrov · ‎06-19-2013

>>[ Summary ] >> >>1 CPU - real 2532.1s ; cpu 2529.7s >>2 CPUs - real 1454.9s ; cpu 2908.5s >>4 CPUs - real 1096.2s ; cpu 4380.5s I could only confirm that performance scaling for cases with 1 CPU, 2 CPUs and 4 CPUs looks right. Unfortunately, I don't have a system with greater than 4 CPUs.

SergeyKostrov · ‎06-20-2013

This is a short follow up and I wonder if Intel software engineers could verify scalability on a system with 8, or 16, or even more CPUs? Thanks in advance. Note: Take into account that a set of environment variables was provided: ... OMP_NUM_THREADS=16 MKL_NUM_THREADS=16 KMP_STACKSIZE=2gb OMP_NESTED=FALSE MKL_DYNAMIC=FALSE KMP_LIBRARY=turnaround KMP_AFFINITY=disabled ...

Ying_H_Intel · ‎06-21-2013

Hi Sergey, James,

Thanks a lot for the test. just quick thought in my mind,

Some of blas functions are threaded by OpeMP, but in order to keep good performance, it only start at most 4 threads. As the function gesv should depend on blas function, so the scaliblity of your test are limited to 4. we will check it again and let you know the details,

Best Regards,

Ying

james_B_8 · ‎06-21-2013

Indeed, please let us know ASAP, we have a spare 1800 threads that apparently can never be utilized by MKL.

Also if blas is limited to 4 threads, how can it ever fully utilize a Xeon Phi card?

o_0

SergeyKostrov · ‎06-21-2013

>>...Also if blas is limited to 4 threads, how can it ever fully utilize a Xeon Phi card?.. These MKL thread limitations do not look right and I see inconsistency because some MKL functions on my system used 8 threads instead of 4. Here is a really small example: ... C = MATMUL( A, B ) ! Calculate product of two dense matricies ... and 8 threads were used.