- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

I work with a large ccNUMA SGI Altix system. One of our users is trying to benchmark some LAPACK routines on our system and is getting some disappointing scaling - stops scaling after 4 threads.

The test I am running is of diagonalizing a 4097x4097 matrix of double precision floats. It uses the routine DSYEV.

From analysing the hotspots in VTune, I find that almost all the time is spent in overhead and spin time from the functions:

*[OpenMP dispatcher]<- pthread_create_child* and in *[OpenMP fork].*

The code was compiled using ifort with the options:* -O3 -openmp -g -traceback -xHost -align -ansi-alias -mkl=parallel*. Using version 13.1.0.146 of the compiler and version 11 of MKL. The system is made up of 8 core Xeon sandy bridge sockets.

The code was ran with the envars:

*OMP_NUM_THREADS=16**MKL_NUM_THREADS=16**KMP_STACKSIZE=2gb**OMP_NESTED=FALSE**MKL_DYNAMIC=FALSE**KMP_LIBRARY=turnaround**KMP_AFFINITY=disabled*

It is also ran with the SGI command for NUMA systems '*dplace -x2'* which locks the threads to their cores.

So I suspect that there is something up with the options for the MKL, or the library isn't configured properly for our system. I have attached the code used.

Does anybody have any ideas on this?

Jim

Link Copied

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

**stops scaling after 4 threads**. >> >>The test I am running is of diagonalizing a

**4097x4097 matrix**of double precision floats. It uses the routine DSYEV... It seems to me that a little performance advantage could be achieved for a 4097x4097 matrix ( I would rate it as small ). Here are two questions: - Why 4097x4097 and not 4096x4096? - Did you try larger matrix sizes, like 16Kx16K, 32Kx32K, and so on?

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Hello again. Yes the user had tried larger matrices and got similar problems with scaling. When he ran the same code on a different machine and he managed to get it to scale beyond 8 threads for 16kx16k. I reran the code with a 16kx16k sized matrix with 4, 8, and 16 OMP threads on our ccNUMA system. The results of profiling for 4 threads are:

[OpenMP fork] 1414.601s 1414.601s

[OpenMP dispatcher] 1165.936s 1165.936s

[OpenMP worker] 153.393s 153.393s

lapack_dsyev 45.606s 0s

diag 2.468s 0s

Where the first column is CPU time and the second is Overhead and spin time. The results for 8 and 16 threads show a similar trend.

Nearly all the time is spent idle even for 4 threads. It can't be because there isn't enough work to do, surely?

So does anyone have any ideas on this?

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

**matrix.chk**file is

**Not**found: ..\DiagTestApp>Diag.exe Read the Hamilton-matrix... forrtl: severe (29): file not found, unit 11, file ..\DiagTestApp\

**matrix.chk**Image PC Routine Line Source Diag.exe 00000001400659C7 Unknown Unknown Unknown Diag.exe 0000000140061383 Unknown Unknown Unknown Diag.exe 0000000140034FA6 Unknown Unknown Unknown Diag.exe 000000014001A975 Unknown Unknown Unknown Diag.exe 00000001400195B0 Unknown Unknown Unknown Diag.exe 000000014000B6E9 Unknown Unknown Unknown Diag.exe 0000000140001985 Unknown Unknown Unknown Diag.exe 0000000140001076 Unknown Unknown Unknown Diag.exe 00000001400F814C Unknown Unknown Unknown Diag.exe 000000014004EC2F Unknown Unknown Unknown kernel32.dll 0000000076B5652D Unknown Unknown Unknown ntdll.dll 000000007724C521 Unknown Unknown Unknown ...

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Sorry, that's the input file containing the matrix. the 16kx16k one is ~2gb in size so I didn't include it initially. I'll upload it tomorrow when I go back to work apparently we're allowed up to 4gb on here...

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

**16kx16k**one is

**~2gb**in size so I didn't include it initially. I'll upload it >>tomorrow when I go back to work apparently we're allowed up to 4gb on here... Is there any chance to modify source codes and generate some random values, or some right numbers to get a solution? I think it will be the best solution... Anyway, on my side the application ( initial version ) is ready for testing. My system has 32GB of physical memory and 96GB of Virtual Memory and I think it will be able to handle your test case.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Hi

Sorry for the delay. The users code for generating the matrices is leviathon in complexity and takes forever. However all one needs for dsyev is a real symmetric matrix, so I wrote my own code (attached) that generates a simple 16kx16k Fiedler matrix: A(i,j,) = abs(i-j). This will output a file in unformatted fortran called 'matrix.chk'. Use this as the input file for the other program.

I can confirm that this also gives the same problems on our system as our users matrix.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Did you manage to get anywhere with it?

J

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

It's a ~200 socket ccNUMA machine. Each socket is a 8 core Intel Xeon E5-4650L with about 7.5gb RAM per core. You request cores and memory for jobs using the MOAB scheduler.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

**[ Hardware ]**Dell Precision Mobile M4700 Intel Core i7-3840QM ( Ivy Bridge / 4 cores / 8 logical CPUs / ark.intel.com/compare/70846 ) Size of L3 Cache = 8MB ( shared between all cores for data & instructions ) Size of L2 Cache = 1MB ( 256KB per core / shared for data & instructions ) Size of L1 Cache = 256KB ( 32KB per core for data & 32KB per core for instructions ) Windows 7 Professional 64-bit 32GB of RAM 96GB of VM

**[ 64-bit application on Windows 7 Professional 64-bit OS ]**Command line to compile: ifort.exe /O3 /Qopenmp /QxHost /Qmkl:parallel /Qansi-alias /align:array32byte Diag.f90

**[ Number of CPUs used: 4 ( 4 threads ) ]**Read the Hamilton-matrix... allocation of mat of 16000x 16000 Read the Hamilton-matrix... ...end! Diagonalization with dsyev: real 1096.2s ; cpu 4380.5s ...done! FIN!

**[ Number of CPUs used: 2 ( 2 threads ) ]**Read the Hamilton-matrix... allocation of mat of 16000x 16000 Read the Hamilton-matrix... ...end! Diagonalization with dsyev: real 1454.9s ; cpu 2908.5s ...done! FIN!

**[ Number of CPUs used: 1 ( 1 thread ) ]**Read the Hamilton-matrix... allocation of mat of 16000x 16000 Read the Hamilton-matrix... ...end! Diagonalization with dsyev: real 2532.1s ; cpu 2529.7s ...done! FIN!

**[ Summary ]**1 CPU - real 2532.1s ; cpu 2529.7s 2 CPUs - real 1454.9s ; cpu 2908.5s 4 CPUs - real 1096.2s ; cpu 4380.5s

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

**[ 32-bit application on Windows 7 Professional 64-bit OS ]**Command line to compile: ifort.exe /O3 /Qopenmp /QxHost /Qmkl:parallel /Qansi-alias /align:array32byte Diag.f90

**[ Number of CPUs used: 4 ( 4 threads ) ]**Read the Hamilton-matrix... /Error diag 41 trying to allocate arryas mat and e diag, arryas mat and e -

**out of memory**

**[ Number of CPUs used: 2 ( 2 threads ) ]**Read the Hamilton-matrix... /Error diag 41 trying to allocate arryas mat and e diag, arryas mat and e - out of memory

**[ Number of CPUs used: 1 ( 1 thread ) ]**Read the Hamilton-matrix... /Error diag 41 trying to allocate arryas mat and e diag, arryas mat and e - out of memory

**[ Summary ]**1 CPU - N/A 2 CPU - N/A 4 CPU - N/A

**Note:**As you can see a test for a 32-bit application failed.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

**[ 32-bit application on Windows 7 Professional 64-bit OS ]**Command line to compile: ifort.exe /O3 /Qopenmp /QxHost /Qmkl:parallel /Qansi-alias /align:array32byte Diag.f90

**[ Number of CPUs used: 4 ( 4 threads ) ]**Read the Hamilton-matrix... /Error diag 41 trying to allocate arryas mat and e diag, arryas mat and e -

**out of memory**

**[ Number of CPUs used: 2 ( 2 threads ) ]**Read the Hamilton-matrix... /Error diag 41 trying to allocate arryas mat and e diag, arryas mat and e - out of memory

**[ Number of CPUs used: 1 ( 1 thread ) ]**Read the Hamilton-matrix... /Error diag 41 trying to allocate arryas mat and e diag, arryas mat and e - out of memory

**[ Summary ]**1 CPU - N/A 2 CPU - N/A 4 CPU - N/A

**Note:**As you can see a test for a 32-bit application failed.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Hi Sergey, James,

Thanks a lot for the test. just quick thought in my mind,

Some of blas functions are threaded by OpeMP, but in order to keep good performance, it only start at most 4 threads. As the function gesv should depend on blas function, so the scaliblity of your test are limited to 4. we will check it again and let you know the details,

Best Regards,

Ying

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Indeed, please let us know ASAP, we have a spare 1800 threads that apparently can never be utilized by MKL.

Also if blas is limited to 4 threads, how can it ever fully utilize a Xeon Phi card?

o_0

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page