Intel® Moderncode for Parallel Architectures
Support for developing parallel programming applications on Intel® Architecture.

Sleeping Threads in MKL

schorscherl
Beginner
767 Views

In a simple test program I have measured the performance
of MKL6.0 DGEMM on a Dual Xeon (2.66 GHz, 533FSB) for
different matrix sizes.
When OMP_NUM_THREADS is greater than 1, I encounter
program stalls, i.e. the threads just start sleeping
and do not do any more work. The matrix size for
which this happens differs from run to run with the
same binary.
Has anybody else seen this effect yet? Any ideas?

Thanks,
Georg.
0 Kudos
7 Replies
Henry_G_Intel
Employee
767 Views
Hi Georg,
I have some ideas about what might be happening but I need a little more information about your program. DGEMM scales very well for large matrices. Are you seeing any parallel speedup when DGEMM is executed with two threads? Is OMP_NUM_THREADS greater than the number of CPUs? Do the threads start sleeping as the matrix sizes get smaller? Is DGEMM called inside an OpenMP parallel region?

Henry
0 Kudos
schorscherl
Beginner
767 Views
Hi Henry,

this is my simple test program:

do i=10,200
jend=(dble(1000)**3+1.d0)/(5*(dble(i)**3))+1
st=MPI_WTIME()

do j=1,jend
call dgemm('N','N',i,i,i,1.d0,a,i,b,i,0.d0,c,i)
enddo
st=MPI_WTIME()-st
write (*,*) i,(jend*2.d0*dble(i)*dble(i)*dble(i))/st
enddo


I see some parallel speedup, although with matrix sizes
as small as in this program the speedup is moderate
(this was written in order to investigate performance
of MKL for a larger application program that tends to
use rather small matrices with DGEMM).
OMP_NUM_THREADS was set to 2. The program works fine
up to i=40 or so and then hangs, but not always at
the same i - sometimes it gets as high as 100, another
time i=50 is the limit. As you can see, the code does
not use any OpenMP by itself (I have left out
the variable declarations etc.). The MPI_WTIME()
is there for convenience, one can of course use any
other timing mechanism.

We have seen this effect also in "real" OpenMP
applications that were compiled with the Intel
compilers, on IA32 as well as on IA64 systems.
Starting with MKL6 though, it became very pronounced.

Kind Regards,
Georg.
0 Kudos
Henry_G_Intel
Employee
767 Views
Hi Georg,
I compiled the following program with the Intel 7.1 Fortran compiler and ran it on a dual-processor Windows 2000 Pro workstation with OMP_NUM_THREADS set to one or two threads:
      program mklomp
 
      double precision a(200,200), b(200,200), c(200,200)
 
      integer start, finish, rate
      real seconds
 
      call system_clock (COUNT_RATE = rate)
 
      do i = 10, 200
         jend = (dble(1000)**3 + 1.d0) / (5 * (dble(i)**3)) + 1
 
         call system_clock (COUNT = start)
 
         do j = 1, jend
            call dgemm('N', 'N', i, i, i, 1.d0, a, i, b, i, 0.d0, c, i)
         enddo
 
         call system_clock (COUNT = finish)
         seconds = float (finish - start) / float (rate)
 
         write(*,*) i, jend, seconds,
     +        (jend * 2.d0 * dble(i) * dble(i) * dble(i)) / seconds
      enddo
      end

The program did not hang and showed reasonable parallel speedup going from one to two threads.

Please check that my test program is an accurate representation of yours. What operating system are you using?

Best regards,
Henry
0 Kudos
schorscherl
Beginner
767 Views
Hi Henry,

I'm using Linux (Debian, Redhat, SuSE, it happens on all
of them, with different compiler and libc versions).

I have compiled your program with ifc 7.1 and linked to
MKL 6.0:

ifc -parallel -static momptest.f -L/opt/intel/mkl/lib/32 -lmkl_ia32

When setting OMP_NUM_THREADS=2 it hangs sometimes after some iterations, as described.

A little sidenote: I had to insert something like
a=0 before the main loop, so that the compiler
generates a (auto-)parallel region. This is necessary,
I have observed, because if the program runs into
MKL (DGEMM) without having executed at least one parallel
region first, I get runtime errors about stacksize
problems (shell limit is 4 GBytes!), reproducibly at
i=17:
...
16 48829 0.2635000 1518053739.65626
Unable to set worker thread stacksize to 4194304
Perhaps try reducing KMP_STACKSIZE or increasing your shell stack limit.

Setting KMP_STACKSIZE to anything doesn't help. But
maybe I'm doing something seriously wrong here...

Kind regards,
Georg.

0 Kudos
Henry_G_Intel
Employee
767 Views
Hi Georg,
The -parallel option should not be necessary to use MKL nor should it be necessary to execute a parallel region before calling an MKL function. This could be an MKL bug but I'm not able to reproduce it locally. Please submit this issue to Intel Premier Support. The MKL experts can probably explain what's happening.

What error message is given about stack limits? You shouldn't have to adjust the KMP_STACKSIZE environment variable because MKL functions should not overflow the thread stacks.

Best regards,
Henry
0 Kudos
schorscherl
Beginner
767 Views
Hi Henry,

ok so this time I've done it by the book. That's my shell log:
---------------------------------------------------------
~/loopkernels > ifc momptest.f -L/opt/intel/mkl/lib/32 -lmkl_ia32 -lguide -lpthread program MKLOMP

29 Lines Compiled
~/loopkernels > ./a.out
10 200001 0.4974000 804185788.957131
11 150263 0.4789000 835247688.594273
12 115741 0.3382000 1182734750.32458
13 91034 0.3794000 1054305166.88130
14 72887 0.4212000 949676754.895770
15 59260 0.3890000 1028290492.21332
16 48829 0.2761000 1448776363.54996
OMP abort: Unable to set worker thread stack size to 4195328 bytes
Try reducing KMP_STACKSIZE or increasing the shell stack limit.

Abort
----------------------------------------------------------------
No KMP_STACKSIZE was set here, and OMP_NUM_THREADS was 2.
There is no problem with OMP_NUM_THREADS=1.

As I had said, my shell stack limit is at 4GBytes. If I add -parallel
to the compiler command, the stacksize problem goes away because
of the additional parallel region in the initialization loop(s). If I
prevent those loops from being parallelized, the stacksize problem
reappears.

I think I will now submit both issues (stacksize and sleeping threads)
to premier support. Thank you nevertheless for your help.

Kind regards,
Georg.

0 Kudos
Henry_G_Intel
Employee
767 Views
Hi Georg,
When the MKL team gives you a solution to this problem, please post it here.

Thanks,
Henry
0 Kudos
Reply