Assuming the arrays you are

Tue_B_ · ‎03-06-2015

Hi guys

I'm trying to optimize some matrix multiplication and in order to do that I need to do openmp reduction on a matrix (array), however once the array dimension gets even slightly large I get stack overflow problems, which I am not sure why I get.

    call KMP_SET_STACKSIZE_S(8000000)
    blocksize=800
    NUMA=8
    dim=blocksize*NUMA
    allocate(bA(dim,dim))
    allocate(bB(dim,dim))
    allocate(bC(dim,dim))
    print*, 'initiating first touch'
    call omp_set_num_threads(NUMA)
    !$OMP PARALLEL DEFAULT(SHARED) PRIVATE(i,first,last)
    !$OMP DO SCHEDULE(STATIC)
    do i=1,NUMA
        first=(i-1)*NUMA+1
        last=i*NUMA
        bA(:,first:last)=0.d0
        bB(first:last,:)=0.d0
        bC(:,first:last)=0.d0
    end do
    !$OMP END DO
    !$OMP END PARALLEL    
    print*, 'first touch done'

    print*, 'Threads',omp_get_max_threads()
    Runtimebegin=omp_get_wtime()
    !$OMP PARALLEL DEFAULT(SHARED) PRIVATE(i,first,last) REDUCTION(+:bC)   
    !$OMP DO SCHEDULE(STATIC)
    do i=1,NUMA
      first=(i-1)*blocksize+1
      call dgemm('N','N',dim,dim,blocksize,1.d0,bA(1,first),dim,bB(first,1),dim,0.d0,bC,dim)
    end do
    !$OMP END DO
    !$OMP END PARALLEL
    Runtimeend=omp_get_wtime()

When blocksize is 800 I get the stackoverflow problem, but when I put it down to 80 everything is okay.

I realize that by setting the array as reduction I implicitly make a private copy of the array on each thread, but even so my memory should be plenty big to handle this.

Can anyone tell me how to get around this? Or at least what exactly I'm doing wrong, because so far I haven't found anything in Openmp documentation which can help me.

Cheers

Tue

Steven_L_Intel1 · ‎03-07-2015

You probably need to set the environment variable KMP_STACKSIZE to a larger value. The default is 4M (4 megabvytes) on Intel 64), 2M on IA-32.

John_Campbell · ‎03-08-2015

Assuming the arrays you are using are real(4), each array is about 160 mb in size.

When you use REDUCTION(+:bC), this will create a copy of the array for each thread. (640 mb for 4 threads or 1280mb for 8 threads)
Assuming this is what you want to do, to overcome this problem, you could create this array with allocate ( bC(dim,dim,num_threads) ) then use this array as shared, but not conflict between threads.

This should overcome the stack overflow problem, although I don't think it would be a good approach. You will need to do the reduction after.

num_threads = get_num_threads ()
allocate ( bC(dim,dim,num_threads) )

!$OMP PARALLEL DEFAULT(SHARED) PRIVATE(i,first,last) SHARED(bC)    
!$OMP DO SCHEDULE(STATIC) 
do i=1,NUMA 
  thread_id = get_thread_id () + 1  ! for Fortran index
  first=(i-1)*blocksize+1 
  call dgemm('N','N',dim,dim,blocksize,1.d0,bA(1,first),dim,bB(first,1),dim,0.d0,  &
             bC(1,1,thread_id),dim) 
end do 
!$OMP END PARALLEL DO

I am also uncertain about your use of bB(first,1), as this does not look like your blocking approach.

John

Tue_B_ · ‎03-09-2015

Hey guys thanks for both the replies!

Steve you are of course right, that stacksize is the problem - I had already played around with it and found that increasing it seemed to be of no use, apparently there is a maximum stacksize of 1 GB in windows, which I was already over. I guess there is no way to increase this limit?

https://software.intel.com/en-us/articles/memory-limits-applications-windows

John you way will fix the stack overflow problem, but unfortunately I think the method will be too slow. Regarding the bB(first,1) you are right it is not the blocking approach, at least not yet, but conceptually it should be the same.

What I have been trying to do is related to this question I asked previously (which I will update on when I find what I want):

https://software.intel.com/en-us/forums/topic/542559

So the idea is to do matrix multiplication on numa systems as efficiently as possible, right now my idea is as follow:

split A columnwise into NUMA blocks: [A1,A2,A3,...,Anuma] and B rowwise into NUMA blocks

you will then have A*B = [A1*B1+A2*B2+...An*Bn], so all the blocks can be multiplied locally on each NUMA node and all that is left is just the summation across the numa nodes.

The thing is just that in order to do the above directly, each node needs a private copy of C( C=A*B), which is too memory intensive...

However if I then also do a splitting on the rows of A and columns of B I can then control the dimension of the temporal private C.

With a hard upper limit of 1 GB for the stacksize the max dimension I can use for a real*8 matrix and 8 numa threads is 160*160, which will hopefully be enough elements to not cause too much loss from overhead.

Codewise I'm right now having trouble with the fact that I can't do reduction over a subarray REDUCTION(+:bc(first:last,first:last)), but I'm hoping this can be overcome with a pointer.

Anyway thank you guys for your input, I will post an update on what I end up with later.

Steven_L_Intel1 · ‎03-09-2015

You are correct regarding the 1GB stacksize limit - that's a wall you can't break through. You might try compiling with /heap-arrays and see if that helps.

Max array size for reduction in OpenMP?