OpenMP Sections with memory size in Fortran?

gottoomanyaccounts · ‎08-02-2009

Hi,

I am trying to use OpenMP Sections to parallelize function calls, e.g.

[cpp]!$OMP PARALLEL default(shared)

!$OMP SECTIONS

!$OMP SECTION
call swap(a1,b1,n)

!$OMP SECTION
call swap(a2,b2,n)

!$OMP END SECTIONS

$!OMP END PARALLEL


[/cpp]

Where swap(a,b,n) swaps array a and b of size n

[cpp]subroutine swap(a,b,n)
integer :: n
   real :: a(nel), ab(n)
   real :: atmp(n)

      atmp = a2
        a2 = a1
        a1 = atmp

      return
      end[/cpp]

The problems is when n is small, the code run with no problem, but when n is large, (in my case n=9,280,656) the code is terminated at run time with a simple error message 'Aborted'. Even I turned check all, warn all options, I still only got the 'Aborted' message. Using totalview debugger, I found it said the array atmp in subroutine swap had a bad adrress, and it occured in the 2nd thread.

I am using ifort 11.0/074 build. Note that I didn't have do use mcmodel=large option to compile the code because the common block is still less than 2G Bytes, however, the problem did seem to have to do with the array size n.
Does anyone have any clue? Any help is much appreciated.

Thanks.

TimP · ‎08-02-2009

Your arrays ab and atmp are automatic arrays, created on stack, with no checking for successful allocation. allocatable with error checking is recommended in such cases. So, you would expect to hit thread stack size limit at some point, if you don't use environment variable KMP_STACKSIZE, or the corresponding library call, to increase the limit.
If you wished to write efficient code, you would optimize the size of temporary data blocks, most likely small enough for 1st level cache locality. You would make a loop so as to copy by cacheable blocks. Then you wouldn't require a stack size increase on that account, and you wouldn't depend on as many decisions by the compiler or library about nontemporal storage. In fact, if you would copy 1 integer at a time, the compiler optimization ought to produce more efficient code yet (assuming successful vectorization), as the temporary storage could be registerized.

gottoomanyaccounts · ‎08-02-2009

Quoting - tim18

Your arrays ab and atmp are automatic arrays, created on stack, with no checking for successful allocation. allocatable with error checking is recommended in such cases. So, you would expect to hit thread stack size limit at some point, if you don't use environment variable KMP_STACKSIZE, or the corresponding library call, to increase the limit.
If you wished to write efficient code, you would optimize the size of temporary data blocks, most likely small enough for 1st level cache locality. You would make a loop so as to copy by cacheable blocks. Then you wouldn't require a stack size increase on that account, and you wouldn't depend on as many decisions by the compiler or library about nontemporal storage. In fact, if you would copy 1 integer at a time, the compiler optimization ought to produce more efficient code yet (assuming successful vectorization), as the temporary storage could be registerized.

Thank you very much for the helpful explanation. Actually the swap subroutine used a loop with a scalar tmporary variable to do the swap, and I changed it to what it is in my first post, and it looked like the code ran faster (in serial version with -O3 -ipo -xhost options turned on). I thought that probably the compiler could optimize the vectorized code better, and in fact I started to use the f90 vector syntax as much as I could...
It seems you suggest the loop over the vector. I will run some experiments to play with it.

THanks