topic Re: Some help/ideas with OpenMP parallelism in Intel® Fortran Compiler

Some help/ideas with OpenMP parallelism

fivos — Thu, 12 Nov 2009 13:04:05 GMT

Hi everyone,

I am using the Intel Fortran Compiler 11.0.83 for linux (Ubuntu 8.10 64bit) to build an algorithm for CFD (Smoothed Particle Hydrodynamics). I have used OpenMP directives for computationally heavy areas of the algorithmin order to distribute workload among more than 1 CPUswhile I have left some other areas for serial execution. However as the complexity of the simulated problems increased these serial regions started to become a bottleneck, so I would like to ask for ideas from other more experienced members on possible ways for parallelization.Following I am going to present specific regions ofalgorithm and the way I have thought to parallelize these areas along with some concerns.

1st part : I have a large set of data and I want to find the minimum/maximum of an array. In serial code this would be done easily in such a manner :

c ---------------------------------------------
xmin = 1000000 (or another large number)
xmax = -1000000 (or another small number)
do i=1,ntot
xmin=min(xmin,X(i)) ! X is an array which holds the value of the desired variable to find min/max
xmax=max(xmax,X(i))
enddo
c -----------------------------------------------

The parallel code for the same job that I have thought is:

c -----------------------------------------------
real*8,allocatable :: minimum(:),maximum(:) ! these arrays store the local minimum/maximum of each thread
dimension data_in (5000) ! this is the array containing the desired variable to find min/max

call omp_set_num_threads(8)
allocate (minimum(1:8),maximum(1:8)) ! 8 since there are 8 threads

!$OMP PARALLEL PRIVATE(I,ID_OMP)

id_omp=OMP_GET_THREAD_NUM()+1

maximum(id_omp)=-1000
minimum(id_omp)=+1000

!$OMP DO
do i=1,n
minimum(id_omp)=min(minimum(id_omp),data_in(i))
maximum(id_omp)=max(maximum(id_omp),data_in(i))
enddo
!$OMP ENDDO
!$OMP END PARALLEL

amaximum_tot=-1000
aminimum_tot=+1000

do i=1,8
amaximum_tot=max(maximum(i),amaximum_tot)
aminimum_tot=min(minimum(i),aminimum_tot)
enddo

deallocate (minimum,maximum)
c -----------------------------------------------

In this way the local minimum/maximum of one threadis protected from other threads read/write operations.
Any other ideas for this part?

2nd part : At a specific area I need to re-order the values of several arrays by omitting the values ofarrays that do not satisfy a specific condition.

The serial code is :

c -----------------------------------------------
do i=1,ntot ! ntot is the maximum array index
if( [some condition] ) cycle
inew=inew+1
X(inew)=X(i)
Y(inew)=Y(i)
... (same for all other arrays) ....
enddo

ntot=inew
c -----------------------------------------------

The parallel code for that area that I have thought:

c -----------------------------------------------
real*8,allocatable :: X_temp(:,:),Y_temp(:,:)
integer, allocatable :: inew(:)

call omp_set_num_threads(8)
allocate (X_temp(1:600,1:8),Y_temp(1:600,1:8))
allocate (inew(1:8))

do i=1,8
inew(i)=0
enddo

!$OMP PARALLEL SHARED (INEW) PRIVATE (I,MY_ID)

MY_ID=OMP_GET_THREAD_NUM()+1

!$OMP DO
do i=1,n
if ( [condition] ) cycle
inew(my_id)=inew(my_id)+1
X_temp(inew(my_id),my_id)=X(i)
Y_temp(inew(my_id),my_id)=Y(i)
enddo
!$OMP ENDDO
!$OMP END PARALLEL

ntot=0
do i=1,8
do j=1,inew(i)
ntot=ntot+1
X(ntot)=X_temp(j,i)
Y(ntot)=Y_temp(j,i)
enddo
enddo
c -----------------------------------------------

The above code works correctly, but it involves the allocation of a temporary (thread local) array for each array I want to reorder. Is there any other way to do this without using so many arrays?

3rd part : This part involves the creation of a linked list. Inpractice I have to find how many times a specificintegerappearsandwhich is the index of the array for the i-th time this specific integer appears.

In serial this can be done in this way :

c ----------------------------------------
do i=1,ntot
NXgrid=int((X(i)-Xmin)/cdx)
NYgrid=int((Y(i)-Ymin)/cdy)
NZgrid=int((Z(i)-Zmin)/cdz)
NPOS=ncelly*ncellz*NXgrid+ncellz*NYgrid+NZgrid+1

npc(Npos)=npc(Npos)+1
jjj(Npos,npc(Npos))=i
enddo
c ----------------------------------------

In this way for a specific NPOS (NPOS>1), I can find how many occurencies have that NPOS (it is npc(npos)) and the array index for the first occurencethat has that NPOS would be jjj(Npos,1) and for the i-th occurence,jjj(Npos,i).

I wasn't able to fully parallelize that part.The calculation of the npc array can be done with a reduction directive as follows :

c ----------------------------------------

!$OMP PARALLEL SHARED(JJJ) PRIVATE(I,ID_OMP) REDUCTION (+ : NPC)

id_omp=OMP_GET_THREAD_NUM()+1

!$OMP DO
do i=1,n
npc(npos))=npc(npos)+1
c ----- jjj(npos,npc(npos))=i
enddo
!$OMP ENDDO
!$OMP END PARALLEL
c ---------------------------------------

How can I include the jjj array calculation in the above parallel region? Note that if the 'c----' is removed the results for jjj are not correct. Any ideas on that ?

I am looking forward to hearing any ideas. Thanks in advance for your time/effort.

Re: Some help/ideas with OpenMP parallelism

TimP — Thu, 12 Nov 2009 14:04:12 GMT

In your 1st part, you don't explain why you don't consider min reduction. Beyond that, if your concern is with performance, there might be an advantage in 2 levels of loops, the outer one with omp reduce, and the inner one with SSE vectorizable reduction, somewhat like what you show.
If you want your algorithm to work also in portable C, of course you must avoid min reduction. The alternatives seem to be, for a small number of threads, to use a scalar reduction as you have done, or a parallel loop with critical section.

Re: Some help/ideas with OpenMP parallelism

jimdempseyatthecove — Thu, 12 Nov 2009 17:20:47 GMT

Fivos,

The following UNTESTED code should get you started

Jim Dempsey

[cpp]! 1st part : I have a large set of data and
! I want to find the minimum/maximum of an array.
! In serial code this would be done easily in such a manner : 

! ---------------------------------------------
xmin = 1000000 (or another large number)
xmax = -1000000 (or another small number)
do i=1,ntot
 xmin=min(xmin,X(i))            ! X is an array which holds the value of the desired variable to find  min/max
 xmax=max(xmax,X(i))
enddo
! -----------------------------------------------

! The parallel code for the same job that I have thought is:

! -----------------------------------------------
! *** attribute current subroutine with RECURSIVE
! *** add in module
! integer, parameter ::  YOUR_MAX_THREADS = 64 ! or 256, or whatever you want
! these arrays store the local minimum/maximum of each thread
real*8 :: minimum(YOUR_MAX_THREADS),maximum(YOUR_MAX_THREADS)
! this is the array containing the desired variable to find min/max 
! *** n I assume is your 5000 
dimension data_in (n)
! *** add
integer :: num_threads
integer :: id_omp, istride, i, j

!$OMP PARALLEL PRIVATE(I,J,ID_OMP) 
! all threads overwrite num_threads with same value
num_threads = omp_get_num_threads()

id_omp=OMP_GET_THREAD_NUM() + 1
maximum(id_omp)=-1000
minimum(id_omp)=+1000
! all threads overwrite istride with same value
istride = n / num_threads
! protection for number of theads .gt. n
if(istride .eq. 0) istride = 1
! starting position
i = ((id_omp - 1) * istride) + 1
! ending position
j = i + istride - 1
! last thread gets remainder if any
if(id_omp .eq. num_threads) j = n
! protection for number of theads .gt. n
if(i .le. n) then
  minimum(id_omp)=minval(data_in(i:j))
  maximum(id_omp)=maxval(data_in(i:j))
endif
!$OMP END PARALLEL

aminimum_tot=minval(minimum(1:num_threads))
amaximum_tot=maxval(maximum(1:num_threads))
! -----------------------------------------------


! 2nd part : At a specific area I need to re-order
! the values of several arrays by omitting the values
! of arrays that do not satisfy a specific condition. 

! The serial code is :

! -----------------------------------------------
do i=1,ntot       ! ntot is the maximum array index  
if( [some condition]  ) cycle
inew=inew+1
X(inew)=X(i)
Y(inew)=Y(i)
... (same for all other arrays) ....
enddo

ntot=inew
! -----------------------------------------------

! The parallel code for that area that I have thought: 

! -----------------------------------------------
! -----------------------------------------------
! *** attribute current subroutine with RECURSIVE
! *** add in module
! integer, parameter ::  YOUR_MAX_THREADS = 64 ! or 256, or whatever you want
! these arrays store the local minimum/maximum of each thread
integer :: inew(YOUR_MAX_THREADS) 
! this is the array containing the desired variable to find min/max 
! *** n I assume is your 5000 
dimension data_in (n)
! *** add
integer :: num_threads
integer :: id_omp, istride, i, j

!$OMP PARALLEL SHARED (INEW) PRIVATE (I,J,K,MY_ID,inew_my_id) 
! all threads overwrite num_threads with same value
num_threads = omp_get_num_threads()

MY_ID=OMP_GET_THREAD_NUM()+1
inew(MY_ID)=0
! all threads overwrite istride with same value
istride = n / num_threads
! protection for number of theads .gt. n
if(istride .eq. 0) istride = 1
! starting position
i = ((MY_ID - 1) * istride) + 1
! ending position
j = i + istride - 1
! last thread gets remainder if any
if(MY_ID .eq. num_threads) j = n
! protection for number of theads .gt. n
if(i .le. n) then
  inew_my_id = 0
  do k=i,j
    if ( [condition] ) cycle
    X(i+inew_my_id)=X(k)
    Y(i+inew_my_id)=Y(k)
    inew_my_id=inew_my_id+1
  enddo
  inew(my_id)=inew_my_id
endif
!$OMP END PARALLEL

ntot=inew(1)    ! inew of 1st thread
k = 0
do i=2,num_threads
k = k + istride
inew_my_id=inew(i)
if(inew_my_id .gt. 0) then
  do j=1,inew_my_id
    X(ntot)=X(k+j)
    Y(ntot)=Y(k+j)
    ntot=ntot+1
  enddo
enddo
!$OMP END PARALLEL
! -----------------------------------------------

! 3rd part : This part involves the creation
! of a linked list. In practice I have to find
! how many times a specific integer appears
! and which is the index of the array for
! the i-th time this specific integer appears.  

! In serial this can be done in this way : 

! ----------------------------------------
do i=1,ntot
  NXgrid=int((X(i)-Xmin)/cdx)
  NYgrid=int((Y(i)-Ymin)/cdy)
  NZgrid=int((Z(i)-Zmin)/cdz)
  NPOS=ncelly*ncellz*NXgrid+ncellz*NYgrid+NZgrid+1

  npc(Npos)=npc(Npos)+1
  jjj(Npos,npc(Npos))=i
enddo
! ----------------------------------------

! In this way for a specific NPOS (NPOS>1),
! I can find how many occurencies have that NPOS
! (it is npc(npos)) and the array index for the
! first occurence that has that NPOS would be
! jjj(Npos,1) and for the i-th occurence, jjj(Npos,i).

! I wasn't able to fully parallelize that part.

!$OMP PARALLEL PRIVATE(...)
!$OMP DO
do i=1,ntot
  NXgrid(i)=int((X(i)-Xmin)/cdx)
  NYgrid(i)=int((Y(i)-Ymin)/cdy)
  NZgrid(i)=int((Z(i)-Zmin)/cdz)
  NPOSprecalc(i)=ncelly*ncellz*NXgrid+ncellz*NYgrid+NZgrid+1
enddo
!$OMP END DO
MY_ID=OMP_GET_THREAD_NUM()+1
num_threads = omp_get_num_threads()
istride = size(npc) / num_threads
! protection for number of theads .gt. n
if(istride .eq. 0) istride = 1
! starting position
ifrom = ((MY_ID - 1) * istride) + 1
! ending position
ito = ifrom + istride - 1
! last thread gets remainder if any
if(id_omp .eq. num_threads) ito = size(npc)
!$OMP DO
do i=1,ntot
  npos = NPOSprecalc(i)
  if(npos .lt. ifrom) cycle
  if(npos .gt. ito) cycle
  npc(Npos)=npc(Npos)+1
  jjj(Npos,npc(Npos))=i
enddo
!$OMP END DO 
!$OMP END PARALLEL 
[/cpp]

Re: Some help/ideas with OpenMP parallelism

jimdempseyatthecove — Thu, 12 Nov 2009 17:28:05 GMT

Fivos,

I forgot to add, for your second part, if the number of entries for merge is large then you can add margionaly more difficult code to compute the source and destinations for each thread portion then start a parallel region (or use a barrier within the first parallel region) and perform a concurrent merge to seperate parts of the X and Y arrays. I will leave that for an exercize on your part.

Jim Dempsey

Re: Some help/ideas with OpenMP parallelism

fivos — Fri, 13 Nov 2009 06:34:43 GMT

First of all I have to thankboth ofyou for your quick response and ideas.

@ Tim18 : I wasn't aware that there was a min reduction available in openmp directives. I am going to use it now. Thanks for reminding me that!

@ Jim Dempsey: I willexamine furtherthe codes you' ve sent me. Thank you very much for your detailed answer.

Re: Some help/ideas with OpenMP parallelism

fivos — Fri, 13 Nov 2009 10:48:24 GMT

Dear Jim Dempsey,

I have seen the codes you've provided me.Practically you splitthe computations from i=1,ntot to all CPU threads by dividing the total ntot by the number of threads and assigning different regions of arraysto each thread explicitly (instead of letting the OpenMP do that through the !$OMP DO directive). It works very nice, especially for the second part were there is no more need for thread local temporaryarrays. But for the third part I have two questions :

- first : Why the additional arrays NXgrid(i), NYgrid(i), NZgrid(i), NPOSprecalc(i)should be created? I mean the values of NXgrid,NYgrid,NZgrid,NPOScould be calculated in the final loop, since they are totally independent from each other, couldn't they ? In this way there wont be any need for extra arrays.

(it would look like something like that:
from line 185 :

do i=1,ntot

NXgrid=int((X(i)-Xmin)/cdx)
NYgrid=int((Y(i)-Ymin)/cdy)
NZgrid=int((Z(i)-Zmin)/cdz)
NPOS=ncelly*ncellz*NXgrid+ncellz*NYgrid+NZgrid+1
if(npos .lt. ifrom) cycle ...etc... (rest reamins as it was) ...

)

- Secondly, is it safe to perform these operations (lines189,190), in the parallel region?
(npc(Npos)=npc(Npos)+1
jjj(Npos,npc(Npos))=i )
I am afraid that this might be a possible data race since threads might access/read/write atthe same npc(npos) element affectingboth calculationsof npc(...) and jjj(...,...). What do you believe on that?

I am going to try it and test it further. And again thank you for you help.

Re: Some help/ideas with OpenMP parallelism

jimdempseyatthecove — Fri, 13 Nov 2009 15:28:13 GMT

The computation time of NPOS:

NXgrid=int((X(i)-Xmin)/cdx)
NYgrid=int((Y(i)-Ymin)/cdy)
NZgrid=int((Z(i)-Zmin)/cdz)
NPOS=ncelly*ncellz*NXgrid+ncellz*NYgrid+NZgrid+1

is significant

As you suggest

all threads calculate NPOS for the complete range of i
(*** see note below)
then accumulate and populate portions of the link table

With 8 threads your method has

8 x (time to calculate NPOS)
8 x ( ~1/8 number of accumulate and populate link table)

As I suggest

8 x ( 1/8 time to calculate NPOS)
8 x ( 1/8 time to write NPOSprecalc(i) )
second loop setup overhead
8 x ( ~1/8 number of accumulate and populate link table)

Essentialy my suggestion reduces by

8 x (7/8 time to calculate NPOS)

at the expense of

8 x (one memory write)
pluse time to setup loop

(*** note from above)

Should the loop (your method)on i be parallel, then each thread
calculates 1/8 of NPOS but stores only ~ 1/8 of NPOS entries in
your accumulators and link table. IOW loss of ~ 7/8 of data.

I will answer the line 189 in next post

Jim Dempsey

Re: Some help/ideas with OpenMP parallelism

jimdempseyatthecove — Fri, 13 Nov 2009 15:34:33 GMT

RE: 189

You are correct.

Remove the !$OMP DO surrounding loop at 184-192
IOW make all threads run complete iteration not slice
The condition inside the loop restricts each thread to seperate zone

Note, the time to read the stored NPOSprecalc(i) and test for range
will be relatively fast since NPOSprecalc(i) will likely reside in cache.

Jim Dempsey

Re: Some help/ideas with OpenMP parallelism

jimdempseyatthecove — Fri, 13 Nov 2009 15:57:35 GMT

Comments in general:

Although !$OMP DO is a convienent way to divide up the iteration space of a loop it comes with the consequence that each thread does not know the extents of its slice. When this information cannot be used to your advantage then by all means use !$OMP DO. However, when knowledge of the extents of slice of loop is important, such as eliminating critical sections, atomics, or temporarry arrays, ... then the added effort of explicit slicing can be rewarded with substantial performance gains.

The manual slicing technique above assumes all threads assigned to the parallel region has equal availability to CPU resource. If this is not the case (e.g. other apps running on system), then a little more hand work will be necessary to divide up the slice space (each former slice has sub slices that can be run by other threads).

A half way step to resolve the issue of say

!$OMP PARALLEL

scheduling 8 thread, but other apps on system consuming 2 threads, and these two threads do not begin the parallel region until after other 6 threads complete the parallel region.

The fix might be to create an !$OMP SECTIONS or !$OMP WORKSHARE but then this would hardwire a specific number of threads (check to see if the OpenMP task capability is now supported in IVF)

Jim Dempsey