- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am using the Intel Fortran Compiler 11.0.83 for linux (Ubuntu 8.10 64bit) to build an algorithm for CFD (Smoothed Particle Hydrodynamics). I have used OpenMP directives for computationally heavy areas of the algorithmin order to distribute workload among more than 1 CPUswhile I have left some other areas for serial execution. However as the complexity of the simulated problems increased these serial regions started to become a bottleneck, so I would like to ask for ideas from other more experienced members on possible ways for parallelization.Following I am going to present specific regions ofalgorithm and the way I have thought to parallelize these areas along with some concerns.
1st part : I have a large set of data and I want to find the minimum/maximum of an array. In serial code this would be done easily in such a manner :
c ---------------------------------------------
xmin = 1000000 (or another large number)
xmax = -1000000 (or another small number)
do i=1,ntot
xmin=min(xmin,X(i)) ! X is an array which holds the value of the desired variable to find min/max
xmax=max(xmax,X(i))
enddo
c -----------------------------------------------
The parallel code for the same job that I have thought is:
c -----------------------------------------------
real*8,allocatable :: minimum(:),maximum(:) ! these arrays store the local minimum/maximum of each thread
dimension data_in (5000) ! this is the array containing the desired variable to find min/max
call omp_set_num_threads(8)
allocate (minimum(1:8),maximum(1:8)) ! 8 since there are 8 threads
!$OMP PARALLEL PRIVATE(I,ID_OMP)
id_omp=OMP_GET_THREAD_NUM()+1
maximum(id_omp)=-1000
minimum(id_omp)=+1000
!$OMP DO
do i=1,n
minimum(id_omp)=min(minimum(id_omp),data_in(i))
maximum(id_omp)=max(maximum(id_omp),data_in(i))
enddo
!$OMP ENDDO
!$OMP END PARALLEL
amaximum_tot=-1000
aminimum_tot=+1000
do i=1,8
amaximum_tot=max(maximum(i),amaximum_tot)
aminimum_tot=min(minimum(i),aminimum_tot)
enddo
deallocate (minimum,maximum)
c -----------------------------------------------
In this way the local minimum/maximum of one threadis protected from other threads read/write operations.
Any other ideas for this part?
2nd part : At a specific area I need to re-order the values of several arrays by omitting the values ofarrays that do not satisfy a specific condition.
The serial code is :
c -----------------------------------------------
do i=1,ntot ! ntot is the maximum array index
if( [some condition] ) cycle
inew=inew+1
X(inew)=X(i)
Y(inew)=Y(i)
... (same for all other arrays) ....
enddo
ntot=inew
c -----------------------------------------------
The parallel code for that area that I have thought:
c -----------------------------------------------
real*8,allocatable :: X_temp(:,:),Y_temp(:,:)
integer, allocatable :: inew(:)
call omp_set_num_threads(8)
allocate (X_temp(1:600,1:8),Y_temp(1:600,1:8))
allocate (inew(1:8))
do i=1,8
inew(i)=0
enddo
!$OMP PARALLEL SHARED (INEW) PRIVATE (I,MY_ID)
MY_ID=OMP_GET_THREAD_NUM()+1
!$OMP DO
do i=1,n
if ( [condition] ) cycle
inew(my_id)=inew(my_id)+1
X_temp(inew(my_id),my_id)=X(i)
Y_temp(inew(my_id),my_id)=Y(i)
enddo
!$OMP ENDDO
!$OMP END PARALLEL
ntot=0
do i=1,8
do j=1,inew(i)
ntot=ntot+1
X(ntot)=X_temp(j,i)
Y(ntot)=Y_temp(j,i)
enddo
enddo
c -----------------------------------------------
The above code works correctly, but it involves the allocation of a temporary (thread local) array for each array I want to reorder. Is there any other way to do this without using so many arrays?
3rd part : This part involves the creation of a linked list. Inpractice I have to find how many times a specificintegerappearsandwhich is the index of the array for the i-th time this specific integer appears.
In serial this can be done in this way :
c ----------------------------------------
do i=1,ntot
NXgrid=int((X(i)-Xmin)/cdx)
NYgrid=int((Y(i)-Ymin)/cdy)
NZgrid=int((Z(i)-Zmin)/cdz)
NPOS=ncelly*ncellz*NXgrid+ncellz*NYgrid+NZgrid+1
npc(Npos)=npc(Npos)+1
jjj(Npos,npc(Npos))=i
enddo
c ----------------------------------------
In this way for a specific NPOS (NPOS>1), I can find how many occurencies have that NPOS (it is npc(npos)) and the array index for the first occurencethat has that NPOS would be jjj(Npos,1) and for the i-th occurence,jjj(Npos,i).
I wasn't able to fully parallelize that part.The calculation of the npc array can be done with a reduction directive as follows :
c ----------------------------------------
!$OMP PARALLEL SHARED(JJJ) PRIVATE(I,ID_OMP) REDUCTION (+ : NPC)
id_omp=OMP_GET_THREAD_NUM()+1
!$OMP DO
do i=1,n
npc(npos))=npc(npos)+1
c ----- jjj(npos,npc(npos))=i
enddo
!$OMP ENDDO
!$OMP END PARALLEL
c ---------------------------------------
How can I include the jjj array calculation in the above parallel region? Note that if the 'c----' is removed the results for jjj are not correct. Any ideas on that ?
I am looking forward to hearing any ideas. Thanks in advance for your time/effort.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If you want your algorithm to work also in portable C, of course you must avoid min reduction. The alternatives seem to be, for a small number of threads, to use a scalar reduction as you have done, or a parallel loop with critical section.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Fivos,
The following UNTESTED code should get you started
Jim Dempsey
[cpp]! 1st part : I have a large set of data and ! I want to find the minimum/maximum of an array. ! In serial code this would be done easily in such a manner : ! --------------------------------------------- xmin = 1000000 (or another large number) xmax = -1000000 (or another small number) do i=1,ntot xmin=min(xmin,X(i)) ! X is an array which holds the value of the desired variable to find min/max xmax=max(xmax,X(i)) enddo ! ----------------------------------------------- ! The parallel code for the same job that I have thought is: ! ----------------------------------------------- ! *** attribute current subroutine with RECURSIVE ! *** add in module ! integer, parameter :: YOUR_MAX_THREADS = 64 ! or 256, or whatever you want ! these arrays store the local minimum/maximum of each thread real*8 :: minimum(YOUR_MAX_THREADS),maximum(YOUR_MAX_THREADS) ! this is the array containing the desired variable to find min/max ! *** n I assume is your 5000 dimension data_in (n) ! *** add integer :: num_threads integer :: id_omp, istride, i, j !$OMP PARALLEL PRIVATE(I,J,ID_OMP) ! all threads overwrite num_threads with same value num_threads = omp_get_num_threads() id_omp=OMP_GET_THREAD_NUM() + 1 maximum(id_omp)=-1000 minimum(id_omp)=+1000 ! all threads overwrite istride with same value istride = n / num_threads ! protection for number of theads .gt. n if(istride .eq. 0) istride = 1 ! starting position i = ((id_omp - 1) * istride) + 1 ! ending position j = i + istride - 1 ! last thread gets remainder if any if(id_omp .eq. num_threads) j = n ! protection for number of theads .gt. n if(i .le. n) then minimum(id_omp)=minval(data_in(i:j)) maximum(id_omp)=maxval(data_in(i:j)) endif !$OMP END PARALLEL aminimum_tot=minval(minimum(1:num_threads)) amaximum_tot=maxval(maximum(1:num_threads)) ! ----------------------------------------------- ! 2nd part : At a specific area I need to re-order ! the values of several arrays by omitting the values ! of arrays that do not satisfy a specific condition. ! The serial code is : ! ----------------------------------------------- do i=1,ntot ! ntot is the maximum array index if( [some condition] ) cycle inew=inew+1 X(inew)=X(i) Y(inew)=Y(i) ... (same for all other arrays) .... enddo ntot=inew ! ----------------------------------------------- ! The parallel code for that area that I have thought: ! ----------------------------------------------- ! ----------------------------------------------- ! *** attribute current subroutine with RECURSIVE ! *** add in module ! integer, parameter :: YOUR_MAX_THREADS = 64 ! or 256, or whatever you want ! these arrays store the local minimum/maximum of each thread integer :: inew(YOUR_MAX_THREADS) ! this is the array containing the desired variable to find min/max ! *** n I assume is your 5000 dimension data_in (n) ! *** add integer :: num_threads integer :: id_omp, istride, i, j !$OMP PARALLEL SHARED (INEW) PRIVATE (I,J,K,MY_ID,inew_my_id) ! all threads overwrite num_threads with same value num_threads = omp_get_num_threads() MY_ID=OMP_GET_THREAD_NUM()+1 inew(MY_ID)=0 ! all threads overwrite istride with same value istride = n / num_threads ! protection for number of theads .gt. n if(istride .eq. 0) istride = 1 ! starting position i = ((MY_ID - 1) * istride) + 1 ! ending position j = i + istride - 1 ! last thread gets remainder if any if(MY_ID .eq. num_threads) j = n ! protection for number of theads .gt. n if(i .le. n) then inew_my_id = 0 do k=i,j if ( [condition] ) cycle X(i+inew_my_id)=X(k) Y(i+inew_my_id)=Y(k) inew_my_id=inew_my_id+1 enddo inew(my_id)=inew_my_id endif !$OMP END PARALLEL ntot=inew(1) ! inew of 1st thread k = 0 do i=2,num_threads k = k + istride inew_my_id=inew(i) if(inew_my_id .gt. 0) then do j=1,inew_my_id X(ntot)=X(k+j) Y(ntot)=Y(k+j) ntot=ntot+1 enddo enddo !$OMP END PARALLEL ! ----------------------------------------------- ! 3rd part : This part involves the creation ! of a linked list. In practice I have to find ! how many times a specific integer appears ! and which is the index of the array for ! the i-th time this specific integer appears. ! In serial this can be done in this way : ! ---------------------------------------- do i=1,ntot NXgrid=int((X(i)-Xmin)/cdx) NYgrid=int((Y(i)-Ymin)/cdy) NZgrid=int((Z(i)-Zmin)/cdz) NPOS=ncelly*ncellz*NXgrid+ncellz*NYgrid+NZgrid+1 npc(Npos)=npc(Npos)+1 jjj(Npos,npc(Npos))=i enddo ! ---------------------------------------- ! In this way for a specific NPOS (NPOS>1), ! I can find how many occurencies have that NPOS ! (it is npc(npos)) and the array index for the ! first occurence that has that NPOS would be ! jjj(Npos,1) and for the i-th occurence, jjj(Npos,i). ! I wasn't able to fully parallelize that part. !$OMP PARALLEL PRIVATE(...) !$OMP DO do i=1,ntot NXgrid(i)=int((X(i)-Xmin)/cdx) NYgrid(i)=int((Y(i)-Ymin)/cdy) NZgrid(i)=int((Z(i)-Zmin)/cdz) NPOSprecalc(i)=ncelly*ncellz*NXgrid+ncellz*NYgrid+NZgrid+1 enddo !$OMP END DO MY_ID=OMP_GET_THREAD_NUM()+1 num_threads = omp_get_num_threads() istride = size(npc) / num_threads ! protection for number of theads .gt. n if(istride .eq. 0) istride = 1 ! starting position ifrom = ((MY_ID - 1) * istride) + 1 ! ending position ito = ifrom + istride - 1 ! last thread gets remainder if any if(id_omp .eq. num_threads) ito = size(npc) !$OMP DO do i=1,ntot npos = NPOSprecalc(i) if(npos .lt. ifrom) cycle if(npos .gt. ito) cycle npc(Npos)=npc(Npos)+1 jjj(Npos,npc(Npos))=i enddo !$OMP END DO !$OMP END PARALLEL [/cpp]
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Fivos,
I forgot to add, for your second part, if the number of entries for merge is large then you can add margionaly more difficult code to compute the source and destinations for each thread portion then start a parallel region (or use a barrier within the first parallel region) and perform a concurrent merge to seperate parts of the X and Y arrays. I will leave that for an exercize on your part.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@ Jim Dempsey: I willexamine furtherthe codes you' ve sent me. Thank you very much for your detailed answer.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- first : Why the additional arrays NXgrid(i), NYgrid(i), NZgrid(i), NPOSprecalc(i)should be created? I mean the values of NXgrid,NYgrid,NZgrid,NPOScould be calculated in the final loop, since they are totally independent from each other, couldn't they ? In this way there wont be any need for extra arrays.
(it would look like something like that:
from line 185 :
do i=1,ntot
NXgrid=int((X(i)-Xmin)/cdx)
NYgrid=int((Y(i)-Ymin)/cdy)
NZgrid=int((Z(i)-Zmin)/cdz)
NPOS=ncelly*ncellz*NXgrid+ncellz*NYgrid+NZgrid+1
if(npos .lt. ifrom) cycle ...etc... (rest reamins as it was) ...
)
- Secondly, is it safe to perform these operations (lines189,190), in the parallel region?
(npc(Npos)=npc(Npos)+1
jjj(Npos,npc(Npos))=i )
I am afraid that this might be a possible data race since threads might access/read/write atthe same npc(npos) element affectingboth calculationsof npc(...) and jjj(...,...). What do you believe on that?
I am going to try it and test it further. And again thank you for you help.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The computation time of NPOS:
NXgrid=int((X(i)-Xmin)/cdx)
NYgrid=int((Y(i)-Ymin)/cdy)
NZgrid=int((Z(i)-Zmin)/cdz)
NPOS=ncelly*ncellz*NXgrid+ncellz*NYgrid+NZgrid+1
is significant
As you suggest
all threads calculate NPOS for the complete range of i
(*** see note below)
then accumulate and populate portions of the link table
With 8 threads your method has
8 x (time to calculate NPOS)
8 x ( ~1/8 number of accumulate and populate link table)
As I suggest
8 x ( 1/8 time to calculate NPOS)
8 x ( 1/8 time to write NPOSprecalc(i) )
second loop setup overhead
8 x ( ~1/8 number of accumulate and populate link table)
Essentialy my suggestion reduces by
8 x (7/8 time to calculate NPOS)
at the expense of
8 x (one memory write)
pluse time to setup loop
(*** note from above)
Should the loop (your method)on i be parallel, then each thread
calculates 1/8 of NPOS but stores only ~ 1/8 of NPOS entries in
your accumulators and link table. IOW loss of ~ 7/8 of data.
I will answer the line 189 in next post
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
RE: 189
You are correct.
Remove the !$OMP DO surrounding loop at 184-192
IOW make all threads run complete iteration not slice
The condition inside the loop restricts each thread to seperate zone
Note, the time to read the stored NPOSprecalc(i) and test for range
will be relatively fast since NPOSprecalc(i) will likely reside in cache.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Comments in general:
Although !$OMP DO is a convienent way to divide up the iteration space of a loop it comes with the consequence that each thread does not know the extents of its slice. When this information cannot be used to your advantage then by all means use !$OMP DO. However, when knowledge of the extents of slice of loop is important, such as eliminating critical sections, atomics, or temporarry arrays, ... then the added effort of explicit slicing can be rewarded with substantial performance gains.
The manual slicing technique above assumes all threads assigned to the parallel region has equal availability to CPU resource. If this is not the case (e.g. other apps running on system), then a little more hand work will be necessary to divide up the slice space (each former slice has sub slices that can be run by other threads).
A half way step to resolve the issue of say
!$OMP PARALLEL
scheduling 8 thread, but other apps on system consuming 2 threads, and these two threads do not begin the parallel region until after other 6 threads complete the parallel region.
The fix might be to create an !$OMP SECTIONS or !$OMP WORKSHARE but then this would hardwire a specific number of threads (check to see if the OpenMP task capability is now supported in IVF)
Jim Dempsey

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page