<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Some help/ideas with OpenMP parallelism  in Intel® Fortran Compiler</title>
    <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Some-help-ideas-with-OpenMP-parallelism/m-p/741288#M853</link>
    <description>&lt;DIV style="margin:0px;"&gt;&lt;/DIV&gt;
&lt;BR /&gt;Fivos,&lt;BR /&gt;&lt;BR /&gt;I forgot to add, for your second part, if the number of entries for merge is large then you can add margionaly more difficult code to compute the source and destinations for each thread portion then start a parallel region (or use a barrier within the first parallel region) and perform a concurrent merge to seperate parts of the X and Y arrays. I will leave that for an exercize on your part.&lt;BR /&gt;&lt;BR /&gt;Jim Dempsey&lt;BR /&gt;</description>
    <pubDate>Thu, 12 Nov 2009 17:28:05 GMT</pubDate>
    <dc:creator>jimdempseyatthecove</dc:creator>
    <dc:date>2009-11-12T17:28:05Z</dc:date>
    <item>
      <title>Some help/ideas with OpenMP parallelism</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Some-help-ideas-with-OpenMP-parallelism/m-p/741285#M850</link>
      <description>Hi everyone,&lt;BR /&gt;&lt;BR /&gt;I am using the Intel Fortran Compiler 11.0.83 for linux (Ubuntu 8.10 64bit) to build an algorithm for CFD (Smoothed Particle Hydrodynamics). I have used OpenMP directives for computationally heavy areas of the algorithmin order to distribute workload among more than 1 CPUswhile I have left some other areas for serial execution. However as the complexity of the simulated problems increased these serial regions started to become a bottleneck, so I would like to ask for ideas from other more experienced members on possible ways for parallelization.Following I am going to present specific regions ofalgorithm and the way I have thought to parallelize these areas along with some concerns.&lt;BR /&gt;&lt;BR /&gt;&lt;STRONG&gt;&lt;SPAN style="text-decoration: underline;"&gt;1st part :&lt;/SPAN&gt;&lt;/STRONG&gt; I have a large set of data and I want to find the minimum/maximum of an array. In serial code this would be done easily in such a manner : &lt;BR /&gt;&lt;BR /&gt;c ---------------------------------------------&lt;BR /&gt;xmin = 1000000 (or another large number)&lt;BR /&gt;xmax = -1000000 (or another small number)&lt;BR /&gt;do i=1,ntot&lt;BR /&gt;xmin=min(xmin,X(i)) ! X is an array which holds the value of the desired variable to find min/max&lt;BR /&gt;xmax=max(xmax,X(i))&lt;BR /&gt;enddo&lt;BR /&gt;c -----------------------------------------------&lt;BR /&gt;&lt;BR /&gt;The parallel code for the same job that I have thought is:&lt;BR /&gt;&lt;BR /&gt;c -----------------------------------------------&lt;BR /&gt;real*8,allocatable :: minimum(:),maximum(:) ! these arrays store the local minimum/maximum of each thread&lt;BR /&gt;dimension data_in (5000) ! this is the array containing the desired variable to find min/max &lt;BR /&gt;&lt;BR /&gt;call omp_set_num_threads(8)&lt;BR /&gt;allocate (minimum(1:8),maximum(1:8)) ! 8 since there are 8 threads&lt;BR /&gt;&lt;BR /&gt;!$OMP PARALLEL PRIVATE(I,ID_OMP) &lt;BR /&gt;&lt;BR /&gt;id_omp=OMP_GET_THREAD_NUM()+1&lt;BR /&gt;&lt;BR /&gt;maximum(id_omp)=-1000&lt;BR /&gt;minimum(id_omp)=+1000&lt;BR /&gt;&lt;BR /&gt;!$OMP DO &lt;BR /&gt;do i=1,n&lt;BR /&gt;minimum(id_omp)=min(minimum(id_omp),data_in(i))&lt;BR /&gt;maximum(id_omp)=max(maximum(id_omp),data_in(i))&lt;BR /&gt;enddo&lt;BR /&gt;!$OMP ENDDO&lt;BR /&gt;!$OMP END PARALLEL&lt;BR /&gt;&lt;BR /&gt;amaximum_tot=-1000&lt;BR /&gt;aminimum_tot=+1000&lt;BR /&gt;&lt;BR /&gt;do i=1,8&lt;BR /&gt;amaximum_tot=max(maximum(i),amaximum_tot)&lt;BR /&gt;aminimum_tot=min(minimum(i),aminimum_tot)&lt;BR /&gt;enddo&lt;BR /&gt;&lt;BR /&gt;deallocate (minimum,maximum)&lt;BR /&gt;c -----------------------------------------------&lt;BR /&gt;&lt;BR /&gt;&lt;EM&gt;In this way the local minimum/maximum of one threadis protected from other threads read/write operations. &lt;BR /&gt;Any other ideas for this part?&lt;/EM&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;STRONG&gt;&lt;SPAN style="text-decoration: underline;"&gt;2nd part :&lt;/SPAN&gt; &lt;/STRONG&gt;At a specific area I need to re-order the values of several arrays by omitting the values ofarrays that do not satisfy a specific condition.&lt;BR /&gt;&lt;BR /&gt;The serial code is :&lt;BR /&gt;&lt;BR /&gt;c -----------------------------------------------&lt;BR /&gt;do i=1,ntot ! ntot is the maximum array index &lt;BR /&gt;if( [some condition] ) cycle&lt;BR /&gt;inew=inew+1&lt;BR /&gt;X(inew)=X(i)&lt;BR /&gt;Y(inew)=Y(i)&lt;BR /&gt;... (same for all other arrays) ....&lt;BR /&gt;enddo&lt;BR /&gt;&lt;BR /&gt;ntot=inew&lt;BR /&gt;c -----------------------------------------------&lt;BR /&gt;&lt;BR /&gt;The parallel code for that area that I have thought: &lt;BR /&gt;&lt;BR /&gt;c -----------------------------------------------&lt;BR /&gt;real*8,allocatable :: X_temp(:,:),Y_temp(:,:)&lt;BR /&gt;integer, allocatable :: inew(:) &lt;BR /&gt;&lt;BR /&gt;call omp_set_num_threads(8)&lt;BR /&gt;allocate (X_temp(1:600,1:8),Y_temp(1:600,1:8))&lt;BR /&gt;allocate (inew(1:8))&lt;BR /&gt;&lt;BR /&gt;do i=1,8&lt;BR /&gt;inew(i)=0&lt;BR /&gt;enddo&lt;BR /&gt;&lt;BR /&gt;!$OMP PARALLEL SHARED (INEW) PRIVATE (I,MY_ID) &lt;BR /&gt;&lt;BR /&gt;MY_ID=OMP_GET_THREAD_NUM()+1&lt;BR /&gt;&lt;BR /&gt;!$OMP DO&lt;BR /&gt;do i=1,n&lt;BR /&gt;if ( [condition] ) cycle&lt;BR /&gt;inew(my_id)=inew(my_id)+1&lt;BR /&gt;X_temp(inew(my_id),my_id)=X(i)&lt;BR /&gt;Y_temp(inew(my_id),my_id)=Y(i)&lt;BR /&gt;enddo&lt;BR /&gt;!$OMP ENDDO&lt;BR /&gt;!$OMP END PARALLEL&lt;BR /&gt;&lt;BR /&gt;ntot=0&lt;BR /&gt;do i=1,8&lt;BR /&gt;do j=1,inew(i)&lt;BR /&gt;ntot=ntot+1&lt;BR /&gt;X(ntot)=X_temp(j,i)&lt;BR /&gt;Y(ntot)=Y_temp(j,i)&lt;BR /&gt;enddo&lt;BR /&gt;enddo&lt;BR /&gt;c -----------------------------------------------&lt;BR /&gt;&lt;BR /&gt;&lt;EM&gt;The above code works correctly, but it involves the allocation of a temporary (thread local) array for each array I want to reorder. Is there any other way to do this without using so many arrays?&lt;/EM&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;STRONG&gt;&lt;SPAN style="text-decoration: underline;"&gt;3rd part :&lt;/SPAN&gt;&lt;/STRONG&gt; This part involves the creation of a linked list. Inpractice I have to find how many times a specificintegerappearsandwhich is the index of the array for the i-th time this specific integer appears.&lt;BR /&gt;&lt;BR /&gt;In serial this can be done in this way : &lt;BR /&gt;&lt;BR /&gt;c ----------------------------------------&lt;BR /&gt;do i=1,ntot&lt;BR /&gt;NXgrid=int((X(i)-Xmin)/cdx)&lt;BR /&gt;NYgrid=int((Y(i)-Ymin)/cdy)&lt;BR /&gt;NZgrid=int((Z(i)-Zmin)/cdz)&lt;BR /&gt;NPOS=ncelly*ncellz*NXgrid+ncellz*NYgrid+NZgrid+1&lt;BR /&gt;&lt;BR /&gt;npc(Npos)=npc(Npos)+1&lt;BR /&gt;jjj(Npos,npc(Npos))=i&lt;BR /&gt;enddo&lt;BR /&gt;c ----------------------------------------&lt;BR /&gt;&lt;BR /&gt;In this way for a specific NPOS (NPOS&amp;gt;1), I can find how many occurencies have that NPOS (it is npc(npos)) and the array index for the first occurencethat has that NPOS would be jjj(Npos,1) and for the i-th occurence,jjj(Npos,i).&lt;BR /&gt;&lt;BR /&gt;I wasn't able to fully parallelize that part.The calculation of the npc array can be done with a reduction directive as follows :&lt;BR /&gt;&lt;BR /&gt;c ----------------------------------------
&lt;P&gt;!$OMP PARALLEL SHARED(JJJ) PRIVATE(I,ID_OMP) REDUCTION (+ : NPC)&lt;BR /&gt;&lt;BR /&gt;id_omp=OMP_GET_THREAD_NUM()+1&lt;BR /&gt;&lt;BR /&gt;!$OMP DO &lt;BR /&gt;do i=1,n&lt;BR /&gt;npc(npos))=npc(npos)+1&lt;BR /&gt;c ----- jjj(npos,npc(npos))=i &lt;BR /&gt;enddo&lt;BR /&gt;!$OMP ENDDO&lt;BR /&gt;!$OMP END PARALLEL&lt;BR /&gt;c ---------------------------------------&lt;BR /&gt;&lt;BR /&gt;&lt;EM&gt;How can I include the jjj array calculation in the above parallel region? Note that if the 'c----' is removed the results for jjj are not correct. Any ideas on that ?&lt;/EM&gt;&lt;BR /&gt;&lt;BR /&gt;I am looking forward to hearing any ideas. Thanks in advance for your time/effort.&lt;/P&gt;</description>
      <pubDate>Thu, 12 Nov 2009 13:04:05 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Some-help-ideas-with-OpenMP-parallelism/m-p/741285#M850</guid>
      <dc:creator>fivos</dc:creator>
      <dc:date>2009-11-12T13:04:05Z</dc:date>
    </item>
    <item>
      <title>Re: Some help/ideas with OpenMP parallelism</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Some-help-ideas-with-OpenMP-parallelism/m-p/741286#M851</link>
      <description>&lt;DIV style="margin: 0px; height: auto;"&gt;&lt;/DIV&gt;
In your 1st part, you don't explain why you don't consider min reduction. Beyond that, if your concern is with performance, there might be an advantage in 2 levels of loops, the outer one with omp reduce, and the inner one with SSE vectorizable reduction, somewhat like what you show. &lt;BR /&gt;If you want your algorithm to work also in portable C, of course you must avoid min reduction. The alternatives seem to be, for a small number of threads, to use a scalar reduction as you have done, or a parallel loop with critical section.&lt;BR /&gt;</description>
      <pubDate>Thu, 12 Nov 2009 14:04:12 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Some-help-ideas-with-OpenMP-parallelism/m-p/741286#M851</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2009-11-12T14:04:12Z</dc:date>
    </item>
    <item>
      <title>Re: Some help/ideas with OpenMP parallelism</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Some-help-ideas-with-OpenMP-parallelism/m-p/741287#M852</link>
      <description>&lt;DIV style="margin:0px;"&gt;&lt;/DIV&gt;
&lt;BR /&gt;Fivos,&lt;BR /&gt;&lt;BR /&gt;The following UNTESTED code should get you started&lt;BR /&gt;&lt;BR /&gt;Jim Dempsey&lt;BR /&gt;&lt;BR /&gt;
&lt;PRE&gt;[cpp]! 1st part : I have a large set of data and
! I want to find the minimum/maximum of an array.
! In serial code this would be done easily in such a manner : 

! ---------------------------------------------
xmin = 1000000 (or another large number)
xmax = -1000000 (or another small number)
do i=1,ntot
 xmin=min(xmin,X(i))            ! X is an array which holds the value of the desired variable to find  min/max
 xmax=max(xmax,X(i))
enddo
! -----------------------------------------------

! The parallel code for the same job that I have thought is:

! -----------------------------------------------
! *** attribute current subroutine with RECURSIVE
! *** add in module
! integer, parameter ::  YOUR_MAX_THREADS = 64 ! or 256, or whatever you want
! these arrays store the local minimum/maximum of each thread
real*8 :: minimum(YOUR_MAX_THREADS),maximum(YOUR_MAX_THREADS)
! this is the array containing the desired variable to find min/max 
! *** n I assume is your 5000 
dimension data_in (n)
! *** add
integer :: num_threads
integer :: id_omp, istride, i, j

!$OMP PARALLEL PRIVATE(I,J,ID_OMP) 
! all threads overwrite num_threads with same value
num_threads = omp_get_num_threads()

id_omp=OMP_GET_THREAD_NUM() + 1
maximum(id_omp)=-1000
minimum(id_omp)=+1000
! all threads overwrite istride with same value
istride = n / num_threads
! protection for number of theads .gt. n
if(istride .eq. 0) istride = 1
! starting position
i = ((id_omp - 1) * istride) + 1
! ending position
j = i + istride - 1
! last thread gets remainder if any
if(id_omp .eq. num_threads) j = n
! protection for number of theads .gt. n
if(i .le. n) then
  minimum(id_omp)=minval(data_in(i:j))
  maximum(id_omp)=maxval(data_in(i:j))
endif
!$OMP END PARALLEL

aminimum_tot=minval(minimum(1:num_threads))
amaximum_tot=maxval(maximum(1:num_threads))
! -----------------------------------------------


! 2nd part : At a specific area I need to re-order
! the values of several arrays by omitting the values
! of arrays that do not satisfy a specific condition. 

! The serial code is :

! -----------------------------------------------
do i=1,ntot       ! ntot is the maximum array index  
if( [some condition]  ) cycle
inew=inew+1
X(inew)=X(i)
Y(inew)=Y(i)
... (same for all other arrays) ....
enddo

ntot=inew
! -----------------------------------------------

! The parallel code for that area that I have thought: 

! -----------------------------------------------
! -----------------------------------------------
! *** attribute current subroutine with RECURSIVE
! *** add in module
! integer, parameter ::  YOUR_MAX_THREADS = 64 ! or 256, or whatever you want
! these arrays store the local minimum/maximum of each thread
integer :: inew(YOUR_MAX_THREADS) 
! this is the array containing the desired variable to find min/max 
! *** n I assume is your 5000 
dimension data_in (n)
! *** add
integer :: num_threads
integer :: id_omp, istride, i, j

!$OMP PARALLEL SHARED (INEW) PRIVATE (I,J,K,MY_ID,inew_my_id) 
! all threads overwrite num_threads with same value
num_threads = omp_get_num_threads()

MY_ID=OMP_GET_THREAD_NUM()+1
inew(MY_ID)=0
! all threads overwrite istride with same value
istride = n / num_threads
! protection for number of theads .gt. n
if(istride .eq. 0) istride = 1
! starting position
i = ((MY_ID - 1) * istride) + 1
! ending position
j = i + istride - 1
! last thread gets remainder if any
if(MY_ID .eq. num_threads) j = n
! protection for number of theads .gt. n
if(i .le. n) then
  inew_my_id = 0
  do k=i,j
    if ( [condition] ) cycle
    X(i+inew_my_id)=X(k)
    Y(i+inew_my_id)=Y(k)
    inew_my_id=inew_my_id+1
  enddo
  inew(my_id)=inew_my_id
endif
!$OMP END PARALLEL

ntot=inew(1)    ! inew of 1st thread
k = 0
do i=2,num_threads
k = k + istride
inew_my_id=inew(i)
if(inew_my_id .gt. 0) then
  do j=1,inew_my_id
    X(ntot)=X(k+j)
    Y(ntot)=Y(k+j)
    ntot=ntot+1
  enddo
enddo
!$OMP END PARALLEL
! -----------------------------------------------

! 3rd part : This part involves the creation
! of a linked list. In practice I have to find
! how many times a specific integer appears
! and which is the index of the array for
! the i-th time this specific integer appears.  

! In serial this can be done in this way : 

! ----------------------------------------
do i=1,ntot
  NXgrid=int((X(i)-Xmin)/cdx)
  NYgrid=int((Y(i)-Ymin)/cdy)
  NZgrid=int((Z(i)-Zmin)/cdz)
  NPOS=ncelly*ncellz*NXgrid+ncellz*NYgrid+NZgrid+1

  npc(Npos)=npc(Npos)+1
  jjj(Npos,npc(Npos))=i
enddo
! ----------------------------------------

! In this way for a specific NPOS (NPOS&amp;gt;1),
! I can find how many occurencies have that NPOS
! (it is npc(npos)) and the array index for the
! first occurence that has that NPOS would be
! jjj(Npos,1) and for the i-th occurence, jjj(Npos,i).

! I wasn't able to fully parallelize that part.

!$OMP PARALLEL PRIVATE(...)
!$OMP DO
do i=1,ntot
  NXgrid(i)=int((X(i)-Xmin)/cdx)
  NYgrid(i)=int((Y(i)-Ymin)/cdy)
  NZgrid(i)=int((Z(i)-Zmin)/cdz)
  NPOSprecalc(i)=ncelly*ncellz*NXgrid+ncellz*NYgrid+NZgrid+1
enddo
!$OMP END DO
MY_ID=OMP_GET_THREAD_NUM()+1
num_threads = omp_get_num_threads()
istride = size(npc) / num_threads
! protection for number of theads .gt. n
if(istride .eq. 0) istride = 1
! starting position
ifrom = ((MY_ID - 1) * istride) + 1
! ending position
ito = ifrom + istride - 1
! last thread gets remainder if any
if(id_omp .eq. num_threads) ito = size(npc)
!$OMP DO
do i=1,ntot
  npos = NPOSprecalc(i)
  if(npos .lt. ifrom) cycle
  if(npos .gt. ito) cycle
  npc(Npos)=npc(Npos)+1
  jjj(Npos,npc(Npos))=i
enddo
!$OMP END DO 
!$OMP END PARALLEL 
[/cpp]&lt;/PRE&gt;</description>
      <pubDate>Thu, 12 Nov 2009 17:20:47 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Some-help-ideas-with-OpenMP-parallelism/m-p/741287#M852</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2009-11-12T17:20:47Z</dc:date>
    </item>
    <item>
      <title>Re: Some help/ideas with OpenMP parallelism</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Some-help-ideas-with-OpenMP-parallelism/m-p/741288#M853</link>
      <description>&lt;DIV style="margin:0px;"&gt;&lt;/DIV&gt;
&lt;BR /&gt;Fivos,&lt;BR /&gt;&lt;BR /&gt;I forgot to add, for your second part, if the number of entries for merge is large then you can add margionaly more difficult code to compute the source and destinations for each thread portion then start a parallel region (or use a barrier within the first parallel region) and perform a concurrent merge to seperate parts of the X and Y arrays. I will leave that for an exercize on your part.&lt;BR /&gt;&lt;BR /&gt;Jim Dempsey&lt;BR /&gt;</description>
      <pubDate>Thu, 12 Nov 2009 17:28:05 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Some-help-ideas-with-OpenMP-parallelism/m-p/741288#M853</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2009-11-12T17:28:05Z</dc:date>
    </item>
    <item>
      <title>Re: Some help/ideas with OpenMP parallelism</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Some-help-ideas-with-OpenMP-parallelism/m-p/741289#M854</link>
      <description>&lt;DIV style="margin:0px;"&gt;First of all I have to thankboth ofyou for your quick response and ideas.&lt;BR /&gt;&lt;/DIV&gt;
@ Tim18 : I wasn't aware that there was a min reduction available in openmp directives. I am going to use it now. Thanks for reminding me that!&lt;BR /&gt;&lt;BR /&gt;@ Jim Dempsey: I willexamine furtherthe codes you' ve sent me. Thank you very much for your detailed answer. &lt;BR /&gt;</description>
      <pubDate>Fri, 13 Nov 2009 06:34:43 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Some-help-ideas-with-OpenMP-parallelism/m-p/741289#M854</guid>
      <dc:creator>fivos</dc:creator>
      <dc:date>2009-11-13T06:34:43Z</dc:date>
    </item>
    <item>
      <title>Re: Some help/ideas with OpenMP parallelism</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Some-help-ideas-with-OpenMP-parallelism/m-p/741290#M855</link>
      <description>&lt;DIV style="margin:0px;"&gt;Dear Jim Dempsey,&lt;/DIV&gt;
I have seen the codes you've provided me.Practically you splitthe computations from i=1,ntot to all CPU threads by dividing the total ntot by the number of threads and assigning different regions of arraysto each thread explicitly (instead of letting the OpenMP do that through the !$OMP DO directive). It works very nice, especially for the second part were there is no more need for thread local temporaryarrays. But for the third part I have two questions : &lt;BR /&gt;&lt;BR /&gt;- first : Why the additional arrays NXgrid(i), NYgrid(i), NZgrid(i), NPOSprecalc(i)should be created? I mean the values of NXgrid,NYgrid,NZgrid,NPOScould be calculated in the final loop, since they are totally independent from each other, couldn't they ? In this way there wont be any need for extra arrays.&lt;BR /&gt;&lt;BR /&gt;(it would look like something like that: &lt;BR /&gt;from line 185 : &lt;BR /&gt;&lt;BR /&gt;do i=1,ntot &lt;BR /&gt;
&lt;P&gt;NXgrid=int((X(i)-Xmin)/cdx)&lt;BR /&gt;NYgrid=int((Y(i)-Ymin)/cdy)&lt;BR /&gt;NZgrid=int((Z(i)-Zmin)/cdz)&lt;BR /&gt;NPOS=ncelly*ncellz*NXgrid+ncellz*NYgrid+NZgrid+1&lt;BR /&gt;if(npos .lt. ifrom) cycle ...etc... (rest reamins as it was) ...&lt;BR /&gt;&lt;BR /&gt;)&lt;BR /&gt;&lt;BR /&gt;- Secondly, is it safe to perform these operations (lines189,190), in the parallel region? &lt;BR /&gt;(npc(Npos)=npc(Npos)+1 &lt;BR /&gt;jjj(Npos,npc(Npos))=i )&lt;BR /&gt;I am afraid that this might be a possible data race since threads might access/read/write atthe same npc(npos) element affectingboth calculationsof npc(...) and jjj(...,...). What do you believe on that?&lt;BR /&gt;&lt;BR /&gt;I am going to try it and test it further. And again thank you for you help. &lt;BR /&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 13 Nov 2009 10:48:24 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Some-help-ideas-with-OpenMP-parallelism/m-p/741290#M855</guid>
      <dc:creator>fivos</dc:creator>
      <dc:date>2009-11-13T10:48:24Z</dc:date>
    </item>
    <item>
      <title>Re: Some help/ideas with OpenMP parallelism</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Some-help-ideas-with-OpenMP-parallelism/m-p/741291#M856</link>
      <description>&lt;DIV style="margin:0px;"&gt;&lt;/DIV&gt;
&lt;BR /&gt;The computation time of NPOS:&lt;BR /&gt;&lt;BR /&gt; NXgrid=int((X(i)-Xmin)/cdx)&lt;BR /&gt; NYgrid=int((Y(i)-Ymin)/cdy)&lt;BR /&gt; NZgrid=int((Z(i)-Zmin)/cdz)&lt;BR /&gt; NPOS=ncelly*ncellz*NXgrid+ncellz*NYgrid+NZgrid+1&lt;BR /&gt;&lt;BR /&gt;is significant&lt;BR /&gt;&lt;BR /&gt;As you suggest&lt;BR /&gt;&lt;BR /&gt; all threads calculate NPOS for the complete range of i&lt;BR /&gt; (*** see note below)&lt;BR /&gt; then accumulate and populate portions of the link table&lt;BR /&gt;&lt;BR /&gt;With 8 threads your method has&lt;BR /&gt;&lt;BR /&gt; 8 x (time to calculate NPOS)&lt;BR /&gt; 8 x ( ~1/8 number of accumulate and populate link table)&lt;BR /&gt;&lt;BR /&gt;As I suggest&lt;BR /&gt;&lt;BR /&gt;8 x ( 1/8 time to calculate NPOS)&lt;BR /&gt; 8 x ( 1/8 time to write NPOSprecalc(i) )&lt;BR /&gt; second loop setup overhead&lt;BR /&gt; 8 x ( ~1/8 number of accumulate and populate link table)&lt;BR /&gt;&lt;BR /&gt;Essentialy my suggestion reduces by&lt;BR /&gt;&lt;BR /&gt; 8 x (7/8 time to calculate NPOS)&lt;BR /&gt;&lt;BR /&gt;at the expense of&lt;BR /&gt;&lt;BR /&gt; 8 x (one memory write)&lt;BR /&gt; pluse time to setup loop&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;(*** note from above)&lt;BR /&gt;&lt;BR /&gt;Should the loop (your method)on i be parallel, then each thread&lt;BR /&gt;calculates 1/8 of NPOS but stores only ~ 1/8 of NPOS entries in&lt;BR /&gt;your accumulators and link table. IOW loss of ~ 7/8 of data.&lt;BR /&gt;&lt;BR /&gt;I will answer the line 189 in next post&lt;BR /&gt;&lt;BR /&gt;Jim Dempsey&lt;BR /&gt;</description>
      <pubDate>Fri, 13 Nov 2009 15:28:13 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Some-help-ideas-with-OpenMP-parallelism/m-p/741291#M856</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2009-11-13T15:28:13Z</dc:date>
    </item>
    <item>
      <title>Re: Some help/ideas with OpenMP parallelism</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Some-help-ideas-with-OpenMP-parallelism/m-p/741292#M857</link>
      <description>&lt;DIV style="margin:0px;"&gt;&lt;/DIV&gt;
&lt;BR /&gt;RE: 189&lt;BR /&gt;&lt;BR /&gt;You are correct.&lt;BR /&gt;&lt;BR /&gt;Remove the !$OMP DO surrounding loop at 184-192&lt;BR /&gt;IOW make all threads run complete iteration not slice&lt;BR /&gt;The condition inside the loop restricts each thread to seperate zone&lt;BR /&gt;&lt;BR /&gt;Note, the time to read the stored NPOSprecalc(i) and test for range&lt;BR /&gt;will be relatively fast since NPOSprecalc(i) will likely reside in cache.&lt;BR /&gt;&lt;BR /&gt;Jim Dempsey&lt;BR /&gt;</description>
      <pubDate>Fri, 13 Nov 2009 15:34:33 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Some-help-ideas-with-OpenMP-parallelism/m-p/741292#M857</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2009-11-13T15:34:33Z</dc:date>
    </item>
    <item>
      <title>Re: Some help/ideas with OpenMP parallelism</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Some-help-ideas-with-OpenMP-parallelism/m-p/741293#M858</link>
      <description>&lt;DIV style="margin:0px;"&gt;&lt;/DIV&gt;
&lt;BR /&gt;Comments in general:&lt;BR /&gt;&lt;BR /&gt;Although !$OMP DO is a convienent way to divide up the iteration space of a loop it comes with the consequence that each thread does not know the extents of its slice. When this information cannot be used to your advantage then by all means use !$OMP DO. However, when knowledge of the extents of slice of loop is important, such as eliminating critical sections, atomics, or temporarry arrays, ... then the added effort of explicit slicing can be rewarded with substantial performance gains.&lt;BR /&gt;&lt;BR /&gt;The manual slicing technique above assumes all threads assigned to the parallel region has equal availability to CPU resource. If this is not the case (e.g. other apps running on system), then a little more hand work will be necessary to divide up the slice space (each former slice has sub slices that can be run by other threads).&lt;BR /&gt;&lt;BR /&gt;A half way step to resolve the issue of say&lt;BR /&gt;&lt;BR /&gt; !$OMP PARALLEL&lt;BR /&gt;&lt;BR /&gt;scheduling 8 thread, but other apps on system consuming 2 threads, and these two threads do not begin the parallel region until after other 6 threads complete the parallel region.&lt;BR /&gt;&lt;BR /&gt;The fix might be to create an !$OMP SECTIONS or !$OMP WORKSHARE but then this would hardwire a specific number of threads (check to see if the OpenMP task capability is now supported in IVF)&lt;BR /&gt;&lt;BR /&gt;Jim Dempsey&lt;BR /&gt;</description>
      <pubDate>Fri, 13 Nov 2009 15:57:35 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Some-help-ideas-with-OpenMP-parallelism/m-p/741293#M858</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2009-11-13T15:57:35Z</dc:date>
    </item>
  </channel>
</rss>

