- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
1. I am solving 3D finite volume temperature fields with Intel FV on Windows7 and using my own iterative methods. I have tried so many variations of OpenMP directives but never got a speed up more than factor 2 even with 16 cores. What bothers me is, that in the test program attached the CPU time recordings are so odd: the CPU time is more or less independent on NTHREADS and Nthreads=1 is faster than ordinary serial loops.
I compile with ifort /c /Qopenmp test3.f90 ,link with link test3.obj and run with test3.. Test3.f90 is attached together with my test3.exe.
Test3.f90 is like as it is, because the typical sort of loop in my 'big' program looks like:
do k=1,Nz do j=1,Ny do i=1,Nx c(i,j,k)=a(i,j,k)*b(i,j,k)+ other matrix elements - other matrix elemens enddo enddo enddo
Q: Is this type of loop structure impeding the use of OpenMP and how to make it better? I also have loops like
do i=1,N ; some arrays a(3,i) ; enddo
which also do not run better with OpenMP.
Q Are there special compiler directives to make it better?
Q What to use as diagnostics? (I have to admit that I'm using the old fashioned way of .bat files to compile and link. I do not use the visual mode).
Q:Can anyone tell me the main tripping hazard for a newbee in openMP?
2. There is some good story in that other loops like this scale with number of threads as expected:
!$OMP PARALLEL PRIVATE(i,j,k) reduction(+:prod) !$omp do do k=1,Nz do j=1,Ny do i=1,Nx prod=prod+a(i,j,k)*b(i,j,k) enddo enddo enddo !$omp end do !$OMP END PARALLEL
Now I am confused and wonder what goes wrong with my matric element multiplication.
Hopiing someone could provide me with a key idea
Best regards, Johannes
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
2. There is some good story in that other loops like this scale with number of threads as expected:
!$OMP PARALLEL PRIVATE(i,j,k) reduction(+:prod) !$omp do do k=1,Nz do j=1,Ny do i=1,Nx prod=prod+a(i,j,k)*b(i,j,k) enddo enddo enddo !$omp end do !$OMP END PARALLELI don't know why you'd do this, but there are more important situations where it's useful to write explicitly a vectorizable non-threaded replacement for inner loop e.g. using dot_product and reserve the threaded reduction to the outer loops. If your loop counts don't fit well with the number of threads, additional frills such as collapse may help.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The reason you saw no additional speedup over 2x is you have reached the memory bandwidth of your system
>> I am solving 3D finite volume temperature fields
You may find improvement by relocating your parallelization to outer levels of your program.
Follow the dictum: Parallel outer / Vector inner
Serial:
do step A
do step B
do step C
Your approach appears to be
do parallel step A
do parallel step B
do parallel step C
While it may be better to perform
do parallel (partition domain space)
do my thread's part of A
do my thread's part of B
do my thread's part of C
end do parallel
Also, for large collections of particles, consider using structure of arrays format. This is to say use arrays of like attributes not an array of particles.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Here is a sample rework of your program example. The purpose of which is to illustrate SOA format at use of higher computation per memory fetch (iow a non-memory bandwidth limited algorithm). This is not intended to illustrate the best way to work this problem
! test3.f90 ! PROGRAM test3 USE omp_lib implicit none real*4, parameter :: ke = 8.9875518e-9 ! Coulomb's constant (to real*4 precision) REAL*4, ALLOCATABLE :: posX(:),posY(:),posZ(:), charge(:) REAL*4, ALLOCATABLE :: velX(:),velY(:),velZ(:) REAL*4, ALLOCATABLE :: forceX(:),forceY(:),forceZ(:), mass(:) real*4 dX, dY, dZ, fX, fY, fZ, r2, r, forceProduct integer :: Nparticles integer :: i,j,k, Nthreads real*8 :: time0, time1 print *,'Test3' ! - CPU_Time of OMP with 1 Thread is shorter than serial CPU time. Why? ! - 4 threads give almost same CPU time as 2 threads Nparticles = 50*50*50 Allocate (posX(Nparticles), posY(Nparticles), posZ(Nparticles), charge(Nparticles)) Allocate (forceX(Nparticles),forceY(Nparticles),forceZ(Nparticles)) call RANDOM_NUMBER(posX) call RANDOM_NUMBER(posY) call RANDOM_NUMBER(posZ) call RANDOM_NUMBER(charge) ! initially 0.0:1.0 ! make 60:40 mix on charge of 1 and 2 do i=1, Nparticles if(charge(i) > 0.4) then charge(i) = 1.0 else charge(i) = 2.0 endif end do ! serial calculation time0 = omp_get_wtime() do i=1,Nparticles forceX(i) = 0.0 forceY(i) = 0.0 forceZ(i) = 0.0 end do do i=1,Nparticles-1 do j=i+1, Nparticles dX = posX(j) - posX(i) dY = posY(j) - posY(i) dZ = posZ(j) - posZ(i) r2 = dX**2 + dY**2 + dZ**2 forceProduct = ke * charge(i) * charge(j) / r2 r = sqrt(r2) fX = forceProduct * dX / r fY = forceProduct * dY / r fZ = forceProduct * dZ / r forceX(i) = forceX(i) + fX forceY(i) = forceY(i) + fY forceZ(i) = forceZ(i) + fZ forceX(j) = forceX(j) - fX forceY(j) = forceY(j) - fY forceZ(j) = forceZ(j) - fZ end do end do time1 = omp_get_wtime() print '(A,f6.2/)','serial time calculate forces',(time1-time0) !parallel openMP calculation !$OMP parallel ! dummy parallel region (outside of timed section) ! used to establish OpenMP thread pool if(omp_get_max_threads() .lt. 0) print *,"This satement won't execute" !$OMP end parallel do Nthreads = 1, omp_get_max_threads() time0 = omp_get_wtime() CALL OMP_SET_NUM_THREADS(NTHREADS) !$OMP parallel private(dX, dY, dZ, r2, r, fX, fY, fZ) !$OMP DO do i=1,Nparticles forceX(i) = 0.0 forceY(i) = 0.0 forceZ(i) = 0.0 end do !$OMP DO do i=1,Nparticles-1 do j=i+1, Nparticles dX = posX(j) - posX(i) dY = posY(j) - posY(i) dZ = posZ(j) - posZ(i) r2 = dX**2 + dY**2 + dZ**2 forceProduct = ke * charge(i) * charge(j) / r2 r = sqrt(r2) fX = forceProduct * dX / r fY = forceProduct * dY / r fZ = forceProduct * dZ / r forceX(i) = forceX(i) + fX forceY(i) = forceY(i) + fY forceZ(i) = forceZ(i) + fZ forceX(j) = forceX(j) - fX forceY(j) = forceY(j) - fY forceZ(j) = forceZ(j) - fZ end do end do !$OMP end parallel time1 = omp_get_wtime() print '(A,i2,A,f6.2/)','Nthreads ',Nthreads,' time ',(time1-time0) end do end PROGRAM test3
Test3 serial time calculate forces 21.08 Nthreads 1 time 21.09 Nthreads 2 time 15.86 Nthreads 3 time 12.22 Nthreads 4 time 10.52 Nthreads 5 time 8.98 Nthreads 6 time 7.91 Nthreads 7 time 7.17 Nthreads 8 time 6.68 Press any key to continue . . .
The above run on 4 core/8 thread system
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
*** CAUTION ***
The above code is in error due to the fact that updates of the (j) forces produce a race condition.
Will revise...
!$OMP DO do i=1,Nparticles-1 do j=1, i-1 dX = posX(j) - posX(i) dY = posY(j) - posY(i) dZ = posZ(j) - posZ(i) r2 = dX**2 + dY**2 + dZ**2 forceProduct = ke * charge(i) * charge(j) / r2 r = sqrt(r2) fX = forceProduct * dX / r fY = forceProduct * dY / r fZ = forceProduct * dZ / r forceX(i) = forceX(i) + fX forceY(i) = forceY(i) + fY forceZ(i) = forceZ(i) + fZ end do do j=i+1, Nparticles dX = posX(j) - posX(i) dY = posY(j) - posY(i) dZ = posZ(j) - posZ(i) r2 = dX**2 + dY**2 + dZ**2 forceProduct = ke * charge(i) * charge(j) / r2 r = sqrt(r2) fX = forceProduct * dX / r fY = forceProduct * dY / r fZ = forceProduct * dZ / r forceX(i) = forceX(i) + fX forceY(i) = forceY(i) + fY forceZ(i) = forceZ(i) + fZ end do end do !$OMP end parallel Test3 serial time calculate forces 21.17 Nthreads 1 time 33.47 Nthreads 2 time 17.21 Nthreads 3 time 12.95 Nthreads 4 time 11.02 Nthreads 5 time 9.85 Nthreads 6 time 9.12 Nthreads 7 time 8.63 Nthreads 8 time 8.27 Press any key to continue . . .
Note, the single interior loop of the prior post is now composed of two loops. The revised algorithm is performing the inter-particle force calculation which may seem redundant, but what it also does is eliminates the race condition of updating the forceX(j), forceY(j) and forceZ(j) of the prior loop.
The above has no race condition. As stated earlier, it is not necessarily optimal, rather it is used for illustrative purposes.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Jim, thanks for ideas.
my fastest loop to multiply 800x800x800 array elements is this one.
!$OMP Parallel private(k) !$OMP do do k=1,Nz c(1:Nx,1:Ny,k)=a(1:Nx,1:Ny,k)*b(1:Nx,1:Ny,k) enddo !$OMP end do !$OMP end Parallel
However, this is only factor 1.5 faster on four cores than a simple c(:,:,:)=a(:,:,:)*b(:,:,:) on one core. Probably vectorization is so effective, that parallelization won't improve it?
BR, Johannes
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
/Qopt-streaming-stores:always reduces the execution time by about factor 1.5 regardless of number of threads.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page