OpenMP and 3D arrays. How to make it fast?

Johannes_A_ · ‎04-24-2014

1. I am solving 3D finite volume temperature fields with Intel FV on Windows7 and using my own iterative methods. I have tried so many variations of OpenMP directives but never got a speed up more than factor 2 even with 16 cores. What bothers me is, that in the test program attached the CPU time recordings are so odd: the CPU time is more or less independent on NTHREADS and Nthreads=1 is faster than ordinary serial loops.

I compile with ifort /c /Qopenmp test3.f90 ,link with link test3.obj and run with test3.. Test3.f90 is attached together with my test3.exe.

Test3.f90 is like as it is, because the typical sort of loop in my 'big' program looks like:

do k=1,Nz
  do j=1,Ny
    do i=1,Nx
      c(i,j,k)=a(i,j,k)*b(i,j,k)+ other matrix elements - other matrix elemens 
    enddo
  enddo
enddo

Q: Is this type of loop structure impeding the use of OpenMP and how to make it better? I also have loops like

  do i=1,N ; some arrays a(3,i) ; enddo

which also do not run better with OpenMP.

Q Are there special compiler directives to make it better?

Q What to use as diagnostics? (I have to admit that I'm using the old fashioned way of .bat files to compile and link. I do not use the visual mode).

Q:Can anyone tell me the main tripping hazard for a newbee in openMP?

2. There is some good story in that other loops like this scale with number of threads as expected:

!$OMP PARALLEL PRIVATE(i,j,k)  reduction(+:prod)
!$omp do
do k=1,Nz
do j=1,Ny
  do i=1,Nx
   prod=prod+a(i,j,k)*b(i,j,k) 
  enddo
enddo
enddo
!$omp end do
!$OMP END PARALLEL

Now I am confused and wonder what goes wrong with my matric element multiplication.

Hopiing someone could provide me with a key idea

Best regards, Johannes

TimP · ‎04-24-2014

Q Are there special compiler directives to make it better? The run-time affinity settings are crucial, particularly so if you are trying to avoid degradation due to HyperThreading or on multiple CPUs. Check out OMP_PROC_BIND and OMP_PLACES or Intel specific alternatives to the latter. If you are trying to minimize the time you spend evaluating alternate methods, you should put some priority on learning about MATMUL and the opt-matmul options.

2. There is some good story in that other loops like this scale with number of threads as expected:

!$OMP PARALLEL PRIVATE(i,j,k)  reduction(+:prod)
!$omp do
do k=1,Nz
do j=1,Ny
  do i=1,Nx
   prod=prod+a(i,j,k)*b(i,j,k) 
  enddo
enddo
enddo
!$omp end do
!$OMP END PARALLEL

I don't know why you'd do this, but there are more important situations where it's useful to write explicitly a vectorizable non-threaded replacement for inner loop e.g. using dot_product and reserve the threaded reduction to the outer loops. If your loop counts don't fit well with the number of threads, additional frills such as collapse may help.

jimdempseyatthecove · ‎04-27-2014

The reason you saw no additional speedup over 2x is you have reached the memory bandwidth of your system

>> I am solving 3D finite volume temperature fields

You may find improvement by relocating your parallelization to outer levels of your program.

Follow the dictum: Parallel outer / Vector inner

Serial:

do step A
do step B
do step C

Your approach appears to be

do parallel step A
do parallel step B
do parallel step C

While it may be better to perform

do parallel (partition domain space)
do my thread's part of A
do my thread's part of B
do my thread's part of C
end do parallel

Also, for large collections of particles, consider using structure of arrays format. This is to say use arrays of like attributes not an array of particles.

Jim Dempsey

jimdempseyatthecove · ‎04-27-2014

Here is a sample rework of your program example. The purpose of which is to illustrate SOA format at use of higher computation per memory fetch (iow a non-memory bandwidth limited algorithm). This is not intended to illustrate the best way to work this problem

!  test3.f90 
!
PROGRAM test3
    USE omp_lib
    implicit none
    real*4, parameter :: ke = 8.9875518e-9 ! Coulomb's constant (to real*4 precision)
    REAL*4, ALLOCATABLE :: posX(:),posY(:),posZ(:), charge(:)
    REAL*4, ALLOCATABLE :: velX(:),velY(:),velZ(:)
    REAL*4, ALLOCATABLE :: forceX(:),forceY(:),forceZ(:), mass(:)
    real*4 dX, dY, dZ, fX, fY, fZ, r2, r, forceProduct
    integer :: Nparticles
    integer :: i,j,k, Nthreads

    real*8 :: time0, time1

    print *,'Test3'
    ! - CPU_Time of OMP with 1 Thread is shorter than serial CPU time. Why?
    ! - 4 threads give almost same CPU time as 2 threads

    Nparticles = 50*50*50
    
    Allocate (posX(Nparticles), posY(Nparticles), posZ(Nparticles), charge(Nparticles))
    Allocate (forceX(Nparticles),forceY(Nparticles),forceZ(Nparticles))
    
    call RANDOM_NUMBER(posX)
    call RANDOM_NUMBER(posY)
    call RANDOM_NUMBER(posZ)
    call RANDOM_NUMBER(charge)  ! initially 0.0:1.0
    ! make 60:40 mix on charge of 1 and 2
    do i=1, Nparticles
        if(charge(i) > 0.4) then
            charge(i) = 1.0
        else
            charge(i) = 2.0
        endif
    end do
    
    ! serial calculation
    time0 = omp_get_wtime()
    do i=1,Nparticles
        forceX(i) = 0.0
        forceY(i) = 0.0
        forceZ(i) = 0.0
    end do
    do i=1,Nparticles-1
        do j=i+1, Nparticles
            dX = posX(j) - posX(i)
            dY = posY(j) - posY(i)
            dZ = posZ(j) - posZ(i)
            r2 = dX**2 + dY**2 + dZ**2
            forceProduct = ke * charge(i) * charge(j) / r2
            r = sqrt(r2)
            fX = forceProduct * dX / r
            fY = forceProduct * dY / r
            fZ = forceProduct * dZ / r
            forceX(i) = forceX(i) + fX
            forceY(i) = forceY(i) + fY
            forceZ(i) = forceZ(i) + fZ
            forceX(j) = forceX(j) - fX
            forceY(j) = forceY(j) - fY
            forceZ(j) = forceZ(j) - fZ
        end do
    end do
    time1 = omp_get_wtime()
    print '(A,f6.2/)','serial time calculate forces',(time1-time0)
 
    !parallel openMP calculation
     !$OMP parallel
    ! dummy parallel region (outside of timed section)
    ! used to establish OpenMP thread pool
    if(omp_get_max_threads() .lt. 0) print *,"This satement won't execute"
    !$OMP end parallel

    do Nthreads = 1, omp_get_max_threads()
    time0 = omp_get_wtime() 
    CALL OMP_SET_NUM_THREADS(NTHREADS)
    !$OMP parallel private(dX, dY, dZ, r2, r, fX, fY, fZ)
    !$OMP DO
    do i=1,Nparticles
        forceX(i) = 0.0
        forceY(i) = 0.0
        forceZ(i) = 0.0
    end do
    !$OMP DO
    do i=1,Nparticles-1
        do j=i+1, Nparticles
            dX = posX(j) - posX(i)
            dY = posY(j) - posY(i)
            dZ = posZ(j) - posZ(i)
            r2 = dX**2 + dY**2 + dZ**2
            forceProduct = ke * charge(i) * charge(j) / r2
            r = sqrt(r2)
            fX = forceProduct * dX / r
            fY = forceProduct * dY / r
            fZ = forceProduct * dZ / r
            forceX(i) = forceX(i) + fX
            forceY(i) = forceY(i) + fY
            forceZ(i) = forceZ(i) + fZ
            forceX(j) = forceX(j) - fX
            forceY(j) = forceY(j) - fY
            forceZ(j) = forceZ(j) - fZ
        end do
    end do
    !$OMP end parallel
    time1 = omp_get_wtime()
    print '(A,i2,A,f6.2/)','Nthreads ',Nthreads,' time ',(time1-time0)
    end do
end PROGRAM test3

 Test3
serial time calculate forces 21.08
Nthreads  1 time  21.09
Nthreads  2 time  15.86
Nthreads  3 time  12.22
Nthreads  4 time  10.52
Nthreads  5 time   8.98
Nthreads  6 time   7.91
Nthreads  7 time   7.17
Nthreads  8 time   6.68

Press any key to continue . . .

The above run on 4 core/8 thread system

Jim Dempsey

jimdempseyatthecove · ‎04-27-2014

*** CAUTION ***

The above code is in error due to the fact that updates of the (j) forces produce a race condition.

Will revise...

    !$OMP DO
    do i=1,Nparticles-1
        do j=1, i-1
            dX = posX(j) - posX(i)
            dY = posY(j) - posY(i)
            dZ = posZ(j) - posZ(i)
            r2 = dX**2 + dY**2 + dZ**2
            forceProduct = ke * charge(i) * charge(j) / r2
            r = sqrt(r2)
            fX = forceProduct * dX / r
            fY = forceProduct * dY / r
            fZ = forceProduct * dZ / r
            forceX(i) = forceX(i) + fX
            forceY(i) = forceY(i) + fY
            forceZ(i) = forceZ(i) + fZ
        end do
        do j=i+1, Nparticles
            dX = posX(j) - posX(i)
            dY = posY(j) - posY(i)
            dZ = posZ(j) - posZ(i)
            r2 = dX**2 + dY**2 + dZ**2
            forceProduct = ke * charge(i) * charge(j) / r2
            r = sqrt(r2)
            fX = forceProduct * dX / r
            fY = forceProduct * dY / r
            fZ = forceProduct * dZ / r
            forceX(i) = forceX(i) + fX
            forceY(i) = forceY(i) + fY
            forceZ(i) = forceZ(i) + fZ
        end do
    end do
    !$OMP end parallel

 Test3
serial time calculate forces 21.17
Nthreads  1 time  33.47
Nthreads  2 time  17.21
Nthreads  3 time  12.95
Nthreads  4 time  11.02
Nthreads  5 time   9.85
Nthreads  6 time   9.12
Nthreads  7 time   8.63
Nthreads  8 time   8.27

Press any key to continue . . .

Note, the single interior loop of the prior post is now composed of two loops. The revised algorithm is performing the inter-particle force calculation which may seem redundant, but what it also does is eliminates the race condition of updating the forceX(j), forceY(j) and forceZ(j) of the prior loop.

The above has no race condition. As stated earlier, it is not necessarily optimal, rather it is used for illustrative purposes.

Jim Dempsey

Johannes_A_ · ‎04-29-2014

Hi Jim, thanks for ideas.

my fastest loop to multiply 800x800x800 array elements is this one.

!$OMP Parallel  private(k) 
!$OMP do 
do k=1,Nz
      c(1:Nx,1:Ny,k)=a(1:Nx,1:Ny,k)*b(1:Nx,1:Ny,k)  
enddo
!$OMP end do
!$OMP end Parallel

However, this is only factor 1.5 faster on four cores than a simple c(:,:,:)=a(:,:,:)*b(:,:,:) on one core. Probably vectorization is so effective, that parallelization won't improve it?

BR, Johannes

TimP · ‎04-29-2014

Did you compare with and without streaming stores? Jim advised you earlier in the thread to consider memory bandwidth, and here you have more memory performance dependence than you did when he made that comment. I suppose the streaming stores may not improve the threaded case as much as the non-threaded one.

Johannes_A_ · ‎04-29-2014

/Qopt-streaming-stores:always reduces the execution time by about factor 1.5 regardless of number of threads.