Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.

OPENMP

JohnNichols
Valued Contributor III
1,276 Views
subroutine do_1(a,b,n)
  real a(n,n), b(n,n)
  !$OMP PARALLEL SHARED(A,B,N)
     !$OMP DO SCHEDULE(DYNAMIC,1) PRIVATE(I,J)
       do i = 2, n
          do j = 1, i
            b(j,i) = ( a(j,i) + a(j,i-1) ) / 2.0
          end do
      end do
  !$OMP END DO NOWAIT 
!$OMP END PARALLEL 
end

Does this actually speed up the analysis -- 

0 Kudos
11 Replies
jimdempseyatthecove
Honored Contributor III
1,276 Views

How large is n?

The NOWAIT in this case does nothing for you as you are doing no work within the parallel region following the END DO.

Well... it would reduce two wait barriers to one, however it would be better to code this loop as:

subroutine do_1(a,b,n)
  real a(n,n), b(n,n)
  integer, parameter :: cutoff=123 ! experiment with this setting
  if(n < cutoff) then
     do i = 2, n
       do j = 1, i
         b(j,i) = ( a(j,i) + a(j,i-1) ) / 2.0
       end do
     end do
     return
  end if 
  do i = 1,cutoff-1
    do j = 1, i
      b(j,i) = ( a(j,i) + a(j,i-1) ) / 2.0
    end do
  end do
  !$OMP PARALLEL DO SCHEDULE(DYNAMIC,1) PRIVATE(I,J) SHARED(A,B,N)
       do i = cutoff, n
          do j = 1, i
            b(j,i) = ( a(j,i) + a(j,i-1) ) / 2.0
          end do
      end do
  !$OMP END PARALLEL DO
end

 

Jim Dempsey

0 Kudos
John_Campbell
New Contributor II
1,276 Views

But does it actually speed up the analysis ?

This is a very interesting problem, which I have tried to test with a "simple" test program, BUT

There are many problems with testing this example.

First the calculation : For small "n", the calculation size is too small to justify starting a !$OMP region, so Jim's suggestion of a cutoff is a good suggestion; for small n, ie don't use !$OMP for small n.

I presume that Jim has retained the single thread calculation up to i=cutoff-1 (do i = 2,cutoff-1) , because of the overhead of switching between threads for only i multiply operations.

However what happens for i > cutoff. My testing on an i7-8700K with 2667MHz memory, appears to show that the memory transfer rate is not good enough to sustain the benefit of parallel threads doing a trivial calculation as " b(1:i,i) = ( a(1:i,i) + a(1:i,i-1) ) / 2.0" .

As each loop has a varying workload, there can be an issue with what schedule to use. DYNAMIC,1 or STATIC,1 do provide a lot of new thread groups to even the workload, but given the memory bottleneck, SCHEDULE(STATIC) works slightly better (in my testing).

So, the results I have obtained show no significant benefit for large n, but a disadvantage for using !$omp for n small (as Jim Addressed). My conclusion is !$OMP is not suitable in this case. (I wonder what DDR5 memory might achieve?)

Now coding a test : I am attaching my test code for a range of n from 2^2 up to 2^15. My testing code tried to address the potential problems of testing a trivial calculation with an optimising compiler that may optimise away my loops, plus the problem of using an accurate timer.

I changed John's do_1 routine to:

  subroutine do_1 (k,a,b,n,ti)
    integer n,i,j,k
    real*8 a(n,n), b(n,n), ti(3), f

    ti(3) = ti(3) + 1      ! increment timer for call
    f = 2+k                ! modify results for each call iteration
    !$OMP PARALLEL SHARED(A,B,N,f)
       !$OMP DO SCHEDULE(DYNAMIC,1) PRIVATE(I,J)
         do i = 2, n
            do j = 1, i
              b(j,i) = ( a(j,i) + a(j,i-1) ) / f
            end do
        end do
       !$OMP END DO NOWAIT 
    !$OMP END PARALLEL 
  end subroutine do_1

 

"k" : the calling do loop counter was used to modify the calculation and "ti(3)" was used to count calls.

Another problem with this test when using large arrays for few calculations is the delays in allocating memory pages, so I repeated the main test size loop twice to try and seperate any page memory allocation delays.

A further benchmark problem can be accuracy of timers, so I reported SYSTEM_CLOCK ticks as elapsed time, and tried to make sure that the tick updates are based on Windows QueryPerformanceCounter, rather than GetTickCount (which updates 64 times a second)

( I defined arrays B and C to test the results were the same, but did not code that test)

So this is how I reached my conclusion as to if !$OMP helps in this DO_1 routine. I would be interested if others find a different interpretation.

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,276 Views

This seems to work better on Core i7 2600K (4 cores) x64 Release with /QxHost

  subroutine do_jim_2 (k,a,b,n,ti)
    use omp_lib
    implicit none
    integer n,i,j,k
    real*8 a(n,n), b(n,n), ti(3), f
    integer, parameter :: cutoff=197 ! experiment with this setting
    integer :: totalWork, workProgress, perThreadWork, iThread, myWorkBegin, myWorkEnd
    
    ti(3) = ti(3) + 1
    f = 2+k
    if(n < cutoff) then
       do i = 2, n
         do j = 1, i
           b(j,i) = ( a(j,i) + a(j,i-1) ) / f
         end do
       end do
       return
    end if 
    do i = 2,cutoff-1
      do j = 1, i
        b(j,i) = ( a(j,i) + a(j,i-1) ) / f
      end do
    end do
    ! the amount of work (and memory latency) is proportional to i
    totalWork = 0
    do i = cutoff, n
        totalWork = totalWork + i
    end do
    
    !$OMP PARALLEL PRIVATE(I,J, workProgress, iThread, myWorkBegin, myWorkEnd) SHARED(A,B,N,f,perThreadWork, totalWork)
    iThread = omp_get_thread_num()
    perThreadWork = totalWork / omp_get_num_threads() ! all threads will write the same value (to shared variable)
    myWorkBegin = perThreadWork * iThread
    myWorkEnd = myWorkBegin + perThreadWork - 1
    workProgress = 0
    do i = cutoff, n
        if(workProgress > myWorkEnd) exit
        if(workProgress >= myWorkBegin) then
            do j = 1, i
                b(j,i) = ( a(j,i) + a(j,i-1) ) / f
            end do
        endif
        workProgress = workProgress + i
    end do
    !$OMP END PARALLEL
  end subroutine do_jim_2

Jim Dempsey

0 Kudos
JohnNichols
Valued Contributor III
1,276 Views

Dear Jim and John:

Thank you both for the response.  This started with an observation that someone made on this forum that they had a significant improvement in performance with OpenMP.  So the observation raised a vague thought. 

I had not really paid attention to it before and so in a spare minute the other day -- I went and read some stuff on it.  This is actually the sample provided in the Intel Fortran OPENMP documentation.  

I have a core i7 at work, but I prefer to use my home computer which is a core i5, because it is not controlled by the IT police.

In terms of the real problem I was looking at, COVID -19.  We now know the statistics only works on a compact urban area. A compact urban area has to be self contained and surrounded by countryside, so Dallas FW,  New York/Northern NJ etc...  The underlying probablity distributon only works where there are limited inflows and outflows, so the real metropolis of NY stretches into 4 states.  The statistics is solid. 

surrounded by countryside = absolute requirement

You can model a smaller sub area, but your attack rate will be wrong and the model is useless from a pure math viewpoint. 

I need a model that creates 1000 urban areas, each compact, within each urban area is 2,000,000 people - this could be randomized but I do not think it will matter, between the urban areas there is a distance d(i) and the interchange between urban areas of people is proportional to the inverse of d squared.  We need to select all the d(i) -- but we can use a hub system so say 100 have major interaction and each hub has 99 offshoots, 

Each person has set of tags,  age, BMI, blood type, location within the urban area, work location, shopping location, mode of transport, wears face mask,  infected, hospitalized, and a few others that I am still thinking about - say 25 elements, moves to another hub, 

We would run this model for a year, at 1 hourly intervals,  using a random number generator to move the people within limits. 

Any thoughts on the data structure so we can OPENMP the analysis would be appreciated.  

I can run this on the supercomputer at work - but I am really lost on how to structure it properly in terms of OPENMP.  

The outerloop would move everyone within there set of allowed places, so time of day becomes important and shift workers

JMN

 

0 Kudos
John_Campbell
New Contributor II
1,233 Views

"This is actually the sample provided in the Intel Fortran OPENMP documentation. "

My objection is that, from my testing, the example provided by Intel Fortran is a bad example, as it does not demonstrate any performance gain. I am not sure if Jim agrees ?

In this example, for n small, there is not enough computation to justify starting an !$OMP region, and for n large, the calculation is swamped by the combined thread, memory transfer overhead of large arrays. This can negate any efficiency from AVX+ calculations. ("large" is for arrays larger than the cache.) There are too many OMP PARALLEL DO examples where the loop calculation is too trivial.

When the loop calculations are more complex and the combined memory transfer demands of all threads don't exceed the memory transfer bandwidth, there can be significant gains using OMP.

Anyway, that is my interpretation. I am asking to be shown that I am wrong, so I can not keep making the mistake of creating OMP calculations on large arrays, that don't work well.

(The solution can be to make more use of the information that is in cache, reducing the required memory transfer rate for the calculation, although the transfer between each core's L1 cache and all-cores shared L3 cache is another issue to understand)

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,231 Views

John,

The documentation has to show the syntactic requirements of using the OpenMP directives. So simplified examples are warranted. However, these simplified examples should be footnoted that real performance gains are found in more complicated code, and provide a link to a working example (e.g. simple N body) that illustrates the benefits of parallelization.

Jim Dempsey

0 Kudos
JohnNichols
Valued Contributor III
1,224 Views

To some extent I disagree, you are unlikely to use this type of code unless you are learning at Uni, in which case you will have a professor or if you are lucky a fellow student who has a friend etc... Or you have many years experience and need some help with a problem, the sample should demonstrate the principal and show the use - ie much faster.  I put up this sample as it was the one I could find. 

I learnt LISP in Australia from three books before the Internet, I can remember spending a week on a single line of code, one learnt a lot from fooling around.  

Just a thought.  

0 Kudos
Steve_Lionel
Honored Contributor III
1,218 Views

The compiler documentation, as @jimdempseyatthecove says, meant to show syntax and rules for usage. It is not attempting to teach you how to optimize using OpenMP.

0 Kudos
JohnNichols
Valued Contributor III
1,159 Views

I was playing and I changed to release -- never used this before and I started to get openmp externals not found - 2 - switched back to debug and it is ok 

Any one got any ideas, I am lost

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,156 Views

>>I changed to release -- never used this before and I started to get openmp externals not found - 2 - switched back to debug and it is ok

Check the build properties for Release to make sure OpenMP is enabled. Id is not unusual to add OpenMP to a Debug build and forget to add it to the Release build as well.

Jim Dempsey

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,276 Views

Pseudo code suggestion

serial for T = 0, nT
  parallel for each urban area
    compute intra urban area interactions
    together with build table of people moving out to other urban area
  end parallel for each urban area
  serial for each urban area
     move exiters of urban area to other urban area
  end serial for each urban area 
end for T

Organize the data such that computations can be made using vector instructions.
Organize access to data such that it (they) are accessed in a few sequential streams
If possible structure computations to use as few as memory fetches and writes as possible.

Jim Dempsey

Reply