subroutine do_1(a,b,n) real a(n,n), b(n,n) !$OMP PARALLEL SHARED(A,B,N) !$OMP DO SCHEDULE(DYNAMIC,1) PRIVATE(I,J) do i = 2, n do j = 1, i b(j,i) = ( a(j,i) + a(j,i-1) ) / 2.0 end do end do !$OMP END DO NOWAIT !$OMP END PARALLEL end
Does this actually speed up the analysis --
How large is n?
The NOWAIT in this case does nothing for you as you are doing no work within the parallel region following the END DO.
Well... it would reduce two wait barriers to one, however it would be better to code this loop as:
subroutine do_1(a,b,n) real a(n,n), b(n,n) integer, parameter :: cutoff=123 ! experiment with this setting if(n < cutoff) then do i = 2, n do j = 1, i b(j,i) = ( a(j,i) + a(j,i-1) ) / 2.0 end do end do return end if do i = 1,cutoff-1 do j = 1, i b(j,i) = ( a(j,i) + a(j,i-1) ) / 2.0 end do end do !$OMP PARALLEL DO SCHEDULE(DYNAMIC,1) PRIVATE(I,J) SHARED(A,B,N) do i = cutoff, n do j = 1, i b(j,i) = ( a(j,i) + a(j,i-1) ) / 2.0 end do end do !$OMP END PARALLEL DO end
But does it actually speed up the analysis ?
This is a very interesting problem, which I have tried to test with a "simple" test program, BUT
There are many problems with testing this example.
First the calculation : For small "n", the calculation size is too small to justify starting a !$OMP region, so Jim's suggestion of a cutoff is a good suggestion; for small n, ie don't use !$OMP for small n.
I presume that Jim has retained the single thread calculation up to i=cutoff-1 (do i = 2,cutoff-1) , because of the overhead of switching between threads for only i multiply operations.
However what happens for i > cutoff. My testing on an i7-8700K with 2667MHz memory, appears to show that the memory transfer rate is not good enough to sustain the benefit of parallel threads doing a trivial calculation as " b(1:i,i) = ( a(1:i,i) + a(1:i,i-1) ) / 2.0" .
As each loop has a varying workload, there can be an issue with what schedule to use. DYNAMIC,1 or STATIC,1 do provide a lot of new thread groups to even the workload, but given the memory bottleneck, SCHEDULE(STATIC) works slightly better (in my testing).
So, the results I have obtained show no significant benefit for large n, but a disadvantage for using !$omp for n small (as Jim Addressed). My conclusion is !$OMP is not suitable in this case. (I wonder what DDR5 memory might achieve?)
Now coding a test : I am attaching my test code for a range of n from 2^2 up to 2^15. My testing code tried to address the potential problems of testing a trivial calculation with an optimising compiler that may optimise away my loops, plus the problem of using an accurate timer.
I changed John's do_1 routine to:
subroutine do_1 (k,a,b,n,ti) integer n,i,j,k real*8 a(n,n), b(n,n), ti(3), f ti(3) = ti(3) + 1 ! increment timer for call f = 2+k ! modify results for each call iteration !$OMP PARALLEL SHARED(A,B,N,f) !$OMP DO SCHEDULE(DYNAMIC,1) PRIVATE(I,J) do i = 2, n do j = 1, i b(j,i) = ( a(j,i) + a(j,i-1) ) / f end do end do !$OMP END DO NOWAIT !$OMP END PARALLEL end subroutine do_1
"k" : the calling do loop counter was used to modify the calculation and "ti(3)" was used to count calls.
Another problem with this test when using large arrays for few calculations is the delays in allocating memory pages, so I repeated the main test size loop twice to try and seperate any page memory allocation delays.
A further benchmark problem can be accuracy of timers, so I reported SYSTEM_CLOCK ticks as elapsed time, and tried to make sure that the tick updates are based on Windows QueryPerformanceCounter, rather than GetTickCount (which updates 64 times a second)
( I defined arrays B and C to test the results were the same, but did not code that test)
So this is how I reached my conclusion as to if !$OMP helps in this DO_1 routine. I would be interested if others find a different interpretation.
This seems to work better on Core i7 2600K (4 cores) x64 Release with /QxHost
subroutine do_jim_2 (k,a,b,n,ti) use omp_lib implicit none integer n,i,j,k real*8 a(n,n), b(n,n), ti(3), f integer, parameter :: cutoff=197 ! experiment with this setting integer :: totalWork, workProgress, perThreadWork, iThread, myWorkBegin, myWorkEnd ti(3) = ti(3) + 1 f = 2+k if(n < cutoff) then do i = 2, n do j = 1, i b(j,i) = ( a(j,i) + a(j,i-1) ) / f end do end do return end if do i = 2,cutoff-1 do j = 1, i b(j,i) = ( a(j,i) + a(j,i-1) ) / f end do end do ! the amount of work (and memory latency) is proportional to i totalWork = 0 do i = cutoff, n totalWork = totalWork + i end do !$OMP PARALLEL PRIVATE(I,J, workProgress, iThread, myWorkBegin, myWorkEnd) SHARED(A,B,N,f,perThreadWork, totalWork) iThread = omp_get_thread_num() perThreadWork = totalWork / omp_get_num_threads() ! all threads will write the same value (to shared variable) myWorkBegin = perThreadWork * iThread myWorkEnd = myWorkBegin + perThreadWork - 1 workProgress = 0 do i = cutoff, n if(workProgress > myWorkEnd) exit if(workProgress >= myWorkBegin) then do j = 1, i b(j,i) = ( a(j,i) + a(j,i-1) ) / f end do endif workProgress = workProgress + i end do !$OMP END PARALLEL end subroutine do_jim_2
Dear Jim and John:
Thank you both for the response. This started with an observation that someone made on this forum that they had a significant improvement in performance with OpenMP. So the observation raised a vague thought.
I had not really paid attention to it before and so in a spare minute the other day -- I went and read some stuff on it. This is actually the sample provided in the Intel Fortran OPENMP documentation.
I have a core i7 at work, but I prefer to use my home computer which is a core i5, because it is not controlled by the IT police.
In terms of the real problem I was looking at, COVID -19. We now know the statistics only works on a compact urban area. A compact urban area has to be self contained and surrounded by countryside, so Dallas FW, New York/Northern NJ etc... The underlying probablity distributon only works where there are limited inflows and outflows, so the real metropolis of NY stretches into 4 states. The statistics is solid.
surrounded by countryside = absolute requirement
You can model a smaller sub area, but your attack rate will be wrong and the model is useless from a pure math viewpoint.
I need a model that creates 1000 urban areas, each compact, within each urban area is 2,000,000 people - this could be randomized but I do not think it will matter, between the urban areas there is a distance d(i) and the interchange between urban areas of people is proportional to the inverse of d squared. We need to select all the d(i) -- but we can use a hub system so say 100 have major interaction and each hub has 99 offshoots,
Each person has set of tags, age, BMI, blood type, location within the urban area, work location, shopping location, mode of transport, wears face mask, infected, hospitalized, and a few others that I am still thinking about - say 25 elements, moves to another hub,
We would run this model for a year, at 1 hourly intervals, using a random number generator to move the people within limits.
Any thoughts on the data structure so we can OPENMP the analysis would be appreciated.
I can run this on the supercomputer at work - but I am really lost on how to structure it properly in terms of OPENMP.
The outerloop would move everyone within there set of allowed places, so time of day becomes important and shift workers
"This is actually the sample provided in the Intel Fortran OPENMP documentation. "
My objection is that, from my testing, the example provided by Intel Fortran is a bad example, as it does not demonstrate any performance gain. I am not sure if Jim agrees ?
In this example, for n small, there is not enough computation to justify starting an !$OMP region, and for n large, the calculation is swamped by the combined thread, memory transfer overhead of large arrays. This can negate any efficiency from AVX+ calculations. ("large" is for arrays larger than the cache.) There are too many OMP PARALLEL DO examples where the loop calculation is too trivial.
When the loop calculations are more complex and the combined memory transfer demands of all threads don't exceed the memory transfer bandwidth, there can be significant gains using OMP.
Anyway, that is my interpretation. I am asking to be shown that I am wrong, so I can not keep making the mistake of creating OMP calculations on large arrays, that don't work well.
(The solution can be to make more use of the information that is in cache, reducing the required memory transfer rate for the calculation, although the transfer between each core's L1 cache and all-cores shared L3 cache is another issue to understand)
The documentation has to show the syntactic requirements of using the OpenMP directives. So simplified examples are warranted. However, these simplified examples should be footnoted that real performance gains are found in more complicated code, and provide a link to a working example (e.g. simple N body) that illustrates the benefits of parallelization.
To some extent I disagree, you are unlikely to use this type of code unless you are learning at Uni, in which case you will have a professor or if you are lucky a fellow student who has a friend etc... Or you have many years experience and need some help with a problem, the sample should demonstrate the principal and show the use - ie much faster. I put up this sample as it was the one I could find.
I learnt LISP in Australia from three books before the Internet, I can remember spending a week on a single line of code, one learnt a lot from fooling around.
Just a thought.
>>I changed to release -- never used this before and I started to get openmp externals not found - 2 - switched back to debug and it is ok
Check the build properties for Release to make sure OpenMP is enabled. Id is not unusual to add OpenMP to a Debug build and forget to add it to the Release build as well.
Pseudo code suggestion
serial for T = 0, nT parallel for each urban area compute intra urban area interactions together with build table of people moving out to other urban area end parallel for each urban area serial for each urban area move exiters of urban area to other urban area end serial for each urban area end for T
Organize the data such that computations can be made using vector instructions.
Organize access to data such that it (they) are accessed in a few sequential streams
If possible structure computations to use as few as memory fetches and writes as possible.