Does nint(matrix(I)/dr)

Fotis_P_ · ‎11-06-2018

Hello all,

So I want to parallelize a histogram filling and I searched around and found this video and I went ahead and tried his example, thinking "seems simple enough". So here's mine (basically the same):

!initialize matrix, hist
do i=1,Nbins
   call omp_init_lock(hist_locks(i))
enddo
!$omp parallel do
do i=1,N
   bin=nint(matrix(i)/dr)
   call omp_set_lock(hist_locks(bin)) 
   hist(bin)=hist(bin)+1               
   call omp_unset_lock(hist_locks(bin)) 
enddo
!$omp end parallel do
print*, 'end'

(sorry, I can't make it to fomat into a code segment)

But when I run it (ctrl+F5), it always comes up with the same error "forrt1: severe (157): Program Exception - access violation" in the line where the omp_unset_lock is. What's going on? I really don't get this error. I mean, if it was in the line where I set the lock it would MAYBE kind of make sense ("2 threads tried to modify the same memory address at the same time", although it would defeat the whole purpose of the locks but whatever). But 2 threads trying to unlock the same lock? How? (I hope I'm interpreting the error correct, but whatever the case. How do I fix it?). I know there are other methods to parallelize a histogram filling but I need to know what's happening with this and doesn't work. Also, I know that in the video he uses an example with c++ and the calls are all made by reference (if I'm not mistaken), (using &). But how do you do that in Fortran? I tried using LOC but it doesn't help. I still get the same errors. Should I use pointers somehow?

Another question: If I insert a " print*, 'a' " command anywhere inside the do loop then the program will run (ctrl+F5) and terminate "normally" without any errors BUT without printing 'end', meaning it did have problems and terminated prematurely and indeed if I run it with the debugger (F5) it comes up with the same errors as before (without the print). Please, any explanations cause I'm starting to go crazy.

jimdempseyatthecove · ‎11-07-2018

Does nint(matrix(I)/dr) produce a number that is .le. 0 or .gt. Nbins?

Add:

bin=nint(matrix(i)/dr)
if(bin .le. 0 .or. bin .gt. Nbins) print *, "Bug - bin .le. 0 .or. bin .gt. Nbins"

Also, in cases where a simple scalar integer is added to a variable, consider using !$omp atomic

Jim Dempsey

Fotis_P_ · ‎11-07-2018

jimdempseyatthecove wrote:
Does nint(matrix(I)/dr) produce a number that is .le. 0 or .gt. Nbins?
Add:
bin=nint(matrix(i)/dr)
if(bin .le. 0 .or. bin .gt. Nbins) print *, "Bug - bin .le. 0 .or. bin .gt. Nbins"
Also, in cases where a simple scalar integer is added to a variable, consider using !$omp atomic
Jim Dempsey

No, bin can't be less, or equal to 0 or bigger than Nbins. I've made sure of that in the initialization of the array matrix and Nbins. Still, I did put the error checking and I get the same errors (and indeed bin is never <=0 or >Nbins)

jimdempseyatthecove · ‎11-07-2018

How did you declair hist_locks?

IOW did you use:

integer(kind=OMP_LOCK_KIND) :: hist_locks(Nbins)

Jim Dempsey

Fotis_P_ · ‎11-07-2018

jimdempseyatthecove wrote:
How did you declair hist_locks?
IOW did you use:
integer(kind=OMP_LOCK_KIND) :: hist_locks(Nbins)
Jim Dempsey

I didn't use omp_lock_kind. I just set it to kind=8 (I am in x64) but to make sure I changed to kind=omp_lock_kind. Still, nothing. Also, just an additional information, all 3 arrays are allocatable (in order for me to have big sizes)

jimdempseyatthecove · ‎11-07-2018

What version of Fortran are you using?

And, do you know if the OpenMP runtime DLL is that that came with the Fortran compiler?

Jim Dempsey

Fotis_P_ · ‎11-07-2018

jimdempseyatthecove wrote:
What version of Fortran are you using?
And, do you know if the OpenMP runtime DLL is that that came with the Fortran compiler?
Jim Dempsey

Intel (R) Visual Fortran Compiler 18.0.1.156 [Intel (R) 64] and I'm using the omp_lib module that came with it. I honestly think there's something wrong with the passing of the arguments, but I don't know exactly what or how to fix it.

jimdempseyatthecove · ‎11-07-2018

In case this is an optimization issue with seemingly unproductive code, add

print *,hist

outside the parallel do, and before the print *,'end'

Jim Dempsey

jimdempseyatthecove · ‎11-07-2018

Grr.... bin is not private

!$omp parallel do private(bin)

Jim Dempsey

jimdempseyatthecove · ‎11-07-2018

The loop control variable I is implicitly private

Fotis_P_ · ‎11-07-2018

jimdempseyatthecove wrote:
Grr.... bin is not private
!$omp parallel do private(bin)
Jim Dempsey

Making bin private seemed to work, I can't believe I didn't think to do it. I tried making it a shared variable (which didn't work) but not private. Thank you.

Still though, in terms of speed it's not that much better. The real time differences (not cpu time) between the parralel and the searial code are 10 seconds (serial does it in ~75 seconds, parallel in ~60 give or take (for 4 threads), for size of matrix N=1000000000. Which is a BIG array (can't go higher anyway, cause I have only so much RAM). I would expect better, especially for such a big matrix.

jimdempseyatthecove · ‎11-07-2018

For this problem, with very little work (serial version) in the loop, the omp set/unset lock overhead is excessive for the code within the loop

Remove the hist_locks and perform the loop thusly:

!$omp parallel private(nThreads, iThread, bin)
nThreads = omp_get_num_threads()
iThread omp_get_thread_num()
! **** remove parallel do !$omp do
do i=1,N
bin=nint(matrix(i)/dr)
if(iThread == mod(bin, nThreads)) hist(bin)=hist(bin)+1
enddo
! **** remove parallel do !$omp end parallel do
!$omp end parallel
print*, 'end'

If the actual code is much larger, you might consider something else.

Report back what you get with the above change.

Jim Dempsey

jimdempseyatthecove · ‎11-07-2018

For the above code, each thread will redundantly perform the work of the threads of the other implementation. However, this exchanges the overhead of the omp set+unset lock against the time of fetch of the same element from the aggregate of

one of the threads from RAM
some of the threads from L3 cache
some of the threads from L2 cache
some of the threads from L1 cache

Jim Dempsey

jimdempseyatthecove · ‎11-07-2018

An alternate is:

!$omp parallel do private(bin) reduction(+:hist)
do i=1,N
bin=nint(matrix(i)/dr)
hist(bin)=hist(bin)+1
enddo
!$omp end parallel do

Jim Dempsey

jimdempseyatthecove · ‎11-08-2018

Now that you have seen some hints at how to improve performance, can we do better?

Perhaps. The reduction may be best in #14 as it reduces reads (from cache), but it adds the reduction phase (nThreads * Nbins) Read-Modify-Write operations.

Consider modifying the #12 method to constrict the thread updates to cache line instead of bin:

integer, parameter :: CachLineSizeBytes = 64 ! This may change for future CPUs
integer, parameter :: nBinsPerCacheLine = CachLineSizeBytes / sizeof(hist(1)) ! consider adding hist_t
...
!$omp parallel private(nThreads, iThread, bin)
nThreads = omp_get_num_threads()
iThread omp_get_thread_num()
do i=1,N
bin=nint(matrix(i)/dr)
if(iThread == mod(bin / nBinsPerCacheLine, nThreads)) hist(bin)=hist(bin)+1
enddo
!$omp end parallel

Note, from your post #1 you have 4 hardware threads. Therefore your system will, for even distribution of bin hits, experience 4 reads for each read/modify/write. Larger numbers of threads will have nThreads reads per read/modify/write and may not necessarily benefit from the redundant read technique of this method. An 8-socket Xeon Platinum could provide for 448 threads, and thus the redundant read might not be effective.

An additional technique might be:

!$omp parallel private(nThreads, iThread, bin)
nThreads = omp_get_num_threads()
iThread omp_get_thread_num()
PartitionSize = max(Nbins / nThreads, 1)
do i=1,N
bin=nint(matrix(i)/dr)
if(iThread == mod(bin / PartitionSize, nThreads)) hist(bin)=hist(bin)+1
enddo
!$omp end parallel

The whole point of this length discussion is not to simply show you the most efficient way, rather to teach you how to discover the most efficient way for the system under test.

Jim Dempsey

Fotis_P_ · ‎11-08-2018

Thank you for all the help. I will play around with all these. You are indeed right, in questions like this it's better to show people (me) how to solve the problem rather than just present the solution as if by magic.

jimdempseyatthecove · ‎11-08-2018

Like in any sketch code presented on these forums, some things are assumed of the reader. In my sketch code above, it is assumed that the array hist is aligned to a cache line boundary. When (if) you run a test:

!DIR$ ATTRIBUTES ALIGN: 64:: hist

Generally one places the alignment directive immediately preceding the variable declaration.

Jim Dempsey

OpenMP locks