Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.

OpenMP locks

Fotis_P_
Beginner
1,420 Views

Hello all,

So I want to parallelize a histogram filling and I searched around and found this video and I went ahead and tried his example, thinking "seems simple enough". So here's mine (basically the same):

!initialize matrix, hist
do i=1,Nbins
   call omp_init_lock(hist_locks(i))
enddo
!$omp parallel do
do i=1,N
   bin=nint(matrix(i)/dr)
   call omp_set_lock(hist_locks(bin)) 
   hist(bin)=hist(bin)+1               
   call omp_unset_lock(hist_locks(bin)) 
enddo
!$omp end parallel do
print*, 'end'

(sorry, I can't make it to fomat into a code segment)

 

But when I run it (ctrl+F5), it always comes up with the same error "forrt1: severe (157): Program Exception - access violation" in the line where the omp_unset_lock is. What's going on? I really don't get this error. I mean, if it was in the line where I set the lock it would MAYBE kind of make sense ("2 threads tried to modify the same memory address at the same time", although it would defeat the whole purpose of the locks but whatever). But 2 threads trying to unlock the same lock? How? (I hope I'm interpreting the error correct, but whatever the case. How do I fix it?). I know there are other methods to parallelize a histogram filling but I need to know what's happening with this and doesn't work. Also, I know that in the video he uses an example with c++ and the calls are all made by reference (if I'm not mistaken), (using &). But how do you do that in Fortran? I tried using LOC but it doesn't help. I still get the same errors. Should I use pointers somehow?

 

 

Another question: If I insert a " print*, 'a' " command anywhere inside the do loop then the program will run (ctrl+F5) and terminate "normally" without any errors BUT without printing 'end', meaning it did have problems and terminated prematurely and indeed if I run it with the debugger (F5) it comes up with the same errors as before (without the print). Please, any explanations cause I'm starting to go crazy.

0 Kudos
16 Replies
jimdempseyatthecove
Honored Contributor III
1,420 Views

Does nint(matrix(I)/dr) produce a number that is .le. 0 or .gt. Nbins?

Add:

bin=nint(matrix(i)/dr)
if(bin .le. 0 .or. bin .gt. Nbins) print *, "Bug - bin .le. 0 .or. bin .gt. Nbins"

Also, in cases where a simple scalar integer is added to a variable, consider using !$omp atomic

Jim Dempsey

 

0 Kudos
Fotis_P_
Beginner
1,420 Views

jimdempseyatthecove wrote:

Does nint(matrix(I)/dr) produce a number that is .le. 0 or .gt. Nbins?

Add:

bin=nint(matrix(i)/dr)
if(bin .le. 0 .or. bin .gt. Nbins) print *, "Bug - bin .le. 0 .or. bin .gt. Nbins"

Also, in cases where a simple scalar integer is added to a variable, consider using !$omp atomic

Jim Dempsey

 

No, bin can't be less, or equal to 0 or bigger than Nbins. I've made sure of that in the initialization of the array matrix and Nbins. Still, I did put the error checking and I get the same errors (and indeed bin is never <=0 or >Nbins)

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,420 Views

How did you declair hist_locks?

IOW did you use:

integer(kind=OMP_LOCK_KIND) :: hist_locks(Nbins)

Jim Dempsey

0 Kudos
Fotis_P_
Beginner
1,420 Views

jimdempseyatthecove wrote:

How did you declair hist_locks?

IOW did you use:

integer(kind=OMP_LOCK_KIND) :: hist_locks(Nbins)

Jim Dempsey

 

I didn't use omp_lock_kind. I just set it to kind=8 (I am in x64) but to make sure I changed to kind=omp_lock_kind. Still, nothing. Also, just an additional information, all 3 arrays are allocatable (in order for me to have big sizes)

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,420 Views

What version of Fortran are you using?

And, do you know if the OpenMP runtime DLL is that that came with the Fortran compiler?

Jim Dempsey

0 Kudos
Fotis_P_
Beginner
1,420 Views

jimdempseyatthecove wrote:

What version of Fortran are you using?

And, do you know if the OpenMP runtime DLL is that that came with the Fortran compiler?

Jim Dempsey

 

Intel (R) Visual Fortran Compiler 18.0.1.156 [Intel (R) 64] and I'm using the omp_lib module that came with it. I honestly think there's something wrong with the passing of the arguments, but I don't know exactly what or how to fix it.

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,420 Views

In case this is an optimization issue with seemingly unproductive code, add

print *,hist

outside the parallel do, and before the print *,'end'

Jim Dempsey

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,420 Views

Grr.... bin is not private

!$omp parallel do private(bin)

Jim Dempsey

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,420 Views

The loop control variable I is implicitly private

0 Kudos
Fotis_P_
Beginner
1,420 Views

jimdempseyatthecove wrote:

Grr.... bin is not private

!$omp parallel do private(bin)

Jim Dempsey

Making bin private seemed to work, I can't believe I didn't think to do it. I tried making it a shared variable (which didn't work) but not private. Thank you.

Still though, in terms of speed it's not that much better. The real time differences (not cpu time) between the parralel and the searial code are 10 seconds (serial does it in ~75 seconds, parallel in ~60 give or take (for 4 threads), for size of matrix N=1000000000. Which is a BIG array (can't go higher anyway, cause I have only so much RAM). I would expect better, especially for such a big matrix.

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,420 Views

For this problem, with very little work (serial version) in the loop, the omp set/unset lock overhead is excessive for the code within the loop

Remove the hist_locks and perform the loop thusly:

!$omp parallel private(nThreads, iThread, bin)
nThreads = omp_get_num_threads()
iThread omp_get_thread_num()
! **** remove parallel do !$omp do
do i=1,N
   bin=nint(matrix(i)/dr)
   if(iThread == mod(bin, nThreads)) hist(bin)=hist(bin)+1              
enddo
! **** remove parallel do !$omp end parallel do
!$omp end parallel
print*, 'end'

If the actual code is much larger, you might consider something else.

Report back what you get with the above change.

Jim Dempsey

 

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,420 Views

For the above code, each thread will redundantly perform the work of the threads of the other implementation. However, this exchanges the overhead of the omp set+unset lock against the time of fetch of the same element from the aggregate of

 one of the threads from RAM
 some of the threads from L3 cache
 some of the threads from L2 cache
 some of the threads from L1 cache

Jim Dempsey

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,420 Views

An alternate is:

!$omp parallel do private(bin) reduction(+:hist)
do i=1,N
   bin=nint(matrix(i)/dr)
   hist(bin)=hist(bin)+1              
enddo
!$omp end parallel do
 

Jim Dempsey

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,420 Views

Now that you have seen some hints at how to improve performance, can we do better?

Perhaps. The reduction may be best in #14 as it reduces reads (from cache), but it adds the reduction phase (nThreads * Nbins) Read-Modify-Write operations.

Consider modifying the #12 method to constrict the thread updates to cache line instead of bin:

integer, parameter :: CachLineSizeBytes = 64 ! This may change for future CPUs
integer, parameter :: nBinsPerCacheLine = CachLineSizeBytes / sizeof(hist(1)) ! consider adding hist_t
...
!$omp parallel private(nThreads, iThread, bin)
nThreads = omp_get_num_threads()
iThread omp_get_thread_num()
do i=1,N
   bin=nint(matrix(i)/dr)
   if(iThread == mod(bin / nBinsPerCacheLine, nThreads)) hist(bin)=hist(bin)+1             
enddo
!$omp end parallel

Note, from your post #1 you have 4 hardware threads. Therefore your system will, for even distribution of bin hits, experience 4 reads for each read/modify/write. Larger numbers of threads will have nThreads reads per read/modify/write and may not necessarily benefit from the redundant read technique of this method. An 8-socket Xeon Platinum could provide for 448 threads, and thus the redundant read might not be effective.

An additional technique might be:

!$omp parallel private(nThreads, iThread, bin)
nThreads = omp_get_num_threads()
iThread omp_get_thread_num()
PartitionSize = max(Nbins / nThreads, 1)
do i=1,N
   bin=nint(matrix(i)/dr)
   if(iThread == mod(bin / PartitionSize, nThreads)) hist(bin)=hist(bin)+1             
enddo
!$omp end parallel

The whole point of this length discussion is not to simply show you the most efficient way, rather to teach you how to discover the most efficient way for the system under test.

Jim Dempsey

0 Kudos
Fotis_P_
Beginner
1,420 Views

Thank you for all the help. I will play around with all these. You are indeed right, in questions like this it's better to show people (me) how to solve the problem rather than just present the solution as if by magic.

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,420 Views

Like in any sketch code presented on these forums, some things are assumed of the reader. In my sketch code above, it is assumed that the array hist is aligned to a cache line boundary. When (if) you run a test:

!DIR$ ATTRIBUTES ALIGN: 64:: hist

Generally one places the alignment directive immediately preceding the variable declaration.

Jim Dempsey

0 Kudos
Reply