- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello all,
So I want to parallelize a histogram filling and I searched around and found this video and I went ahead and tried his example, thinking "seems simple enough". So here's mine (basically the same):
!initialize matrix, hist do i=1,Nbins call omp_init_lock(hist_locks(i)) enddo !$omp parallel do do i=1,N bin=nint(matrix(i)/dr) call omp_set_lock(hist_locks(bin)) hist(bin)=hist(bin)+1 call omp_unset_lock(hist_locks(bin)) enddo !$omp end parallel do print*, 'end'
(sorry, I can't make it to fomat into a code segment)
But when I run it (ctrl+F5), it always comes up with the same error "forrt1: severe (157): Program Exception - access violation" in the line where the omp_unset_lock is. What's going on? I really don't get this error. I mean, if it was in the line where I set the lock it would MAYBE kind of make sense ("2 threads tried to modify the same memory address at the same time", although it would defeat the whole purpose of the locks but whatever). But 2 threads trying to unlock the same lock? How? (I hope I'm interpreting the error correct, but whatever the case. How do I fix it?). I know there are other methods to parallelize a histogram filling but I need to know what's happening with this and doesn't work. Also, I know that in the video he uses an example with c++ and the calls are all made by reference (if I'm not mistaken), (using &). But how do you do that in Fortran? I tried using LOC but it doesn't help. I still get the same errors. Should I use pointers somehow?
Another question: If I insert a " print*, 'a' " command anywhere inside the do loop then the program will run (ctrl+F5) and terminate "normally" without any errors BUT without printing 'end', meaning it did have problems and terminated prematurely and indeed if I run it with the debugger (F5) it comes up with the same errors as before (without the print). Please, any explanations cause I'm starting to go crazy.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Does nint(matrix(I)/dr) produce a number that is .le. 0 or .gt. Nbins?
Add:
bin=nint(matrix(i)/dr)
if(bin .le. 0 .or. bin .gt. Nbins) print *, "Bug - bin .le. 0 .or. bin .gt. Nbins"
Also, in cases where a simple scalar integer is added to a variable, consider using !$omp atomic
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
jimdempseyatthecove wrote:Does nint(matrix(I)/dr) produce a number that is .le. 0 or .gt. Nbins?
Add:
bin=nint(matrix(i)/dr)
if(bin .le. 0 .or. bin .gt. Nbins) print *, "Bug - bin .le. 0 .or. bin .gt. Nbins"Also, in cases where a simple scalar integer is added to a variable, consider using !$omp atomic
Jim Dempsey
No, bin can't be less, or equal to 0 or bigger than Nbins. I've made sure of that in the initialization of the array matrix and Nbins. Still, I did put the error checking and I get the same errors (and indeed bin is never <=0 or >Nbins)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
How did you declair hist_locks?
IOW did you use:
integer(kind=OMP_LOCK_KIND) :: hist_locks(Nbins)
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
jimdempseyatthecove wrote:How did you declair hist_locks?
IOW did you use:
integer(kind=OMP_LOCK_KIND) :: hist_locks(Nbins)
Jim Dempsey
I didn't use omp_lock_kind. I just set it to kind=8 (I am in x64) but to make sure I changed to kind=omp_lock_kind. Still, nothing. Also, just an additional information, all 3 arrays are allocatable (in order for me to have big sizes)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
What version of Fortran are you using?
And, do you know if the OpenMP runtime DLL is that that came with the Fortran compiler?
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
jimdempseyatthecove wrote:What version of Fortran are you using?
And, do you know if the OpenMP runtime DLL is that that came with the Fortran compiler?
Jim Dempsey
Intel (R) Visual Fortran Compiler 18.0.1.156 [Intel (R) 64] and I'm using the omp_lib module that came with it. I honestly think there's something wrong with the passing of the arguments, but I don't know exactly what or how to fix it.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In case this is an optimization issue with seemingly unproductive code, add
print *,hist
outside the parallel do, and before the print *,'end'
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Grr.... bin is not private
!$omp parallel do private(bin)
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The loop control variable I is implicitly private
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
jimdempseyatthecove wrote:Grr.... bin is not private
!$omp parallel do private(bin)
Jim Dempsey
Making bin private seemed to work, I can't believe I didn't think to do it. I tried making it a shared variable (which didn't work) but not private. Thank you.
Still though, in terms of speed it's not that much better. The real time differences (not cpu time) between the parralel and the searial code are 10 seconds (serial does it in ~75 seconds, parallel in ~60 give or take (for 4 threads), for size of matrix N=1000000000. Which is a BIG array (can't go higher anyway, cause I have only so much RAM). I would expect better, especially for such a big matrix.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
For this problem, with very little work (serial version) in the loop, the omp set/unset lock overhead is excessive for the code within the loop
Remove the hist_locks and perform the loop thusly:
!$omp parallel private(nThreads, iThread, bin)
nThreads = omp_get_num_threads()
iThread omp_get_thread_num()
! **** remove parallel do !$omp do
do i=1,N
bin=nint(matrix(i)/dr)
if(iThread == mod(bin, nThreads)) hist(bin)=hist(bin)+1
enddo
! **** remove parallel do !$omp end parallel do
!$omp end parallel
print*, 'end'
If the actual code is much larger, you might consider something else.
Report back what you get with the above change.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
For the above code, each thread will redundantly perform the work of the threads of the other implementation. However, this exchanges the overhead of the omp set+unset lock against the time of fetch of the same element from the aggregate of
one of the threads from RAM
some of the threads from L3 cache
some of the threads from L2 cache
some of the threads from L1 cache
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
An alternate is:
!$omp parallel do private(bin) reduction(+:hist)
do i=1,N
bin=nint(matrix(i)/dr)
hist(bin)=hist(bin)+1
enddo
!$omp end parallel do
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Now that you have seen some hints at how to improve performance, can we do better?
Perhaps. The reduction may be best in #14 as it reduces reads (from cache), but it adds the reduction phase (nThreads * Nbins) Read-Modify-Write operations.
Consider modifying the #12 method to constrict the thread updates to cache line instead of bin:
integer, parameter :: CachLineSizeBytes = 64 ! This may change for future CPUs
integer, parameter :: nBinsPerCacheLine = CachLineSizeBytes / sizeof(hist(1)) ! consider adding hist_t
...
!$omp parallel private(nThreads, iThread, bin)
nThreads = omp_get_num_threads()
iThread omp_get_thread_num()
do i=1,N
bin=nint(matrix(i)/dr)
if(iThread == mod(bin / nBinsPerCacheLine, nThreads)) hist(bin)=hist(bin)+1
enddo
!$omp end parallel
Note, from your post #1 you have 4 hardware threads. Therefore your system will, for even distribution of bin hits, experience 4 reads for each read/modify/write. Larger numbers of threads will have nThreads reads per read/modify/write and may not necessarily benefit from the redundant read technique of this method. An 8-socket Xeon Platinum could provide for 448 threads, and thus the redundant read might not be effective.
An additional technique might be:
!$omp parallel private(nThreads, iThread, bin)
nThreads = omp_get_num_threads()
iThread omp_get_thread_num()
PartitionSize = max(Nbins / nThreads, 1)
do i=1,N
bin=nint(matrix(i)/dr)
if(iThread == mod(bin / PartitionSize, nThreads)) hist(bin)=hist(bin)+1
enddo
!$omp end parallel
The whole point of this length discussion is not to simply show you the most efficient way, rather to teach you how to discover the most efficient way for the system under test.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you for all the help. I will play around with all these. You are indeed right, in questions like this it's better to show people (me) how to solve the problem rather than just present the solution as if by magic.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Like in any sketch code presented on these forums, some things are assumed of the reader. In my sketch code above, it is assumed that the array hist is aligned to a cache line boundary. When (if) you run a test:
!DIR$ ATTRIBUTES ALIGN: 64:: hist
Generally one places the alignment directive immediately preceding the variable declaration.
Jim Dempsey
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page