- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi all,
Last time I posted a query and the problem was solved by fusing all the omp regions.
However this time the problem seems to be with scalability on openmp.
The code doesn't scale well.
Its about 3X on 16 Xeon cores(Intel(R) Xeon(R) E5-2650 v2)
How to improve scalability??
here is the code
do k=1,km-1 do kk=1,2 starttime = omp_get_wtime() !$OMP PARALLEL PRIVATE(I)DEFAULT(SHARED) !$omp do do j=1,ny_block do i=1,nx_block LMASK(i,j) = TLT%K_LEVEL(i,j,bid) == k .and. & TLT%K_LEVEL(i,j,bid) < KMT(i,j,bid) .and. & TLT%ZTW(i,j,bid) == 1 if ( LMASK(i,j) ) then WORK1(i,j,kk) = KAPPA_THIC(i,j,kbt,k,bid) & * SLX(i,j,kk,kbt,k,bid) * dz(k) WORK2(i,j,kk) = c2 * dzwr(k) * ( WORK1(i,j,kk) & - KAPPA_THIC(i,j,ktp,k+1,bid) * SLX(i,j,kk,ktp,k+1,bid) & * dz(k+1) ) WORK2_NEXT(i,j) = c2 * ( & KAPPA_THIC(i,j,ktp,k+1,bid) * SLX(i,j,kk,ktp,k+1,bid) - & KAPPA_THIC(i,j,kbt,k+1,bid) * SLX(i,j,kk,kbt,k+1,bid) ) WORK3(i,j,kk) = KAPPA_THIC(i,j,kbt,k,bid) & * SLY(i,j,kk,kbt,k,bid) * dz(k) WORK4(i,j,kk) = c2 * dzwr(k) * ( WORK3(i,j,kk) & - KAPPA_THIC(i,j,ktp,k+1,bid) * SLY(i,j,kk,ktp,k+1,bid) & * dz(k+1) ) WORK4_NEXT(i,j) = c2 * ( & KAPPA_THIC(i,j,ktp,k+1,bid) * SLY(i,j,kk,ktp,k+1,bid) - & KAPPA_THIC(i,j,kbt,k+1,bid) * SLY(i,j,kk,kbt,k+1,bid) ) endif if( LMASK(i,j) .and. abs( WORK2_NEXT(i,j) ) < abs( WORK2(i,j,kk) ) ) then WORK2(i,j,kk) = WORK2_NEXT(i,j) endif if ( LMASK(i,j) .and. abs( WORK4_NEXT(i,j) ) < abs( WORK4(i,j,kk ) ) ) then WORK4(i,j,kk) = WORK4_NEXT(i,j) endif LMASK(i,j) = TLT%K_LEVEL(i,j,bid) == k .and. & TLT%K_LEVEL(i,j,bid) < KMT(i,j,bid) .and. & TLT%ZTW(i,j,bid) == 2 if ( LMASK(i,j) ) then WORK1(i,j,kk) = KAPPA_THIC(i,j,ktp,k+1,bid) & * SLX(i,j,kk,ktp,k+1,bid) WORK2(i,j,kk) = c2 * ( WORK1(i,j,kk) & - ( KAPPA_THIC(i,j,kbt,k+1,bid) & * SLX(i,j,kk,kbt,k+1,bid) ) ) WORK1(i,j,kk) = WORK1(i,j,kk) * dz(k+1) WORK3(i,j,kk) = KAPPA_THIC(i,j,ktp,k+1,bid) & * SLY(i,j,kk,ktp,k+1,bid) WORK4(i,j,kk) = c2 * ( WORK3(i,j,kk) & - ( KAPPA_THIC(i,j,kbt,k+1,bid) & * SLY(i,j,kk,kbt,k+1,bid) ) ) WORK3(i,j,kk) = WORK3(i,j,kk) * dz(k+1) endif LMASK(i,j) = LMASK(i,j) .and. TLT%K_LEVEL(i,j,bid) + 1 < KMT(i,j,bid) if (k.lt.km-1) then ! added to avoid out of bounds access if( LMASK(i,j) ) then WORK2_NEXT(i,j) = c2 * dzwr(k+1) * ( & KAPPA_THIC(i,j,kbt,k+1,bid) * SLX(i,j,kk,kbt,k+1,bid) * dz(k+1) - & KAPPA_THIC(i,j,ktp,k+2,bid) * SLX(i,j,kk,ktp,k+2,bid) * dz(k+2)) WORK4_NEXT(i,j) = c2 * dzwr(k+1) * ( & KAPPA_THIC(i,j,kbt,k+1,bid) * SLY(i,j,kk,kbt,k+1,bid) * dz(k+1) - & KAPPA_THIC(i,j,ktp,k+2,bid) * SLY(i,j,kk,ktp,k+2,bid) * dz(k+2)) endif end if if( LMASK(i,j) .and. abs( WORK2_NEXT(i,j) ) < abs( WORK2(i,j,kk) ) ) & WORK2(i,j,kk) = WORK2_NEXT(i,j) if( LMASK(i,j) .and. abs(WORK4_NEXT(i,j)) < abs(WORK4(i,j,kk)) ) & WORK4(i,j,kk) = WORK4_NEXT(i,j) enddo enddo !$omp end do !$OMP END PARALLEL endtime = omp_get_wtime() total = total + (endtime - starttime) enddo enddo
Also attached is the standalone working code I have created, so that you guys could run it on your machines if you wish.
The attached code is a standalone version created from a larger code piece.
- Tags:
- Parallel Computing
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
here are the results on threads
Xeon | timings |
1 | 1.10E-02 |
2 | 8.27E-03 |
4 | 6.37E-03 |
8 | 3.67E-03 |
16 | 2.65E-03 |
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This looks like code posted elsewhere.
The runtime of the of the parallel region is too short to make effective use of parallelization. Your parallel region is inside your do k=1,km-1 and do kk=1,2 loops or (60-1)*2 = 118 iterations. The time you display is for the sum of the runtimes of the enclosed parallel region. Yielding parallel region run times of 1/118'th that listed:
1 9.32203E-05
2 7.00847E-05
4 5.39831E-05
8 3.11017E-05
16 2.16949E-05
Starting and ending a parallel region has some expense that has to be amortized. Restructure your code to locate the parallel region further out (if possible at the do k= level).
Your code is not incrementing the variable bid.
Your LMASK array can be replaced with a scalar.
Your outputs are being overwritten (missing code or bug).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
As I told you the changed.F90 is part of a huge file which you will not be able to run.
Its just a subroutine of CESM.
at K level it isnt parallel.
So at kk as well.
Look at the updates at WORK1 you will notice why.
variables are overwritten??
if you mean I am changing work1(1,1,1) from current iteration to next iteration that is true. it works like that itself.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Lasttime the issue was with speedup.
Now the issue is with scaling.
you can increase the nx_block and ny_block to a huge value and increase the code runtime and demonstrate speedup as well.
I was more concerned with false-sharing. Is it not a huge overhead in my code considering all the shared variables?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
bid is a constant per entry to the subroutine of CESM. so it wont change.
My code looks strange, but I cant explain why. that is how it is.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If you replace the LMASK array with LMASK scalar (because each thread references only one cell per iteration) then the LMASK setting won't write to memory and won't false share on the ~ first 8 and last 8 cells of the thread's zone of LMASK as array. Even without the false sharing of LMASK as an array, this will save the writing to RAM and to some lesser extent the Read Modify Write of the cache line containing the cell for LMASK(i,j)
To reduce the number of entry/exit of parallel regions (assuming missing code permits this) consider:
starttime = omp_get_wtime() !$OMP PARALLEL PRIVATE(I, k, kk) DEFAULT(SHARED) ! note, all threads redundantly executing these two loops do k=1,km-1 do kk=1,2 ! slice the inner range !$omp do do j=1,ny_block do i=1,nx_block ... enddo enddo !$omp end do ! The above !$OMP has an implied barrier ! this will keep the k and kk loops in sync enddo enddo !$OMP END PARALLEL endtime = omp_get_wtime() total = total + (endtime - starttime)
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
i did try the LMASK. I have forgotten to add to make it part of the standalone code I have provided.
It offered a small benefit.
code was about 1.2-1.5X over the existing speedup we got.
I will try this new method mentioned above. Seems good in theory. must check if it works well.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Although the compiler should optimize this:
WORK1(i,j,kk) = KAPPA_THIC(i,j,ktp,k+1,bid) & * SLX(i,j,kk,ktp,k+1,bid) WORK2(i,j,kk) = c2 * ( WORK1(i,j,kk) & - ( KAPPA_THIC(i,j,kbt,k+1,bid) & * SLX(i,j,kk,kbt,k+1,bid) ) ) WORK1(i,j,kk) = WORK1(i,j,kk) * dz(k+1)
consider using this instead:
WORK1temp = KAPPA_THIC(i,j,ktp,k+1,bid) & * SLX(i,j,kk,ktp,k+1,bid) WORK2(i,j,kk) = c2 * ( WORK1temp & - ( KAPPA_THIC(i,j,kbt,k+1,bid) & * SLX(i,j,kk,kbt,k+1,bid) ) ) WORK1(i,j,kk) = WORK1temp * dz(k+1)
And make similar use of temporaries where applicable.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You might want to experiment with changing:
if (k.lt.km-1) then ! added to avoid out of bounds access
if( LMASK(i,j) ) then
to
if (k.eq.km-1) LMASK(i,j) = .false. ! added to avoid out of bounds access (remove endif)
if( LMASK(i,j) ) then
The reasoning for this is whereas the former code might not vectorize, the latter may have a better chance at vectorization.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
i tried combining the parallel region of K level as suggested.
interestingly there isnt great improvement as such.
its still approx 2.7E-03.
Checking if using local variables help.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
using local variable like worktemp seems to be a bad idea.
must be because we are not vectorizing then?
Performance has degraded.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If WORKtemp knocked out vectorization (over use of array) then this is a lost optimization opportunity by the compiler. It may be a case that the major loop unde optimization has too many things to keep track of.
Try the following (although I think it is redundant to do this:
! slice the inner range !$omp do do j=1,ny_block !$omp simd do i=1,nx_block
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Or
! slice the inner range !$omp do simd collapse(2) do j=1,ny_block do i=1,nx_block
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I meant we must replace worktemp by worktemp(1:10) work on vector lengths per iteration.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Try the simd clause with the scalar temp, then if necessary go back to the original array format using simd clause (both ways)
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
not helping either.
seems like KAPPA and SLX have unaligned access. we may have to copy them to temporaries and work. that must help.
also if you notice work1,work2 work2_next go hand in hand.
I think breaking the loop to accomodate these dependencies will help.
Any way I was following your quickthread programming site.
Do you publish papers or have you published any papers with these models?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I haven't done much with the QuickThread threading toolkit since about 2009. I couldn't get enough followers to support continued development. Which is surprising seeing that it resolves(ed) virtually all of the threading issues from small single CPU multi-core systems to large multi CPU many core (and NUMA) systems. I've even suggested to Intel of having me of incorporating the threading pool design concept into OpenMP (both C++ and Fortran) as an experimental extension. This then could be used as a test bed to either prove or disprove the merits of the extension. I am a strong proponent of making the pudding (proof of the pudding) and then letting the reviewers have a taste. Intel has experimented with non-standard extensions to OpenMP (e.g. CREW). And I am well familiar with the OpenMP limitations with regard to thread teaming.
Consider a problem similar to your problem, only much larger run on a multi-node NUMA system. And where you know you could gain (significant) performance provided you could control the cache locality. This is extremely hard to do using current OpenMP features. The is especially true when one portion of your code favors one organization of thread teams and a different portion of your code favors different thread teaming arrangement and considering that you do not want to oversubscribe threads or under subscribe threads. There is no way to do this using current OpenMP features.
**** HYPOTHETICAL OpenMP extension ****
! Outer loop distributed by NUMA node !$OMP PARALLEL DO TEAM(OneEachM0$) collapse(2) PRIVATE(I, k, kk) DEFAULT(SHARED) do k=1,km-1 do kk=1,2 ! slice the inner range !$omp parallel do TEAM(OneEachL2WithinM0$) ! slice by core within NUMA node do j=1,ny_block !$omp parallel do TEAM(L2$) ! slice by threads within core do i=1,nx_block ... enddo !$omp end parallel do enddo !$omp end parallel do enddo enddo !$OMP END PARALLEL do
**** end HYPOTHETICAL OpenMP extension ****
Notes:
There is no requirement for any convoluted KMP_AFFINITY or OMP_PLACE_THREADS (which may not be ideal for all portions of the application).
There is no requirement for the application to know anything about the topology. On a small 1 CPU system the outer loop has a team of 1.
Unlike standard OpenMP nested parallelism, the proposed system constitutes a single thread pool, affinity pinned for compute threads, and non-pinned higher priority threads for I/O. The single pool is organizable in any manner you choose at point of call (!$OMP...TEAM(...)) and it does so without creating a new pool (thus avoids oversubscription). This organization is not constricted by an arrangement setup by way of environment variables (other than those used to constrict the application to a subset of the available logical processors).
As a further extension, assume that at some level of the above nested loop structure you need to output results data. With standard OpenMP how would you do this without mucking up the running of the nested loop? About the only avenues you have are a) create an independent thread using pthreads (or other such thread), b) Use the OpenMP TASK but your nested loop if fully subscribed so this will either oversubscribe or most likely perform the task immediately thus effectively blocking the current thread through the duration of any I/O wait, or c) perform the I/O directly in the thread, possibly within a critical section, thus blocking not only all compute threads wishing I/O while other thread is pending I/O. None of those three "solutions" are particularly appealing.
A better way would be to borrow the QuickThread feature of having an I/O thread or pool of threads.
**** HYPOTHETICAL OpenMP extension ****
! Outer loop distributed by NUMA node !$OMP PARALLEL DO TEAM(OneEachM0$) PRIVATE(I, k, kk) DEFAULT(SHARED) do k=1,km-1 do kk=1,2 ! slice the inner range !$omp parallel do TEAM(OneEachL2$) ! slice by core do j=1,ny_block !$omp parallel do TEAM(L2$) ! slice by threads within core do i=1,nx_block ... enddo !$omp end parallel do !$omp task team(IO$) private(args) ! enqueue to I/O thread call yourIOroutine(args) !$omp end task enddo !$omp end parallel do enddo enddo !$OMP END PARALLEL do
**** end HYPOTHETICAL OpenMP extension ****
The above new statement would enqueue the task into a higher priority thread queue, who's threads are typically not affinity pinned (i.e. will run on any logical processor, preferably a waiting logical processor), and most importantly, while the higher priority thread is blocked performing I/O, the lesser priority compute thread that was preempted is permitted the resume. You can also optionally constrict the I/O threads to one or more logical processors (e.g. those of the last core on a Xeon Phi which are otherwise not used, or on the socket that is directly connected to an I/O device, or a thread that "owns" the GPU).
Any comments by the community as to the desirability of features like this would be welcome.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I think we've said it before: the alignments which are important are of the arrays which are stored. If all the work arrays are set up under -align array32byte and the leading dimension is a multiple of 32 bytes, you can assert those arrays aligned . It does look as if your scaling would be limited by memory bandwidth.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Experiment wit associate. Something like this (be careful):
associate (TLT_K_LEVELa => TLT%K_LEVEL(:,:,bid), & & KMTa => KMT(:,:,bid), & & TLT_ZTWa => TLT%ZTW(:,:,bid), & & ...) ! end of outer level associate do k=1,km-1 associate (dza => dz(k), & & dza1 => dz(k+1), & & ...) ! end of k level associate do kk=1,2 starttime = omp_get_wtime() ! be carefule of the kbt, ktp, k+1, ... associates associate (WORK1a => WORK1(:,:,kk), & & WORK2a => WORK2(:,:,kk), & & WORK3a => WORK3(:,:,kk), & & WORK4a => WORK4(:,:,kk), & & KAPPA_THICa => KAPPA_THIC(:,:,kbt,k,bid), & & SLXa => SLX(:,:,kk,kbt,k,bid), & & SLXa1 => SLX(i,j,kk,ktp,k+1,bid) & & ...) ! end of kk level associate !$OMP PARALLEL !$omp do do j=1,ny_block do i=1,nx_block LMASK(i,j) = TLT_K_LEVELa(i,j) == k .and. & TLT_K_LEVELa(i,j) < KMTa(i,j) .and. & TLT_ZTWa(i,j) == 1 if ( LMASK(i,j) ) then WORK1a(i,j) = KAPPA_THICa(i,j) & * SLXa(i,j) * dza WORK2a(i,j) = c2 * dzwra * ( WORK1a(i,j) & - KAPPA_THICa1(i,j) * SLXa1(i,j) & * dza1 ) ... endif enddo ! i enddo ! j !$omp end do !$OMP END PARALLEL end associate ! kk level endtime = omp_get_wtime() total = total + (endtime - starttime) enddo end associate ! k level enddo end associate ! outer level
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Associate may or may not lose alignment attributes. You may need a !DIR$ to re-assert known alignments.
In the case of the arrays, associate will have to construct an appropriate array descriptor (same thing as if you made a CALL). I cannot attest as to if this would be faster with associate or not, it depends on if the optimization had problems with the reducing the number of subscript references.
Jim Dempsey

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page