Open MP faster on dual core, slower on quad core?

peterklaver · ‎12-29-2009

I am working with a Fortran code that includes a numerical integration scheme. The problem parallelizes well, in theory at least, and by adding some Open MP directives I have seen satisfying decreases in run time on a dual-core system (Windows XP with various Intel Core 2 Duo chips). However, when I run the same executable on a quad-core system (Windows XP, Intel Core 2 Quad Q9550) the run time is longer than with the serial version. This is puzzling, because I tend to think that if the code is sensitive to common issues like overhead and scheduling then I wouldnt see as much performance improvement with the dual core system. But maybe I am missing something.

Here is a skeletonized version of the code to show its basic structure:

[plain]program [name]
use [modules]
implicit none
[declarations]
[initializing routines]

do while (T < Tsim)

!$omp parallel private([arguments])
!$omp do
  do j=1,Nreaches
    call CalcFluxSources([arguments])
    call CalcCellFluxes([arguments])
    call UpdateAQ([arguments])
    call TPACalc([arguments])
  end do
!$omp end do nowait
!$omp end parallel

  [other routines]

end do

[finishing routines]

end program
[/plain]

The number of reaches varies but is on the order of dozens. The four subroutines called within the parallel region each consist of one or more loops that operate on the individual cells of each reach. As the number of cells in each reach can vary considerably, I suspected that scheduling may have been an issue. I therefore replaced an !$omp parallel do directive with the multiple directives so I could add the nowait clause, although it made no noticeable difference. The other routines includes code that links the reaches to one another, which cannot be done in parallel, so the parallel region has to end before those routines are called. My understanding is that the threads are not destroyed by doing this.

I am using Intel Visual Fortran (11.1.046) integrated with Visual Studio 2008 (9.0.21022.8). If anyone has any suggestions as to what to look at, or what to try, I would greatly appreciate them. I've been seeing ~30% decrease in run time on a dual core and would very much like to scale that to more cores. Im sure Ive left something important out in this post, so feel free to ask for additional information. Thanks.

--->peter

TimP · ‎12-29-2009

A possible suspect is that your threads are attempting to share cache, and there is some partial false sharing effect when running on the CPU with split L2 cache. If you run 2 threads, with both threads affinitized to the same cache, and then with affinity to separate cache, this might give experimental confirmation. The problem would show up if one thread writes to the same cache line which another thread fetches with the intent to read. Supposing, for example, that one thread writes one cache line ahead of where the other reads, it may perform well only when the threads are on the same L2 cache.
With the default "alternate sector" prefetch, each 64-byte cache line fetch for read brings in its companion, wasting memory transfers if the other thread writes there. The strided hardware prefetch mode tries to read about 3 cache lines ahead, and again there is wasted data transfer, if that thread is following closely behind the other.
Applications which must work this way often use the MSR facilities to disable hardware prefetch. Such applications might work well with no effort on a current quad core (Core i7 or newer) but not scale well to multiple sockets.

peterklaver · ‎12-29-2009

Thanks for the reply. I am a hydrodynamic modeler and cannot pretend to completely follow what you say, but I can tell you that prefetch insertion is disabled in my current release build. Here is the command line:

/nologo /QxSSE4.1 /Qopenmp /Qopenmp-report1 /Qvec-report2 /module:"Release" /object:"Release" /libs:static /threads /c

I could experiment with prefetching but I don't think that's what you're recommending. Are there Fortran intrinsics for setting thread affinity? Sorry for asking what may be an obvious question, but this is a step ot two beyond what I've worked with previously. Alternatively I could build this for 64-bit Windows 7 but it would still be the Core 2 Quad, not a Core i7. I'm not sure if that would make a difference, however.

TimP · ‎12-29-2009

As you're using ifort, you would require set kmp_affinity environment variable to make full use of your split cache CPU. Presumably, with
OMP_NUM_THREADS=2 you could use KMP_AFFINITY=compact to put both on a single cache, or KMP_AFFINITY=scatter to split it among 2 cache.
For the 4 thread case also, KMP_AFFINITY=compact would put adjacent threads on the same cache, rather than allowing them to float. In current ifort, there is -par-affinity command line option which you could use to force KMP_AFFINITY=compact, preventing the environment variable from taking effect.
I wasn't talking about software prefetch inserted by the compiler, although it might have the effect of over-riding the hardware prefetch.
64-bit Windows could improve 4 thread scaling, if your application is tight on virtual address space for 4 threads in 32-bit mode.

peterklaver · ‎12-30-2009

I experimented a little with the /Qpar-affinity switch and the results are interesting. My previous claim of a general slowdown on quad-core relative to serial version was not correct, as it turns out; I was not comparing the right results. To make a more useful comparison I ran the same test case on two machines with three different builds: 1) with Open MP directives disabled, 2) enabling Open MP, and 3) adding /Qpar-affinity:compact. The dual-core chip was a Core 2 Duo E8400 @ 3.0 GHz and the quad-core was a Core 2 Quad Q9550 @ 2.83 GHz. Run times in seconds (average of two runs) are as follows:

Case Duo Quad
----------------------- --------- -------------
w/o Open MP 521.3 567.2
w/ Open MP 372.1 423.3
w/ affinity:compact 372.0 402.2

So it appears that keeping threads on one cache is helpful for the quad core but has little, if any, benefit for the dual core. Ultimately, in this case at least, the quad achieves at best the same speedup as the duo. We will be getting a Core i7 machine in here soon and I will see what I can get out of it.

Your suggestions were helpful, Tim, thank you very much. Should I take from this that additional gains will depend more on system architecture and runtime environment, and less on optimal tuning within the confines of Open MP directives?

TimP · ‎12-30-2009

The affinity setting isn't important for the dual core, nor for Core i7, as all cores have equal access to the single last level cache. It becomes important in cases like these when the last level cache is split, as on the Core 2 Quad.
I'm not sure where you are trying to go with your generalization. We have demonstrated cases on dual socket 6 core and beyond where it is important to write OpenMP code which explicitly divides work evenly among threads, as well as setting affinity, avoiding data fragmentation across last level caches.
We are looking at increased future importance of programming models dealing with multiple levels of parallelism:
1) SIMD instruction level parallel (vector) within a thread
2) multiple threads (OpenMP, TBB, ....) on single (possibly dual) socket multi-core CPU
3) MPI across boundaries where shared memory is less efficient or impractical

jimdempseyatthecove · ‎12-30-2009

Peter

Can you experiment with changing the placement of the otput data (written data) within the parallel region such that each threads output data is not within the same cache line

! do not use
REAL, ALLOCATABLE:: TOTAL(:) ! one per thread
... ! additional arrays of one element per possible thread

! use something like
type td_t
REAL :: TOTAL
... ! other thread context data
INTEGER :: PADD(nn) ! nn sized to reach 64 byte boundry
end type td_t
type(td_t), ALLOCATABLE:: TD(:) ! one per thread

Jim Dempsey

peterklaver · ‎12-31-2009

Tim - That's okay, I'm not entirely sure where I was going with that generalization, either. Mostly I am in production mode with this model and interested in saving some run time, which is happening. In the future I intend to try balancing the work more evenly among threads, which should be straightforward because the amount of work for each "reach" is well-defined and does not change throughout the simulation. Most of the work done outside the parallel region involves boundary conditions in which the various reaches communicate with one another at their end cells (hence the need to join), and while this could also be done in a parallel region (because each reach end can only be connected to a single boundary condition), the amount of work associated with boundaries can vary considerably over the course of a simulation and load balancing would not be trivial.

Jim - I'm afraid I don't understand what you are suggesting. There is no "written" output in the parallel region, not to disk at least. If instead you mean written to memory, then the problem becomes complicated. The subroutines work on a goodly number of 2D arrays that are declared in a module, where the outer dimension represents reaches and the inner dimension represents cells within a given reach. In my mind I envision that each thread is working on specific columns of these arrays. I don't know how to go about changing the "placement" of the data, as you suggest, without some major revision of the code. Again, as a civil engineer I can only drop to so low a level in computer science before I get that glazed-over feeling. But I appreciate your taking the time to consider my problem.

jimdempseyatthecove · ‎12-31-2009

Would it be fair to assume that j is passed in as an argument?
And if so, j is one of the indexes in the 2D arrays?
And if so, which index is it?

If you are using Array(i,j) then you code may be optimal.
Whereas Array(j,i) is non-optimal as i, i+1 will NOT be in adjacent memory i, i+1 will stride down the array by the number of elements in the first index.In Fortran the adjacent cells in memory are thosethat differ by 1 in the left most index of a multi-dimension array. If yours are backwards, it might be relatively easy to swap around the DO i= and DO j= loops.

Jim Dempsey

peterklaver · ‎12-31-2009

Quoting - jimdempseyatthecove

Would it be fair to assume that j is passed in as an argument?
And if so, j is one of the indexes in the 2D arrays?
And if so, which index is it?

If you are using Array(i,j) then you code may be optimal.
Whereas Array(j,i) is non-optimal as i, i+1 will NOT be in adjacent memory i, i+1 will stride down the array by the number of elements in the first index.In Fortran the adjacent cells in memory are thosethat differ by 1 in the left most index of a multi-dimension array. If yours are backwards, it might be relatively easy to swap around the DO i= and DO j= loops.

Jim Dempsey

Yes, j is passed in and yes, all arrays are arranged like (i,j). That happens to be one little nugget of computer science I'm familiar with.

When you say "optimal", I assume that refers to the data placement. Scheduling may be another matter. Recall that the number of cells (the i index) in each reach can vary from dozens to thousands (if I had my way the variation would be from hundreds to thousands, but I don't get to design the applications in most cases). By using "nowait" with two threads, I imagine the ideal process has one thread getting the first j, the other thread getting the second j, with the next j going to whichever thread is finished first, down to the last j (j=Nreaches). At the end, one thread will probably have to wait a little for the other to finsh. With four threads, however, there could be three waiting for the fourth to finish. I wonder if this is why I don't see much scaling going from two to four threads.

Or is it instead the case that one thread gets reaches 1 through Nreaches/2, and the other thread gets reaches Nreaches/2+1 through Nreaches, and if the first set takes twice as long as the second set, tough luck? I can see that also not scaling well to four threads. If, on the other hand, I could determine four contiguous sets of reaches with about the same number of cells in total, then I could assign each set to a given thread and balance the work that way. Is this possible with Open MP? You don't have to explain how, I'm happy to teach myself from a book. I just wonder if you think it's worth pursuing.

Oh, and Happy New year!

TimP · ‎12-31-2009

With default static scheduling, which you appear to be using, the loops are divided as evenly as possible among the threads, with each thread getting a contiguous chunk, which is good for cache locality. This is good when the work is evenly distributed among loop iterations.
You could use the chunk option to change the default chunk size (downwards, if using static scheduling). If the amount of work per loop is known to vary, but in an unknown way, schedule(guided) may be appropriate. That starts off allocating about 50% of the loop iterations the same as default static, then goes to dynamic scheduling with progressively decreasing chunk size. If you know ahead of time a way to give each thread the same amount of work, in a way which preserves memory locality, that is likely to be better. I posted an example for triangular matrix operation a short time ago on this thread.
Memory locality issues aren't as serious when they involve only the cache split scheme of core 2 quad; they become more important where the RAM is local to each CPU and remote memory access is slower.

jimdempseyatthecove · ‎01-03-2010

Happy New Year to you too.

Try this

1) Create a user defined type containing aninteger array of size of the span of j indexing. e.g. ArrayOfJ(jMax). If this size is not known until run time then make this array allocatable. The type should also contain the number of valid indexes and an integer sum.

2) Create an allocatable array of these types and allocate to the number of threads.

3) initialize (allocate internal array if necessary), set valid index = 0, set sum = 0

4) iterate across j
determine span of i
iterateacross number of threads
locateentry in array created in 2) who's sum is lowest
end iterate
add span ofi to found entry's sum
incriment valid index in found entry
save j into ArrayOfJ(valid index) of found entry
end iterate across j

Essentially what you have done is a ballanced packing technique.
The code will be much easire that the description.

This initialization need be created once at beginning then again onlywhen the extentschange.

To use:

Replace the PARALLEL DO with PARALLEL
inside parallel region obtain thread number (bias by 1 if required)
then iterate across your arrayOfJ

The above technique assumes all threads will be available when you enter the parallel region and will work best when the threads are available.

peterklaver · ‎01-04-2010

Thanks, Jim. Your pseudocode makes sense to me. Other projects have intruded for now, and it may be a few weeks before I get a chance to properly test this (balanced versus unbalanced, with and without setting affinity). But I will share the results.

--->peter

jimdempseyatthecove · ‎01-04-2010

Peter,

The packing technique outlined earlier should get you started. You can improve upon it as you get your code working. I wanted to get your mind directed towards a solution and not necessarily provide you with the solution. Learning how to figure this out is more important in the long run than being told what to do.

The first improvement might be to produce a list containing {sizeOfIndexI, j}, then sort in descending order, then fill the packing bins in the order described earlier.

The second improvement would be starting with first improvement, permute the fill sequence such that you find the minimum largest sum of sizeOfIndexI for all threads (this will approximate the shortestcompletion time).

Jim Dempsey