low OpenMP speedup, embarassingly parallel problem

paulius · ‎10-24-2005

I'm having a tough time getting decent speedup with OpenMP on dual Xeon machine (HP xw8000). I use Intel's C++ compiler (8.1) with their OpenMP libraries. My test program finds the mean of a given array of integers.I have a function that takes a pointer and a count as parameters and returns the sum (which I later divide by n). To parallelize it, I use the following code:

Code:

#pragma omp parallel sections
{
#pragma omp section
   sum1=sum(a, n/2);
#pragma omp section
   sum2=sum(a+n/2, n/2);
}

Here's the problem: up to a certain array size I get speedup that's nearly 2. Afterwards the speedup is around 1.5 (this starts for inputs of 4,194,304 integers). For all sizes, if I comment either of the two parallel sections, I get exactly half of the full sequential running time (as expected). The two functions work on contiguous and disjoint memory blocks.

My guess is that it's some memory issue, though I don't think it's a case for false-sharing. Any similar experiences or ideas on what may be happening?

Message Edited by paulius on 10-23-2005 07:13 PM

Message Edited by paulius on 10-24-2005 07:36 AM

jim_dempsey · ‎10-24-2005

You may be experiencing an aliasing problem. Xeon has a cache that is sensitive to modulous 64KB addressing. If the two threads are processing data that is seperated by exactly64KB then the processors will experience an adverse interaction. Try using a do arround your around your sections where each call to sum processes 32KB in each gulp.

iGulp = 32768 / sizeof(a(1))

do i = 1,n,iGulp*2

#pragma omp parallel sections
{
#pragma omp section
   sum1=sum(a(i), iGulp);
#pragma omp section
   sum2=sum(a(i+iGulp), iGulp);
}

end do

*** then do the remainder (I will let you write that)

The other route to use is the P4 cache line size is 128 bytes. Divide by the size of the variables in the arrays and have each processor work on alternate 128 byte sections.

Remember to align your array on cache line size interval (128 for now, should use a function call to get the line size). If you are unable to allign the array then process the array up to the alignment point ouside the parallel sections ( 0 to 7 adds if a is real(8), or 0-15 if a is real(4)),then process in the parallel sections the two groups of cache line size8 or 16 vars per groupfor the number of whole paired groups for the 2 processors (should rewrite code to auto adjust to number of processors), then process the remainder outisde the loop. i.e. find alignment point computing partial sum if any on the waythen process whole groups in parallel, then process remainder if any.

Also read up on how to write code that takes advantage to SIMD instructions.

Jim Dempsey

paulius · ‎10-28-2005

Thanks for the reply.

Another question: in dual-CPU setup, do Xeons in effect have a unified cache (i.e. both CPUs have identical lines in their caches)?

I've run into 64kB aliasing in other sequential applications, so I wrote the OpenMP code to avoid that (to the best of my knowledge). Basically, if the two threads are executing in lock-step, then they shouldn't be concurrently accessing addresses that are a multiple of 64kB apart. I don't have thread profiler with my vTune, but I've tested the idea sequentially (having two array acesses per loop iteration, at the same difference as parallel code), and the speedup was 3-fold. I'll have to try your approach.

jim_dempsey · ‎10-28-2005

0 x 64KB is an integralmultiple of 64KB

The aliasing is a problem with shared cache such as from two threads in same core but different HT virtual processors. I am not familiar (would have to look up) with dual-core Xeon as to if it uses a shared cache (marketing literature seems to indicate not shared). But when you have 2 seperate chip carriers or 2 seperate chips on one chip carrier (something new) the caches are physicaly seperate and the 64KB aliasing should not be a problem. Xeons do come with HT as well. I believe there is a way to turn off the HT in the motherboard CMOS setup. You can also experiment with setting processor affinity and set the hard affinity to only run on one of the processors in the HT set. So if you have dual-CPU (2 chips) but see 4 processors in task manager then restrict the application threads to one of the 2 virtual processors. I am not certain but I belive processor numbers 0/1 are the processor numbers for the1st core of HT processor, 2/3 arethe processors numbers for the2nd core of HT processor, etc.. Now if you have HT enabled for 1 core and not the other core then I have no idea what the processor number sequence would be. Once you get the code inserted to set the processor affinity mask then it is easy to run some tests.

To set the processor affinity you will have to make a Win32 call. Google site:microsoft.com to find the function call.

Caviat: If HyperThreading is enabled OpenMP will not know that your app has called the set processor affinity. Therefore you may experience problems if say OpenMP schedules 4 processors but your app is blocking 2 of the 4 processors. This will take some experimentation.

Jim Dempsey

paulius · ‎11-16-2005

I'm not sure if we're talking about the same thing. My setup is two separate physical Xeon CPUs (single-core, HT not enabled). So, I'm talking about OpenMP threads running on separate CPUs, no hyperthreading.

When I run my applicatino sequentially, I see a linear increase in time, which is expected since the sum of n numbers is O(n). When I run two-threaded OpenMP version of the code, I see a nearly two-fold speedup up toa certain n, then speedup suddenly decreases to 1.5 or lower. The problematic n causes no aberations in observed time. So, my interpretation is that the problem is not caused by the size of the data set.

That's why I figured it's some kind of interaction between the caches. 0x64kB is a multiple of 64kB, but it also means that it's already sitting in the cache, so it should not cause eviction, unless I'm missing something. I'm still not clear on whether the two physical CPUs have the same contents in their cache. Can someone point me to a document that discussues that?

jim_dempsey · ‎11-16-2005

Assuming you use allocated arrays.

What happens when you allocate larger than the point where the slowdown occures. Then with the larger dataset run the test to determine where the slowdown is? i.e. keep the same memory placement for all tests.

Jim Dempsey