- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
#pragma omp parallel sections { #pragma omp section sum1=sum(a, n/2); #pragma omp section sum2=sum(a+n/2, n/2); }
Here's the problem: up to a certain array size I get speedup that's nearly 2. Afterwards the speedup is around 1.5 (this starts for inputs of 4,194,304 integers). For all sizes, if I comment either of the two parallel sections, I get exactly half of the full sequential running time (as expected). The two functions work on contiguous and disjoint memory blocks.
Message Edited by paulius on 10-23-2005 07:13 PM
Message Edited by paulius on 10-24-2005 07:36 AM
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You may be experiencing an aliasing problem. Xeon has a cache that is sensitive to modulous 64KB addressing. If the two threads are processing data that is seperated by exactly64KB then the processors will experience an adverse interaction. Try using a do arround your around your sections where each call to sum processes 32KB in each gulp.
iGulp = 32768 / sizeof(a(1))
do i = 1,n,iGulp*2
#pragma omp parallel sections { #pragma omp section sum1=sum(a(i), iGulp); #pragma omp section sum2=sum(a(i+iGulp), iGulp); }
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for the reply.
Another question: in dual-CPU setup, do Xeons in effect have a unified cache (i.e. both CPUs have identical lines in their caches)?
I've run into 64kB aliasing in other sequential applications, so I wrote the OpenMP code to avoid that (to the best of my knowledge). Basically, if the two threads are executing in lock-step, then they shouldn't be concurrently accessing addresses that are a multiple of 64kB apart. I don't have thread profiler with my vTune, but I've tested the idea sequentially (having two array acesses per loop iteration, at the same difference as parallel code), and the speedup was 3-fold. I'll have to try your approach.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
0 x 64KB is an integralmultiple of 64KB
The aliasing is a problem with shared cache such as from two threads in same core but different HT virtual processors. I am not familiar (would have to look up) with dual-core Xeon as to if it uses a shared cache (marketing literature seems to indicate not shared). But when you have 2 seperate chip carriers or 2 seperate chips on one chip carrier (something new) the caches are physicaly seperate and the 64KB aliasing should not be a problem. Xeons do come with HT as well. I believe there is a way to turn off the HT in the motherboard CMOS setup. You can also experiment with setting processor affinity and set the hard affinity to only run on one of the processors in the HT set. So if you have dual-CPU (2 chips) but see 4 processors in task manager then restrict the application threads to one of the 2 virtual processors. I am not certain but I belive processor numbers 0/1 are the processor numbers for the1st core of HT processor, 2/3 arethe processors numbers for the2nd core of HT processor, etc.. Now if you have HT enabled for 1 core and not the other core then I have no idea what the processor number sequence would be. Once you get the code inserted to set the processor affinity mask then it is easy to run some tests.
To set the processor affinity you will have to make a Win32 call. Google site:microsoft.com to find the function call.
Caviat: If HyperThreading is enabled OpenMP will not know that your app has called the set processor affinity. Therefore you may experience problems if say OpenMP schedules 4 processors but your app is blocking 2 of the 4 processors. This will take some experimentation.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm not sure if we're talking about the same thing. My setup is two separate physical Xeon CPUs (single-core, HT not enabled). So, I'm talking about OpenMP threads running on separate CPUs, no hyperthreading.
When I run my applicatino sequentially, I see a linear increase in time, which is expected since the sum of n numbers is O(n). When I run two-threaded OpenMP version of the code, I see a nearly two-fold speedup up toa certain n, then speedup suddenly decreases to 1.5 or lower. The problematic n causes no aberations in observed time. So, my interpretation is that the problem is not caused by the size of the data set.
That's why I figured it's some kind of interaction between the caches. 0x64kB is a multiple of 64kB, but it also means that it's already sitting in the cache, so it should not cause eviction, unless I'm missing something. I'm still not clear on whether the two physical CPUs have the same contents in their cache. Can someone point me to a document that discussues that?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Assuming you use allocated arrays.
What happens when you allocate larger than the point where the slowdown occures. Then with the larger dataset run the test to determine where the slowdown is? i.e. keep the same memory placement for all tests.
Jim Dempsey

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page