- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Here is my simple program.
int main()
{
LARGE_INTEGER f, t1, t2;
int s = 0;
QueryPerformanceFrequency( &f );
QueryPerformanceCounter( &t1 );
int i;
#pragma omp parallel for reduction(+:s)
for ( i = 0; i < 1000000000; i++ )
{
s += i / 3178;
}
QueryPerformanceCounter( &t2 );
printf( "%d %lf ", s, static_cast( t2.QuadPart - t1.QuadPart ) / f.QuadPart );
}
I am using VS2005. When I compile is with/without the omp pragma I get times of 1.38/1.94 correspondingly. I wonder why it speeds up only by 40% and not by 100%. My CPU is a Core 2 Duo E6300. Seen the same behavior on AMD X2...
Thank you. 8 - )
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Hmm, it's not a bandwidth problem. I could only think that maybe the division operation you do does not take the same amount of time for each calculation. Division algorithms in hardware can vary in time taken.
My suggestion to test this, is to try schedule (dynamic, 1000) and see if it improves. Of course if you use the Thread Profiler you'll get a concrete view of what the problem is.
Aaron
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If you are adventuresome, open a disassembly window and see if the reduction operator +:s is being performed inside the loop or if a temp is used within the loop.
Jim
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Caa,
Try adding schedule(static) to the #pragma.
#pragma omp parallel for schedule(static) reduction(+:s)
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jim,
I think, that for beginning the most important - ??? the general
principles, and then already details.
For parallel architecture with model of the shared
memory the size of a cache and its competent use is important.
For BLAS3 it is achieved by competent programming.
For example, I BLAS3 is much faster BLAS3 from Inek MKL for IA32.
For BLAS2 and other settlement methods where effectively to use a cache it is
impossible, are important both competent programming (I BLAS2 is much faster
BLAS2 from Inek MKL for IA32 and EM64T), and frequency of operative
memory. For example, Intel MKL at use BLAS2 manages only one core.
(see my page: http://www.thesa-store.com/products)
Yurii
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yruii,
Please take the time to look at the first message in this thread. In there is a sample program containing a FOR loop with one integer expression statement. The FOR loop is parallelized (assuming 2 threads on my part), and with a reduction operator.The index i, for each thread, should be registerized, the local copy of "s" should also be registerized, the entire for loop should fit into the instruction cache (a few 10's of bytes).
Therefore the user's question of why not a 2x speedup was quite valid. The instruction streams, once in cache, should not interfere with each other, and all the rest of the code should be using registers. Therefore a ~2x speedup would be expected .... but was not observed.
There are a few things to account for this.
1) something else is sucking up processor resources (I think the use is savvy enough to eliminate this).
2) The threads interfere with one another. This could be due to how often the reduction occurs. If the #pragma contained "schedule(static,1)" then there would be 0.5E+9 potential collisions on performing the reduction on the sum s. However, if the #pragma contained "schedule(static)" then the for loops only have one potential instance for collision performing reduction on reduction of the sum of s.
3) The threads do not interfere with one another but the computation loads are not balanced between the threads. This can occur with unfavorable selections of scheduling of various types. e.g. with 2 threads and 1000000000 iterations schedule(static,333333333) would result in 2 processors at 100% for 50% of the time (2/3rd of the process) and one processor at 100% for the other 50% of the time (the remaining 1/3rd of the process). The end result would beexecution taking 66% of the time over single thread. Schedule(dynamic[,chunk]) may have similar characteristics as well as the other forms of schedule.
The default schedule is implementation dependent and overridable with environment variables. The original post did not contain enough information to determine the schedule method.
This thread has nothing to do with BLAS - please refrain from plugging your I BLAS3 product as it is not helping this user.
Respectfully yours,
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jim,
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi all,
so I ran the same code on my Dual Xeon 51xx (Dual-core 3.0 GHz) machine and I get perfect scalability from 1.39 to .39. The only problem I could catch maybe was your calculation of the time. I don't know if that was a typo you used with the static_cast or not. But below I used (double). Anyways, can you try this code and see if you get same problem?
Here is the code I used to be exact:
#include
"stdafx.h"#include
#include
#define
MAX 1000000000//#define MAX 1000000000
int
_tmain(int argc, _TCHAR* argv[]){
LARGE_INTEGER f, t1, t2;
int s = 0;QueryPerformanceFrequency( &f );
QueryPerformanceCounter( &t1 );
int i;#pragma
omp parallel for reduction(+:s) schedule(static) for ( i = 0; i < MAX ; i++ ) {s += i / 3178;
}
QueryPerformanceCounter( &t2 );
printf (
"%d %lf ", s, (double)( t2.QuadPart - t1.QuadPart ) / (double)f.QuadPart );}
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
accoday,
I took the liberty to modify your program to enclose the test in a loop to process from 1 to number of cores on my system. This gives me results for 1, 2, 3, 4 cores in one test run.
Changing loop count to 2000000000 and running in 32-bit mode on my x64 server I read
cores timemultiplier
14.687090
22.3602601.9858363
31.5791812.96805116
41.1927403.92968291
Close to linear scaling.
Using MS VC++. Except for an immdiate constant (inverse of 3178) the other variables in the compute loop were registerized. i.e. there should be no cache conflicts until the loop ends.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Caa,
#include
#include
LARGE_INTEGER f, t1, t2;
int s=0;
QueryPerformanceFrequency(&f);
QueryPerformanceCounter(&t1);
int i;
omp_set_num_threads(NUM_THREADS);
#pragma omp parallel for reduction(+:s)
for ( i = 0; i < 1000000000; i++ ) {
s += i / 3178;
}
QueryPerformanceCounter(&t2);
printf( "%d %f ", s, (double)(t2.QuadPart - t1.QuadPart) / (double)f.QuadPart);
}

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page