Why (in my simple case) doesn't OpenMP provide a 2x performance boost?

caa · ‎07-18-2007

Here is my simple program.

int main()
{
    LARGE_INTEGER f, t1, t2;

    int s = 0;

    QueryPerformanceFrequency( &f );
    QueryPerformanceCounter( &t1 );

    int i;
#pragma omp parallel for reduction(+:s)
    for ( i = 0; i < 1000000000; i++ )
    {
        s += i / 3178;
    }

    QueryPerformanceCounter( &t2 );

    printf( "%d %lf
", s, static_cast( t2.QuadPart - t1.QuadPart ) / f.QuadPart );

}

I am using VS2005. When I compile is with/without the omp pragma I get times of 1.38/1.94 correspondingly. I wonder why it speeds up only by 40% and not by 100%. My CPU is a Core 2 Duo E6300. Seen the same behavior on AMD X2...
Thank you. 8 - )

Aaron_C_Intel · ‎08-20-2007

Hi,

Hmm, it's not a bandwidth problem. I could only think that maybe the division operation you do does not take the same amount of time for each calculation. Division algorithms in hardware can vary in time taken.

My suggestion to test this, is to try schedule (dynamic, 1000) and see if it improves. Of course if you use the Thread Profiler you'll get a concrete view of what the problem is.

Aaron

jimdempseyatthecove · ‎08-20-2007

If you are adventuresome, open a disassembly window and see if the reduction operator +:s is being performed inside the loop or if a temp is used within the loop.

Jim

abcd_qmost · ‎08-21-2007

For model of theshared memory and your type of a cycle it is good result.

abcd_qmost · ‎08-21-2007

If you wish to improve results for such type of calculations irrespective of number of cores for model of theshared memory, increase its frequency (see: http://www.thesa-store.com/products)

Yurii

jimdempseyatthecove · ‎08-21-2007

Caa,

Try adding schedule(static) to the #pragma.

#pragma omp parallel for schedule(static) reduction(+:s)

Jim Dempsey

abcd_qmost · ‎08-21-2007

Jim,

I think, that for beginning the most important - ??? the general
principles, and then already details.
For parallel architecture with model of the shared
memory the size of a cache and its competent use is important.
For BLAS3 it is achieved by competent programming.
For example, I BLAS3 is much faster BLAS3 from Inek MKL for IA32.
For BLAS2 and other settlement methods where effectively to use a cache it is
impossible, are important both competent programming (I BLAS2 is much faster
BLAS2 from Inek MKL for IA32 and EM64T), and frequency of operative
memory. For example, Intel MKL at use BLAS2 manages only one core.
(see my page: http://www.thesa-store.com/products)

Yurii

jimdempseyatthecove · ‎08-21-2007

Yruii,

Please take the time to look at the first message in this thread. In there is a sample program containing a FOR loop with one integer expression statement. The FOR loop is parallelized (assuming 2 threads on my part), and with a reduction operator.The index i, for each thread, should be registerized, the local copy of "s" should also be registerized, the entire for loop should fit into the instruction cache (a few 10's of bytes).

Therefore the user's question of why not a 2x speedup was quite valid. The instruction streams, once in cache, should not interfere with each other, and all the rest of the code should be using registers. Therefore a ~2x speedup would be expected .... but was not observed.

There are a few things to account for this.

1) something else is sucking up processor resources (I think the use is savvy enough to eliminate this).

2) The threads interfere with one another. This could be due to how often the reduction occurs. If the #pragma contained "schedule(static,1)" then there would be 0.5E+9 potential collisions on performing the reduction on the sum s. However, if the #pragma contained "schedule(static)" then the for loops only have one potential instance for collision performing reduction on reduction of the sum of s.

3) The threads do not interfere with one another but the computation loads are not balanced between the threads. This can occur with unfavorable selections of scheduling of various types. e.g. with 2 threads and 1000000000 iterations schedule(static,333333333) would result in 2 processors at 100% for 50% of the time (2/3rd of the process) and one processor at 100% for the other 50% of the time (the remaining 1/3rd of the process). The end result would beexecution taking 66% of the time over single thread. Schedule(dynamic[,chunk]) may have similar characteristics as well as the other forms of schedule.

The default schedule is implementation dependent and overridable with environment variables. The original post did not contain enough information to determine the schedule method.

This thread has nothing to do with BLAS - please refrain from plugging your I BLAS3 product as it is not helping this user.

Respectfully yours,

Jim Dempsey

abcd_qmost · ‎08-22-2007

Jim,

I agree, that BLAS3 nothing can help with the given problem. I compared BLAS2 and BLAS3. Also explained, that, unlike BLAS3, BLAS2 it is not meaningful to carry out more than on one core. To this strategy adheres Intel MKL. I tried to explain, that the initial example basically cannot come nearer to 100 %.

Yurii

Aaron_C_Intel · ‎08-22-2007

Hi all,

so I ran the same code on my Dual Xeon 51xx (Dual-core 3.0 GHz) machine and I get perfect scalability from 1.39 to .39. The only problem I could catch maybe was your calculation of the time. I don't know if that was a typo you used with the static_cast or not. But below I used (double). Anyways, can you try this code and see if you get same problem?

Here is the code I used to be exact:

#include

"stdafx.h"

#include

#define

MAX 1000000000

//#define MAX 1000000000

int

_tmain(int argc, _TCHAR* argv[])

{

LARGE_INTEGER f, t1, t2;

int s = 0;

QueryPerformanceFrequency( &f );

QueryPerformanceCounter( &t1 );

int i;

#pragma

omp parallel for reduction(+:s) schedule(static)

for ( i = 0; i < MAX ; i++ ) {

s += i / 3178;

}

QueryPerformanceCounter( &t2 );

printf (

"%d %lf ", s, (double)( t2.QuadPart - t1.QuadPart ) / (double)f.QuadPart );

}

jimdempseyatthecove · ‎08-22-2007

accoday,

I took the liberty to modify your program to enclose the test in a loop to process from 1 to number of cores on my system. This gives me results for 1, 2, 3, 4 cores in one test run.

Changing loop count to 2000000000 and running in 32-bit mode on my x64 server I read

cores timemultiplier
14.687090
22.3602601.9858363
31.5791812.96805116
41.1927403.92968291

Close to linear scaling.

Using MS VC++. Except for an immdiate constant (inverse of 3178) the other variables in the compute loop were registerized. i.e. there should be no cache conflicts until the loop ends.

Jim Dempsey

abcd_qmost · ‎08-22-2007

Caa,

Your basic mistake - the number ofcores (threads)is not specified.

One more mistake - s there should be more, than a maximal integer.

Therefore the result turns out negative.

#include
#include
#include

#define NUM_THREADS 2

int main(){
LARGE_INTEGER f, t1, t2;
int s=0;
QueryPerformanceFrequency(&f);
QueryPerformanceCounter(&t1);
int i;
omp_set_num_threads(NUM_THREADS);
#pragma omp parallel for reduction(+:s)
for ( i = 0; i < 1000000000; i++ ) {
s += i / 3178;
}
QueryPerformanceCounter(&t2);
printf( "%d %f ", s, (double)(t2.QuadPart - t1.QuadPart) / (double)f.QuadPart);
}

NUM_THREADS=1:

-2086857720 1.583106

NUM_THREADS=2:

-2086857720 0.791723

As you can see, results remarkable.

Yurii