parallelized code that runs significantly slower

david_livshin · ‎02-20-2008

The following code ( kernel #9 of the Livermore loops benchmark suite, all data types, except i and n, are double ) runs ( on Pentium4 2.8Ghz, compiled and fully optimized with gcc 4.2.3 ) for about 5 second.

for ( i=0 ; i{
px[0] = dm28*px[12] + dm27*px[11] + dm26*px[10] +
dm25*px[ 9] + dm24*px[ 8] + dm23*px[ 7] +
dm22*px[ 6] + c0*( px[ 4] + px[ 5])
+ px[ 2];
}

Parallelizing it:

#pragma omp parallel for private(i)
for ( i=0 ; i{
px[0] = dm28*px[12] + dm27*px[11] + dm26*px[10] +
dm25*px[ 9] + dm24*px[ 8] + dm23*px[ 7] +
dm22*px[ 6] + c0*( px[ 4] + px[ 5])
+ px[ 2];
}

generates the executable that runs for 170 seconds!

What are the reasons for such a drastic slowdown?

Is there a reference ( document ) that discusses/lists the cases when parallelization shall be avoided?

Thank you in advance,

David Livshin

www.dalsoft.com

TimP · ‎02-20-2008

This seems extreme, but you haven't given enough information. The last time I ran this with gcc (not parallel), on an early Pentium-M, it took 5 milliseconds average, for the largest n (== 101). If you are running the standard Livermore loops driver, and allow it to run more repetitions in an attempt to get repeatable timing for the parallel version, you should be looking at the time per loop, not the overall time.
I never tried forcing parallelism here, as the Intel compilers don't see it as a candidate for -parallel. Long loop lengths are needed to benefit from threading, when each thread gets a fraction of the original loop, even when the threads have separate floating point units and other resources available. The considerations for adding threaded parallelism in an artificial benchmark like this, which hits the same cache lines over and over, are different from real applications.
Pentium 4 came in both hyperthread and non-hyperthread versions. Even if you have hyperthreading, a benchmark like this would not speed up with threading even under ideal conditions, as it measures only the rate at which parallel floating point operations can be issued, and hyperthreading doesn't give you any additional floating point units. It could slow down, either from the notorious "64k aliasing" of the early P4, or from the normal operation of hardware prefetch, if you haven't disabled it. Hardware prefetch would cause data to be continually shuffled between caches as one thread grabs data from the region required by the other, with a false sharing situation between the last cache line used by one and the first used by the other.
In fact, it may run faster with HT disabled, in which case the threaded version ideally could run nearly as fast as the non-threaded.

david_livshin · ‎02-21-2008

This seems extreme, but you haven't given enough information. The last time I ran this with gcc (not parallel), on an early Pentium-M, it took 5 milliseconds average, for the largest n (== 101). If you are running the standard Livermore loops driver, and allow it to run more repetitions in an attempt to get repeatable timing for the parallel version, you should be looking at the time per loop, not the overall time.

I am running a modified version of the benchmark that doesn't perform calibration and reports total execution time of every loop.

I never tried forcing parallelism here, as the Intel compilers don't see it as a candidate for -parallel. Long loop lengths are needed to benefit from threading, when each thread gets a fraction of the original loop, even when the threads have separate floating point units and other resources available. The considerations for adding threaded parallelism in an artificial benchmark like this, which hits the same cache lines over and over, are different from real applications.

Why do you think the compiler failed to parallelize the code? There are no dependencies that might prevent it from happening and compiler, I guess, didn't classify it as an "artificial benchmark" and not a "real application". What might be a criteria ( heuristic ) for code parallelization? Is there a reference ( document ) that discusses/lists the cases when parallelization shall be avoided?

Pentium 4 came in both hyperthread and non-hyperthread versions. Even if you have hyperthreading, a benchmark like this would not speed up with threading even under ideal conditions, as it measures only the rate at which parallel floating point operations can be issued, and hyperthreading doesn't give you any additional floating point units. It could slow down, either from the notorious "64k aliasing" of the early P4, or from the normal operation of hardware prefetch, if you haven't disabled it. Hardware prefetch would cause data to be continually shuffled between caches as one thread grabs data from the region required by the other, with a false sharing situation between the last cache line used by one and the first used by the other.
In fact, it may run faster with HT disabled, in which case the threaded version ideally could run nearly as fast as the non-threaded.

I didn't expect the parallel code to run faster ( for this particular case ) - what shocked me is how much slower it become. Obviously something ( that I would like to understand ) is happening at the hardware level.

David Livshin

www.dalsoft.com