- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The following code ( kernel #9 of the Livermore loops benchmark suite, all data types, except i and n, are double ) runs ( on Pentium4 2.8Ghz, compiled and fully optimized with gcc 4.2.3 ) for about 5 second.
for ( i=0 ; i
px[0] = dm28*px[12] +
dm27*px[11] + dm26*px[10] +
dm25*px[ 9] +
dm24*px[ 8] + dm23*px[ 7] +
dm22*px[ 6] +
c0*( px[ 4] + px[ 5])
+ px[ 2];
}
Parallelizing it:
#pragma omp parallel for private(i)
for ( i=0 ; i
px[0] = dm28*px[12] +
dm27*px[11] + dm26*px[10] +
dm25*px[ 9] +
dm24*px[ 8] + dm23*px[ 7] +
dm22*px[ 6] +
c0*( px[ 4] + px[ 5])
+ px[ 2];
}
generates the executable that runs for 170 seconds!
What are the reasons for such a drastic slowdown?
Is there a reference ( document ) that discusses/lists the cases when parallelization shall be avoided?
Thank you in advance,
David Livshin
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I never tried forcing parallelism here, as the Intel compilers don't see it as a candidate for -parallel. Long loop lengths are needed to benefit from threading, when each thread gets a fraction of the original loop, even when the threads have separate floating point units and other resources available. The considerations for adding threaded parallelism in an artificial benchmark like this, which hits the same cache lines over and over, are different from real applications.
Pentium 4 came in both hyperthread and non-hyperthread versions. Even if you have hyperthreading, a benchmark like this would not speed up with threading even under ideal conditions, as it measures only the rate at which parallel floating point operations can be issued, and hyperthreading doesn't give you any additional floating point units. It could slow down, either from the notorious "64k aliasing" of the early P4, or from the normal operation of hardware prefetch, if you haven't disabled it. Hardware prefetch would cause data to be continually shuffled between caches as one thread grabs data from the region required by the other, with a false sharing situation between the last cache line used by one and the first used by the other.
In fact, it may run faster with HT disabled, in which case the threaded version ideally could run nearly as fast as the non-threaded.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am running a modified version of the benchmark that doesn't perform calibration and reports total execution time of every loop.
I never tried forcing parallelism here, as the Intel compilers don't see it as a candidate for -parallel. Long loop lengths are needed to benefit from threading, when each thread gets a fraction of the original loop, even when the threads have separate floating point units and other resources available. The considerations for adding threaded parallelism in an artificial benchmark like this, which hits the same cache lines over and over, are different from real applications.
Why do you think the compiler failed to parallelize the code? There are no dependencies that might prevent it from happening and compiler, I guess, didn't classify it as an "artificial benchmark" and not a "real application". What might be a criteria ( heuristic ) for code parallelization? Is there a reference ( document ) that discusses/lists the cases when parallelization shall be avoided?
Pentium 4 came in both hyperthread and non-hyperthread versions. Even if you have hyperthreading, a benchmark like this would not speed up with threading even under ideal conditions, as it measures only the rate at which parallel floating point operations can be issued, and hyperthreading doesn't give you any additional floating point units. It could slow down, either from the notorious "64k aliasing" of the early P4, or from the normal operation of hardware prefetch, if you haven't disabled it. Hardware prefetch would cause data to be continually shuffled between caches as one thread grabs data from the region required by the other, with a false sharing situation between the last cache line used by one and the first used by the other.
In fact, it may run faster with HT disabled, in which case the threaded version ideally could run nearly as fast as the non-threaded.
I didn't expect the parallel code to run faster ( for this particular case ) - what shocked me is how much slower it become. Obviously something ( that I would like to understand ) is happening at the hardware level.
David Livshin
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page