- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
is it possible to prgram with both openmp and sse3 to speed up?
for example
int main (int argc, char *argv[]) {
int th_id, nthreads;
#pragma omp parallel private(th_id)
{
th_id = omp_get_thread_num();
printf("Hello World from thread %d\n", th_id);
#pragma omp barrier
// SSE3 code
}
return 0;
}
thanks
for example
int main (int argc, char *argv[]) {
int th_id, nthreads;
#pragma omp parallel private(th_id)
{
th_id = omp_get_thread_num();
printf("Hello World from thread %d\n", th_id);
#pragma omp barrier
// SSE3 code
}
return 0;
}
thanks
Link Copied
8 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - flydsp@hotmail.com
is it possible to prgram with both openmp and sse3 to speed up?
for example
int main (int argc, char *argv[]) {
int th_id, nthreads;
#pragma omp parallel private(th_id)
{
th_id = omp_get_thread_num();
printf("Hello World from thread %dn", th_id);
#pragma omp barrier
// SSE3 code
}
return 0;
}
thanks
for example
int main (int argc, char *argv[]) {
int th_id, nthreads;
#pragma omp parallel private(th_id)
{
th_id = omp_get_thread_num();
printf("Hello World from thread %dn", th_id);
#pragma omp barrier
// SSE3 code
}
return 0;
}
thanks
Will certainly like to follow your thread.
~BR
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - flydsp@hotmail.com
is it possible to prgram with both openmp and sse3 to speed up?
for example
int main (int argc, char *argv[]) {
int th_id, nthreads;
#pragma omp parallel private(th_id)
{
th_id = omp_get_thread_num();
printf("Hello World from thread %dn", th_id);
#pragma omp barrier
// SSE3 code
}
return 0;
}
for example
int main (int argc, char *argv[]) {
int th_id, nthreads;
#pragma omp parallel private(th_id)
{
th_id = omp_get_thread_num();
printf("Hello World from thread %dn", th_id);
#pragma omp barrier
// SSE3 code
}
return 0;
}
One dilemma for the code fragment above is that it contains no work sharing construct; without a loop construct or a sections construct, the code inside the parallel region will still only be running in a single thread. In that context, the barrier construct doesn't do a whole lot.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - Robert Reed (Intel)
Quoting - flydsp@hotmail.com
is it possible to prgram with both openmp and sse3 to speed up?
for example
int main (int argc, char *argv[]) {
int th_id, nthreads;
#pragma omp parallel private(th_id)
{
th_id = omp_get_thread_num();
printf("Hello World from thread %dn", th_id);
#pragma omp barrier
// SSE3 code
}
return 0;
}
for example
int main (int argc, char *argv[]) {
int th_id, nthreads;
#pragma omp parallel private(th_id)
{
th_id = omp_get_thread_num();
printf("Hello World from thread %dn", th_id);
#pragma omp barrier
// SSE3 code
}
return 0;
}
One dilemma for the code fragment above is that it contains no work sharing construct; without a loop construct or a sections construct, the code inside the parallel region will still only be running in a single thread. In that context, the barrier construct doesn't do a whole lot.
I have successfully combined SSE intrinsics with OpenMP. Simply use the intrinsic inside the parallel section. (As Bob already pointed out, your example is not executed in parallel.)
#pragma omp parallel for
for (int i=0; i
Beware that the speed-upof parallel-code is often lower if the (serial) code is highly optimized. You might therefore indeed see little or no benefit from OpenMP if your loop is very short or if you are already bandwidth limited.
I would move the distinction of different platforms outside of the loop to avoid replicating it in parallel threads.
Kind regards
Thomas
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
From my experience (~4 years of mixing OpenMP and SSE3) the preference for coding is to do both.
Pay attention to optimize your vector codefirst (SSE3), then optimize for multiple threads (OpenMP) second. Multi-threaded code works best when any one thread does not saturate the memory subsystem.
OpenMP parallelization works better as you move the start/stop of the parallel regions to outer layers of the code. In code with relatively small loops, divide the work up such that each threadcan work on different (non-loop) parts of the problem at the same time. In some cases, consider some changes to perform the work in pipeline manner.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - jimdempseyatthecove
From my experience (~4 years of mixing OpenMP and SSE3) the preference for coding is to do both.
Pay attention to optimize your vector codefirst (SSE3), then optimize for multiple threads (OpenMP) second. Multi-threaded code works best when any one thread does not saturate the memory subsystem.
OpenMP parallelization works better as you move the start/stop of the parallel regions to outer layers of the code. In code with relatively small loops, divide the work up such that each threadcan work on different (non-loop) parts of the problem at the same time. In some cases, consider some changes to perform the work in pipeline manner.
Putting vectorization optimizations in place first, as Jim recommends, puts you on the road toward localizing the data in each thread.
I wasn't certain the poster wanted to hear again about this idea, which goes back at least 20 years, to the slogan "concurrent outer, vector inner."
On some of the older CPUs, my test cases show less than 20% advantage for a vectorizing compiler over a non-vectorizing OpenMP compiler such as MSVC9, but the same cases show a much bigger combined speedup for vectorization and threading on the latest CPUs. As I posted an extreme example recently of the use of OpenMP plus SSE intrinsics (SSE or SSE4, depending on the compiler switch), where the combined speedup is 15x on Core i7, I had to assume that was not of interest here. Even in that case, I set the OpenMP if clause to keep it down to 1 thread until the inner and outer loop counts both exceed 100.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi TimP,
Would you mind republishing that openMP code example using multiple SSE please(as described above).
I have devised openMP code that should use the SSE on every core of a Nehalem CPU but find that it seems to be bottlenecking at the SSE. I suspect that I need to address the SSE on each core in a different manner and have done so by assigning different variable names for the SSE in each thread. i.e. A in thread 1, B in thread 2, etc. Still a bottleneck.
This was not necesssary with my MPI equivalent which is running as it should and deploying the SSE on each core within the CPU.
Would you mind republishing that openMP code example using multiple SSE please(as described above).
I have devised openMP code that should use the SSE on every core of a Nehalem CPU but find that it seems to be bottlenecking at the SSE. I suspect that I need to address the SSE on each core in a different manner and have done so by assigning different variable names for the SSE in each thread. i.e. A in thread 1, B in thread 2, etc. Still a bottleneck.
This was not necesssary with my MPI equivalent which is running as it should and deploying the SSE on each core within the CPU.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The icc and gcc compilers automatically promote certain intrinsics to SSE4 when the compiler switch is set. In the following example, it's not necessary to specify the optimization of _mm_setps for SSE4, but it is necessary to write explicitly an AVX-256 version. If you use intrinsics efficiently, it's certainly possible to encounter stalls in the sharing of floating point units between hyperthreads on the same core, such that peak performance is reached at 1 thread per core.
This example has become extremely ugly with addition of a conditional compilation branch for AVX:
This example has become extremely ugly with addition of a conditional compilation branch for AVX:
[bash]#if defined __AVX__ #pragma omp parallel for if(i__2 > 103) for (i__ = 1; i__ <= i__2; i__ += 8) { int k = i__ * i__3 - i__3; __m256 tmp = _mm256_loadu_ps(&bb[i__ + bb_dim1]); for (int j = 2; j <= i__3; ++j){ __m256 tmp1 = _mm256_set_ps(cdata_1.array[k+7*i__3], cdata_1.array[k+6*i__3],cdata_1.array[k+5*i__3], cdata_1.array[k+4*i__3],cdata_1.array[k+3*i__3], cdata_1.array[k+2*i__3],cdata_1.array[k+1*i__3], cdata_1.array[k+0*i__3]); tmp=_mm256_add_ps(tmp,_mm256_mul_ps(tmp1, _mm256_loadu_ps(&cc[i__ + j * cc_dim1]))); // this will break if 32-byte alignment isn't supported _mm256_store_ps(&bb[i__ + j * bb_dim1],tmp); ++k; } } #else #if defined __SSE2__ #pragma omp parallel for if(i__2 > 103) for (i__ = 1; i__ <= i__2; i__ += 4) { int k = i__ * i__3 - i__3; __m128 tmp = _mm_loadu_ps(&bb[i__ + bb_dim1]); for (int j = 2; j <= i__3; ++j){ __m128 tmp1 = _mm_set_ps(cdata_1.array[k+3*i__3], cdata_1.array[k+2*i__3],cdata_1.array[k+1*i__3], cdata_1.array[k+0*i__3]); __m128 tmp2 = _mm_loadu_ps(&cc[i__ + j * cc_dim1]); tmp=_mm_add_ps(tmp,_mm_mul_ps(tmp1,tmp2)); _mm_store_ps(&bb[i__ + j * bb_dim1],tmp); ++k; } } #else #pragma omp parallel for if(i__2 > 103) for (i__ = 1; i__ <= i__2; ++i__ ) { int k = i__ * i__3 - i__3; for (int j = 2; j <= i__3; ++j) bb[i__ + j * bb_dim1] = bb[i__ + (j - 1) * bb_dim1] + cdata_1.array[k++] * cc[i__ + j * cc_dim1]; } #endif #endif
This case might appear to leave open the possibility of a gain for hyperthreading, as the latency of the add
instruction leaves plenty of cycles open for sharing floating point unit. However, on Sandy Bridge AVX, with 8 scalar float loads
and 1 _mm256_loadu_ps which has to be split by the hardware when crossing the 128-bit path from L2
(_mm_store_ps is split for 128-bit path to fill buffer), performance appears to be limited by data rate
bottlenecks. As Nehalem can issue only 1 32-bit load per cycle, there also this code is limited by data rate,
even with high L3 cache locality maintained by setting KMP_AFFINITY.[/bash]
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for reposting that sample. I differed in where I had defined my variables. Once they were more local, as per the example, it ran very well. I have included speedups achieved with the use of the SSE and openMP on each core ofthe Nehalem cpu. I assume that speedup degrade with core increase is possibly owing to memory bandwidth saturation.
Cores Speedup(i.e. Algorithm using openMP only/openMP with SSE)
1 1.70
2 1.65
41.59
6 1.50
81.38
With the use of the AVX in a similar manner I will be knocking on the door of my GTX480's performance for this algorithm.
Cores Speedup(i.e. Algorithm using openMP only/openMP with SSE)
1 1.70
2 1.65
41.59
6 1.50
81.38
With the use of the AVX in a similar manner I will be knocking on the door of my GTX480's performance for this algorithm.
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page