openmp and SSE3

flydsp · ‎03-28-2009

is it possible to prgram with both openmp and sse3 to speed up?
for example

int main (int argc, char *argv[]) {
int th_id, nthreads;
#pragma omp parallel private(th_id)
{
th_id = omp_get_thread_num();
printf("Hello World from thread %d\n", th_id);
#pragma omp barrier
// SSE3 code
}
return 0;
}

thanks

srimks · ‎03-29-2009

Quoting - flydsp@hotmail.com

is it possible to prgram with both openmp and sse3 to speed up?
for example

int main (int argc, char *argv[]) {
int th_id, nthreads;
#pragma omp parallel private(th_id)
{
th_id = omp_get_thread_num();
printf("Hello World from thread %dn", th_id);
#pragma omp barrier
// SSE3 code
}
return 0;
}

thanks

I did ask the same query in http://software.intel.com/en-us/forums/showthread.php?t=64152 theoritically but till date haven't received any correct answersnor approach towards it.

Will certainly like to follow your thread.

~BR

robert-reed · ‎03-29-2009

Quoting - flydsp@hotmail.com

is it possible to prgram with both openmp and sse3 to speed up?
for example

int main (int argc, char *argv[]) {
int th_id, nthreads;
#pragma omp parallel private(th_id)
{
th_id = omp_get_thread_num();
printf("Hello World from thread %dn", th_id);
#pragma omp barrier
// SSE3 code
}
return 0;
}

There's nothing theoretically that precludes the use of vector instructions within a parallel threaded region, but there maybe practical considerations in terms of the particular architecture you intend to use and the nature of the algorithm you implement whether such a scheme would actually increase the performance or increase the overhead. I haven't actually tried writing any code, but I could easily imagine an OpenMP work sharing construct which contains Intel SSE intrinsics to drive vector computation. Will it run any faster? The practical complexity of combining an arbitary algorithm with range ofarchitectures that may differ in their vector and memory performance means often that the simplest way to find the answer is to try it, verify that it's working the way you think it is, and measure whether it performs better than the alternatives.

One dilemma for the code fragment above is that it contains no work sharing construct; without a loop construct or a sections construct, the code inside the parallel region will still only be running in a single thread. In that context, the barrier construct doesn't do a whole lot.

Thomas_W_Intel · ‎03-30-2009

Quoting - Robert Reed (Intel)

Quoting - flydsp@hotmail.com

is it possible to prgram with both openmp and sse3 to speed up?
for example

int main (int argc, char *argv[]) {
int th_id, nthreads;
#pragma omp parallel private(th_id)
{
th_id = omp_get_thread_num();
printf("Hello World from thread %dn", th_id);
#pragma omp barrier
// SSE3 code
}
return 0;
}

There's nothing theoretically that precludes the use of vector instructions within a parallel threaded region, but there maybe practical considerations in terms of the particular architecture you intend to use and the nature of the algorithm you implement whether such a scheme would actually increase the performance or increase the overhead. I haven't actually tried writing any code, but I could easily imagine an OpenMP work sharing construct which contains Intel SSE intrinsics to drive vector computation. Will it run any faster? The practical complexity of combining an arbitary algorithm with range ofarchitectures that may differ in their vector and memory performance means often that the simplest way to find the answer is to try it, verify that it's working the way you think it is, and measure whether it performs better than the alternatives.

One dilemma for the code fragment above is that it contains no work sharing construct; without a loop construct or a sections construct, the code inside the parallel region will still only be running in a single thread. In that context, the barrier construct doesn't do a whole lot.

I have successfully combined SSE intrinsics with OpenMP. Simply use the intrinsic inside the parallel section. (As Bob already pointed out, your example is not executed in parallel.)

#pragma omp parallel for
for (int i=0; i result = (unsigned short) _mm_movemask_epi8(_mm_cmpeq_epi8(vector,CmpValue));

Beware that the speed-upof parallel-code is often lower if the (serial) code is highly optimized. You might therefore indeed see little or no benefit from OpenMP if your loop is very short or if you are already bandwidth limited.

I would move the distinction of different platforms outside of the loop to avoid replicating it in parallel threads.

Kind regards
Thomas

jimdempseyatthecove · ‎03-30-2009

From my experience (~4 years of mixing OpenMP and SSE3) the preference for coding is to do both.

Pay attention to optimize your vector codefirst (SSE3), then optimize for multiple threads (OpenMP) second. Multi-threaded code works best when any one thread does not saturate the memory subsystem.

OpenMP parallelization works better as you move the start/stop of the parallel regions to outer layers of the code. In code with relatively small loops, divide the work up such that each threadcan work on different (non-loop) parts of the problem at the same time. In some cases, consider some changes to perform the work in pipeline manner.

Jim Dempsey

TimP · ‎03-30-2009

Quoting - jimdempseyatthecove

From my experience (~4 years of mixing OpenMP and SSE3) the preference for coding is to do both.

Pay attention to optimize your vector codefirst (SSE3), then optimize for multiple threads (OpenMP) second. Multi-threaded code works best when any one thread does not saturate the memory subsystem.

OpenMP parallelization works better as you move the start/stop of the parallel regions to outer layers of the code. In code with relatively small loops, divide the work up such that each threadcan work on different (non-loop) parts of the problem at the same time. In some cases, consider some changes to perform the work in pipeline manner.

When this post went up, I wondered why the specification of SSE3 with no indication of whether the code was suitable for SSE3 (as opposed to SSE2, SSE4,...), and no indication of whether the idea was to avoid using a vectorizing compiler; if so, what was the motivation.
Putting vectorization optimizations in place first, as Jim recommends, puts you on the road toward localizing the data in each thread.
I wasn't certain the poster wanted to hear again about this idea, which goes back at least 20 years, to the slogan "concurrent outer, vector inner."
On some of the older CPUs, my test cases show less than 20% advantage for a vectorizing compiler over a non-vectorizing OpenMP compiler such as MSVC9, but the same cases show a much bigger combined speedup for vectorization and threading on the latest CPUs. As I posted an extreme example recently of the use of OpenMP plus SSE intrinsics (SSE or SSE4, depending on the compiler switch), where the combined speedup is 15x on Core i7, I had to assume that was not of interest here. Even in that case, I set the OpenMP if clause to keep it down to 1 thread until the inner and outer loop counts both exceed 100.

magicfoot · ‎07-15-2011

Hi TimP,

Would you mind republishing that openMP code example using multiple SSE please(as described above).

I have devised openMP code that should use the SSE on every core of a Nehalem CPU but find that it seems to be bottlenecking at the SSE. I suspect that I need to address the SSE on each core in a different manner and have done so by assigning different variable names for the SSE in each thread. i.e. A in thread 1, B in thread 2, etc. Still a bottleneck.

This was not necesssary with my MPI equivalent which is running as it should and deploying the SSE on each core within the CPU.

TimP · ‎07-16-2011

The icc and gcc compilers automatically promote certain intrinsics to SSE4 when the compiler switch is set. In the following example, it's not necessary to specify the optimization of _mm_setps for SSE4, but it is necessary to write explicitly an AVX-256 version. If you use intrinsics efficiently, it's certainly possible to encounter stalls in the sharing of floating point units between hyperthreads on the same core, such that peak performance is reached at 1 thread per core.
This example has become extremely ugly with addition of a conditional compilation branch for AVX:

[bash]#if defined __AVX__
#pragma omp parallel for if(i__2 > 103)
      for (i__ = 1; i__ <= i__2; i__ += 8) {
          int k = i__ * i__3 - i__3;
          __m256 tmp = _mm256_loadu_ps(&bb[i__ + bb_dim1]);
          for (int j = 2; j <= i__3; ++j){
              __m256 tmp1 = _mm256_set_ps(cdata_1.array[k+7*i__3],
                  cdata_1.array[k+6*i__3],cdata_1.array[k+5*i__3],
                  cdata_1.array[k+4*i__3],cdata_1.array[k+3*i__3],
                  cdata_1.array[k+2*i__3],cdata_1.array[k+1*i__3],
                  cdata_1.array[k+0*i__3]);
              tmp=_mm256_add_ps(tmp,_mm256_mul_ps(tmp1,
               _mm256_loadu_ps(&cc[i__ + j * cc_dim1])));
              // this will break if 32-byte alignment isn't supported
              _mm256_store_ps(&bb[i__ + j * bb_dim1],tmp);
              ++k;
              }
          }
#else
#if defined __SSE2__
#pragma omp parallel for if(i__2 > 103)
      for (i__ = 1; i__ <= i__2; i__ += 4) {
          int k = i__ * i__3 - i__3;
          __m128 tmp = _mm_loadu_ps(&bb[i__ + bb_dim1]);
          for (int j = 2; j <= i__3; ++j){
              __m128 tmp1 = _mm_set_ps(cdata_1.array[k+3*i__3],
                  cdata_1.array[k+2*i__3],cdata_1.array[k+1*i__3],
                  cdata_1.array[k+0*i__3]);
              __m128 tmp2 = _mm_loadu_ps(&cc[i__ + j * cc_dim1]);
              tmp=_mm_add_ps(tmp,_mm_mul_ps(tmp1,tmp2));
              _mm_store_ps(&bb[i__ + j * bb_dim1],tmp);
              ++k;
              }
          }
#else
#pragma omp parallel for if(i__2 > 103)
      for (i__ = 1; i__ <= i__2; ++i__ ) {
          int k = i__ * i__3 - i__3;
          for (int j = 2; j <= i__3; ++j)
              bb[i__ + j * bb_dim1] = bb[i__ + (j - 1) * bb_dim1] +
                cdata_1.array[k++] * cc[i__ + j * cc_dim1];
        }
#endif
#endif

This case might appear to leave open the possibility of a gain for hyperthreading, as the latency of the add
instruction leaves plenty of cycles open for sharing floating point unit.  However, on Sandy Bridge AVX, with 8 scalar float loads
and 1 _mm256_loadu_ps which has to be split by the hardware when crossing the 128-bit path from L2
(_mm_store_ps is split for 128-bit path to fill buffer), performance appears to be limited by data rate
bottlenecks.  As Nehalem can issue only 1 32-bit load per cycle, there also this code is limited by data rate,
even with high L3 cache locality maintained by setting KMP_AFFINITY.[/bash]

magicfoot · ‎07-19-2011

Thanks for reposting that sample. I differed in where I had defined my variables. Once they were more local, as per the example, it ran very well. I have included speedups achieved with the use of the SSE and openMP on each core ofthe Nehalem cpu. I assume that speedup degrade with core increase is possibly owing to memory bandwidth saturation.

Cores Speedup(i.e. Algorithm using openMP only/openMP with SSE)
1 1.70
2 1.65
41.59
6 1.50
81.38

With the use of the AVX in a similar manner I will be knocking on the door of my GTX480's performance for this algorithm.