Community
cancel
Showing results for 
Search instead for 
Did you mean: 
New Contributor I
230 Views

OpenMP SIMD

Jump to solution

Hi,

I'm having trouble implementing #pragma omp simd. I obtain different results with stubs turned on and off. My code is as follows and I'm using Intel C++ 19.1 through Visual Studio 2019 on Windows 10:

// Map Pattern
// saxpy_simd.cpp

#include <iostream>
#include <chrono>
#include <omp.h>

void reportTime(const char* msg, std::chrono::steady_clock::duration span);

constexpr int N = 100;

void saxpy(float a, const float* x, float* y, int n) {
#pragma omp parallel
{
#pragma omp simd
for (int i = 0; i < n; ++i) {
    y[i] = a * x[i] + y[i];
}
}
}

int main() {
float* x = new float[N];
float* y = new float[N];
float a = 2.0f;

// initialize
for (int i = 0; i < N; i++)
x[i] = y[i] = 1.0f;

std::chrono::steady_clock::time_point ts, te;
ts = std::chrono::steady_clock::now();
saxpy(a, x, y, N);
te = std::chrono::steady_clock::now();

// sum y[i]
float s = 0.0f;
for (int i = 0; i < N; i++)
s += y[i];

std::cout << "Sum y[i] = " << s << std::endl;
reportTime("saxpy", te - ts);

delete[] x;
delete[] y;
}

// reportTime inserts the duration (span) into standard output
//
void reportTime(const char* msg, std::chrono::steady_clock::duration span) {
auto ms = std::chrono::duration_cast<std::chrono::nanoseconds>(span);
std::cout << msg << ": " <<
ms.count() << " nanoseconds" << std::endl;
}

with Qopenmp_stubs my output is

Sum y[i] = 300
saxpy: 100 nanoseconds

with Qopenmp my output is

Sum y[i] = 1700
saxpy: 6141900 nanoseconds

It is as if y[i] changes from 1 to 17 instead of from 1 to 3. My code works well with omp for without vectorization.

What am I missing?

Chris

 

0 Kudos

Accepted Solutions
New Contributor I
186 Views

Hi,

I have found the answer to my question:

#pragma omp simd

  for (i = 0; i < n; ++i)

    y[i] = a * x[i] + y[i];

 

Thanks,

Chris

View solution in original post

0 Kudos
4 Replies
Moderator
217 Views

Hi Chris,

Here in the code, you are getting wrong results while using OpenMP since all the threads launched by omp parallel() will access the same memory (y[i]) concurrently and override the results.

All the threads launched by omp parallel will run the complete block concurrently. Hence, for loop inside the omp parallel is redundant.

Pragma
Description
omp parallel

Specifies that a structured block should be run in parallel by a team of threads.

 

 

To achieve the desired result using omp parallel you can refer the below code snippet.

void saxpy(float a, const float* x, float* y, int n) {

omp_set_num_threads(N);

#pragma omp parallel 

  {

     int i = omp_get_thread_num();

     y[i] = a * x[i] + y[i];

  }

}

 

Instead, if you wanted to use OpenMP for a single for loop. It is recommended to use omp parallel for and it should be immediately above the targeted for loop.

omp parallel for

Provides an abbreviated way to specify a parallel region containing a single FOR construct.

 

Please find the below code:

void saxpy(float a, const float* x, float* y, int n) {

omp_set_num_threads(N);

#pragma omp parallel for simd

for (int i = 0; i < n; ++i) {

    y[i] = a * x[i] + y[i];

   }
}

Let us know if you need any help or face any errors.

Regards

Prasanth

0 Kudos
New Contributor I
202 Views

Thank you for drawing my attention to this. I'm actually looking for a way to demonstre the difference in performance on saxpy using different OpenMP constructs. I have serial, SPMD, Work Sharing, Work Sharing with SIMD, but can't get SIMD alone right (vectorization only). With Cilk gone, I'm after a purely OpenMP solution. In light of your note, is this the simplest construct

#pragma omp parallel num_threads(1) {

  #pragma omp for simd

  for (i = 0; i < n; +ii)

    y[i] = a * x[i] + y[i];

}

or is there a simpler form?

0 Kudos
New Contributor I
187 Views

Hi,

I have found the answer to my question:

#pragma omp simd

  for (i = 0; i < n; ++i)

    y[i] = a * x[i] + y[i];

 

Thanks,

Chris

View solution in original post

0 Kudos
Moderator
139 Views

Hi Chris,

 

Glad you have found the answer. Since you are trying to achieve vectorization only, that is the simplest construct.

We will be closing this thread for now. If you require any additional assistance from intel please raise a new thread.

 

Regards

Prasanth

0 Kudos