Solved: OpenMP SIMD

Chris_Szalwinski · ‎10-08-2020

Hi,

I'm having trouble implementing #pragma omp simd. I obtain different results with stubs turned on and off. My code is as follows and I'm using Intel C++ 19.1 through Visual Studio 2019 on Windows 10:

// Map Pattern
// saxpy_simd.cpp

#include <iostream>
#include <chrono>
#include <omp.h>

void reportTime(const char* msg, std::chrono::steady_clock::duration span);

constexpr int N = 100;

void saxpy(float a, const float* x, float* y, int n) {
#pragma omp parallel
{
#pragma omp simd
for (int i = 0; i < n; ++i) {
y[i] = a * x[i] + y[i];
}
}
}

int main() {
float* x = new float[N];
float* y = new float[N];
float a = 2.0f;

// initialize
for (int i = 0; i < N; i++)
x[i] = y[i] = 1.0f;

std::chrono::steady_clock::time_point ts, te;
ts = std::chrono::steady_clock::now();
saxpy(a, x, y, N);
te = std::chrono::steady_clock::now();

// sum y[i]
float s = 0.0f;
for (int i = 0; i < N; i++)
s += y[i];

std::cout << "Sum y[i] = " << s << std::endl;
reportTime("saxpy", te - ts);

delete[] x;
delete[] y;
}

// reportTime inserts the duration (span) into standard output
//
void reportTime(const char* msg, std::chrono::steady_clock::duration span) {
auto ms = std::chrono::duration_cast<std::chrono::nanoseconds>(span);
std::cout << msg << ": " <<
ms.count() << " nanoseconds" << std::endl;
}

with Qopenmp_stubs my output is

Sum y[i] = 300
saxpy: 100 nanoseconds

with Qopenmp my output is

Sum y[i] = 1700
saxpy: 6141900 nanoseconds

It is as if y[i] changes from 1 to 17 instead of from 1 to 3. My code works well with omp for without vectorization.

What am I missing?

Chris

Chris_Szalwinski · ‎10-10-2020

Hi,

I have found the answer to my question:

#pragma omp simd

for (i = 0; i < n; ++i)

y[i] = a * x[i] + y[i];

Thanks,

Chris

View solution in original post

PrasanthD_intel · ‎10-09-2020

Hi Chris,

Here in the code, you are getting wrong results while using OpenMP since all the threads launched by omp parallel() will access the same memory (y[i]) concurrently and override the results.

All the threads launched by omp parallel will run the complete block concurrently. Hence, for loop inside the omp parallel is redundant.

Pragma	Description
omp parallel	Specifies that a structured block should be run in parallel by a team of threads.

To achieve the desired result using omp parallel you can refer the below code snippet.

void saxpy(float a, const float* x, float* y, int n) {

omp_set_num_threads(N);

#pragma omp parallel 

  {

     int i = omp_get_thread_num();

     y[i] = a * x[i] + y[i];

  }

}

Instead, if you wanted to use OpenMP for a single for loop. It is recommended to use omp parallel for and it should be immediately above the targeted for loop.

omp parallel for

Provides an abbreviated way to specify a parallel region containing a single FOR construct.

Please find the below code:

void saxpy(float a, const float* x, float* y, int n) {

omp_set_num_threads(N);

#pragma omp parallel for simd

for (int i = 0; i < n; ++i) {

    y[i] = a * x[i] + y[i];

   }
}

Let us know if you need any help or face any errors.

Regards

Prasanth

Chris_Szalwinski · ‎10-09-2020

Thank you for drawing my attention to this. I'm actually looking for a way to demonstre the difference in performance on saxpy using different OpenMP constructs. I have serial, SPMD, Work Sharing, Work Sharing with SIMD, but can't get SIMD alone right (vectorization only). With Cilk gone, I'm after a purely OpenMP solution. In light of your note, is this the simplest construct

#pragma omp parallel num_threads(1) {

#pragma omp for simd

for (i = 0; i < n; +ii)

y[i] = a * x[i] + y[i];

}

or is there a simpler form?

Chris_Szalwinski · ‎10-10-2020

Hi,

I have found the answer to my question:

#pragma omp simd

for (i = 0; i < n; ++i)

y[i] = a * x[i] + y[i];

Thanks,

Chris

PrasanthD_intel · ‎10-15-2020

Hi Chris,

Glad you have found the answer. Since you are trying to achieve vectorization only, that is the simplest construct.

We will be closing this thread for now. If you require any additional assistance from intel please raise a new thread.

Regards

Prasanth

OpenMP SIMD

Pragma