- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I'm having trouble implementing #pragma omp simd. I obtain different results with stubs turned on and off. My code is as follows and I'm using Intel C++ 19.1 through Visual Studio 2019 on Windows 10:
// Map Pattern
// saxpy_simd.cpp
#include <iostream>
#include <chrono>
#include <omp.h>
void reportTime(const char* msg, std::chrono::steady_clock::duration span);
constexpr int N = 100;
void saxpy(float a, const float* x, float* y, int n) {
#pragma omp parallel
{
#pragma omp simd
for (int i = 0; i < n; ++i) {
y[i] = a * x[i] + y[i];
}
}
}
int main() {
float* x = new float[N];
float* y = new float[N];
float a = 2.0f;
// initialize
for (int i = 0; i < N; i++)
x[i] = y[i] = 1.0f;
std::chrono::steady_clock::time_point ts, te;
ts = std::chrono::steady_clock::now();
saxpy(a, x, y, N);
te = std::chrono::steady_clock::now();
// sum y[i]
float s = 0.0f;
for (int i = 0; i < N; i++)
s += y[i];
std::cout << "Sum y[i] = " << s << std::endl;
reportTime("saxpy", te - ts);
delete[] x;
delete[] y;
}
// reportTime inserts the duration (span) into standard output
//
void reportTime(const char* msg, std::chrono::steady_clock::duration span) {
auto ms = std::chrono::duration_cast<std::chrono::nanoseconds>(span);
std::cout << msg << ": " <<
ms.count() << " nanoseconds" << std::endl;
}
with Qopenmp_stubs my output is
Sum y[i] = 300
saxpy: 100 nanoseconds
with Qopenmp my output is
Sum y[i] = 1700
saxpy: 6141900 nanoseconds
It is as if y[i] changes from 1 to 17 instead of from 1 to 3. My code works well with omp for without vectorization.
What am I missing?
Chris
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I have found the answer to my question:
#pragma omp simd
for (i = 0; i < n; ++i)
y[i] = a * x[i] + y[i];
Thanks,
Chris
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Chris,
Here in the code, you are getting wrong results while using OpenMP since all the threads launched by omp parallel() will access the same memory (y[i]) concurrently and override the results.
All the threads launched by omp parallel will run the complete block concurrently. Hence, for loop inside the omp parallel is redundant.
Pragma |
Description |
---|---|
omp parallel |
Specifies that a structured block should be run in parallel by a team of threads. |
To achieve the desired result using omp parallel you can refer the below code snippet.
void saxpy(float a, const float* x, float* y, int n) {
omp_set_num_threads(N);
#pragma omp parallel
{
int i = omp_get_thread_num();
y[i] = a * x[i] + y[i];
}
}
Instead, if you wanted to use OpenMP for a single for loop. It is recommended to use omp parallel for and it should be immediately above the targeted for loop.
omp parallel for |
Provides an abbreviated way to specify a parallel region containing a single FOR construct. |
Please find the below code:
void saxpy(float a, const float* x, float* y, int n) {
omp_set_num_threads(N);
#pragma omp parallel for simd
for (int i = 0; i < n; ++i) {
y[i] = a * x[i] + y[i];
}
}
Let us know if you need any help or face any errors.
Regards
Prasanth
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you for drawing my attention to this. I'm actually looking for a way to demonstre the difference in performance on saxpy using different OpenMP constructs. I have serial, SPMD, Work Sharing, Work Sharing with SIMD, but can't get SIMD alone right (vectorization only). With Cilk gone, I'm after a purely OpenMP solution. In light of your note, is this the simplest construct
#pragma omp parallel num_threads(1) {
#pragma omp for simd
for (i = 0; i < n; +ii)
y[i] = a * x[i] + y[i];
}
or is there a simpler form?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I have found the answer to my question:
#pragma omp simd
for (i = 0; i < n; ++i)
y[i] = a * x[i] + y[i];
Thanks,
Chris
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Chris,
Glad you have found the answer. Since you are trying to achieve vectorization only, that is the simplest construct.
We will be closing this thread for now. If you require any additional assistance from intel please raise a new thread.
Regards
Prasanth
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page