- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I am trying to make the "#pragma omp declare simd" construct to work, but I am struggling with some problems.
I wrote a program that consists of two compilation units: main.cpp and f.cpp. The file f.cpp contains two functions. The function f_not_vectorized does not come with a vectorized version and f_openmp is asked to be vectorized with OpenMP 4. They both compute the cos of a float.
I use main.cpp to time the results. Three tests are done. One with f_not_vectorized, one with f_openmp and one with a direct call to std::cos. Unfortunately, I get the following result:
Not vectorized: 8.370e-01 s OpenMP: 8.370e-01 s Inlined: 1.563e-01 s
which is not what I expected. I was looking to an OpenMP version with performance very close to the inlined version as it should be vectorized. Here is the full code compiler with icpc version 16.0.2, and compiled with
icpc -c -std=c++11 -O3 -xHost -ansi-alias -qopenmp main.cpp -o main.o icpc -c -std=c++11 -O3 -xHost -ansi-alias -qopenmp f.cpp -o f.o icpc -std=c++11 -O3 -xHost -ansi-alias -qopenmp main.o f.o -o main
The file f.cpp
#include <cmath> float f_not_vectorized(float x) { return std::cos(x); } #pragma omp declare simd notinbranch simdlen(8) float f_openmp(float x) { return std::cos(x); }
And the file main.cpp
#include <cstdio> #include <cmath> #include <chrono> float f_not_vectorized(float x); #pragma omp declare simd notinbranch simdlen(8) float f_openmp(float x); int main() { const int nb_times{1000000}; const int array_length{128}; float v[array_length]; auto time_begin = std::chrono::high_resolution_clock::now(); for (int k{0}; k < nb_times; ++k) { for (int i{0}; i < array_length; ++i) { v = f_not_vectorized(v); } } auto time_end = std::chrono::high_resolution_clock::now(); double time_not_vectorized{ 1.0e-9 * std::chrono::duration_cast<std::chrono::nanoseconds>(time_end - time_begin) .count()}; std::printf("Not vectorized: %7.3e s\n", time_not_vectorized); time_begin = std::chrono::high_resolution_clock::now(); for (int k{0}; k < nb_times; ++k) { #pragma omp simd for (int i{0}; i < array_length; ++i) { v = f_openmp(v); } } time_end = std::chrono::high_resolution_clock::now(); double time_openmp{1.0e-9 * std::chrono::duration_cast<std::chrono::nanoseconds>( time_end - time_begin) .count()}; std::printf(" OpenMP: %7.3e s\n", time_not_vectorized); time_begin = std::chrono::high_resolution_clock::now(); for (int k{0}; k < nb_times; ++k) { for (int i{0}; i < array_length; ++i) { v = std::cos(v); } } time_end = std::chrono::high_resolution_clock::now(); double time_inlined{1.0e-9 * std::chrono::duration_cast<std::chrono::nanoseconds>( time_end - time_begin) .count()}; std::printf(" Inlined: %7.3e s\n", time_inlined); float check{0.0f}; for (int i{0}; i < array_length; ++i) { check += v; } std::printf("Check: %7.3e\n", check); return 0; }
Thanks for your help.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Firstly, the loop not marked with #pragma simd is also auto-vectorized. So the performance of both first and second case in your output produce the same result. In all 3 cases, it invokes __svml_cosf8. The only difference is that with the third case, we have the function inlined as thus avoids the function call overhead. Functions with this small body are ideal candidates for inlining than being good candidates for vector functions.
Thanks and Regards
Anoop

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page