Intel® C++ Compiler
Community support and assistance for creating C++ code that runs on platforms based on Intel® processors.
7956 Discussions

Problem with #pragma omp declare simd

velvia
Beginner
396 Views

Hi,

I am trying to make the "#pragma omp declare simd" construct to work, but I am struggling with some problems.

I wrote a program that consists of two compilation units: main.cpp and f.cpp. The file f.cpp contains two functions. The function f_not_vectorized does not come with a vectorized version and f_openmp is asked to be vectorized with OpenMP 4. They both compute the cos of a float.

I use main.cpp to time the results. Three tests are done. One with f_not_vectorized, one with f_openmp and one with a direct call to std::cos. Unfortunately, I get the following result:

Not vectorized: 8.370e-01 s
        OpenMP: 8.370e-01 s
       Inlined: 1.563e-01 s

which is not what I expected. I was looking to an OpenMP version with performance very close to the inlined version as it should be vectorized. Here is the full code compiler with icpc version 16.0.2, and compiled with

icpc -c -std=c++11 -O3 -xHost -ansi-alias -qopenmp main.cpp -o main.o
icpc -c -std=c++11 -O3 -xHost -ansi-alias -qopenmp f.cpp -o f.o
icpc -std=c++11 -O3 -xHost -ansi-alias -qopenmp main.o f.o -o main

The file f.cpp

#include <cmath>

float f_not_vectorized(float x) { return std::cos(x); }

#pragma omp declare simd notinbranch simdlen(8)
float f_openmp(float x) { return std::cos(x); }

And the file main.cpp

#include <cstdio>
#include <cmath>
#include <chrono>

float f_not_vectorized(float x);

#pragma omp declare simd notinbranch simdlen(8)
float f_openmp(float x);

int main() {
  const int nb_times{1000000};
  const int array_length{128};
  float v[array_length];

  auto time_begin = std::chrono::high_resolution_clock::now();
  for (int k{0}; k < nb_times; ++k) {
    for (int i{0}; i < array_length; ++i) {
      v = f_not_vectorized(v);
    }
  }
  auto time_end = std::chrono::high_resolution_clock::now();
  double time_not_vectorized{
      1.0e-9 *
      std::chrono::duration_cast<std::chrono::nanoseconds>(time_end -
                                                           time_begin)
          .count()};
  std::printf("Not vectorized: %7.3e s\n", time_not_vectorized);

  time_begin = std::chrono::high_resolution_clock::now();
  for (int k{0}; k < nb_times; ++k) {
#pragma omp simd
    for (int i{0}; i < array_length; ++i) {
      v = f_openmp(v);
    }
  }
  time_end = std::chrono::high_resolution_clock::now();
  double time_openmp{1.0e-9 *
                     std::chrono::duration_cast<std::chrono::nanoseconds>(
                         time_end - time_begin)
                         .count()};
  std::printf("        OpenMP: %7.3e s\n", time_not_vectorized);

  time_begin = std::chrono::high_resolution_clock::now();
  for (int k{0}; k < nb_times; ++k) {
    for (int i{0}; i < array_length; ++i) {
      v = std::cos(v);
    }
  }
  time_end = std::chrono::high_resolution_clock::now();
  double time_inlined{1.0e-9 *
                      std::chrono::duration_cast<std::chrono::nanoseconds>(
                          time_end - time_begin)
                          .count()};
  std::printf("       Inlined: %7.3e s\n", time_inlined);

  float check{0.0f};
  for (int i{0}; i < array_length; ++i) {
    check += v;
  }
  std::printf("Check: %7.3e\n", check);

  return 0;
}

Thanks for your help.

0 Kudos
1 Reply
Anoop_M_Intel
Employee
396 Views

Firstly, the loop not marked with #pragma simd is also auto-vectorized. So the performance of both first and second case in your output produce the same result. In all 3 cases, it invokes  __svml_cosf8. The only difference is that with the third case, we have the function inlined as thus avoids the function call overhead. Functions with this small body are ideal candidates for inlining than being good candidates for vector functions.

Thanks and Regards
Anoop

0 Kudos
Reply