Intel® C++ Compiler
Community support and assistance for creating C++ code that runs on platforms based on Intel® processors.

unroll_and_jam pragma ignored but no reason specified

Abhinav_B_
Beginner
643 Views

Hi,

Consider the following C++ code:

#include <malloc.h>
#include <cmath>
#include <complex>

int main(int argc, char **argv) {
    int N = 4000000;
    double * _arr_4_0;
  _arr_4_0 = (double *) (malloc((sizeof(double) * (unsigned long) (5331.0))));
  for (int _i0 = 0; (_i0 <= 5330); _i0 = (_i0 + 1))
  {
    _arr_4_0[_i0] = std::sin(_i0);
  }
    double * _arr_7_7;
  _arr_7_7 = (double *) (malloc((sizeof(double) * (unsigned long) (((0.1 * (double) (N)) + -66.0)))));
  #pragma omp parallel for schedule(static)
  #pragma ivdep
  for (int _i0 = 0; (_i0 < ((N / 10) - 66)); _i0 = (_i0 + 1))
  {
    _arr_7_7[_i0] = std::sqrt(_i0);
  }
    std::complex<double> * _arr_6_8;
  _arr_6_8 = (std::complex<double> *) (malloc((sizeof(std::complex<double>) * (unsigned long) (((0.1 * (double) (N)) + -5396.0)))));
  for (int o1 = 0; (o1 < (((N + 110) / 320) - 168)); o1 = (o1 + 1))
  {
    int _ct167 = ((((32 * o1) + 31) < ((N / 10) - 5397))? ((32 * o1) + 31): ((N / 10) - 5397));
    for (int o2 = (32 * o1); (o2 <= _ct167); o2 = (o2 + 1))
    {
      _arr_6_8[o2] = (0.0 + 0.0j);
    }
  }
  #pragma omp parallel for schedule(static)
  for (int o1 = 0; (o1 < (((N + 110) / 320) - 168)); o1 = (o1 + 1))
  {
    for (int o2 = 0; (o2 <= 166); o2 = (o2 + 1))
    {
      int _ct168 = ((((32 * o1) + 31) < ((N / 10) - 5397))? ((32 * o1) + 31): ((N / 10) - 5397));
      #pragma unroll_and_jam (6)
      for (int o3 = (32 * o1); (o3 <= _ct168); o3 = (o3 + 1))
      {
        int _ct169 = ((5330 < ((32 * o2) + 31))? 5330: ((32 * o2) + 31));
        #pragma ivdep
        for (int o4 = (32 * o2); (o4 <= _ct169); o4 = (o4 + 1))
        {
          _arr_6_8[o3] = (_arr_6_8[o3] + (_arr_7_7[((5330 - o4) + o3)] * _arr_4_0[o4]));
        }
      }
    }
  }
    return 0;
}

I compiled this using the following command (file saved as test.cpp):

icpc -O3 -qopenmp -qopt-report=5 -qopt-report-file=stdout test.cpp > optrpt

However, I get a warning on stderr which says:

test.cpp(38): (col. 7) remark: unroll_and_jam pragma will be ignored due to 

There is no reason specified for why the pragma is being ignored. Could you please help me diagnose this?

icpc -V

gives

Intel(R) C++ Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 17.0.1.132 Build 20161005
Copyright (C) 1985-2016 Intel Corporation.  All rights reserved.

This bug is also present on

Intel(R) C++ Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 17.0.2.174 Build 20170213
Copyright (C) 1985-2017 Intel Corporation.  All rights reserved.

Any suggestions on how to debug this would be appreciated.

Thanks,
Abhinav

0 Kudos
4 Replies
Abhinav_B_
Beginner
644 Views

An update on the issue:

The warning is issued only when the unroll_and_jam pragma is used with tiled loops. If lines 31-48 are replaced with the code below, no warning is emitted and the outer loop is unroll-jammed.

#pragma omp parallel for schedule(static)
#pragma unroll_and_jam (16)
for (int _i0 = 0; (_i0 < ((N / 10) - 5396)); _i0 = (_i0 + 1))
{
    #pragma ivdep
    for (int _i1 = 0; (_i1 <= 5330); _i1 = (_i1 + 1))
    {
      _arr_6_8[_i0] = (_arr_6_8[_i0] + (_arr_7_7[((5330 + _i0) - _i1)] * _arr_4_0[_i1]));
    }
}

 

0 Kudos
Igor_V_Intel
Employee
644 Views

Hi Abhinav,

I will investigate it and will be back with an update shortly. Looks like a bug and I will check your test case with 18.0 Beta compiler version.


Regards,

Igor

0 Kudos
Igor_V_Intel
Employee
644 Views

The problem is still in 18.0 compiler version. We should correct the remark message for sure to explain the reason. It looks like the loop was distributed on 2 chunks and the innermost loop of chunk 1 was vectorized. Chunk 2 was not vectorized. I will escalate this to the developers.

Thank you for reporting this problem.

0 Kudos
Abhinav_B_
Beginner
644 Views

Hi Igor,

Thanks for your response. In fact, AFAICS the loop should actually be unroll-jammed (maybe that's why there's no reason given for ignoring the pragma). You're right in observing that the 2 loops in the original code (#2) have been tiled into chunks of length 32 each.

Thanks,
Abhinav

0 Kudos
Reply