strange behavior with icc, openmp and dynamic library

marsupialtail · ‎07-15-2020

I posted this question on StackOverflow. Thought I'd repost it here.

I have a C++ function written using OpenMP. The function involves an outer loop which is parallelized statically with some private variables. But I believe the precise nature of the function might not be important. When I compile it into a dynamic library using this:

g++ -fopenmp -shared -fPIC -O3 -march=native testing.cpp -o test.so

everything works well downstream. When I compile it into a dynamic library using the equivalent command in icc:

icc -qopenmp -shared -fPIC -O3 -march=native testing.cpp -o test.so, the function executes in about the same time but gives the wrong result. Any ideas?

To reproduce: https://github.com/marsupialtail/icc-problem, run compile.sh, with either icc or gcc, can see that the results don't agree...

AbhishekD_Intel · ‎07-16-2020

Hi,

Thank you for posting this strange behavior, there are some details we wanted to know

Is this behavior persists only with dynamic libraries or also with the normal executables?
Please send us a minimal reproducer giving the above-mentioned problem to get more details, so that we can raise it as a bug.
Also, give us details of the processor and compiler version you used to test the reproducer so that we can reproduce it on our end.

Please give us the above details.

Warm Regards,

Abhishek

jimdempseyatthecove · ‎07-16-2020

While I haven't attempted to build and run your program, perhaps you can explain:

..., const float * __restrict__ BC, ...
...
#pragma omp parallel for schedule(static) private(ACC,RC,val,zero)
for(int C_block = 0; C_block < 28; C_block ++){
  int C_offset = C_block * (12544 / 28);
  ...
  for(int lane =0; lane < Tsz; lane += 4){
     RC = _mm256_load_ps(&BC[0 + C_offset + lane]);
     ...

your incoming arrays are float.
you are using mm256 instructions, which have a SIMD vector width of 8 floats
while your lane process loop is advancing 4 lanes?

Is this a programming error (iow, you took SSE code and "converted" it to AVX code)?

Jim Dempsey

marsupialtail · ‎07-20-2020

Hi,

In the code, the increment of 4 is for Arm, as defined by preprocessor directives. For x86 I am using an increment of 8. The code is correct when NOT executed as a dynamic shared library.

The reproducible example is included in the github link.

Thank you!

jimdempseyatthecove · ‎07-20-2020

>>In the code, the increment of 4 is for Arm, as defined by preprocessor directives. For x86 I am using an increment of 8.

That is not what your code showed in the clip I excised from. You are explicitly using += 4 (a literal) combined with _mm256... intrinsic. It would appear that you may have a copy/paste error.

Jim Dempsey

AbhishekD_Intel · ‎07-19-2020

Hi,

Please give us an update on your issue and also give us a minimal reproducer.

Thank you

AbhishekD_Intel · ‎08-03-2020

Hi,

Please give us an update on your issue. It seems Jim's observation is correct. So please give us relevant information to dig into this issue.

Warm Regards,

Abhishek

AbhishekD_Intel · ‎08-13-2020

Hi,

Please give us an update on your issue.

Thank you

AbhishekD_Intel · ‎08-17-2020

We have not heard back from you, we won't be monitoring this thread. If you need further assistance, please post a new thread.

Thank you

Abhishek