Intel® C++ Compiler
Support and discussions for creating C++ code that runs on platforms based on Intel® processors.
Announcements
Welcome to the Intel Community. If you get an answer you like, please mark it as an Accepted Solution to help others. Thank you!
7569 Discussions

strange behavior with icc, openmp and dynamic library

marsupialtail
Beginner
443 Views

I posted this question on StackOverflow. Thought I'd repost it here. 

I have a C++ function written using OpenMP. The function involves an outer loop which is parallelized statically with some private variables. But I believe the precise nature of the function might not be important. When I compile it into a dynamic library using this:

g++ -fopenmp -shared -fPIC -O3 -march=native testing.cpp -o test.so

everything works well downstream. When I compile it into a dynamic library using the equivalent command in icc:

icc -qopenmp -shared -fPIC -O3 -march=native testing.cpp -o test.so, the function executes in about the same time but gives the wrong result. Any ideas?

To reproduce: https://github.com/marsupialtail/icc-problem, run compile.sh, with either icc or gcc, can see that the results don't agree...

0 Kudos
8 Replies
AbhishekD_Intel
Moderator
428 Views

Hi,


Thank you for posting this strange behavior, there are some details we wanted to know

  1. Is this behavior persists only with dynamic libraries or also with the normal executables?
  2. Please send us a minimal reproducer giving the above-mentioned problem to get more details, so that we can raise it as a bug.
  3. Also, give us details of the processor and compiler version you used to test the reproducer so that we can reproduce it on our end.


Please give us the above details.



Warm Regards,

Abhishek


jimdempseyatthecove
Black Belt
427 Views

While I haven't attempted to build and run your program, perhaps you can explain:

..., const float * __restrict__ BC, ...
...
#pragma omp parallel for schedule(static) private(ACC,RC,val,zero)
for(int C_block = 0; C_block < 28; C_block ++){
  int C_offset = C_block * (12544 / 28);
  ...
  for(int lane =0; lane < Tsz; lane += 4){
     RC = _mm256_load_ps(&BC[0 + C_offset + lane]);
     ...

your incoming arrays are float.
you are using mm256 instructions, which have a SIMD vector width of 8 floats
while your lane process loop is advancing 4 lanes?

Is this a programming error (iow, you took SSE code and "converted" it to AVX code)?

Jim Dempsey

marsupialtail
Beginner
400 Views

Hi,

In the code, the increment of 4 is for Arm, as defined by preprocessor directives. For x86 I am using an increment of 8. The code is correct when NOT executed as a dynamic shared library.

The reproducible example is included in the github link.

Thank you!

 

jimdempseyatthecove
Black Belt
397 Views

>>In the code, the increment of 4 is for Arm, as defined by preprocessor directives. For x86 I am using an increment of 8.

That is not what your code showed in the clip I excised from. You are explicitly using += 4 (a literal) combined with _mm256... intrinsic. It would appear that you may have a copy/paste error.

Jim Dempsey

AbhishekD_Intel
Moderator
405 Views

Hi,

Please give us an update on your issue and also give us a minimal reproducer.


Thank you


AbhishekD_Intel
Moderator
364 Views

Hi,

Please give us an update on your issue. It seems Jim's observation is correct. So please give us relevant information to dig into this issue.



Warm Regards,

Abhishek


AbhishekD_Intel
Moderator
330 Views

Hi,

Please give us an update on your issue.


Thank you


AbhishekD_Intel
Moderator
317 Views

We have not heard back from you, we won't be monitoring this thread. If you need further assistance, please post a new thread. 


Thank you

Abhishek


Reply