Intel® C++ Compiler
Community support and assistance for creating C++ code that runs on platforms based on Intel® processors.
7956 Discussions

strange behavior with icc, openmp and dynamic library

marsupialtail
Beginner
934 Views

I posted this question on StackOverflow. Thought I'd repost it here. 

I have a C++ function written using OpenMP. The function involves an outer loop which is parallelized statically with some private variables. But I believe the precise nature of the function might not be important. When I compile it into a dynamic library using this:

g++ -fopenmp -shared -fPIC -O3 -march=native testing.cpp -o test.so

everything works well downstream. When I compile it into a dynamic library using the equivalent command in icc:

icc -qopenmp -shared -fPIC -O3 -march=native testing.cpp -o test.so, the function executes in about the same time but gives the wrong result. Any ideas?

To reproduce: https://github.com/marsupialtail/icc-problem, run compile.sh, with either icc or gcc, can see that the results don't agree...

0 Kudos
8 Replies
AbhishekD_Intel
Moderator
919 Views

Hi,


Thank you for posting this strange behavior, there are some details we wanted to know

  1. Is this behavior persists only with dynamic libraries or also with the normal executables?
  2. Please send us a minimal reproducer giving the above-mentioned problem to get more details, so that we can raise it as a bug.
  3. Also, give us details of the processor and compiler version you used to test the reproducer so that we can reproduce it on our end.


Please give us the above details.



Warm Regards,

Abhishek


0 Kudos
jimdempseyatthecove
Honored Contributor III
918 Views

While I haven't attempted to build and run your program, perhaps you can explain:

..., const float * __restrict__ BC, ...
...
#pragma omp parallel for schedule(static) private(ACC,RC,val,zero)
for(int C_block = 0; C_block < 28; C_block ++){
  int C_offset = C_block * (12544 / 28);
  ...
  for(int lane =0; lane < Tsz; lane += 4){
     RC = _mm256_load_ps(&BC[0 + C_offset + lane]);
     ...

your incoming arrays are float.
you are using mm256 instructions, which have a SIMD vector width of 8 floats
while your lane process loop is advancing 4 lanes?

Is this a programming error (iow, you took SSE code and "converted" it to AVX code)?

Jim Dempsey

0 Kudos
marsupialtail
Beginner
891 Views

Hi,

In the code, the increment of 4 is for Arm, as defined by preprocessor directives. For x86 I am using an increment of 8. The code is correct when NOT executed as a dynamic shared library.

The reproducible example is included in the github link.

Thank you!

 

0 Kudos
jimdempseyatthecove
Honored Contributor III
888 Views

>>In the code, the increment of 4 is for Arm, as defined by preprocessor directives. For x86 I am using an increment of 8.

That is not what your code showed in the clip I excised from. You are explicitly using += 4 (a literal) combined with _mm256... intrinsic. It would appear that you may have a copy/paste error.

Jim Dempsey

0 Kudos
AbhishekD_Intel
Moderator
896 Views

Hi,

Please give us an update on your issue and also give us a minimal reproducer.


Thank you


0 Kudos
AbhishekD_Intel
Moderator
855 Views

Hi,

Please give us an update on your issue. It seems Jim's observation is correct. So please give us relevant information to dig into this issue.



Warm Regards,

Abhishek


0 Kudos
AbhishekD_Intel
Moderator
821 Views

Hi,

Please give us an update on your issue.


Thank you


0 Kudos
AbhishekD_Intel
Moderator
808 Views

We have not heard back from you, we won't be monitoring this thread. If you need further assistance, please post a new thread. 


Thank you

Abhishek


0 Kudos
Reply