Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Beginner
229 Views

strange behavior with icc, openmp and dynamic library

I posted this question on StackOverflow. Thought I'd repost it here. 

I have a C++ function written using OpenMP. The function involves an outer loop which is parallelized statically with some private variables. But I believe the precise nature of the function might not be important. When I compile it into a dynamic library using this:

g++ -fopenmp -shared -fPIC -O3 -march=native testing.cpp -o test.so

everything works well downstream. When I compile it into a dynamic library using the equivalent command in icc:

icc -qopenmp -shared -fPIC -O3 -march=native testing.cpp -o test.so, the function executes in about the same time but gives the wrong result. Any ideas?

To reproduce: https://github.com/marsupialtail/icc-problem, run compile.sh, with either icc or gcc, can see that the results don't agree...

0 Kudos
8 Replies
Highlighted
Moderator
214 Views

Hi,


Thank you for posting this strange behavior, there are some details we wanted to know

  1. Is this behavior persists only with dynamic libraries or also with the normal executables?
  2. Please send us a minimal reproducer giving the above-mentioned problem to get more details, so that we can raise it as a bug.
  3. Also, give us details of the processor and compiler version you used to test the reproducer so that we can reproduce it on our end.


Please give us the above details.



Warm Regards,

Abhishek


0 Kudos
Highlighted
213 Views

While I haven't attempted to build and run your program, perhaps you can explain:

..., const float * __restrict__ BC, ...
...
#pragma omp parallel for schedule(static) private(ACC,RC,val,zero)
for(int C_block = 0; C_block < 28; C_block ++){
  int C_offset = C_block * (12544 / 28);
  ...
  for(int lane =0; lane < Tsz; lane += 4){
     RC = _mm256_load_ps(&BC[0 + C_offset + lane]);
     ...

your incoming arrays are float.
you are using mm256 instructions, which have a SIMD vector width of 8 floats
while your lane process loop is advancing 4 lanes?

Is this a programming error (iow, you took SSE code and "converted" it to AVX code)?

Jim Dempsey

0 Kudos
Highlighted
Moderator
191 Views

Hi,

Please give us an update on your issue and also give us a minimal reproducer.


Thank you


0 Kudos
Highlighted
Beginner
186 Views

Hi,

In the code, the increment of 4 is for Arm, as defined by preprocessor directives. For x86 I am using an increment of 8. The code is correct when NOT executed as a dynamic shared library.

The reproducible example is included in the github link.

Thank you!

 

0 Kudos
Highlighted
183 Views

>>In the code, the increment of 4 is for Arm, as defined by preprocessor directives. For x86 I am using an increment of 8.

That is not what your code showed in the clip I excised from. You are explicitly using += 4 (a literal) combined with _mm256... intrinsic. It would appear that you may have a copy/paste error.

Jim Dempsey

0 Kudos
Highlighted
Moderator
150 Views

Hi,

Please give us an update on your issue. It seems Jim's observation is correct. So please give us relevant information to dig into this issue.



Warm Regards,

Abhishek


0 Kudos
Highlighted
Moderator
116 Views

Hi,

Please give us an update on your issue.


Thank you


0 Kudos
Highlighted
Moderator
103 Views

We have not heard back from you, we won't be monitoring this thread. If you need further assistance, please post a new thread. 


Thank you

Abhishek


0 Kudos