Intel Inspector reports data race in atan2 function

mattytee · ‎05-27-2021

Hello,

Intel inspector in Parallel Studio XE 2019 detects a data race when atan2 function is used. Here is a sample code:

program test_arctan

implicit none

real*8 x(1000),y(1000),r(1000)

integer i,n

n=1000
x=0.1d0
y=0.1d0

!$omp parallel do schedule(static,1)
do i=1,n
r(i)=atan2(x(i),y(i))
enddo

do i=1,n
write(5000,*)r(i)
enddo

end program test_arctan

The attached snapshot of the Inspector screen showing read/write race is from a different, larger, code, which the sample program here is meant to reproduce. I also tested explicitly declaring the arguments as thread private and that also got rid of the data race error when using the atan2 function.:

!$omp parallel do schedule(static,1) &
!$omp& firstprivate(x,y)
do i=1,n
r(i)=atan2(x(i),y(i))
enddo

Is it a false positive or do I have a problem using atan2 like that?

Thank you

Steve_Lionel · ‎05-28-2021

Your sample code looks nothing like what is shown in the screenshot.

mattytee · ‎05-28-2021

Yes, it is from the original code that I could not share, as I mentioned in the original message. The image was meant to illustrate the actual reported error. A similar one, for different variables, is generated for the sample program I shared.

Steve_Lionel · ‎05-28-2021

But it's not at all similar. In the screenshot, the arguments to atan2 are scalars, whereas in your "sample" they are array elements indexed by the parallel loop.

I did, however, find an issue when I built the program as a release build and parallelization enabled. It appears to be inside the SVML (vector math library) when it is initializing the "feature flag" based on the processor type. (See screenshot attached). This doesn't look right to me and I suggest you report it to Intel for investigation.

Steve_Lionel · ‎05-29-2021

I did some more thinking about the data race, and if it is doing what I think, it is harmless. The first time you call an optimized math routine, it checks the CPU type so that it can do "CPU dispatching" for best performance. Then it writes a code into a global memory location that it checks on future calls. In a multithreaded environment, it's always going to write the same code, so it doesn't matter if there are two threads trying to write it. The library could try to synchronize access, but that would be slow and unnecessary.

mattytee · ‎05-28-2021

Thank you, Steve, for looking into it. Sorry, if it was confusing.

jimdempseyatthecove · ‎05-29-2021

In addition to the SVML issue Steve mentioned, the above code should not use static scheduling with chunk size of 1. Doing so will result in excessive cache line evictions amongst cores of your thread team. To correct this:

a) align arrays x, y, and r on cache line boundaries (currently 64 bytes) and use a chunk size of multiples of cells in cache line (64/sizeof(x(1)).

b) use static scheduling without specifying chunk size (and consider adding simd clause too)

Additionally, x and y can be shared as there currently is an unnecessary copy operation.

Jim Dempsey