Another possible OpenMP SIMD compiler bug

John_D_12 · ‎03-15-2019

We've been adding lots of OpenMP SIMD instructions to our electronic structure code (http://elk.sourceforge.net/) and successfully sped it up.

But we've also encountered a few potential compiler bugs along the the way. The first was reported here: https://software.intel.com/en-us/forums/intel-fortran-compiler-for-linux-and-mac-os-x/topic/805677

I think there may be another. Here is the simplest code which still has the problem:

program test
use modbug
implicit none
integer i
complex(8) z1
complex(8), allocatable :: x(:),y(:)
complex(8) zf
external zf

n=10

allocate(r(n))
allocate(x(n),y(n))

r(:)=1
x(:)=1
y(:)=1

z1=zf(x,y)

print *,z1

end program

complex(8) function zf(x,y)
use modbug
implicit none
complex(8), intent(in) :: x(n)
complex(8), intent(in) :: y(n)
! local variables
integer i
zf=0.d0
!$OMP SIMD
do i=1,n
  zf=zf+r(i)*conjg(x(i))*y(i)
end do
return
end function

A module in a separate file is also needed:

module modbug

integer n
real(8), allocatable :: r(:)

end module

The code is compiled with

ifort -O3 -ip -axCORE-AVX2,AVX,SSE4.2 -qopenmp modbug.f90 test.f90

on our Intel Xeon E5-2680 cluster with Intel Fortran 18.0.0.

The correct output should be 10.0, but with the SIMD directive the code returns 5.0 instead.

If the module file is included in the same file as the code then the compiler reports:

test.f90(35): warning #15552: loop was not vectorized with "simd"

and the code works fine.

Juergen_R_R · ‎03-15-2019

I would say that this loop is not parallelizable, as the complex variable zf to which things are added does have different values for parallel executions of the loop and in fact needs a serial execution of the loop. Note that all the examples one finds for the OMP SIMD pragma are loops of the form

!$OMP SIMD
do i = 1, n
   a(i) = a(i) * b(i) + c(i)
end do

where you have elemental operations, but not an incrementing operation on a variable quasi-global to the loop.

jimdempseyatthecove · ‎03-15-2019

Try:

zf=0.d0
!$OMP SIMD REDUCTION(+:zf)
do i=1,n
  zf=zf+r(i)*conjg(x(i))*y(i)
end do

Jim Dempsey

John_D_12 · ‎03-15-2019

jimdempseyatthecove wrote:
Try:
zf=0.d0
!$OMP SIMD REDUCTION(+:zf)
do i=1,n
  zf=zf+r(i)*conjg(x(i))*y(i)
end do
Jim Dempsey

This is what we did originally. Unfortunately it yields:

catastrophic error: **Internal compiler error: segmentation violation signal raised** Please report this error along with the circumstances in which it occurred in a Software Problem Report.  Note: File and line given may not be explicit cause of this error.

in Intel Fortran version 17 but only for the more complicated version of the code in Elk. The simplified example does not result in the error but does not yield vectorized code.

After trying it without the REDUCTION clause we discovered the error in Intel Fortran version 18 as stated in the original post.

John_D_12 · ‎03-15-2019

Juergen R. wrote:
I would say that this loop is not parallelizable, as the complex variable zf to which things are added does have different values for parallel executions of the loop and in fact needs a serial execution of the loop. Note that all the examples one finds for the OMP SIMD pragma are loops of the form
!$OMP SIMD
do i = 1, n
   a(i) = a(i) * b(i) + c(i)
end do
where you have elemental operations, but not an incrementing operation on a variable quasi-global to the loop.

It is permitted to update the same variable within a SIMD loop. As Jim mentioned, it's better to tell the compiler that it is a REDUCTION variable. Unfortunately, this resulted in a compile-time error for Intel Fortran version 17 for the more complicated version of the simple example above.

If all the variables are real in the above example (with or without REDUCTION), Intel Fortran 17/18 compiles without the warning that no vectorization is performed. However, this does not result in a measurable speed-up but we've added it to Elk nevertheless.

The original bug still stands: an OMP SIMD directive alone should not break code. At worst it will not result in any vectorization performed.