Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.

Floating point exception in MSVCRT

Allen_Barnett
Beginner
776 Views

I'm using Version 17.0.1.143 Build 20161005 with visual studio 2013. When floating point exceptions are unmasked, the following program produces the INVALID floating point exception when compiled with the given command line:

program m
  integer, parameter :: n = 2
  real b(n)
  b = (/ 5.3359774e-5, 5.5001899e-5 /)
  call a( b, n )
end program m
subroutine a ( b, n )
  integer n
  real b(n)
  double precision t
  integer i
  do i = 1, n
     t = dble( -b(i) )
     b(i) = sngl( exp(t) )
  end do
end subroutine a

Compile with:

ifort /fp:strict /QxAVX /Qvec-threshold0  /fpe0 /O2  /Z7 /MD well.F90 /link /map /nodefaultlib:libmmd

And run it:

> well
forrtl: error (65): floating invalid
Image              PC                Routine            Line        Source

MSVCR120.dll       00007FFE1E27CADF  Unknown               Unknown  Unknown
well.exe           00007FF6CF18108D  Unknown               Unknown  Unknown
well.exe           00007FF6CF18258E  Unknown               Unknown  Unknown
well.exe           00007FF6CF183E87  Unknown               Unknown  Unknown
KERNEL32.DLL       00007FFE2B8913D2  Unknown               Unknown  Unknown
ntdll.dll          00007FFE2B9E54E4  Unknown               Unknown  Unknown

This set of options appears to cause two confounding things to happen: 1) the loop in subroutine a is vectorized and 2) instead of the implementation of exp() in the Intel libm, you get the exp() in MSVCRT. The vectorization of the loop involves loading two double values into an XMM register (say xmmX). Since exp() is not the vectorized version, it first loads the low quadword of xmmX into the argument register for exp() (namely xmm0). If there is more than one element in b, then it uses pshufd xmm0,xmmX,0eh  to put the high quadword of xmmX in xmm0. This may also put some non-zero bits in the high quadword of xmm0. The MSVCRT implementation of exp() doesn't really like this; at some point it (pointlessly) tries to convert the high quadword of xmm0 into a 32-bit integer, which, depending on the exact bit pattern, may produce an invalid integer.

I think the code in MSVCRT exp() is clearly wrong; it shouldn't be converting unused values. But, I don't know that the Fortran should be supplying the argument in xmm0 with a non-zero high quadword either. What do you think? Is this a bug in the Fortran?

Thanks,
Allen

0 Kudos
7 Replies
mecej4
Honored Contributor III
776 Views

Allen Barnett wrote:
1) the loop in subroutine a is vectorized and 2) instead of the implementation of exp() in the Intel libm, you get the exp() in MSVCRT.

The situation that I see is a bit different from the one in that description, and occurs only with a 64-bit target. The loop is (i) unrolled, and (ii) merged into the main program. There are lots of vector instructions, but they do things only up to the call to exp(), which itself occurs with exactly one double precision argument passed in the lower half (quad-word) of XMM0.

The code for the subroutine still exists in the OBJ and EXE, but there is no call to the subroutine, and a smart linker, supported by suitable user-specified options, could remove the dead code. Looking for references to exp in the OBJ file, I see:

s:\lang>dumpbin /disasm well.obj  | findstr exp
  000000000000005F: E8 00 00 00 00     call        exp
  000000000000006E: E8 00 00 00 00     call        exp    <<== causes exception in MSVCRT
  0000000000000111: E8 00 00 00 00     call        exp
  0000000000000120: E8 00 00 00 00     call        exp

The first two calls are in the main program. Looking at the traceback shows you that it is the second call that caused the exception ( the reported address ends with "073", which matches 06E + 5 bytes (the length of the instruction that caused the exception). The third and fourth calls to exp are in the dead code of the subroutine.

Interestingly, if the two values in your initializer for the array b are interchanged, the crash still happens with the second call. This finding supports the conjecture that there is some memory/register corruption in the MSVCRT code that is only loosely related to value of the argument to exp.

P.S.: The third operand of the VPSHUFD instruction before the second call to exp is 0Eh. Should that not be 4Eh, to swap the two 64-bit floats? Should it use some other way of putting a double precision value into bits 64-127 of the register -- a value whose integer part is not so large that an exception will occur when a CVTTxx2xx instruction is applied to this operand later?

0 Kudos
Allen_Barnett
Beginner
776 Views

Hi mecej4: I should have mentioned this is 64-bit; I didn't test it on 32-bits.

The code itself is just an excerpt from my production code; the routine is not in-lined in that case; but the basic unrolling and vectorization of the loop is the same. The old version of the compiler (14.0 or there abouts) did not try to vectorize this loop and consequently, every call to exp() had the high quadword of xmm0 zero'd out. So, as you observe, it's always the second call to exp() which produces the exception when exp() is invoked with the shuffled double precision value. (Well, I admit to finessing the arguments a bit to cause this result; but they're actual values computed by my production code where I observed this problem.)

Thanks,
Allen

0 Kudos
mecej4
Honored Contributor III
776 Views

Here are some specific details on why an exception occurs. In summary, some operations are performed on a portion of a XMM/YMM register that is passed to EXP without being zeroed. Within the MSVCR120 code for exp(), there is a pair of instructions that truncate a float to its integer value:

   cvttpd2dq  xmm2,xmm1

   cvttdq2pd xmm1,xmm2

The unused part of the argument is not zeroed by the caller; the debugger shows 0xE0000000 for that part, this corresponds to -3.7..E+19, which is too big for its integer part to fit into 32 bits. Since -fpe:0 was specified, an exception is taken when the first of these two instructions is executed.

The exp() in libmmd.dll does not run into this issue since it copies the true argument into the uninitialized part fairly early in the exp() code, with the instruction

    unpcklpd    xmm0,xmm0

However, I am not well-acquainted with the rules regarding who is responsible for fixing this: the caller or the callee. Had the uninitialized value not caused an exception, exp() would have returned a correct result, so MS could say, "zero out the unused part of the register before calling". Intel, on the other hand, could say, "the library routine should zero out the unused part of the register in advance so as to avoid unwarranted exceptions".  Nor is it clear what the C or Fortran programmer is responsible for doing in order to avoid this issue.

0 Kudos
Allen_Barnett
Beginner
776 Views

Yeah. That was basically my question: Is a non-zero value in the high quadword allowed by the MSVCRT ABI? This is really all I could find:

https://msdn.microsoft.com/en-us/library/zthk2dkh.aspx

It sort of suggests that the unused parts of a register will be ignored, but it only states that explicitly for the regular register (Rxx) arguments.

Thanks,
Allen

0 Kudos
Kevin_D_Intel
Employee
776 Views

Thank you both for the excellent discussion and analysis. I will share all the details with Development for them analyze and keep this thread updated about what I hear back.

(Internal tracking id: TBD)

0 Kudos
mecej4
Honored Contributor III
776 Views

I find that the problem also occurs with IFort 16.0.4 and 17.0.1 on an older system with a Core-2 Duo E8400 CPU, as well, with /QxAVX replaced by /Qxhost. On this system, IFort 13.1.3.198 does not have this problem (all are X64 versions).

Comparing the code generated by that old compiler with that from the current compiler, I noticed that the probable culprit is the generation of the CVTPS2PD instruction by IFort2017 prior to the call to exp(), in contrast to CVTSS2SD with the old compiler.

0 Kudos
TimP
Honored Contributor III
776 Views

The substitution of parallel for serial instruction might have been done to eliminate unexpected partial register stalls on some CPU model, but clearing the register with parallel XOR might work better.

0 Kudos
Reply