Solved: Access violation when trying to take MOD of 0.0 (64 bit real)

Karanta__Antti · ‎10-20-2020

I have in a couple of places in our code ran into an access violation when we use the AMOD / DMOD intrinsic functions to take a mod of 0.0 (64 bit reals), i.e. 0.0 as the first parameter. Has anyone else encountered this?

Unless I have my math entirely wrong, mod(0.0,n) where n /= 0.0 is 0. And even if it wasn't, I would expect a NaN, not an access violation.

I have tried but been unable to write a minimal sample to reproduce this, so the cause can of course be e.g. a memory corruption in our code. Though I am wondering why MOD and why only when the first parameter is zero? Also this has occurred in very distinct parts of the code base (using MOD).

Just plain single threaded execution, no openmp or other form of parallelism involved.

Environment: Win 10 (1909 build 19363.595), ifort 19.0.5.281 Build 20190815, targeting x64

IanH · ‎10-28-2020

Given the code I wildly guess in the absence of details .... you are using /Qfp-stack-check in your compile options. If so remove it - you are not using the x87 FPU.

View solution in original post

Arjen_Markus · ‎10-20-2020

Do you have a small self-contained program that illustrates this? That would make it much easier to analyse the problem. (From a mathematical point of view you are right, of course. The answer should be zero)

Bernard · ‎10-20-2020

Do you have a minidump file?

Or can you run your program under windbg or VS debugger?

mecej4 · ‎10-20-2020

There a bit of inconsistency in the problem report wording. The constant 0.0 is not a 64-bit real. In Ifort, whether targeting 32-bit or 64-bit Windows, that constant (or, presumably, a variable with that value) would be a 32-bit IEEE floating point number.

The two arguments to the MOD function are required to be of the same type. It is not clear whether the variable n has been declared to be of type REAL.

JohnNichols · ‎10-20-2020

!  Console10.f90 
!
!  FUNCTIONS:
!  Console10 - Entry point of console application.
!

!****************************************************************************
!
!  PROGRAM: Console10
!
!  PURPOSE:  Entry point for the console application.
!
!****************************************************************************

    program Console10

    implicit none

    real*8 t, t1,t3
    ! Variables

    ! Body of Console10
    print *, 'Hello World'
    t = 5.0d0
    t1 = 0.0d0
    t3 = 3.0d0
    t = amod(t1,t3)
    write(*,*)t
    end program Console10

1. It will not compile if the t's are not the same.

2. It runs with every combination of amod, dmod and mod and 32 and X64

IanH · ‎10-20-2020

For a little while now ifort with certain command line options (/fpe:0 /MD) will (often/always?) throw an access violation while trying to handle/report a floating point exception.

Check that n in the expression is actually non-zero. Also check that preceding floating point operations haven't done something that might have resulted in an exceptional value.

(Looks like this might be fixed in the current beta.)

Karanta__Antti · ‎10-21-2020

Thanks for all the input all those who answered!

To address the questions posed:

Bernard:
I have debugged using Visual Studio 2019. I can produce a dump file, but it is 1.2GB. Zips to just under 300MB, but still quite huge. Not sure if I have that much space on any cloud drive so I could even upload it...

mecej4:
Sorry for being a bit unprecise. We are using compiler options /4I8 /4R8 to default integers and reals to eight bytes, so real and real*8 should be equivalent. As I stated, the code was using 8 byte reals. The values weren't literal in the code - what I tried to say, is that in an expression AMOD(A,B) A was 0.0 and B was 1.0d-3 (i.e. nonzero) (0x3F50624DD2F1A9FC in hex to be very precise, but I wouldn't think that makes a difference).
To be very exact, the expression is
QMOD = AMOD( ABS( U1 - U0 ), DUP )
where QMOD, U0, U1 and DUP are declared as REAL and U0 == U1 == 0.0 and DUP == 1.0d-3

Also tried DMOD instead of AMOD, but the result was the same.

John:
Thanks for the code sample. Very similar to what I tried when I tried to reproduce a minimal sample. Please see a variation below.

Ian:
Playing around, I was able to produce an access violation with compiler options /MD /fpe:0 and also separately with options /Qinit:snan /libs:dll using the following source:

! produces access violation when compiled either of the following ways:
! ifort /fpe:0 /MD /nologo modtest.f90
! ifort /Qinit:snan /libs:dll /nologo modtest.f90
program Modtest

implicit none

real :: r, r1, r2

r1 = 1.0d-3
r2 = 0.0d0
r = mod(r1,r2)
write(*,*) r

end program

However, this is not exactly the case I have. If I switch the values of r1 and r2, this minimal program does not produce an access violation, but our real program does.
I tried carefully debugging to see if there would be an illegal floating point operation somewhere before the AMOD/DMOD call, but could not yet find one.

But I can confirm that this access violation is indeed related to floating point error checking - if I turn off (i.e. do not turn on) floating point error checking in our program, the access violation goes away.

We do use the compiler options /threads and /libs:dll which, if I am reading the documentation correctly, amounts to the same thing as /MD.
We do not use /fpe:0, but instead (conditionally) turn on fp exception checking using _controlfp_s (https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/controlfp-s?view=vs-2019). Which I imagine produces the same end result.

Note: I have (in June this year) already filed a bug report about the /Qinit:snan /libs:dll combo case: https://supporttickets.intel.com/requestdetail?id=5000P00000qEjIVQA0&lang=null

Bernard · ‎10-21-2020

>>>But I can confirm that this access violation is indeed related to floating point error checking - if I turn off (i.e. do not turn on) floating point error checking in our program, the access violation goes away.>>>

Does dump file when opened by windbg point you to the so called "Faulting IP"?

If I remeber correctly the trap frame should be present, which does contain the context as collected by the system during an access violation exception.

Karanta__Antti · ‎10-21-2020

Bernard - I have to admit not being familiar w/ doing postmortems using dump files and windbg.

I opened the dump file in windbg and ran the analysis. I don't notice anything out of the ordinary that I did not already know, but I may be missing something. If you want to have a look, please find the windbg output attached. It does not say anything about faulting ip.

Bernard · ‎10-22-2020

The eax register contains null upon execution of instruction at offset napaapi+0x543c8d2. You should check the previous frame and its arguments. Start at the beginning from the arguments of the faulting IP function and make your way back up to faulting IP. The most important thing is to find the code which set the rax register to 0.

Of course you should do it during the live debugging (not postmortem analysis).

Karanta__Antti · ‎10-28-2020

I'm a bit out of my league here. I have little (more than none, less than adequate) experience in assembler and even that is a long time back.

Anyhow, here's an excerpt from the disassembly from VS debugger:

END IF

00007FFCFED2D659 mov rax,qword ptr [U0] 
00007FFCFED2D660 movsd xmm0,mmword ptr [U1] 
00007FFCFED2D665 movsd xmm1,mmword ptr [rax] 
00007FFCFED2D669 subsd xmm0,xmm1 
00007FFCFED2D66D andpd xmm0,xmmword ptr [GR_CDEV_X_WINDOW_SYSTEM+0ACh (07FFD06EA20F0h)] 
00007FFCFED2D675 mov rax,qword ptr [DUP] 
00007FFCFED2D67C movsd xmm1,mmword ptr [rax] 
00007FFCFED2D680 call fmod (07FFD06224B62h) 
00007FFCFED2D685 fldz 
00007FFCFED2D687 fldz 
00007FFCFED2D689 fldz 
00007FFCFED2D68B fldz 
00007FFCFED2D68D fldz 
00007FFCFED2D68F fldz 
00007FFCFED2D691 fldz 
00007FFCFED2D693 fldz 
00007FFCFED2D695 mov qword ptr [rbp+0E0h],rax 
00007FFCFED2D69C fnstsw ax 
00007FFCFED2D69E test ax,40h 
00007FFCFED2D6A2 je GR61+754h (07FFCFED2D6A8h) 
00007FFCFED2D6A4 xor eax,eax 
00007FFCFED2D6A6 mov dword ptr [rax],eax 
00007FFCFED2D6A8 mov rax,qword ptr [rbp+0E0h] 
00007FFCFED2D6AF fstp st(0) 
00007FFCFED2D6B1 fstp st(0) 
00007FFCFED2D6B3 fstp st(0) 
00007FFCFED2D6B5 fstp st(0) 
00007FFCFED2D6B7 fstp st(0) 
00007FFCFED2D6B9 fstp st(0) 
00007FFCFED2D6BB fstp st(0) 
00007FFCFED2D6BD fstp st(0) 
00007FFCFED2D6BF movsd mmword ptr [rbp+130h],xmm0 
00007FFCFED2D6C7 mov byte ptr [rbp+0BFh],1 
00007FFCFED2D6CE movsd xmm0,mmword ptr [rbp+130h] 
00007FFCFED2D6D6 movsd mmword ptr [QMOD],xmm0 
QMOD = DMOD( ABS( U1 - U0 ), DUP )

Debugging I can see that the line setting eax (and thus rax) to zero is this one:

00007FFCFED2D6A4 xor eax,eax

The access violation is raised by the next line, i.e.

00007FFCFED2D6A6 mov dword ptr [rax],eax

Seems a bit strange that the compiler would produce these lines, as (if I understand the lines correctly), they will always cause an access violation.

IanH · ‎10-28-2020

Given the code I wildly guess in the absence of details .... you are using /Qfp-stack-check in your compile options. If so remove it - you are not using the x87 FPU.

Karanta__Antti · ‎10-29-2020

Thanks Ian, this is it! Removing this option fixed the access violation.

From the documentation of /Qfp-stack-check I would think that it serves a purpose also on x64:

https://software.intel.com/content/www/us/en/develop/documentation/fortran-compiler-developer-guide-and-reference/top/compiler-reference/floating-point-operations/understanding-floating-point-operations/checking-the-floating-point-stack-state.html

Further: the documentation of this option says that when a problem is detected, there will be an access violation so the problem can be detected quicker. Could it be that the access violation was due to detecting some actual fp problem? If I understood correctly, the check on x64 is that one does not call a function returning a real using a wrong prototype.

IanH · ‎10-29-2020

You would have to be doing something pretty obscure. The x87 floating point stack is not used on Windows x64 for argument passing.

Bernard · ‎10-29-2020

@Karanta__Antti wrote:

I'm a bit out of my league here. I have little (more than none, less than adequate) experience in assembler and even that is a long time back.

Anyhow, here's an excerpt from the disassembly from VS debugger:

END IF

00007FFCFED2D659 mov rax,qword ptr [U0] 
00007FFCFED2D660 movsd xmm0,mmword ptr [U1] 
00007FFCFED2D665 movsd xmm1,mmword ptr [rax] 
00007FFCFED2D669 subsd xmm0,xmm1 
00007FFCFED2D66D andpd xmm0,xmmword ptr [GR_CDEV_X_WINDOW_SYSTEM+0ACh (07FFD06EA20F0h)] 
00007FFCFED2D675 mov rax,qword ptr [DUP] 
00007FFCFED2D67C movsd xmm1,mmword ptr [rax] 
00007FFCFED2D680 call fmod (07FFD06224B62h) 
00007FFCFED2D685 fldz 
00007FFCFED2D687 fldz 
00007FFCFED2D689 fldz 
00007FFCFED2D68B fldz 
00007FFCFED2D68D fldz 
00007FFCFED2D68F fldz 
00007FFCFED2D691 fldz 
00007FFCFED2D693 fldz 
00007FFCFED2D695 mov qword ptr [rbp+0E0h],rax 
00007FFCFED2D69C fnstsw ax 
00007FFCFED2D69E test ax,40h 
00007FFCFED2D6A2 je GR61+754h (07FFCFED2D6A8h) 
00007FFCFED2D6A4 xor eax,eax 
00007FFCFED2D6A6 mov dword ptr [rax],eax 
00007FFCFED2D6A8 mov rax,qword ptr [rbp+0E0h] 
00007FFCFED2D6AF fstp st(0) 
00007FFCFED2D6B1 fstp st(0) 
00007FFCFED2D6B3 fstp st(0) 
00007FFCFED2D6B5 fstp st(0) 
00007FFCFED2D6B7 fstp st(0) 
00007FFCFED2D6B9 fstp st(0) 
00007FFCFED2D6BB fstp st(0) 
00007FFCFED2D6BD fstp st(0) 
00007FFCFED2D6BF movsd mmword ptr [rbp+130h],xmm0 
00007FFCFED2D6C7 mov byte ptr [rbp+0BFh],1 
00007FFCFED2D6CE movsd xmm0,mmword ptr [rbp+130h] 
00007FFCFED2D6D6 movsd mmword ptr [QMOD],xmm0 
QMOD = DMOD( ABS( U1 - U0 ), DUP )

Debugging I can see that the line setting eax (and thus rax) to zero is this one:

00007FFCFED2D6A4 xor eax,eax

The access violation is raised by the next line, i.e.

00007FFCFED2D6A6 mov dword ptr [rax],eax

Seems a bit strange that the compiler would produce these lines, as (if I understand the lines correctly), they will always cause an access violation.

Your reasoning is a correct one, because [rax] contains a 0x00000000 value, hence the CPU triggered an access violation exception. I do not understand what was the compiler "intention" -- storage of 0x0 value at 0x0 address!!
Or maybe it was done on purpose to trap on some invalid condition and to point the programmer fairly quickly to location of the fault.

Upon further analysis it became clear, that instruction 0x7FFCFED2D69C stores the FPU status word without checking for pending non-masked floating-point exceptions. The code was testing for SF (stack fault) bit being set.

00007FFCFED2D695 mov qword ptr [rbp+0E0h],rax 
00007FFCFED2D69C fnstsw ax 
00007FFCFED2D69E test ax,40h 
00007FFCFED2D6A2 je GR61+754h (07FFCFED2D6A8h)

I'm glad, that issue was solved.

IanH · ‎10-21-2020

/Qinit:snan also enables floating point exceptions.

Bernard · ‎10-23-2020

>>>It does not say anything about faulting ip.>>>

I suppose that this is a Microsoft concocted term.

It means the IP which caused an exception. In case of your windbg analysis -- it is a stored trap frame.