Floating invalid only when vectorizing

Andersson__Per · ‎08-12-2019

Dear all

First of all I must apologize as I for legal reasons can't show any actual code and I have not been able to draft dummy code that can reproduce the problem I am experiencing. The code in question is doing scientific HPC calculations and is of the order of 200 000 lines long. Core parts of it is F77 and my job is to update the code and turn everything into more modern Fortran. Serial performance is a very high priority as we are pushing the limits when it comes to parallelization at a few thousands cores. The core subroutines (serial, all parallelization is done at a higher level) are hit billions of times and the code in those subroutines are optimized for vectorization. I can, again for legal reasons, not copy any output or code from the system where the code is running, but I hope you all can see beyond any obvious typos.

The code used to run without any problems but after an update of parts not related(?) to the core routines I am getting error(65), Floating invalids right of the bat in one of the core subroutines. It is obviously something I have done but I can for my life not figure it out and the problem turned out to be very difficult to debug. I only see the error when the compiler vectorize the code. Whatever I do that turn vectorization off (-novec, -O0 or -O1 or add any form of -check for example or fork the code with a print statement in the vectorized part) also "fix" the problem and the code will run without any errors. I use ifort 19.0.4.227 20190416.

The update I did was to change the definition of the local floats in a few supporting subroutines and functions from double precision to selected_real_kind(15,307). All floats in modules are still defined as double precision and the same is true for all local floats in the core routines. The code is compiled with the options " -mcmodel=large -align array64byte -xCORE-AVX512 -O3 -qopt-zmm-usage=low -fp-model fast=2 -g -traceback -qopt-report=5" and that has not been changed. The code will for some reason always crash with -qopt-zmm-usage=high and Intel's experts have not been able to diagnose that problem which of course is of some concern.

The run-time error and Intel Inspector point to a specific line in the code and Inspector also indicates that there is something going partially wrong when a array used in the aforementioned line is allocated. The array in question is defined in a module

double precision, dimension(:), allocatable :: A, B, C

and allocated as

allocate(A(1:nj))

Inspector reports a Invalid partial memory access when the array is allocated. When I add code to inquire the status of the allocation I get STAT=0. The array A is used in the code before the core routine where the code crashes without any problem. The segment of the core subroutine where the code crashes looks like this:

!DIR$ ASSUME_ALIGNED A(1):64
!DIR$ ASSUME_ALIGNED B(1):64
!DIR$ ASSUME_ALIGNED C(1):64

!DIR$ IVDEP

do j = 1, nj

  locarray1(j) = locvar1 + locvar2*B(j)/A(j) + locvar3*C(j)  ! CRASH HERE

end do

All local arrays and variables are defined as double precision and I can inspect A, B and C and all other variables outside of the loop and the bounds are well defined and all values are non-NaN and not zero (A ~ 0.5). Inspector reports Invalid memory access at the offending line.

I have tried to align the arrays A, B, and C "by hand" by using !DIR$ attributes align: 64 :: A, B,C in the module where the arrays are defined without any difference. In the optrpt-file the compiler reports

vectorization support: reference A(j) has aligned acess

and the same for locarray1, B, and C. The loop in question is vectorized without a peel loop but with a small reminder loop.

Sorry for the wall of text and lack of actual code. I hope I could make my case anyway. It is frustrating to debug a code where any attempt to look closer at the problem makes it go away. Could it be that the KIND parameter used in the updated subroutines messes the alignment up in some global context? Is the "new" KIND parameter not compatible with -align array64byte?

Best regards

Per

Steve_Lionel · ‎08-12-2019

How you specify the KIND has no impact here. I am not familiar with the -qopt-zmm-usage option.

All I can offer is a generic suggestion that the set of symptoms you describe suggest data corruption earlier in the execution. Are you sure that array A is fully initialized before it is used?

Andersson__Per · ‎08-12-2019

Thank you for looking at my problem! The array is initialized and populated coming into the subroutine in question and it is without any problem used earlier in the code. In an effort to figure out what was going on I printed all the incoming arrays at the top of the troublesome subroutine, before any vectorized loop, and there are no NaNs or zeroes and the bounds are correct. I have also tried and compiled the code with -fpe0 (both with and without -no-ftz) and again printed the arrays. Same result, the code crashes at the same line and nowhere else. With vectorizing turned off everything works fine and I get no error message or warning what so ever from Inspector.

Is the message from Inspector about the Invalid partial memory access during allocation significant? As I understand the error message it indicates that one or more of the bytes of the 8 used for that double precision float is not logically valid. The offset reported by Inspector indicates a problem in the middle of the array and not at either end.

Per

jimdempseyatthecove · ‎08-12-2019

In your description in post #1, while you declare A, B and C as aligned, we have no information relating to alignment of localarray1?

Is this aligned?
If not, attribute for alignment.

Also, is localarray1 allocatable or on stack?
If NOT allocatable, is heap_arrays set to allocate this from heap?
If NOT is stack size large enough to accommodate stack allocation (note, you may not be informed of this until a first touch faults out).

Jim Dempsey

jimdempseyatthecove · ‎08-12-2019

Additional test:

Allocate the arrays to a full set of cache lines

allocate(A((nj+7) - MOD(nj+7,8))
... and for B, C, and localarray1

Assure that the pad for A, B and C are not NAN and pad for A is not 0.0D0

If this makes the symptoms go away, then it is likely a bug in the compiler where a partial vector was computed as a complete vector.

Jim Dempsey

Andersson__Per · ‎08-13-2019

Again, thank you for looking at my problem. And again, I am sorry for not being able to show actual source code.

locarray1 is locally defined as double precision locarray1(nj) and according to the original optrpt file it is aligned according to the compiler. I tried also to add a explicit pragma in the same way as for A, B, and C and it made no difference, still crashing when vectorizing. I have also tried the two recommendations I got, the limit on the stack size and the padding. The problem persists after I increased the stack size by two orders of magnitude (but still within the bounds of the available RAM). Regarding the padding, I added the padding as recommended for the arrays used in the loop that causes problems and ubound(A) now reports a number that is mod(ubound(A),8)=0. It used to be 110 for my test input and the padding changed it to 112. The same is true for the other arrays in the loop. I have added a explicit initialization after the arrays have been allocated, populating the whole arrays with 1.0d0. When I print the arrays just before the loop everything looks exactly like it should, no NaNs or zeroes. The code contains of the order of 800 allocatable arrays and in total just under 2000 arrays. Do you think that I have to pad all arrays to test for the bug, or just the ones used in the problematic loop?

It would be interesting to know if the vectorized loop chokes on the first iteration/array index or somewhere else, but any attempt to print any information in the loop or attach a debugger turns the vectorization off and the problem goes away. I did the obvious test and by elimination identified which variable or array was causing the exception and as expected it was A, the divisor.

The obvious solution would be to run the code without vectorization, but the vectorized loops in the core subroutines count for over 90% of the runtime when run in serial mode even when slimmed down and the algorithms tuned, and without that factor 8 in speed-up the code is more or less useless as we can't throw more cores at the problem. I really appreciate all help I can get.

Per

Andersson__Per · ‎08-13-2019

Update: when I use !DIR$ VECTOR ALIGNED instead of !DIR$ IVDEP (which I was recommended to use at Intel vectorization training) together with the padding recommended above the problem goes away! Without the padding the problem persists. According to both Inspector and Advisor the core subroutines are fully vectorized and no invalid floating point operations any more. Any ideas about what could be going on here?

/Per

jimdempseyatthecove · ‎08-13-2019

Per,

!DIR$ IVDEP should have been superfluous ... unless A, B, C and/or localarray1 somehow could have been assumed by the compiler as capable of being aliased and the loop known by you to not be affected by said alias.

!DIR$ VECTOR ALIGNED should have been superfluous as well (but apparently not) due to the !DIR$ ASSUME_ALLIGNED... (and localarray1 locally being declared as aligned). This would appear to be a compiler bug.

Also, the !DIR$ VECTOR ALIGNED "should" only affect (eliminate) the peel code tests, but not affect any remainder test and potential execution.

I suggest that you restore the stack size declaration to what it was, comment in the code at the loop as to what you discovered .AND. comment in a compendium of Quirks for this project.

Glad you have it working, and you can move on now.

If I were to make a "flying leap" of a guess as to what caused this bug to come to light: -O3 on V19.0.4... is enabling inter-procedural optimizations and was inlining the code and something got mucked up in the process. Did (does) your optimization report indicate the code was inlined?

Jim Dempsey

Andersson__Per · ‎08-13-2019

Well, the inside of such a complicated machinery as an optimizing compiler is probably best left alone by us mere mortals, but I noted that the code really was inlined, and that it will run with padding and VECTOR ALIGNED but crashes if either is removed. I have a solution that works for me and the results for all parts of my test suite looks good. Again, thank you for the recommendation to pad the arrays.

Per