I have encountered issues with the combination of vectorization and floating point errors. The code in question is rather complex, ~300 000 lines of code, where most of the code is setting things up for a few subroutines that are hit very, very hard, over and over again for months, with a large set of input files needed. I have been working hard trying to tweak the core subroutines, optimizing the use of vectorization as I have done what I can with parallelization, but I have encountered problems. I know that I will be asked to supply code, but I have not yet been able to produce a meaningful simplified problem that reproduces the problem. I am working on it, but at this stage any hint in the right direction would be very helpful. The calculations are run on a off-line system so it is a bit difficult to show longer files and error messages. Sorry about that. For reasons a lot of the code is still in f77-style. Please note that the code below is not the actual code but rather a much scaled down reproduction with as much information as I can supply and there could be typos from the copying by hand from the off-line system.
When I compile and run the code with any compiler flag that disables vectorization of loops it works as it should, but as soon as turn on auto-vectorization I get floating invalid(65) at the first, very simple vectorized loop. The loop should (?) be fool proof:
module datamod integer, parameter :: nh=1000 integer, parameter :: nz=360 integer, parameter :: nnj=400 double precision, dimension(:,:), allocatable :: mtab double precision, dimension(:), allocatable :: volume end module datamod subroutine allo_all use datamod allocate(mtab(nh,nz)) allocate(volume(nnj)) return end subroutine calcall(h,z) use datamod integer j,h,z double precision prefac2(nnj) !DIR$ ASSUME_ALIGNED mtab(1,1):64 !DIR$ ASSUME_ALIGNED volume(1):64 !DIR$ IVDEP do j = 1,nj prefac2(j) = mtab(h,z)/volume(j) end do
The optrpt-file says that all variables are aligned and that the loop is vectorized.
I have checked both mtab and volume outside of the loop with vectorization and they are as they should, and the code runs just fine without vectorization, but with vectorization I get an error at the only line inside of the loop. I have tried to turn off vectorization in a lot of different ways, like compiling with -O1 or -C or adding a write statement in the loop, all with the same result.
Compiler options used: -O3 -align array64byte -xCORE-AVX512 -qopt-zmm-usage=high -fp-model fast=2
Compiler: ifort 2019.0.0117
OS: CentOS 7.5.1804
CPU: Xeon Gold 6132
I understand that it is both frustrating and difficult to say anything with so little information and code, but there is little I can do there unfortunately. This is probably a simple noob mistake made by me, so any hint or idea is very much appreciated.
If you wish the vectorized divide to have the same numerical behavior as the scalar code, you must set -prec-div (and -prec-sqrt if sqrt is involved). Do you have a reason for setting fast=2 ?
Thank you for your comment, it is very much appreciated. I am aware of the difference and that is one of the things I would like to investigate. I am balancing between several different "evils" here. I know that the data in the input has inaccuracies and that I have had to truncate the physical model, disregarding higher corrections to get acceptable turn-around times with a grid fine enough to resolve relevant features. If I already have (acceptable) uncertainties in the model of lets say 1% a numerical error of 10^-8 makes very little difference, but that has to be tested. For me the turn-around time is a hard limit. A perfect answer too late is of no help, unfortunately.
Update: the behaviour is connected to the use of -qopt-zmm-usage. When set to low the compiler will vectorize with lesser gain (according to the optrpt file) but no floating invalids. Set to high and the vectorization will be a factor two "better" but the code will always crash. I noted that the default for -xCORE-AVX512 is low at the same time as Intel Advisor and the optrpt file recommends high. Is this behaviour to be expected? Is there some compiler option or pragma that can be used to avoid the floating point errors with -qopt-zmm-usage=high, or is it just to bite the bullet and accept the performance loss and use low?
Edit: the same thing happens if I force (as I understand it) the use of the zmm registers by using !$OMP SIMD SIMDLEN(8)