I am exploring the use of Intel Fortran's "go fast" flags to try and make my program "go fast". First, let me say that by default, we build our model with:
-O3 -qopt-report0 -ftz -align all -fno-alias -traceback -convert big_endian -fPIC -fpe0 -fp-model source -heap-arrays 32 -assume noold_maxminloc -align dcommons
These are fairly safe flags (-fp-model source, etc.) and give us reproducibility between runs, between like chips (*wells to *wells) and if we alter our MPI layout, the same answer. Now to try and eke out all the performance I can, on the Haswells I'm testing on I run:
-O3 -xCORE-AVX2 -fma -qopt-report0 -ftz -align all -fno-alias -align array32byte -traceback -convert big_endian -fPIC -fpe3 -fp-model fast=2 -no-prec-div -g -fimf-use-svml -align dcommons
I've tried to turn on every flag I can think of for "make go fast". And, this does help. Seems to be ~15% faster or so, which is nothing to sneeze at. However, one main issue is that with these flags, I seem to have killed the MPI layout reproducibility. Everything else is identical: both using Intel MPI 2018 as the MPI stack, both using Intel Fortran 2018 as the compiler. Just the flags.
Obviously, I used the "take everything and see what sticks" approach and just took every possible flag I could find for speeding up code. I have no doubt I've messed things up. But I'd like to try and maintain as much speed up as possible. I'm going to start exploring which flag(s) might be causing this, but I thought I'd leverage the experts help. Each recompile of my program takes ~20 minutes, so it's not going to be a fast process, so any speed up is welcome.
Note: I realized that my use of the word "layout" might be odd.
This is what I mean, if I run my executable on 96 (4x24) processors, I get a different answer than running on 54 (3x18) processors. It's like there is a reduction being done differently. However, most of the global sums, etc. in our model are pretty safe (often we cast to REAL64 do the work and cast back to REAL32) and not implemented via MPI_AllReduce for this issue (we only use Allreduce in some calls to get a min or max).
Also, since I can run the same MPI stack 4x24 and 3x18 without the fast opts and get identical results, I don't really tend to blame Intel MPI. And this code has no MKL so MKL_CBWR shouldn't matter.
I guess I'm wondering if there is something the options are triggering that can cause a difference if say one processor has 6 processes but in the other layout has, say, 14.
Are the CPUs the same between 4x24 and 3x18 (iow would they be taking different code branches based on available instructions)?
Can you post the CPU model numbers?