We have added "-qopenmp" to the compilation and linking commands for a serial code, and numerical results changed slightly. That is, there are no OpenMP pragmas in the code, no OpenMP support functions, and omp.h is not included, but the results are slightly different. I need to track down the cause of the changes.
Can anybody please advise what other compiler arguments -qopenmp sets? I want to experiment with enabling/disabling some of these arguments to find the cause of the difference.
Qopenmp causes some initialization in the openmp and pthreads libraries. If you have vectorization with operations sensitive to data alignments but have left alignments unspecified, differences might occur there. Such changes are particularly likely in 32 bit mode or with avx or avx2.
Tim, thank you! This sounds like my case: compiling for AVX/AVX2 in 64-bit mode.
I compiled with -qopt-report=5 and went through the optimization reports. It seems that -qopenmp added some messages like this:
remark #34014: optimization advice for memcpy: increase the destination's alignment to 16 (and use __assume_aligned) to speed up library implementation remark #34014: optimization advice for memcpy: increase the source's alignment to 16 (and use __assume_aligned) to speed up library implementation remark #34026: call to memcpy implemented as a call to optimized library version remark #34014: optimization advice for memset: increase the destination's alignment to 16 (and use __assume_aligned) to speed up library implementation remark #34026: call to memset implemented as a call to optimized library version
Does this sound like the operations sensitive to alignment that you mentioned, or should I look somewhere else?
I should have mentioned that -fp-model source removes optimizations which are expected to be sensitive to data alignment, so that, together with -qopt-report=4, might help identify where such differences occur. If the differences occur in math libraries, -fimf-arch-consistency=true might have an effect. That option could be applied to specific math functions.
Your diagnostics about memcpy and memset advice appear to confirm that you have some odd alignments. I don't know why those messages would appear only under -qopenmp if you don't have any parallel regions. avx and avx2 may require 32-byte alignment to remove differences due to alignments (when leaving fp-model at default).
There's no reason to require more alignment for AVX2 than for AVX. MIC does need 64-byte alignment.
Certain CPUs, including Nehalem, could show a strong benefit for 32-byte alignment even though 16-byte alignment is sufficient for all aligned moves.
I suppose Microsoft expects you to use _aligned_malloc when you want 16 byte alignment in 32 bit mode or 32 byte alignment in 64 bit mode. Icl should give you more options, unfortunately not fully portable.
I don't see that we can be expected to think avx512 when you say avx2. I expect we will get some experience of avx512 soon.
64-byte alignment is recommended for AVX512 (and may be required for MIC). It's not likely to hurt AVX-256 or AVX2-256, which do exhibit cases where 32-byte alignment is useful.
Compilers targeting AVX2 use a lot of split loads (as recommended for Sandy Bridge), in which case 16-byte alignment is sufficient. I still have the old Westmere and Nehalem boxes, where 64-bit split loads could be faster than 128-bit loads if the data aren't 32-byte aligned.
I don't think many Fortran programmers go beyond the recommendation of setting -align array32byte (or array64byte). Those alignments don't carry over to COMMON, although from the beginning it was recognized that they ought to align the start of a COMMON or equivalent structure.