What other arguments does -qopenmp enable?

Andrey_Vladimirov · ‎11-10-2015

We have added "-qopenmp" to the compilation and linking commands for a serial code, and numerical results changed slightly. That is, there are no OpenMP pragmas in the code, no OpenMP support functions, and omp.h is not included, but the results are slightly different. I need to track down the cause of the changes.

Can anybody please advise what other compiler arguments -qopenmp sets? I want to experiment with enabling/disabling some of these arguments to find the cause of the difference.

TimP · ‎11-10-2015

Qopenmp causes some initialization in the openmp and pthreads libraries. If you have vectorization with operations sensitive to data alignments but have left alignments unspecified, differences might occur there. Such changes are particularly likely in 32 bit mode or with avx or avx2.

Andrey_Vladimirov · ‎11-11-2015

Tim, thank you! This sounds like my case: compiling for AVX/AVX2 in 64-bit mode.

I compiled with -qopt-report=5 and went through the optimization reports. It seems that -qopenmp added some messages like this:

remark #34014: optimization advice for memcpy: increase the destination's alignment to 16 (and use __assume_aligned) to speed up library implementation
remark #34014: optimization advice for memcpy: increase the source's alignment to 16 (and use __assume_aligned) to speed up library implementation
remark #34026: call to memcpy implemented as a call to optimized library version
remark #34014: optimization advice for memset: increase the destination's alignment to 16 (and use __assume_aligned) to speed up library implementation
remark #34026: call to memset implemented as a call to optimized library version

Does this sound like the operations sensitive to alignment that you mentioned, or should I look somewhere else?

TimP · ‎11-11-2015

I should have mentioned that -fp-model source removes optimizations which are expected to be sensitive to data alignment, so that, together with -qopt-report=4, might help identify where such differences occur. If the differences occur in math libraries, -fimf-arch-consistency=true might have an effect. That option could be applied to specific math functions.

Your diagnostics about memcpy and memset advice appear to confirm that you have some odd alignments. I don't know why those messages would appear only under -qopenmp if you don't have any parallel regions. avx and avx2 may require 32-byte alignment to remove differences due to alignments (when leaving fp-model at default).

SergeyKostrov · ‎02-10-2016

I'd like to follow up on that old thread. I'm not sure that these C++ compiler remarks ... >>1 remark #34014: optimization advice for memcpy: increase the destination's alignment to 16 >>(and use __assume_aligned) to speed up library implementation >>2 remark #34014: optimization advice for memcpy: increase the source's alignment to 16 >>(and use __assume_aligned) to speed up library implementation >>3 remark #34026: call to memcpy implemented as a call to optimized library version >>4 remark #34014: optimization advice for memset: increase the destination's alignment to 16 >>(and use __assume_aligned) to speed up library implementation >>5 remark #34026: call to memset implemented as a call to optimized library version ... are related to deviated results of calculations since all of them are Memory-bound operations and they are not FPU-bound operations. >>...and numerical results changed slightly... I would ask two questions right away: What is an Absolute Error and, What is a Relative Error ( I mean between some deviated results )? If the relative error is less than 0.5% or 1% it could be neglected. In reality, even on ISO 9001 certified X-Ray imaging software, which could work on different computers with different Intel CPUs, results of image post-processing are always different for different computers and it is not a problem if the difference not exceeds some threshold defined in specifications for the software.

SergeyKostrov · ‎02-10-2016

>>Can anybody please advise what other compiler arguments -qopenmp sets? I think only C++ compiler engineers could answer it. It also makes sence to look at all C++ compiler options and in case of GCC-alike compilers these are as follows: ... -Wopenmp-simd Warn if a simd directive is overridden by the vectorizer cost model ... -fopenmp Enable OpenMP ( implies -frecursive in Fortran ) -fopenmp-simd Enable OpenMP's SIMD directives ...

SergeyKostrov · ‎02-10-2016

>>... >>1 remark #34014: optimization advice for memcpy: increase the destination's alignment to 16 >>(and use __assume_aligned) to speed up library implementation >>... Here is an example on how it has to be used in case of codes for a GCC-alike C++ compiler: ... template < class T > _RTINLINE RTvoid _MatrixMulProcessing( T * _RTRESTRICT ptA, T * _RTRESTRICT ptB, T * _RTRESTRICT ptC, RTssize_t iM, RTssize_t iK, RTssize_t iN, RTint iNumOfThreads ) { ... _RTALIGNED T *ptA2 = ( T * )__builtin_assume_aligned( ptA, _RTDEFAULT_ALIGNMENT ); _RTALIGNED T *ptB2 = ( T * )__builtin_assume_aligned( ptB, _RTDEFAULT_ALIGNMENT ); _RTALIGNED T *ptC2 = ( T * )__builtin_assume_aligned( ptC, _RTDEFAULT_ALIGNMENT ); ... } ... _RTALIGNED is: ... #define _RTALIGN04 __attribute__( ( aligned( 4 ) ) ) #define _RTALIGN08 __attribute__( ( aligned( 8 ) ) ) #define _RTALIGN16 __attribute__( ( aligned( 16 ) ) ) #define _RTALIGN32 __attribute__( ( aligned( 32 ) ) ) #define _RTALIGN64 __attribute__( ( aligned( 64 ) ) ) #define _RTALIGNED _RTALIGN64 ... In a mentioned above case _RTALIGNED is configured for Intel ISAs up to AVX2. _RTDEFAULT_ALIGNMENT is: ... #define _RTDEFAULT_ALIGNMENT _RTPU_CACHELINE_SIZE ...

SergeyKostrov · ‎02-10-2016

>>...avx and avx2 may require 32-byte alignment... Simply to note that: - in case of AVX ISA a 32-byte alignment is needed, and - in case of AVX2 ISA a 64-byte alignment is needed. It is very easy to verify what alignment is needed and just look at how __mNNN ( like __m128, or __m256 ) intrinsic data type is declared.

TimP · ‎02-11-2016

There's no reason to require more alignment for AVX2 than for AVX. MIC does need 64-byte alignment.

Certain CPUs, including Nehalem, could show a strong benefit for 32-byte alignment even though 16-byte alignment is sufficient for all aligned moves.

SergeyKostrov · ‎02-11-2016

>>...There's no reason to require more alignment for AVX2 than for AVX... It contradicts with what Intel Optimization Guides recommend. This is because all intrinsic 512-bit data types, like, __m512, __m512d and __m512i, declared as unions with already "forced" alignment to 64-byte boundary. I would follow what it is officially recommended and, of course, I always verify how data are aligned with a set of very simple macros: ... #define _RTISALIGNED( p, n ) ( ( ( RTusize_t )( p ) % ( n ) == 0 ) ? 0 : 1 ) #define _RTASSERT_ISALIGNED( p, n ) \ { \ if( ( ( RTusize_t )( p ) ) % ( n ) != 0 ) \ { \ CrtPrintfA( "Pointer is Not aligned on %ld-byte boundary\n", ( RTint )n ); \ } \ } [ Just Note ] A recent problem I had to deal with ( about 2 days ago ) was related to an 8-byte alignment of a memory block, allocated by a CRT-function malloc in Debug configuration when Microsoft C++ compiler is used, when a legacy SSE2 intrinsic functions are called. They need 16-byte alignment as you know and an exception is thrown during processing if there is 8-byte alignment for __m128-like intrinsic data types.

TimP · ‎02-11-2016

I suppose Microsoft expects you to use _aligned_malloc when you want 16 byte alignment in 32 bit mode or 32 byte alignment in 64 bit mode. Icl should give you more options, unfortunately not fully portable.

I don't see that we can be expected to think avx512 when you say avx2. I expect we will get some experience of avx512 soon.

SergeyKostrov · ‎02-11-2016

That was a nice deviation from the problem... Now, >>... >>I want to experiment with enabling/disabling some of these arguments to find the cause of the difference. >>... I would take a look at Floating Point Model options, that is, /fp:fast, /fp:strict and /fp:precise. For a long time I've been using /fp:fast option for Floating Point Model ( in all Release configurations ) and results of computations are very consistent across many platforms with different CPUs.

SergeyKostrov · ‎02-11-2016

>>...I suppose Microsoft expects you to use _aligned_malloc... This is what I do Not use at all and I use Intel's intrinsic _mm_malloc instead. Or I use my own malloc with aligned functionality if _mm_malloc is not supported and it is absolutely portable down to 16-bit platforms, like, MS-DOS, PTS-DOS, Embedded OSs, etc.

Bernard · ‎03-05-2016

->>> in case of AVX2 ISA a 64-byte alignment is needed.>>>

Why you need 64-byte alignment for union data type which maps directly to 32-byte wide YMM register?

TimP · ‎03-05-2016

64-byte alignment is recommended for AVX512 (and may be required for MIC). It's not likely to hurt AVX-256 or AVX2-256, which do exhibit cases where 32-byte alignment is useful.

Compilers targeting AVX2 use a lot of split loads (as recommended for Sandy Bridge), in which case 16-byte alignment is sufficient. I still have the old Westmere and Nehalem boxes, where 64-bit split loads could be faster than 128-bit loads if the data aren't 32-byte aligned.

I don't think many Fortran programmers go beyond the recommendation of setting -align array32byte (or array64byte). Those alignments don't carry over to COMMON, although from the beginning it was recognized that they ought to align the start of a COMMON or equivalent structure.

Bernard · ‎03-06-2016

Actually I use 32-byte alignment for AVX/AVX2 data types and 16-byte alignment for SSE data types.