-march=corei7 slower than -march=core2 on Intel Haswell processor.

gn164 · ‎07-15-2018

Using ifort 18.0 and setting -march=corei7 gives slower performance in one of the critical loops of my program on Haswell (around 25%) than compiling with -march=core2.

Do both options generate code that is optimized for Intel processors?

According to the help for -march:

corei7

Generates code for processors that support Intel® SSE4 Efficient Accelerated String and Text Processing instructions. May also generate code for Intel® SSE4 Vectorizing Compiler and Media Accelerator, Intel® SSE3, SSE2, SSE, and SSSE3 instructions.

core2

Generates code for the Intel® Core™ 2 processor family.

Given this, I would assume that core2 option is Intel processor-specific while corei7 covers all processors that support the given instruction set?

mecej4 · ‎07-17-2018

Typically, -march options are intended to enable the compiler to generate of faster machine code on the specified CPU family for a wide spectrum of user source code.

That -march=core2 generates faster code than -march=corei7 is, perhaps, an anomaly, but keep in mind that corei7 is not the same as haswell. For example, I have a laptop with a Haswell i5; I would not expect all corei7 features to be available on the i5. The start-up code of your application may detect this kind of issue, and choose generic code paths for the rest of the program run, which would understandably run slower than with -march=core2.

TimP · ‎07-18-2018

If you're interested in this subject, you may wish to compare performance and examine opt-report for those options, and for -xHost.

Your application may see little benefit from the choice of SSE4 instructions. You may be vectorizing loops which are too short, if the compiler doesn't have sufficient information to know this, or needs -align=array32byte. One option may default to an unroll factor more suitable for your case. You have facilities to control these locally with directives as well as by changing compile line options.

I don't know whether the -march options are well tested or optimized, or whether they translate to specific choices of -x options such as -xSSE4.1 and -xSSSE3.

I wouldn't expect the -march options to generate processor-dependent code, although library functions will retain the behavior Sham described.

gn164 · ‎08-07-2018

Hi,

Thank you for your comments. I run a couple of tests with -march , -mtune and -x/-ax options and to see how these affect the vectorization:

One point is that it looks like the -ax options are performing better than the -mtune.

For the following piece of code (ifort 18 , -O3):

         complex(kind=4)  :: sum

         complex(kind=4), allocatable :: local(:,:)
         real(kind=4), allocatable :: ang(:)
         integer nx,ny,nz,ix,iy,iz,i

         nx = 100
         ny = 100
         allocate(local(ny,nx))
         allocate(ang(nx))

         local(1:ny,1:nx) = cmplx(0.5,0.5)
         ang(1:nx) = (/ (real(i), i = 1, nx) /)

         sum = 0.0

         do iy = 1 , ny
            do ix = 1 , nx
               sum = sum +local(iy,ix)*exp(cmplx(0.0,ang(ix)))
            enddo
         enddo

         print *, sum

Vec report for -mtune=core-avx2:

LOOP BEGIN at mtune_test.f90(23,13)
   remark #15542: loop was not vectorized: inner loop was already vectorized

   LOOP BEGIN at mtune_test.f90(22,10)
      remark #15389: vectorization support: reference LOCAL(iy,ix) has unaligned access   [ mtune_test.f90(24,27) ]
      remark #15381: vectorization support: unaligned access used inside loop body
      remark #15305: vectorization support: vector length 4
      remark #15309: vectorization support: normalized vectorization overhead 0.009
      remark #15418: vectorization support: number of FP down converts: double precision to single precision 1   [ mtune_test.f90(20,10) ]
      remark #15301: PERMUTED LOOP WAS VECTORIZED
      remark #15450: unmasked unaligned unit stride loads: 1 
      remark #15475: --- begin vector cost summary ---
      remark #15476: scalar cost: 229 
      remark #15477: vector cost: 84.500  
      remark #15478: estimated potential speedup: 2.700 
      remark #15488: --- end vector cost summary ---
   LOOP END
LOOP END

Vec report for -axcore-avx2:

   LOOP BEGIN at mtune_test.f90(22,10)
      remark #15388: vectorization support: reference LOCAL(iy,ix) has aligned access   [ mtune_test.f90(24,27) ]
      remark #15305: vectorization support: vector length 8
      remark #15399: vectorization support: unroll factor set to 4
      remark #15309: vectorization support: normalized vectorization overhead 0.007
      remark #15418: vectorization support: number of FP down converts: double precision to single precision 1   [ mtune_test.f90(20,10) ]
      remark #15301: PERMUTED LOOP WAS VECTORIZED
      remark #15442: entire loop may be executed in remainder
      remark #15448: unmasked aligned unit stride loads: 1 
      remark #15475: --- begin vector cost summary ---
      remark #15476: scalar cost: 217 
      remark #15477: vector cost: 42.000 
      remark #15478: estimated potential speedup: 4.410 
      remark #15488: --- end vector cost summary ---
   LOOP END

So for the mtune case it looks like that no avx is actually used since the vector length is 4.

The -march=core-avx2 flag looks like is generating avx code though and the report matches the one from -x option.

According to the documentation for the mtune flag:

Performs optimizations for specific processors but does not cause extended instruction sets to be used (unlike -march).

So is it right to say that the -mtune=core-avx2 does not actually produce avx code?

If this is right, isn't it always preferable to use -axavx options against -mtune as the -axavx option can also produced targeted and generic code paths?

TimP · ‎08-08-2018

Yes, although I wasn't aware that ifort now supports -mtune, gcc documentation (which applies also to gfortran), tells that -mtune doesn't permit newer instruction sets; it only allows optimizations which aren't expected to benefit older CPUs.

The reason for loop permutation is obvious, once the compiler points it out; the expensive exp() function may then be hoisted out of the inner loop. I prefer to write such loops such that the permutation isn't needed. Among the reasons is to avoid unexpected variations as a side effect of changes elsewhere.

The comment about the possibility of using only the remainder loop is worth considering. With unroll 4 and vector length 8, if there were no reduction, you might expect 3 iterations of the full optimized loop, consuming 64 elements, and 4 iterations of a scalar remainder loop. At this vector length, a vectorized remainder loop would then be skipped. As you have a reduction, ifort will try to riffle the loop, accumulating multiple sums which must be combined after the loop. With a vector length of only 100, that would be counterproductive, and could result in only a scalar remainder loop being executed. I suppose ifort won't take the length 100 in the allocate as a clue to avoid the optimizations which can't be used at short vector lengths. This might even make the generation of an avx loop counter-productive. gfortran doesn't attempt riffling, and so might out-perform ifort in such cases. You might try removing the unroll from ifort, if you don't expect to spend much time on long loops. If the "potential speedup" is reduced only slightly, that will surely speed up short loops.

The double to single precision down-converts probably are the result of compiling with more conservative options than complex limited range, which would be implied by -fast=2, although I prefer to try it by itself. Particularly with exp(), the implicit up- and down-conversions are likely to avoid troubles with over- and under-flow, which , would be data dependent. These conversions undoubtedly are expensive, not only in the conversions themselves, but in the effective reduction of vector length, and likely in the form of exp() employed You should note that for gfortran, -ffast-math, which is required for simd optimization of reduction, defaults to cx-limited-range. This is a frequent reason for thinking gfortran performs better on complex.

TimP · ‎08-08-2018

As to the question about supporting multiple instruction sets, it's nearly certain, as you may be implying, that generating both avx and avx2 paths will be counter-productive. In many cases, the compiler would see it and refrain from generating both paths. Among the differences implied by avx vs. avx2 would be different treatment of possibly unaligned loops.

As core2 appears to be the oldest CPU you intend to support, if you choose -axavx, you will need to add the option to upgrade the default path to core2, or sse or possibly sse4.1 depending on the age of your core2. Most core2 CPUs still around are the variety which supports sse4.1; all support sse3 which is extremely important for complex. sse3 will often be faster than avx, at least for short loops, on account of its specific support for complex. The opt-report may give you some clues, particularly if there are remainder loops. You must take into account stuff like the more frequent execution of remainder loops for avx and that sse3 loops promoted to double precision will not be reported as vectorized even with full simd optimization.

As you can see, the questions you raise about optimization of complex are more "complex" than most people care to deal with.

Steve_Lionel · ‎08-08-2018

I am sure Tim has more experience with this than I do (given I have never used it myself), but the !DIR$ LOOP COUNT directive can give hints to the compiler as to how long the loop is. This may help in its vectorization.

TimP · ‎08-09-2018

Steve makes a good point. The loop count directive should guide the compiler to use that as the target length for optimization. At one time, the default target length was supposed to be like !dir$ loop count avg=100, but it appears to be larger in your quotation. You might expect the opt-report to show a reduced unroll factor if the loop count directive has taken effect.

You should look up the loop count documentation. If your loop count varies, you should take advantage of at least the avg option. At least in the past, specifying loop count without avg could reduce performance unnecessily for counts even slightly different from the one specified.

The -gap compiler option is supposed to give you hints about vectorization directives, such as suggesting where to set a larger loop count directive may enable vectorization, but it isn't necessarily a help for cases like yours.

If you use module arrays (or, if your code is f77 style with COMMON), the compiler should optimize automatically for the declared size of the arrays. Then the loop count directive would be useful only where you expect the actual loop lengths to be much smaller.

gn164 · ‎08-10-2018

Hi All,

Thank you for your comments there are a number of interesting points raised.

Standing for a minute on the implicit up and down FP conversions that shows up when exp() is used.

The following piece of code (which is the one posted below but rewritten to avoid the stride memory access so permutation does not get in the way)

         do iy = 1 , ny
            do ix = 1 , nx
               sum = sum +local(ix,iy)*exp(cmplx(0.0,ang(ix)))
            enddo
         enddo

gives the following report:

LOOP BEGIN at mtune_test.f90(27,10)

   remark #15542: loop was not vectorized: inner loop was already vectorized

   LOOP BEGIN at mtune_test.f90(28,13)
      remark #15388: vectorization support: reference ANG(ix) has aligned access   [ mtune_test.f90(29,54) ]
      remark #15388: vectorization support: reference ANG(ix) has aligned access   [ mtune_test.f90(29,54) ]
      remark #15389: vectorization support: reference LOCAL(ix,iy) has unaligned access   [ mtune_test.f90(29,27) ]
      remark #15381: vectorization support: unaligned access used inside loop body
      remark #15305: vectorization support: vector length 8
      remark #15309: vectorization support: normalized vectorization overhead 0.020
      remark #15417: vectorization support: number of FP up converts: single precision to double precision 1   [ mtune_test.f90(29,40) ]
      remark #15418: vectorization support: number of FP down converts: double precision to single precision 1   [ mtune_test.f90(24,10) ]
      remark #15300: LOOP WAS VECTORIZED
      remark #15448: unmasked aligned unit stride loads: 1
      remark #15450: unmasked unaligned unit stride loads: 1
      remark #15475: --- begin vector cost summary ---
      remark #15476: scalar cost: 217
      remark #15477: vector cost: 31.870
      remark #15478: estimated potential speedup: 5.510
      remark #15482: vectorized math library calls: 2
      remark #15487: type converts: 2
      remark #15488: --- end vector cost summary ---
   LOOP END

   LOOP BEGIN at mtune_test.f90(28,13)
   <Remainder loop for vectorization>
   LOOP END
LOOP END

For The following code (variable a defined as a real)

         a = 0.0
         do iy = 1 , ny
            do ix = 1 , nx
               sum = sum +local(ix,iy)*exp(cmplx(a,ang(ix)))
            enddo
         enddo

Report looks as follows:

LOOP BEGIN at mtune_test.f90(28,10)
   remark #15542: loop was not vectorized: inner loop was already vectorized

   LOOP BEGIN at mtune_test.f90(29,13)
      remark #15388: vectorization support: reference ANG(ix) has aligned access   [ mtune_test.f90(30,52) ]
      remark #15389: vectorization support: reference LOCAL(ix,iy) has unaligned access   [ mtune_test.f90(30,27) ]
      remark #15381: vectorization support: unaligned access used inside loop body
      remark #15305: vectorization support: vector length 4
      remark #15309: vectorization support: normalized vectorization overhead 0.045
      remark #15300: LOOP WAS VECTORIZED
      remark #15448: unmasked aligned unit stride loads: 1
      remark #15450: unmasked unaligned unit stride loads: 1
      remark #15475: --- begin vector cost summary ---
      remark #15476: scalar cost: 112
      remark #15477: vector cost: 22.250
      remark #15478: estimated potential speedup: 5.020
      remark #15482: vectorized math library calls: 1
      remark #15488: --- end vector cost summary ---
   LOOP END
LOOP END

That is, no FP conversions, reduced vector length and different number of vectorized math library calls.

Compilation flags and compiler used for both codes

Intel(R) Fortran Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 18.0.2.199 Build 20180210

Compiler options: -O3 -qopt-report-file=vecreport -qopt-report-phase=vec -qopt-report=5 -xcore-avx2

gn164 · ‎08-14-2018

Hi,

Could someone help me understand the behavior described in my previous comment.

I would not expect using a variable instead of a floating point literal to affect the vectoriser that much.

Thanks