Performance Regression with Intel 14.0.1.106

Matt_Thompson · ‎12-16-2013

All,

A colleague here found an interesting performance regression with Intel Fortran 14. It's probable that it's related to this C regression, but I thought I'd place it here in case there is a more Intel Fortran specific fix.

To wit, the simple program to look at is:

[fortran]program dotProduct
implicit none

   integer, parameter :: npts=100000
   integer, parameter :: nobs=1000
   integer, parameter :: nanals=64

   real, allocatable, dimension(:,:) :: analysis_chunk,analysis_ob,dotProd
   integer :: i,k,j
   real t1,t2, tmpsum
   integer(kind=8) :: cr, cm
   real(kind=8) :: rate

   integer(kind=8) :: random_start, random_end
   integer(kind=8) :: dot_start, dot_end
   real(kind=8) :: random_time, dot_time

   call system_clock(count_rate=cr)
   call system_clock(count_max=cm)
   rate = real(cr,kind=8)

   allocate(analysis_chunk(npts,nanals))
   allocate(analysis_ob(nobs,nanals))
   allocate(dotProd(nobs,npts))

   call system_clock(random_start)
   call random_number(analysis_ob)
   call random_number(analysis_chunk)
   call system_clock(random_end)

random_time = (random_end-random_start)/rate
write (*,111) 'Random generation time:', random_time

   call system_clock(dot_start)
   do i=1,nobs
      do k=1,npts
         dotProd(i,k) = sum(analysis_chunk(k,:)*analysis_ob(i,:))/float(nanals-1)
      enddo
   enddo
   call system_clock(dot_end)
   dot_time = (dot_end-dot_start)/rate

write (*,111) "dotProd time:", dot_time

write (*,111) 'dotProd(npts,nobs):', dotProd(npts,nobs)

   deallocate(analysis_chunk)
   deallocate(analysis_ob)
   deallocate(dotProd)

111 format (1X,A,T20,F12.8)

end program dotProduct[/fortran] (Wow...that is some funky indenting by the code parser).

Now if I compile with Intel 13:

$ ifort -V
Intel(R) Fortran Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 13.1.3.192 Build 20130607
Copyright (C) 1985-2013 Intel Corporation. All rights reserved.

(1091) $ ifort -O3 dot.f90 -o dot.ifort13.exe
(1092) $ ./dot.ifort13.exe
Random generation 0.10362200
dotProd time: 0.82572200
dotProd(npts,nobs) 0.21707585

And now Intel 14:

$ ifort -V
Intel(R) Fortran Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 14.0.1.106 Build 20131008
Copyright (C) 1985-2013 Intel Corporation. All rights reserved.

(1065) $ ifort -O3 dot.f90 -o dot.ifort14.exe
(1066) $ ./dot.ifort14.exe
Random generation 0.10825000
dotProd time: 8.31167000
dotProd(npts,nobs) 0.21707585

As you can see, the Intel 14 version is about 9x slower. Hmm. Looking at -opt-report1 -vec-report1, we can see the difference:

Intel 13:

HPO VECTORIZER REPORT (MAIN__) LOG OPENED ON Mon Dec 16 11:56:12 2013

<dot.f90;-1:-1;hpo_vectorization;MAIN__;0>
HPO Vectorizer Report (MAIN__)

dot.f90(37): (col. 66) remark: PERMUTED LOOP WAS VECTORIZED.
dot.f90(37:66-37:66):VEC:MAIN__: PERMUTED LOOP WAS VECTORIZED
dot.f90(37): (col. 66) remark: PERMUTED LOOP WAS VECTORIZED.
PERMUTED LOOP WAS VECTORIZED
vectorization support: unroll factor set to 2
dot.f90(37): (col. 66) remark: PERMUTED LOOP WAS VECTORIZED.
PERMUTED LOOP WAS VECTORIZED
vectorization support: unroll factor set to 4
dot.f90(37): (col. 66) remark: PARTIAL LOOP WAS VECTORIZED.
PARTIAL LOOP WAS VECTORIZED

Intel 14:

HPO VECTORIZER REPORT (MAIN__) LOG OPENED ON Mon Dec 16 11:56:30 2013

<dot.f90;-1:-1;hpo_vectorization;MAIN__;0>
HPO Vectorizer Report (MAIN__)

dot.f90(27:9-27:9):VEC:MAIN__: loop was not vectorized: unsupported data type
loop was not vectorized: not inner loop
dot.f90(28:9-28:9):VEC:MAIN__: loop was not vectorized: unsupported data type
loop was not vectorized: not inner loop
dot.f90(37:25-37:25):VEC:MAIN__: loop was not vectorized: vectorization possible but seems inefficient
dot.f90(36): (col. 7) remark: OUTER LOOP WAS VECTORIZED
dot.f90(36:7-36:7):VEC:MAIN__: OUTER LOOP WAS VECTORIZED
dot.f90(37:25-37:25):VEC:MAIN__: loop was not vectorized: vectorization possible but seems inefficient
dot.f90(36:7-36:7):VEC:MAIN__: loop was not vectorized: low trip count
dot.f90(37:66-37:66):VEC:MAIN__: loop was not vectorized: not inner loop

So, it looks like the optimizer isn't doing right. If we read the "PERMUTED" part, one can try interchanging the loops in our calculation, and if we do we get:

Random generation 0.09767500
dotProd time: 2.87279900
dotProd(npts,nobs) 0.21707585

Much better, but still about 4x slower than Intel 13.

Also, as an aside, it looks like compiling with -xHost on Intel 14 leads to a slowdown even with the permuted code:

(Intel 13) $ ifort -O3 -xHost dot.permute.f90 -o dot.ifort13.permute.xHost.exe
(Intel 13) $ ./dot.ifort13.permute.xHost.exe
Random generation 0.09036300
dotProd time: 0.80169300
dotProd(npts,nobs) 0.21707587

(Intel 14) $ ifort -O3 -xHost dot.permute.f90 -o dot.ifort14.permute.xHost.exe
(Intel 14) $ ./dot.ifort14.permute.xHost.exe
Random generation 0.09545500
dotProd time: 3.34055000
dotProd(npts,nobs) 0.21707590

Finally, -fast really doesn't like this code with either compiler:

(Intel 13) $ ifort -fast dot.permute.f90 -o dot.ifort13.permute.fast.exe
ipo: remark #11001: performing single-file optimizations
ipo: remark #11006: generating object file /gpfsm/dnb31/tdirs/pbs/slurm.448506.mathomp4/ipo_iforta6D8B9.o
(Intel 13) $ ./dot.ifort13.permute.fast.exe
Random generation 0.09720600
dotProd time: 3.20100500
dotProd(npts,nobs) 0.21707590

(Intel 14) $ ifort -fast dot.permute.f90 -o dot.ifort14.permute.fast.exe
ipo: remark #11001: performing single-file optimizations
ipo: remark #11006: generating object file /gpfsm/dnb31/tdirs/pbs/slurm.448505.mathomp4/ipo_iforthmRKBV.o
(Intel 14) $ ./dot.ifort14.permute.fast.exe
Random generation 0.10952500
dotProd time: 3.42286700
dotProd(npts,nobs) 0.21707590

(That said, the compilers are even for once.)

TimP · ‎12-16-2013

Is that meant to be a test of the compiler's ability to recognize more convoluted forms of

dotProd = matmul( analysis_ob , transpose( analysis_chunk ) ) / (nanals-1)

?

If so, there seems every reason to encourage the compiler to make the opt_matmul substitution or even to write in the equivalent gemm BLAS call if the compiler should fail to do so (with the transpose eliminated by incorporating it into gemm). You could incorporate the multiplier 1_8/(nanals-1) into the BLAS call so as to permit optimum placement.

I would have thought the explicit BLAS call a reasonable thing to do in C

Note that Fortran now has a real64 data type, not that I expect to see it used in preference to real(8). Past Fortran compilers have been known to snap at those who divided double precision by float() intrinsic. Fortran 90 equivalent would be real(nanals-1, kind(1d0)).

Matt_Thompson · ‎12-17-2013

Tim Prince wrote:

Is that meant to be a test of the compiler's ability to recognize more convoluted forms of

dotProd = matmul( analysis_ob , transpose( analysis_chunk ) ) / (nanals-1)

?

I don't think it was, it was just a test case that was found. My concern was that Intel 14 was producing significantly slower code compared to an older version of the compiler. As this seems to be an issue where the optimizer isn't recognizing code in the same way (and in a detrimental way, comparitively), I thought I would ask here on the Forum if that was an expected result of upgrading to Intel 14 and if there is a new flag I should be experimenting with; I usually expect code built with newer compilers to keep performance or improve.

That said, while Intel 14 does better with your suggestion, it still is slower than the "convoluted" code's performance with Intel 13. To wit:

[fortran] call system_clock(dot_start)
   dotProd = matmul( analysis_ob , transpose( analysis_chunk ) ) / (nanals-1)
   call system_clock(dot_end)
   dot_time = (dot_end-dot_start)/rate[/fortran]

I get, compiling with -O3, where the "permute" exe is the one with my original code with hand-permuted i and k loops:

(Intel 13) $ ./dot.ifort13.permute.exe
Random generation   0.10292100
dotProd time:       0.82688300
dotProd(npts,nobs) 0.21707585
(Intel 13) $ ./dot.ifort13.matmul.exe
Random generation   0.10603800
dotProd time:       2.91957300
dotProd(npts,nobs) 0.21707585

(Intel 14) $ ./dot.ifort14.permute.exe
Random generation   0.10511600
dotProd time:       2.94216000
dotProd(npts,nobs) 0.21707585
(Intel 14) $ ./dot.ifort14.matmul.exe
Random generation   0.10599400
dotProd time:       1.12875200
dotProd(npts,nobs) 0.21707585

So it looks like Intel 14 does handle matmul better (hooray!) but it's still ~36% slower than what Intel 13 does with the original code. Also, with matmul, fast does seem to help with both compilers compared to the original when compiled with -fast compared to -O3:

(Intel 13) $ ./dot.ifort13.matmul.fast.exe
Random generation 0.10580900
dotProd time: 1.06494500
dotProd(npts,nobs) 0.21707587

(Intel 14) $ ./dot.ifort14.matmul.fast.exe
Random generation 0.09832200
dotProd time: 1.05861200
dotProd(npts,nobs) 0.21707587

Always good to see -fast work like one hopes it would given its name!

TimP · ‎12-17-2013

Compiler 14.0.1 introduced more capability for outer loop vectorization (strip mining), but that doesn't make it a good solution for a small number of threads when loop permutation can produce effectively inner loop vectorization, particularly when it's still possible to parallelize.

So I guess the outer loop vectorization comment is a red flag when you know there are better alternatives.

Martyn_C_Intel · ‎12-18-2013

No, this is not expected behavior. The innermost loop has non-unit stride memory access. The 13.1 compiler does loop interchange followed by loop blocking to improve data locality, whereas 14.0 does not. (You can see this directly by adding -opt-report-phase hlo ). This is not a consequence of outer loop vectorization, which occurs at a later stage after this opportunity has been missed. This has been escalated to the compiler developers to investigate.

There are various ways to improve performance, as you saw. One possible advantage of using matmul for large matrices, (or writing code that the compiler can identify as matmul), is that if you use -O3 -parallel, or -O3 -opt-matmul, the compiler may replace this with a call to a threaded library function.

-fast implies -xhost (in addition to -O3 -no-prec-div -ipo -static), so it's expected to have similar behavior to -xhost directly. However, for your original example with the 13.1 compiler, -ipo also seemed to interfere with the loop optimizations.

Finally, if you are running on a system that supports Intel(R) AVX, you may get an additional benefit from aligning your data with -align array32byte.

TimP · ‎12-18-2013

Martyn Corden (Intel) wrote:

No, this is not expected behavior. The innermost loop has non-unit stride memory access. The 13.1 compiler does loop interchange followed by loop blocking to improve data locality, whereas 14.0 does not. (You can see this directly by adding -opt-report-phase hlo ). This is not a consequence of outer loop vectorization, which occurs at a later stage after this opportunity has been missed. This has been escalated to the compiler developers to investigate.

There are various ways to improve performance, as you saw. One possible advantage of using matmul for large matrices, (or writing code that the compiler can identify as matmul), is that if you use -O3 -parallel, or -O3 -opt-matmul, the compiler may replace this with a call to a threaded library function.

-fast implies -xhost (in addition to -O3 -no-prec-div -ipo -static), so it's expected to have similar behavior to -xhost directly. However, for your original example with the 13.1 compiler, -ipo also seemed to interfere with the loop optimizations.

Finally, if you are running on a system that supports Intel(R) AVX, you may get an additional benefit from aligning your data with -align array32byte.

Thanks to Martyn for initiating investigation into why the compiler chooses the outer loop vectorization which is less effective for this case than loop switching.

Last I checked, -O3 implied -opt-matmul. If you had a case too small for -opt-matmul (not the case cited here) the combination -O3 -no-opt-matmul could be quite effective. -opt-matmul can be set regardless of -O3 so as to get the MKL internal library function (MKL threading depending on setting mkl link option without mkl=sequential)

I have found -align array32byte even more effective for sse2 on corei7 than for later CPUs which support avx. Still, the point about AVX is well taken; even when calling C++ code built with MSVC++ /arch:AVX from Fortran, the 32-byte alignment can be quite effective (primarily for rank 1 arrays).