Function returning an array of large size: performance issue with Intel Fortran

FortranFan · ‎05-23-2017

I am helping a colleague who has written an engineering program that involves a function returning an array of large size (0.5 to 2 GB typically on Windows x64 OS with 16 GB RAM) and who is noticing significantly slower performance with Intel Fortran compared to gfortran. We are trying to understand why this is the case.

Toward this, consider the following simplification of the program that reduces to a function instruction equivalent to the formula y = f(x) where x and y happen to be arrays of significant size; in this example, they are on the order of 0.5 GB. The function has been made trivial for this example as a simple assignment: y = x.

What one notices then is the performance of the program using Intel Fortran with /O2 optimization is significantly worse compared to that required for simply copying all the data. When looked at in the form of a ratio, program using Intel Fortran shows performance that is more than 4 times slower than that compared to the exact same code run with gfortran.

Can someone from the Intel Fortran team please provide some insight and guidance with this?

Thanks,

program p

   use, intrinsic :: iso_fortran_env, only : R8 => real64

   implicit none

   integer, parameter :: N = 2**27
   integer, parameter :: NUM_SAMPLES = 100
   integer :: i
   real :: x(N)  ! 0.5 GB storage size
   real :: y(N)
   real(R8) :: t1
   real(R8) :: t2
   real(R8) :: t_assign
   real(R8) :: t(NUM_SAMPLES)

   call random_number( x )

   print *, "Checking y=x Assignment:"
   print "(a,t10,a,t25,a)", "i", "x", "y"
   do i = 1, NUM_SAMPLES
      if ( mod(i,20) == 0 ) x = x + epsilon(x) ! For compiler not to optimize away the sampling loop
      call cpu_time( t1 )
      y = x
      call cpu_time( t2 )
      t(i) = t2 - t1
      if ( mod(i,20) == 0 ) then
         print "(g0,t10,g0,t25,g0)", i, x(i), y(i)
      end if
   end do
   t_assign = sum(t)
   print "(*(g0,1x))", "Average CPU Time = ", sum(t)/real(NUM_SAMPLES,kind=R8)
   print *

   print *, "Checking y=equals(x) Function Call:"
   print "(a,t10,a,t25,a)", "i", "x", "y"
   do i = 1, NUM_SAMPLES
      if ( mod(i,20) == 0 ) x = x + epsilon(x) ! For compiler not to optimize away the sampling loop
      call cpu_time( t1 )
      y = equals( x )
      call cpu_time( t2 )
      t(i) = t2 - t1
      if ( mod(i,20) == 0 ) then
         print "(g0,t10,g0,t25,g0)", i, x(i), y(i)
      end if
   end do
   print "(*(g0,1x))", "Average CPU Time = ", sum(t)/real(NUM_SAMPLES,kind=R8)
   print *

   print "(g0,f10.2)", "Ratio of the two instructions: ", sum(t)/t_assign

   stop

contains

   pure function equals( a ) result( r )

      real, intent(in) :: a(:)
      ! Function result
      real :: r( size(a) )

      r = x

      return

   end function equals

end program p

Compilation and execution:

C:\Fortran>ifort /heap-arrays:0 /standard-semantics p.f90
Intel(R) Visual Fortran Intel(R) 64 Compiler for applications running on Intel(R
) 64, Version 18.0.0.065 Beta Build 20170320
Copyright (C) 1985-2017 Intel Corporation.  All rights reserved.

ifort: NOTE: The Beta evaluation period for this product ends on 12-oct-2017 UTC
.
Microsoft (R) Incremental Linker Version 14.00.24215.1
Copyright (C) Microsoft Corporation.  All rights reserved.

-out:p.exe
-subsystem:console
p.obj

C:\Fortran>p.exe
 Checking y=x Assignment:
i        x              y
20       .9765872E-01   .9765872E-01
40       .4800636       .4800636
60       .6310107       .6310107
80       .8096844       .8096844
100      .4880919       .4880919
Average CPU Time =  .4383628099999999E-01

 Checking y=equals(x) Function Call:
i        x              y
20       .9765931E-01   .9765931E-01
40       .4800642       .4800642
60       .6310113       .6310113
80       .8096850       .8096850
100      .4880925       .4880925
Average CPU Time =  .2020212949999999

Ratio of the two instructions:       4.61

C:\Fortran>

On the Windows based computer system I tried, the ratio shown above is consistently on the order of 4.5. Note using /O3 with Intel Fortran only makes a small difference. Whereas the same program compiled with gfortan (with their /O2 or /O3 optimization option) on the same system, the ratio is typically below 1.3.

P.S.> I've taken a peek at the assembler instructions and there is something with Intel Fortran that appears bothersome, but I'll hold my thoughts to myself and allow the Intel team to followup on this.

jimdempseyatthecove · ‎05-23-2017

What happens with:

   pure subroutine equalsSub( r, a )
      real, intent(in) :: a(:)
      real :: r( size(a) )
      r = x
      return
   end subroutine equalsSub

And performing call equalsSub(y, x) in your timed loop.

Jim Dempsey

FortranFan · ‎05-23-2017

jimdempseyatthecove wrote:

What happens with .. performing call equalsSub(y, x) in your timed loop.

Jim Dempsey

Jim,

Yes, replacing the function subprogram with a subroutine clearly helps with Intel Fortran, it brings the ratio I show above to around unity, implying with the optimization in effect, it is effectively an inllning of the subroutine subprogram resulting in essentially no procedure invocation overhead. This is indeed an option we have already considered, it's just that it will be a lot of change for my colleague to refactor the code.

But, of course, the question for the Intel team is why such a difference relative to gfortran for the code in the original post!

P.S.> With gfortran, there was hardly any change between function and subroutine subprograms (I think I either need some other compiler option or a different compiler version to notice a difference).

jimdempseyatthecove · ‎05-23-2017

Can you tell if an array temporary is created in the function call...
... or if the generated code is using scalar copy verses vector copy?

With a really old version of IFV I had an issue of scalar array copy being performed when vector copy should have been chosen. The fix was to use

!DIR$ VECTOR ALWAYS
r = x

You might give that a try. (there are other clauses to VECTOR that may be of interest too).

Jim Dempsey

Kevin_D_Intel · ‎05-24-2017

It appears ifort allocates/deallocates an array temporary for the function variant but not for the array assignment or subroutine variants. I submitted this to Development for their analysis.

(Internal tracking id: CMPLRS-43227)

FortranFan · ‎05-25-2017

Kevin D (Intel) wrote:

It appears ifort allocates/deallocates an array temporary for the function variant but not for the array assignment or subroutine variants. I submitted this to Development for their analysis.

(Internal tracking id: CMPLRS-43227)

Thanks much , Kevin - that's exactly what I noticed and which seems to degrade performance. My hope is Intel Development will followup soon with an approach that greatly enhances performance, as good as or better than gfortran for the use case shown in the original post. I look forward to your feedback from Development analysis.