- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

I am helping a colleague who has written an engineering program that involves a function returning an array of large size (0.5 to 2 GB typically on Windows x64 OS with 16 GB RAM) and who is noticing significantly slower performance with Intel Fortran compared to gfortran. We are trying to understand why this is the case.

Toward this, consider the following simplification of the program that reduces to a function instruction equivalent to the formula y = f(x) where x and y happen to be arrays of significant size; in this example, they are on the order of 0.5 GB. The function has been made trivial for this example as a simple assignment: y = x.

What one notices then is the performance of the program using Intel Fortran with /O2 optimization is significantly worse compared to that required for simply copying all the data. When looked at in the form of a ratio, program using Intel Fortran shows performance that is more than 4 times slower than that compared to the exact same code run with gfortran.

Can someone from the Intel Fortran team please provide some insight and guidance with this?

Thanks,

program p use, intrinsic :: iso_fortran_env, only : R8 => real64 implicit none integer, parameter :: N = 2**27 integer, parameter :: NUM_SAMPLES = 100 integer :: i real :: x(N) ! 0.5 GB storage size real :: y(N) real(R8) :: t1 real(R8) :: t2 real(R8) :: t_assign real(R8) :: t(NUM_SAMPLES) call random_number( x ) print *, "Checking y=x Assignment:" print "(a,t10,a,t25,a)", "i", "x", "y" do i = 1, NUM_SAMPLES if ( mod(i,20) == 0 ) x = x + epsilon(x) ! For compiler not to optimize away the sampling loop call cpu_time( t1 ) y = x call cpu_time( t2 ) t(i) = t2 - t1 if ( mod(i,20) == 0 ) then print "(g0,t10,g0,t25,g0)", i, x(i), y(i) end if end do t_assign = sum(t) print "(*(g0,1x))", "Average CPU Time = ", sum(t)/real(NUM_SAMPLES,kind=R8) print * print *, "Checking y=equals(x) Function Call:" print "(a,t10,a,t25,a)", "i", "x", "y" do i = 1, NUM_SAMPLES if ( mod(i,20) == 0 ) x = x + epsilon(x) ! For compiler not to optimize away the sampling loop call cpu_time( t1 ) y = equals( x ) call cpu_time( t2 ) t(i) = t2 - t1 if ( mod(i,20) == 0 ) then print "(g0,t10,g0,t25,g0)", i, x(i), y(i) end if end do print "(*(g0,1x))", "Average CPU Time = ", sum(t)/real(NUM_SAMPLES,kind=R8) print * print "(g0,f10.2)", "Ratio of the two instructions: ", sum(t)/t_assign stop contains pure function equals( a ) result( r ) real, intent(in) :: a(:) ! Function result real :: r( size(a) ) r = x return end function equals end program p

Compilation and execution:

C:\Fortran>ifort /heap-arrays:0 /standard-semantics p.f90 Intel(R) Visual Fortran Intel(R) 64 Compiler for applications running on Intel(R ) 64, Version 18.0.0.065 Beta Build 20170320 Copyright (C) 1985-2017 Intel Corporation. All rights reserved. ifort: NOTE: The Beta evaluation period for this product ends on 12-oct-2017 UTC . Microsoft (R) Incremental Linker Version 14.00.24215.1 Copyright (C) Microsoft Corporation. All rights reserved. -out:p.exe -subsystem:console p.obj C:\Fortran>p.exe Checking y=x Assignment: i x y 20 .9765872E-01 .9765872E-01 40 .4800636 .4800636 60 .6310107 .6310107 80 .8096844 .8096844 100 .4880919 .4880919 Average CPU Time = .4383628099999999E-01 Checking y=equals(x) Function Call: i x y 20 .9765931E-01 .9765931E-01 40 .4800642 .4800642 60 .6310113 .6310113 80 .8096850 .8096850 100 .4880925 .4880925 Average CPU Time = .2020212949999999 Ratio of the two instructions: 4.61 C:\Fortran>

On the Windows based computer system I tried, the ratio shown above is consistently on the order of 4.5. Note using /O3 with Intel Fortran only makes a small difference. Whereas the same program compiled with gfortan (with their /O2 or /O3 optimization option) on the same system, the ratio is typically below 1.3.

P.S.> I've taken a peek at the assembler instructions and there is something with Intel Fortran that appears bothersome, but I'll hold my thoughts to myself and allow the Intel team to followup on this.

Link Copied

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

What happens with:

pure subroutine equalsSub( r, a ) real, intent(in) :: a(:) real :: r( size(a) ) r = x return end subroutine equalsSub

And performing call equalsSub(y, x) in your timed loop.

Jim Dempsey

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

jimdempseyatthecove wrote:

What happens with .. performing call equalsSub(y, x) in your timed loop.

Jim Dempsey

Jim,

Yes, replacing the function subprogram with a subroutine clearly helps with Intel Fortran, it brings the ratio I show above to around unity, implying with the optimization in effect, it is effectively an inllning of the subroutine subprogram resulting in essentially no procedure invocation overhead. This is indeed an option we have already considered, it's just that it will be a lot of change for my colleague to refactor the code.

But, of course, the question for the Intel team is why such a difference relative to gfortran for the code in the original post!

P.S.> With gfortran, there was hardly any change between function and subroutine subprograms (I think I either need some other compiler option or a different compiler version to notice a difference).

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Can you tell if an array temporary is created in the function call...

... or if the generated code is using scalar copy verses vector copy?

With a really old version of IFV I had an issue of scalar array copy being performed when vector copy should have been chosen. The fix was to use

!DIR$ VECTOR ALWAYS

r = x

You might give that a try. (there are other clauses to VECTOR that may be of interest too).

Jim Dempsey

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

It appears ifort allocates/deallocates an array temporary for the function variant but not for the array assignment or subroutine variants. I submitted this to Development for their analysis.

(Internal tracking id: CMPLRS-43227)

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Kevin D (Intel) wrote:

It appears ifort allocates/deallocates an array temporary for the function variant but not for the array assignment or subroutine variants. I submitted this to Development for their analysis.

(Internal tracking id: CMPLRS-43227)

Thanks much , Kevin - that's exactly what I noticed and which seems to degrade performance. My hope is Intel Development will followup soon with an approach that greatly enhances performance, as good as or better than gfortran for the use case shown in the original post. I look forward to your feedback from Development analysis.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page