Run time performance, for_alloc_assign_v2 and for_alloc_copy - Page 3

Simon_Richards1 · ‎12-01-2022

We have a production code which shows significantly worse performance with recent versions of ifort (including oneapi 2022.1 and ifort version 19.0) compared to ifort version 16.0.

I have used VTune to identify a single subroutine responsible for this performance slowdown in a particular test problem.

When the code is built with ifort version 16.0 this routine takes just under 4 seconds.

When the code is built with oneAPI 2022 (ifort) this routine takes more than 21 seconds. This additional CPU time is almost entirely spent in for_alloc_assign_v2 (9.7 sec) and for_alloc_copy (6.2 sec) . I don't see any time spent in these routines at all in the ifort 16.0 case.

The subroutine in question is a complicated piece of legacy code and I have not yet managed to reproduce the same behaviour in a minimum example. It may be possible to refactor the code to avoid repeated allocation and deallocation but first I want to understand why the same code built with different ifort version shows such different behaviour and performance.

Thanks.

Simon_Richards1 · ‎12-08-2022

I just wanted to note here that I posted results from a modified case which hopefully removes the optimization from the equation, in case it is lost in the threading on the forum.

Ron_Green · ‎12-12-2022

it is curious. I'll have a look at this sometime this week.

jimdempseyatthecove · ‎12-12-2022

FWIW

  SUBROUTINE sub()  
    laws(1)%a(1) = laws(1)%a(1) + 1.0    
    jxrange(1) = nc_range      
  END SUBROUTINE sub

The token "laws(1)%a(1)" should use scalar instructions to generate a reference via the array descriptors. For the excessive time consumption should not require the generation of temporary objects.

This said, it should be noted that in this example, sub is a contained procedure .and. laws, jxrange and nc_range are declared in the containing scope. If I were to guess, the code generator is generating a copy-in/copy-out method as opposed to reference-in.

Jim Dempsey

Simon_Richards1 · ‎12-13-2022

Good point about sub being a contained procedure in the example code. I only did that for convenience and in fact the affected subroutine in the production code is an external subroutine. I have modified my example to make sub an external subroutine. The results are the same: ifort 16.0 - 2.6 seconds; oneAPI 2022 (ifort) - 48 seconds.

Just for completeness I also tried making sub a contained procedure in the data_mod module. The results are the same again.

jimdempseyatthecove · ‎12-13-2022

The disassembly code would be instructive for the Intel developers to look at.

Seeing that the results of the example code are correct, it would appear that something on the rhs is making the compiler think a temporary object(s) need to be created to facilitate extracting a scalar to complete the expression (and then destroyed afterwards).

While your case seems to be the first report, your data structures are not unusual, and thus I suspect this is happening to others as well. And due to the severity of performance impact, this warrants attention.

Jim Dempsey

Ron_Green · ‎12-13-2022

I also moved sub to the module: same result. I put a call counter in and disproved that gfortran and old ifort optimized all but 1 call - no, there are a lot of calls to sub in all compilers.

Jim is on the right track, from what I've seen so far. This appears to be a change from the 17.6 compiler to the 18.0 compiler. So whatever it is, it ain't new. Something about these extended derived types is tripping up the front end, I think. Not sure yet, but pretty sure. It's interesting that this has been around in all compilers since 18.0.0 and no one has noticed it to date. Makes me think it's a little used code path at fault. OR others just don't have as many calls to low-work subroutines (overhead becomes noise in a sub with lots of actual work).

I really do NOT think it's sloppy code in for_* routines, else EVERYONE would be impacted. This is why I think Jim is on the correct track - the front end is thinking it needs a temp when it doesn't.

I'll wrap up a bug report and get that over to the front end people to analyze. Thanks for sending this to us.

ron

Simon_Richards1 · ‎12-13-2022

Thank you @Ron_Green

We’ve been releasing with version 16.0 of the compiler because we needed to support users with older operating systems. We’ve been building and testing with newer compilers for a while but not really focusing on performance, which is why we’ve only recently uncovered this issue. Now that we’re dropping support for older operating systems we’re ready to move to later compilers and we therefore care much more about their performance.

The routine in the production code obviously does a lot more work than sub in the example, so the performance overhead is less dramatic, but it is definitely significant, and way above being just noise. The routine is called a very large number of times, so needs to be quick.

It’s interesting that so far we’ve only noticed this in one routine. We could probably refactor that routine to remove the issue but the worry is that other code paths might be similarly affected.

I will be very interested to hear what the front end developers have to say.

jimdempseyatthecove · ‎12-14-2022

As a sketch for a possible work around:

  SUBROUTINE sub()
    associate(a=>laws(1)%a) ! or a(1) if using only index=1
      a(1) = a(1) + 1.0    
      jxrange(1) = nc_range   
    end associate   
  END SUBROUTINE sub

Note, this is a work around that need not be removed later.

Jim Dempsey

Simon_Richards1 · ‎12-14-2022

Thanks Jim, good suggestion.

I did have a go at replacing the assignments with associates but the production code is a lot more complicated, with various loops and IF blocks and even GOTOs (legacy FORTRAN77 style routine). It became challenging to make the associate blocks work with the existing program flow. I think we’d have to refactor the routine significantly to make it work. That kind of refactoring would have other benefits of course, but we’re talking here about a legacy code route which is maintained for back compatibility but no longer actively developed.

Ron_Green · ‎12-14-2022

Bug ID is CMPLRLLVM-42717

Steve_Lionel · ‎12-14-2022

I’d be very interested in whatever technical details become available about the difference.

Simon_Richards1 · ‎01-26-2023

I'd be very interested to hear if there has been any progress on this issue, or any further information.

Ron_Green · ‎05-16-2024

@Simon_Richards1 we have a fix for this in the pipeline. This did not make Update 2 compiler that will release in late June or early July. This is targeted for the 2025.0 package version which is ifx 2025.0.0.

Runtime for this testcase on my test server went from 20seconds to 0.26seconds.

The fix in ifx will come with the 2025.0.0 version. This is due to release in the fall, around November or early December.

With the current compiler and the 2024.2 rerelease build
 starting.  calling sub many times
 done with calls finally
 calls     50000000

real	0m20.528s
user	0m20.480s
sys	0m0.010s


With the code branch for 2025.0
 starting.  calling sub many times
 done with calls finally
 calls     50000000

real	0m0.272s
user	0m0.252s
sys	0m0.004s

Simon_Richards1 · ‎05-16-2024

Thanks for the update @Ron_Green .

What was the cause? Was it unnecessary temporaries, as you suspected?

Ron_Green · ‎05-16-2024

No. For extended type assignments we were calling into the RTL, hence the "for_" function overhead you saw. We changed the front-end to tag these as implicit allocatable and hence these can be inlined to remove the call into the FRTL overhead.