Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.
28434 Discussions

Run time performance, for_alloc_assign_v2 and for_alloc_copy

Simon_Richards1
5,717 Views

We have a production code which shows significantly worse performance with recent versions of ifort (including oneapi 2022.1 and ifort version 19.0) compared to ifort version 16.0.

 

I have used VTune to identify a single subroutine responsible for this performance slowdown in a particular test problem.

 

When the code is built with ifort version 16.0 this routine takes just under 4 seconds. 

 

When the code is built with oneAPI 2022 (ifort) this routine takes more than 21 seconds. This additional CPU time is almost entirely spent in for_alloc_assign_v2 (9.7 sec) and for_alloc_copy (6.2 sec) . I don't see any time spent in these routines at all in the ifort 16.0 case.

 

The subroutine in question is a complicated piece of legacy code and I have not yet managed to reproduce the same behaviour in a minimum example. It may be possible to refactor the code to avoid repeated allocation and deallocation but first I want to understand why the same code built with different ifort version shows such different behaviour and performance.

 

Thanks. 

 

 

0 Kudos
53 Replies
Simon_Richards1
1,490 Views

I just wanted to note here that I posted results from a modified case which hopefully removes the optimization from the equation, in case it is lost in the threading on the forum.

 

0 Kudos
Ron_Green
Moderator
1,417 Views

it is curious.  I'll have a look at this sometime this week.

 

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,415 Views

FWIW

  SUBROUTINE sub()  
    laws(1)%a(1) = laws(1)%a(1) + 1.0    
    jxrange(1) = nc_range      
  END SUBROUTINE sub

The token "laws(1)%a(1)" should use scalar instructions to generate a reference via the array descriptors. For the excessive time consumption should not require the generation of temporary objects.

 

This said, it should be noted that in this example, sub is a contained procedure .and. laws, jxrange and nc_range are declared in the containing scope. If I were to guess, the code generator is generating a copy-in/copy-out method as opposed to reference-in.

 

Jim Dempsey

0 Kudos
Simon_Richards1
1,396 Views

Good point about sub being a contained procedure in the example code. I only did that for convenience and in fact the affected subroutine in the production code is an external subroutine. I have modified my example to make sub an external subroutine. The results are the same: ifort 16.0 - 2.6 seconds; oneAPI 2022 (ifort) - 48 seconds.

 

Just for completeness I also tried making sub a contained procedure in the data_mod module. The results are the same again. 

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,383 Views

The disassembly code would be instructive for the Intel developers to look at.

Seeing that the results of the example code are correct, it would appear that something on the rhs is making the compiler think a temporary object(s) need to be created to facilitate extracting a scalar to complete the expression (and then destroyed afterwards).

While your case seems to be the first report, your data structures are not unusual, and thus I suspect this is happening to others as well. And due to the severity of performance impact, this warrants attention.

 

Jim Dempsey

Ron_Green
Moderator
1,373 Views

I also moved sub to the module: same result.  I put a call counter in and disproved that gfortran and old ifort optimized all but 1 call - no, there are a lot of calls to sub in all compilers.

 

Jim is on the right track, from what I've seen so far.  This appears to be a change from the 17.6 compiler to the 18.0 compiler.  So whatever it is, it ain't new.  Something about these extended derived types is tripping up the front end, I think.  Not sure yet, but pretty sure.  It's interesting that this has been around in all compilers since 18.0.0 and no one has noticed it to date.  Makes me think it's a little used code path at fault. OR others just don't have as many calls to low-work subroutines (overhead becomes noise in a sub with lots of actual work).  

I really do NOT think it's sloppy code in for_* routines, else EVERYONE would be impacted.  This is why I think Jim is on the correct track - the front end is thinking it needs a temp when it doesn't.

I'll wrap up a bug report and get that over to the front end people to analyze.  Thanks for sending this to us.

 

ron

0 Kudos
Simon_Richards1
1,363 Views

Thank you @Ron_Green 

 

We’ve been releasing with version 16.0 of the compiler because we needed to support users with older operating systems. We’ve been building and testing with newer compilers for a while but not really focusing on performance, which is why we’ve only recently uncovered this issue. Now that we’re dropping support for older operating systems we’re ready to move to later compilers and we therefore care much more about their performance.

 

The routine in the production code obviously does a lot more work than sub in the example, so the performance overhead is less dramatic, but it is definitely significant, and way above being just noise. The routine is called a very large number of times, so needs to be quick. 

It’s interesting that so far we’ve only noticed this in one routine. We could probably refactor that routine to remove the issue but the worry is that other code paths might be similarly affected.

 

I will be very interested to hear what the front end developers have to say. 

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,338 Views

As a sketch for a possible work around:

  SUBROUTINE sub()
    associate(a=>laws(1)%a) ! or a(1) if using only index=1
      a(1) = a(1) + 1.0    
      jxrange(1) = nc_range   
    end associate   
  END SUBROUTINE sub

Note, this is a work around that need not be removed later.

 

Jim Dempsey

0 Kudos
Simon_Richards1
1,331 Views

Thanks Jim, good suggestion.

 

I did have a go at replacing the assignments with associates but the production code is a lot more complicated, with various loops and IF blocks and even GOTOs (legacy FORTRAN77 style routine). It became challenging to make the associate blocks work with the existing program flow. I think we’d have to refactor the routine significantly to make it work. That kind of refactoring would have other benefits of course, but we’re talking here about a legacy code route which is maintained for back compatibility but no longer actively developed.

0 Kudos
Ron_Green
Moderator
1,323 Views

Bug ID is CMPLRLLVM-42717


Steve_Lionel
Honored Contributor III
1,321 Views

I’d be very interested in whatever technical details become available about the difference. 

0 Kudos
Simon_Richards1
608 Views

I'd be very interested to hear if there has been any progress on this issue, or any further information.

0 Kudos
Reply