IFX 2024.0 Performance degradation caused by bogus "Serialized function calls"

RobertBollig · ‎08-13-2024

Hi,

when comparing recent Ifort and Ifx performance I notice a performance degradation particularly in subroutines where a large number of " remark #15485: serialized function calls" are reported in the simd optimization reports.

This seems to happen when a non-local module array or derived-type member array is accessed by index, a la type%array(i,j,k).
This might be caused by ifx mistaking the array indices as a function parameter in the optimization frontend.
The code still gives the same numerical results as ifort so it still translates to assembly fine, but the vectorization becomes suboptimal.

I cannot try with ifx 2024.2.0 because that version causes segfaults and internal compiler errors galore.

Is such a problem known in the IFX development team?

Ulrich_M_ · ‎08-14-2024

Yet another performance loss relative to ifort for some something super basic, and yet Intel wants to no longer even provide ifort? I understand that resources are limited, and that the Intel Fortran team cannot develop two compilers in parallel. But why not at a minimum continue to provide the frozen version of ifort as a download for longer? Given these comments, I for one will try to just hang in there with my current version of ifort for as long as I can.

Ron_Green · ‎08-14-2024

First, ifx and ifort parse your program identically. Everything with optimizations and vectorizations are completely different. Optimizations are very different and may need some tuning.

The warning message indicates your loop looks something like this

do i=1,N

... do some calculations

call mysubroutine( X, i ). !X is an array, or an array component in a type

!or

X(i) = myfunc( X(i) )

... do some stuff

end do

To get this to vectorize the call to 'mysubroutine', one of 2 things have to happen:

1) the compiler inlines the call or function. ifort had very aggressive inlining by default. ifx uses llvm for this, and to get that to kick in you need this option

-ipo or -flto

are you compiling your code with -ipo or -flto? ifort often did this automatically at higher optimization levels. ifx does not.

2) do you decorate the definition of your procedure with

!$omp declare simd

subroutine mysubroutine( X, i )

and then the loop where it is called needs omp simd

!$omp simd

do i=1,N

... do some calculations

call mysubroutine( X, i ). !X is an array, or an array component in a type

... do some stuff

end do

What version of the compiler are you using? Can you send the code and input to reproduce?

RobertBollig · ‎08-15-2024

Hello Ron,

here is an excerpt of the code. Note that there is not a single function in this code block. Every type member is an array.

ifx (IFX) 2024.0.2 20231213
Copyright (C) 1985-2023 Intel Corporation. All rights reserved.

    !$OMP SIMD
    do i=0,spectralCut%lastActive(max(ie-1,1),is)
       ip1=min(i+1,config%imaxp+1)
       i2=2*i
       i2p1=i2+1
       if (ie.le.spectralCut%ierandb(i,is)+1) then
! d(Momeq_xw)/dY_e
!J
          df_ms(1,2)%data(i2) = radialGrid%dv(i)                                *      &
                                backquants%alpq(i)                              *   (  &
                               -matter_coefficients%opac    %kappa_aq(i,ie,is)   *     &
                                matter_coefficients%dOpacdYe%bplaq(i,ie,is,iNumEN)  -  &
                                matter_coefficients%dOpacdYe%kappa_aq(i,ie,is)   * (   &
                                matter_coefficients%opac    %bplaq(i,ie,is,iNumEn)-    &
                                xw(i2,ie,is)                                       )   &
                                                                                    )

!H
          df_ms(1,1)%data(i2p1) = radialGrid%polrq1(i)                          *      &
                                  radialGrid%dvq(i) * backquants%alp(i)         *   (  &
                                  matter_coefficients%dOpacdYe%kappa_sq(i,ie,is)  *    &
                                  xw(i2p1,ie,is)                                       &
                                                                                    )

          df_ms(1,3)%data(i2p1) = radialGrid%polrq(i)                            *     &
                                  radialGrid%dvq(i) * backquants%alp(i)          *  (  &
                                  matter_coefficients%dOpacdYe%kappa_sq(ip1,ie,is) *   &
                                  xw(i2p1,ie,is)                                       &
                                                                                    )
! d(Momeq_xw)/dE
!J
          df_ms(2,2)%data(i2) = radialGrid%dv(i)                                *      &
                                backquants%alpq(i)                              *   (  &
                               -matter_coefficients%opac   %kappa_aq(i,ie,is)    *     &
                                matter_coefficients%dOpacdE%bplaq(i,ie,is,iNumEn)   -  &
                                matter_coefficients%dOpacdE%kappa_aq(i,ie,is)    * (   &
                                matter_coefficients%opac   %bplaq(i,ie,is,iNumEn)    - &
                                xw(i2,ie,is)                                      )    &
                                                                                    )

!H
          df_ms(2,1)%data(i2p1) = radialGrid%polrq1(i)                          *      &
                                  radialGrid%dvq(i) * backquants%alp(i)         *   (  &
                                  matter_coefficients%dOpacdE%kappa_sq(i,ie,is)   *    &
                                  xw(i2p1,ie,is)                                       &
                                                                                    )


          df_ms(2,3)%data(i2p1) = radialGrid%polrq(i)                           *      &
                                  radialGrid%dvq(i) * backquants%alp(i)         *   (  &
                                  matter_coefficients%dOpacdE%kappa_sq(ip1,ie,is) *    &
                                  xw(i2p1,ie,is)                                       &
                                                                                    )
       endif !(ie.le.spectralCut%ierandb(i,is)
    enddo !i

And the corresponding optimization report

LOOP BEGIN at momeq_jacobian.f90 (70, 5)
    remark #25530: Stmt at line 73 sinked after loop using last value computation
    remark #25530: Stmt at line 72 sinked after loop using last value computation
    remark #15301: SIMD LOOP WAS VECTORIZED
    remark #15305: vectorization support: vector length 16
    remark #15475: --- begin vector loop cost summary ---
    remark #15476: scalar cost: 79.000000 
    remark #15477: vector cost: 20.890625 
    remark #15478: estimated potential speedup: 3.656250 
    remark #15309: vectorization support: normalized vectorization overhead 0.140625
    remark #15485: serialized function calls: 8
    remark #15488: --- end vector loop cost summary ---
    remark #15447: --- begin vector loop memory reference summary ---
    remark #15450: unmasked unaligned unit stride loads: 1 
    remark #15456: masked unaligned unit stride loads: 14 
    remark #15458: masked indexed (or gather) loads: 14 
    remark #15459: masked indexed (or scatter) stores: 6 
    remark #15567: Gathers are generated due to non-unit stride index of the corresponding loads.
    remark #15568: Scatters are generated due to non-unit stride index of the corresponding stores.
    remark #15474: --- end vector loop memory reference summary ---
LOOP END

If I remove the !$OMP SIMD pragma at the beginning it won't even vectorize. There are no dependencies in that piece of code.

LOOP BEGIN at momeq_jacobian.f90 (70, 5)
    remark #15344: Loop was not vectorized: vector dependence prevents vectorization
    remark #15346: vector dependence: assumed FLOW dependence between (79:11) and (84:62) 
    remark #15346: vector dependence: assumed ANTI dependence between (92:64) and (79:11) 
    remark #15346: vector dependence: assumed FLOW dependence between (90:11) and (84:62) 
    remark #15346: vector dependence: assumed OUTPUT dependence between (90:11) and (79:11) 
    remark #15346: vector dependence: assumed FLOW dependence between (90:11) and (92:64) 
    remark #15346: vector dependence: assumed ANTI dependence between (100:64) and (79:11) 
    remark #15346: vector dependence: assumed ANTI dependence between (100:64) and (90:11) 
    remark #15346: vector dependence: assumed FLOW dependence between (98:11) and (84:62) 
    remark #15346: vector dependence: assumed OUTPUT dependence between (98:11) and (79:11) 
    remark #15346: vector dependence: assumed FLOW dependence between (98:11) and (92:64) 
LOOP END

Ron_Green · ‎08-14-2024

@Ulrich_M_ I fully understand. When we moved the compiler from Digital/Compaq to Intel we had the same growing pains. You may remember this period. A lot of people stuck with DVF/CVF for many years until they finally moved to ifort. In fact, we still get questions on DVF/IVF ( and occasionally DEC Fortran on Vax or ifort on Itanium ! )

We are well on our way with ifx. But there will be certain loop patterns, or data structures and types that will need tuning and tweaking over time. As we get reports AND code to reproduce the performance degradation we can take action on addressing those. Ifort took a lot of years and tuning to get where it is today. I expect the same for ifx. The issue reported in this thread really does look related to inlining, rather a lack thereof, that prevents IFX from vectorizing the code. We can try some compiler options or the OMP DECLARE SIMD and SIMD combo.

That users need to consider downloading and saving a copy of ifort is why I wrote the blog and put notices on this forum: to give everyone a heads up to save off a copy of ifort if ifx is not quite ready for your code OR you don't have time to put into the porting and tuning right now. For most people ifx and ifort perform roughly the same. If you are one of those for which this is NOT true it may seem like ifx "is not ready".

Each update release is getting roughly 200 edits, some of them performance related. So with each update it gets a little better. We're working on the 2025.0 release, and it is looking like a good solid compiler with a number of fixes from the vectorizer and optimizer team. If you can't send us code, hopefully another user sent us something actionable that will also address your issue. One advantage we have here is the very large user base increases the probability that any issue you are seeing will eventually have another user submit something we can fix. Thus I would encourage those that are seeing slower code with ifx to not give up. Rather, try the updates as they come out. Sooner or later some fix for another customer could also unlock the performance in your code.

jimdempseyatthecove · ‎08-15-2024

FWIW

I think you have two statements (lines 3 and 6) that are interfering with getting good optimizations.

Line 3 can be omitted by changing the do i loop such that it conforms to the ip1==i+1 (then handle the excption case either before or after the loop

Line 6 will require information about the frequency and placement in the iteration space. If on most of the iterations the line 6 evaluation is .true. then it may be most effective to save the df_ms(1:2,1:3)%data arrays into temporary arrays (save_df_ms(1:2,1:3)%data), then run a loop afterwards restoring the values that satisfy if (ie.gt.spectralCut%ierandb(i,is)+1) then

Jim Dempsey

RobertBollig · ‎08-15-2024

Hi Jim,
thanks for the hints.
Line 3 is normally automatically taken care of by the compiler in the remainder loop, at least I don't see any hint in the optimization reports of it generating gathers for that particular variable indexing.

Line 6 is generally .true. for nearly all values due to the spectralCut%lastActive loop iteration bounds.
Generally there is first a block of contiguous .true. values, followed by either a contiguous block of .false., or, in transient cases, islands of .true. inside a block of .false.. LastActive is the index of the last .true. value, alternatively there is also an "allActive" index that marks the element before the first .false. value that I used for the multistage approach below.

A typical pattern would be "T,T,T,T,T,T,T,...,T,T,T,T,T,T,F,F,F,F,F,F,T,T,F,F,F,...,F,F,F", where the first bold marks the allActive index and the second bold marks the lastActive index.

Early on I used a multistage approach, where I reimplemented the code without the if statement for the known block of contiguous .true. values, followed by the conditional remainder loop in the example above.

There were minimal performance differences because the AVX512 mask registers having not too much overhead and the if statement evaluating to .true. for 99% of cases. The only effective difference is probably just the optimization heuristics being slightly different but as long as one forces it using !$OMP SIMD it should not make too much of a performance difference.

Reimplementing the multistage approach in all loops of this style (there are many more) would clutter up the code. Alternatively I would have to create elemental SIMD subroutines that encapsulate the loop body and hope that the compiler inlining is good enough.

The complete code as it is now after roughly a decade of optimization is able to get close to 50% of theoretical peak dpflops on CPUs so there are very little avenues of optimization left (as long as the compiler vectorizes reliably and efficiently).

jimdempseyatthecove · ‎08-15-2024

Try something along the lines of this: (untested code)

integer, parameter :: sizeOfYourReals = sizeof(radialGrid%dv(1))
integer(sizeOfYourReals), dimension(max(ie-1,1),is):: idx_i,idx_ip1,idx_i2,idx_i2p1
integer :: n_idx
...
n_idx = 0
do i=0,spectralCut%lastActive(max(ie-1,1),is)
  ip1=min(i+1,config%imaxp+1)
  i2=2*i
  i2p1=i2+1
  if (ie.le.spectralCut%ierandb(i,is)+1) then
    n_idx = n_idx+1
    idx_ip1(n_idx)=min(i+1,config%imaxp+1)
    idx_i2(n_idx)=2*i
    idx_i2p1(n_idx)=i2+1
  endif
end do
if(n_idx > 0) then

!$OMP SIMD
  df_ms(1,2)%data(idx_i2(1:n_idx) = &
    radialGrid%dv(idx_i(1:n_idx)) * &
    backquants%alpq(idx_i(1:n_idx)) * (  &
      -matter_coefficients%opac%kappa_aq(idx_i(1:n_idx),ie,is) *  &
       matter_coefficients%dOpacdYe%bplaq(idx_i(1:n_idx),ie,is,iNumEN) - &
       matter_coefficients%dOpacdYe%kappa_aq(idx_i(1:n_idx),ie,is)   * (   &
         matter_coefficients%opac%bplaq(idx_i(1:n_idx),ie,is,iNumEn) - &
         xw(idx_i2(1:n_idx),ie,is) &
       ) &
    )

!H
  df_ms(1,1)%data(idx_i2p1(1:n_idx)) = &
    radialGrid%polrq1(idx_i(1:n_idx)) * &
    radialGrid%dvq(i) * backquants%alp(idx_i(1:n_idx)) * (  &
    matter_coefficients%dOpacdYe%kappa_sq(idx_i(1:n_idx),ie,is) * &
    xw(idx_i2p1(1:n_idx),ie,is) &
   )

  df_ms(1,3)%data(idx_i2p1(1:n_idx)) = &
    radialGrid%polrq(idx_i(1:n_idx)) * &
    radialGrid%dvq(idx_i(1:n_idx)) * backquants%alp(idx_i(1:n_idx)) * ( &
    matter_coefficients%dOpacdYe%kappa_sq(idx_ip1(1:n_idx),ie,is) * &
    xw(idx_i2p1(1:n_idx),ie,is) &
                                                                                    )
! d(Momeq_xw)/dE
!J
  df_ms(2,2)%data(idx_i2(1:n_idx)) = &
    radialGrid%dv(idx_i(1:n_idx)) * &
    backquants%alpq(idx_i(1:n_idx)) * ( &
    -matter_coefficients%opac%kappa_aq(idx_i(1:n_idx),ie,is) * &
    matter_coefficients%dOpacdE%bplaq(idx_i(1:n_idx),ie,is,iNumEn) - &
    matter_coefficients%dOpacdE%kappa_aq(idx_i(1:n_idx),ie,is) * ( &
    matter_coefficients%opac%bplaq(idx_i(1:n_idx),ie,is,iNumEn) - &
    xw(idx_i2(1:n_idx),ie,is) ) &
                                                                                    )

!H
  df_ms(2,1)%data(idx_i2p1(1:n_idx)) = &
    radialGrid%polrq1(idx_i(1:n_idx)) * &
    radialGrid%dvq(idx_i(1:n_idx)) * backquants%alp(idx_i(1:n_idx)) * ( &
    matter_coefficients%dOpacdE%kappa_sq(idx_i(1:n_idx),ie,is) * &
    xw(idx_i2p1(1:n_idx),ie,is) &
    )


  df_ms(2,3)%data(idx_i2p1(1:n_idx)) = &
    radialGrid%polrq(idx_i(1:n_idx)) * &
    radialGrid%dvq(idx_i(1:n_idx)) * backquants%alp(idx_i(1:n_idx)) * ( &
    matter_coefficients%dOpacdE%kappa_sq(idx_ip1(1:n_idx),ie,is) * &
    xw(idx_i2p1(1:n_idx),ie,is) &
    )
endif !(n_idx>0)

On AVX512 systems, the compiler should be able scatter the lhs and gather the rhs for each component of each statement.

Note, I may have copy/paste errors introduced in the above. Try this out on one of your sections of code, it should be relatively easy to implement and test.

Jim Dempsey

jimdempseyatthecove · ‎08-16-2024

Note, if the number of points is large (n_idx > ???) consider parallelizing the code:

!$omp parallel do simd
do i=1,n_idx
  df_ms(1,2)%data(idx_i2(i)) = &
  ...

Also, I notice I missed a close ) on the first line of the store statements.

Jim Dempsey

jimdempseyatthecove · ‎08-16-2024

RE: parallelization

The CPU core has a limited number of registers, you have 6 expressions and stores. This may exceed the capacity for registers.

Therefore, through experimentation, try using more than one loop (2, 3 or 6 loops).

And, depending on size of n_idx, you may want to use a parallel region with 6 sections each executing one scatter=gather statement.

Some experimentation will find the sweet spot.

Also, sometimes it helps register pressure and/or optimizer to use either an associate or call a contained function with the slice of the arrays.

block
  ...
  associate( &
    kappa_sq => matter_coefficients%dOpacdE%kappa_sq(:,ie,is), &
    ... => ...)
 
  ...
  ...kappa_sq(1:n_idx)...
  ...
  end associate
end block

Jim Dempsey