Notes: RA scaled back due to excessive number of temporaries

kinni · ‎07-13-2021

I have the following problem:

Platform: Linux / Windows

Compiler version: Fortran Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 19.1.3.311 Build 20201010_000000

as well as earlier versions.

I compile it as follows (one of these, problem remains the same):

ifort /Qip /O3 /Qprec-div /Qprec /QaxCORE-AVX512 /Qopenmp /Qopt-report tmp.for

ifort /O3 /Qip /Qprec-div /Qprec /QxHost /Qopenmp /Qopt-report tmp.for

ifort -ipo -O3 -no-prec-div -fp-model fast=2 -xHost -qopenmp -qopenmp-link=static tmp.for

ifort -ipo -O3 -prec-div -mp1 -xHost -qopenmp -qopenmp-link=static tmp.for

ifort -ipo -O3 -prec-div -mp1 -axCORE-AVX512 -qopenmp -qopenmp-link=static -qopt-report tmp.for

As the program grows up, after some time I get the following additional note at the end of *.optrpt file and the program slows down for about 40%.

The program is pretty large and has got many OMP Loops inside of it.

(Program contains 66 INCLUDE commands, including diverse routines.)

C:\tmp2\tmp.for(16,7):remark #34051: REGISTER ALLOCATION : [MAIN__.A] C:\tmp2\tmp.for:16

....

Hardware registers
Reserved : 2[ rsp rip]
Available : 39[ rax rdx rcx rbx rbp rsi rdi r8-r15 mm0-mm7 zmm0-zmm15]
Callee-save : 18[ rbx rbp rsi rdi r12-r15 xmm6-xmm15]
Assigned : 31[ rax rdx rcx rbx rbp rsi rdi r8-r15 zmm0-zmm15]

Routine temporaries
Total : 30410
Global : 7699
Local : 22711
Regenerable : 7213
Spilled : 5865

Routine stack
Variables : 10081 bytes*
Reads : 807 [8.57e-03 ~ 0.0%]
Writes : 1903 [6.77e-01 ~ 0.7%]
Spills : 52928 bytes*
Reads : 10808 [2.89e+01 ~ 28.9%]
Writes : 7700 [1.32e+01 ~ 13.2%]

Notes
RA scaled back due to excessive number of temporaries

*Non-overlapping variables and spills may share stack space,
so the total stack size might be less than this.
-------------
My comments:
I think this note causes the problem and makes the code slows down:

"Notes
RA scaled back due to excessive number of temporaries"

tmp.for
...
(16) USE omp_lib

Does anybody know what does it mean and how to overcome this problem?

Thanks for helping.

jimdempseyatthecove · ‎07-13-2021

For your AVX512 builds, you might try adding: /Qopt-zmm-usage:high

Also, register pressure can (may) be reduced by NOT inlining everything (either explicitly or implicitly via IPO).

Jim Dempsey

kinni · ‎07-13-2021

Thanks for your suggestions!

I tried to compile with ipo disabled (/Qipo-), but the problem still remains.

Also adding: /Qopt-zmm-usage:high did not help.

The "note" appears in the cg section (Report from: Code generation optimizations [cg]) and not in the ipo section of the report, so I am not sure if it has something to do with ipo.

jimdempseyatthecove · ‎07-13-2021

If your procedure is very long, and if it contains loops in different sections you can localize the "temporary" variables through the use of BLOCK/ END BLOCK encapsulating the loop(s) and declare the "temporaries" within the BLOCK.

subroutine foo(...)
... ! data
... ! code
BLOCK
    REAL :: SomeTemporary ! known only inside this block
    SomeTemporary = 0.0
    DO i=1,n
        ...
        SomeTemporary = SomeTemporary + Expression
        ...
    END DO
    NonBlockVariable = SomeTemporary
END BLOCK
    ... ! more code with additional blocks
end subroutine

It would help if you can share the code.

Jim Dempsey

kinni · ‎07-14-2021

Thanks for your help!

I'll try that.

jimdempseyatthecove · ‎07-14-2021

Please report back if this helps (or doesn't help) as this will aid others with similar concerns.

Jim Dempsey

kinni · ‎07-15-2021

Now I have tried different compiler options (/O2, /O3, /fast,..., /QxHost, ... ) but nothing helps.

I do not have any subroutines nor Blocks, everything is placed in one big code (using include).

I do not have temporary arrays.

If I exclude OpenMP, than I get the following message:

"Notes
RA scaled back due to excessive number of interferences"

(Compiled with: ifort /O1 /QaxCORE-AVX2 /Qoverride-limits /Qopt-zmm-usage:high /Qopt-report tmp.for)

So I thing the problem also does not have to deal only with temporaries?!

kinni · ‎07-15-2021

By the way, what does it mean:

"RA scaled back due to ..."

jimdempseyatthecove · ‎07-15-2021

zmm usage is only available when using AVX512. AVX512 has double the registers of AVX2 (and earlier).

I suspect that when using OpenMP that each parallel region constructs what amounts to a separate procedure whereby the PRIVATE variables are localize to the procedure... , .AND. where the compiler optimization focus is constrained to the parallel region. IOW register usage does not span into, nor out of, the parallel region.

For your non-OpenMP configuration, you might see a similar effect (benefit) by placing the loop(s) that can be parallelized (but now run serial) into CONTAINS subroutines. You also may want to place/duplicate the variables that may benefit from being registerized in the contained procedures (loop control variables, temporarries, etc...).

Jim Dempsey

kinni · ‎07-16-2021

Thanks for your comments, but I really did not get it, what do you mean and what should I do to overcome the problem?

jimdempseyatthecove · ‎07-16-2021

RA could be considered to mean Register Assignment (or Register Allocation), though the author of the annotation may have something other in mind. There is a limited number of general purpose registers (8/16/32/64-bit) used for integers or addresses and SIMD registers (8/16/32 depending on architecture) used for floating point scalars and/or SIMD vectors.

When you (or some auto code generator) writes your code as a very large number of statements, the working variables are declared (possibly implicitly) once at procedure entry point. The compiler optimization code then needs to determine if possible which variables, if any, will produce better performing code if that/those variables are registerized (GP and SIMD). The problem with very long and complicated code sections is you have too many "good" candidates to choose from as to which gets selected to be registerized. By using BLOCKs, contained procedures, external procedures (and/or parallel regions), you reduce the scope and lifetime of the local variables, this in turn reduces the number of potential candidates for use in registers (at the expense of some overhead for the call or equivilent of call in the case of OpenMP).

Again, if you show the code, there may be something entirely different from (lack of) register optimization that is at the root of your performance issue.

Jim Dempsey

kinni · ‎07-19-2021

Unfortunately I can not share the code accross internet, so I'll try to describe how it works as close as possible.
The code is also pretty large (approx. 10000 lines).

The code is sequential, but
only DO Loops are parallelized with "!$OMP PARALLEL DO " (see ROUTINEs below)
There are no arrays inside the private statement!

The note "RA scaled back ..." appears no matter what change do I take. For example
if I put some additional code lines inside a do loop, or just changing input
of some data at the beginning (Input*.for) or some output data at the end (Output.for).

For me it seems the problem could be really with registers as you described.
But how to organize code in Blocks as you suggested?
---------------------------------------------

Here are some details of the code, how it looks like:

USE omp_lib

INCLUDE 'Parameters.for' !Set some parameters and define arrays as allocatable
INCLUDE 'Input1.for' !Read in some input parameters needed to allocate memory

INCLUDE 'Allocate_mem.for' ! Allocate memory

INCLUDE 'Input2.for' ! Read in input parameters & variables

INCLUDE 'Setup.for' ! Set up variables & arrays & Initialise data

!$OMP PARALLEL
nthreads = OMP_GET_NUM_THREADS()
!$OMP END PARALLEL

INCLUDE 'Continue_run.for' ! Reads in data from previous run if necces.

c*************************************************************

c BEGINN CYCLEs

call cpu_time(t1)
call system_clock(count_rate=iclock_rate) !Find the time rate
call system_clock(count=iclock_start) !Start Timer

cycle = 0
100 cycle = cycle + 1

INCLUDE 'Routine1.for' ! Routine1
INCLUDE 'Routine2.for' ! Routine2
...

INCLUDE 'Routine20.for' ! Routine20

if ( cycle .lt. cycle_max ) goto 100

c END CYCLEs
c*************************************************************

call cpu_time(t2)
print *, "CPU time(seconds): ", t2 - t1

call system_clock(count=iclock_stop) !End Timer
s_time = real( iclock_stop - iclock_start ) / real(iclock_rate)
print *, "system_clock time: ", s_time

INCLUDE 'Output.for' ! Output data
c-------------------------------------------------------

END
c===================================

General structure of the ROUTINEs:

is = 1
ie = ijmax1

data_1 = 9.81

!$OMP PARALLEL DO
!$OMP+ DEFAULT (SHARED)

!$OMP+ PRIVATE(ll,I,IP,IJ,IPJ,... ! there are no arrays inside private
!$OMP+ ,DY, DX, FLX, coef, ...
!$OMP+ ,...
!$OMP+ ,...
!$OMP+ ,p1, p2, p3 )

do ll = is, ie
ij = Nodes(ll)
Q(ij) = 0.
coef = data_1 * array1(ij)
fi_1(ij) = coef * array2(ij)
fi_2(ij) = coef * array3(ij)
enddo

!$OMP END PARALLEL DO
c----------
is = 1
ie = ijmax2

p1 = 20.

!$OMP PARALLEL DO
!$OMP+ DEFAULT (SHARED)
!$OMP+ PRIVATE(ll,I,IP,IJ,IPJ,... ! there are no arrays inside private
!$OMP+ ,DY, DX, FLX, coef, ...
!$OMP+ ,...
!$OMP+ ,...
!$OMP+ ,p1, p2, p3 )

DO ll = is, ie

ij = Pointer(ll)

I = array(1,ij)
IP = array(2,ij)

DX = x(ip) - x(i)
DY = y(ip) - y(i)

coef = sqrt ( dx**2 + dy**2 )

----

FLX = ...

Q(i) = Q(i) - FLX * coef * p1
Q(ip) = Q(ip) + FLX * coef * p1

enddo

!$OMP END PARALLEL DO

---------------------------------------

I hope this will help.

jimdempseyatthecove · ‎07-20-2021

Since this seems to be a concern of your, I will assume you will be willing to do a little bit of cleanup work...

1) place your parameters and common variables by your INCLUDE 'xxx.for' source files into a source file as a module, preferably .f90, and replace the INCLUDE 'Parameters.for' (and INCLUDE 'CommonVariables.for') with USE YourNameOfChoice

2) Replace the inline code introduced using INCLUDE 'whatever.for' with a CALL whatever(). Note, these will be enclosed with SUBROUTINE whatever/END SUBROUTIN and contain USE YourNameOfChoice to obtain the common parameters and variables. "whatever" could be Routine1, Routine2, ... as the case may be.

Note, by doing this you have partitioned one large 10000-ish line single program containing 26-some INCLUDE-ed sources into a relatively small PROGRAM plus 26-some relatively small subroutines. And this in turn will aid the compiler in optimization (of register usage).

Jim Dempsey

kinni · ‎07-28-2021

OK, thanks! I'll splitt up the program into several subroutines and report if this helped.