- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have the following problem:
Platform: Linux / Windows
Compiler version: Fortran Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 19.1.3.311 Build 20201010_000000
as well as earlier versions.
I compile it as follows (one of these, problem remains the same):
ifort /Qip /O3 /Qprec-div /Qprec /QaxCORE-AVX512 /Qopenmp /Qopt-report tmp.for
ifort /O3 /Qip /Qprec-div /Qprec /QxHost /Qopenmp /Qopt-report tmp.for
ifort -ipo -O3 -no-prec-div -fp-model fast=2 -xHost -qopenmp -qopenmp-link=static tmp.for
ifort -ipo -O3 -prec-div -mp1 -xHost -qopenmp -qopenmp-link=static tmp.for
ifort -ipo -O3 -prec-div -mp1 -axCORE-AVX512 -qopenmp -qopenmp-link=static -qopt-report tmp.for
As the program grows up, after some time I get the following additional note at the end of *.optrpt file and the program slows down for about 40%.
The program is pretty large and has got many OMP Loops inside of it.
(Program contains 66 INCLUDE commands, including diverse routines.)
C:\tmp2\tmp.for(16,7):remark #34051: REGISTER ALLOCATION : [MAIN__.A] C:\tmp2\tmp.for:16
....
Hardware registers
Reserved : 2[ rsp rip]
Available : 39[ rax rdx rcx rbx rbp rsi rdi r8-r15 mm0-mm7 zmm0-zmm15]
Callee-save : 18[ rbx rbp rsi rdi r12-r15 xmm6-xmm15]
Assigned : 31[ rax rdx rcx rbx rbp rsi rdi r8-r15 zmm0-zmm15]
Routine temporaries
Total : 30410
Global : 7699
Local : 22711
Regenerable : 7213
Spilled : 5865
Routine stack
Variables : 10081 bytes*
Reads : 807 [8.57e-03 ~ 0.0%]
Writes : 1903 [6.77e-01 ~ 0.7%]
Spills : 52928 bytes*
Reads : 10808 [2.89e+01 ~ 28.9%]
Writes : 7700 [1.32e+01 ~ 13.2%]
Notes
RA scaled back due to excessive number of temporaries
*Non-overlapping variables and spills may share stack space,
so the total stack size might be less than this.
-------------
My comments:
I think this note causes the problem and makes the code slows down:
"Notes
RA scaled back due to excessive number of temporaries"
tmp.for
...
(16) USE omp_lib
Does anybody know what does it mean and how to overcome this problem?
Thanks for helping.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
For your AVX512 builds, you might try adding: /Qopt-zmm-usage:high
Also, register pressure can (may) be reduced by NOT inlining everything (either explicitly or implicitly via IPO).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for your suggestions!
I tried to compile with ipo disabled (/Qipo-), but the problem still remains.
Also adding: /Qopt-zmm-usage:high did not help.
The "note" appears in the cg section (Report from: Code generation optimizations [cg]) and not in the ipo section of the report, so I am not sure if it has something to do with ipo.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If your procedure is very long, and if it contains loops in different sections you can localize the "temporary" variables through the use of BLOCK/ END BLOCK encapsulating the loop(s) and declare the "temporaries" within the BLOCK.
subroutine foo(...)
... ! data
... ! code
BLOCK
REAL :: SomeTemporary ! known only inside this block
SomeTemporary = 0.0
DO i=1,n
...
SomeTemporary = SomeTemporary + Expression
...
END DO
NonBlockVariable = SomeTemporary
END BLOCK
... ! more code with additional blocks
end subroutine
It would help if you can share the code.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for your help!
I'll try that.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Please report back if this helps (or doesn't help) as this will aid others with similar concerns.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Now I have tried different compiler options (/O2, /O3, /fast,..., /QxHost, ... ) but nothing helps.
I do not have any subroutines nor Blocks, everything is placed in one big code (using include).
I do not have temporary arrays.
If I exclude OpenMP, than I get the following message:
"Notes
RA scaled back due to excessive number of interferences"
(Compiled with: ifort /O1 /QaxCORE-AVX2 /Qoverride-limits /Qopt-zmm-usage:high /Qopt-report tmp.for)
So I thing the problem also does not have to deal only with temporaries?!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
By the way, what does it mean:
"RA scaled back due to ..."
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
zmm usage is only available when using AVX512. AVX512 has double the registers of AVX2 (and earlier).
I suspect that when using OpenMP that each parallel region constructs what amounts to a separate procedure whereby the PRIVATE variables are localize to the procedure... , .AND. where the compiler optimization focus is constrained to the parallel region. IOW register usage does not span into, nor out of, the parallel region.
For your non-OpenMP configuration, you might see a similar effect (benefit) by placing the loop(s) that can be parallelized (but now run serial) into CONTAINS subroutines. You also may want to place/duplicate the variables that may benefit from being registerized in the contained procedures (loop control variables, temporarries, etc...).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for your comments, but I really did not get it, what do you mean and what should I do to overcome the problem?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
RA could be considered to mean Register Assignment (or Register Allocation), though the author of the annotation may have something other in mind. There is a limited number of general purpose registers (8/16/32/64-bit) used for integers or addresses and SIMD registers (8/16/32 depending on architecture) used for floating point scalars and/or SIMD vectors.
When you (or some auto code generator) writes your code as a very large number of statements, the working variables are declared (possibly implicitly) once at procedure entry point. The compiler optimization code then needs to determine if possible which variables, if any, will produce better performing code if that/those variables are registerized (GP and SIMD). The problem with very long and complicated code sections is you have too many "good" candidates to choose from as to which gets selected to be registerized. By using BLOCKs, contained procedures, external procedures (and/or parallel regions), you reduce the scope and lifetime of the local variables, this in turn reduces the number of potential candidates for use in registers (at the expense of some overhead for the call or equivilent of call in the case of OpenMP).
Again, if you show the code, there may be something entirely different from (lack of) register optimization that is at the root of your performance issue.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Unfortunately I can not share the code accross internet, so I'll try to describe how it works as close as possible.
The code is also pretty large (approx. 10000 lines).
The code is sequential, but
only DO Loops are parallelized with "!$OMP PARALLEL DO " (see ROUTINEs below)
There are no arrays inside the private statement!
The note "RA scaled back ..." appears no matter what change do I take. For example
if I put some additional code lines inside a do loop, or just changing input
of some data at the beginning (Input*.for) or some output data at the end (Output.for).
For me it seems the problem could be really with registers as you described.
But how to organize code in Blocks as you suggested?
---------------------------------------------
Here are some details of the code, how it looks like:
USE omp_lib
INCLUDE 'Parameters.for' !Set some parameters and define arrays as allocatable
INCLUDE 'Input1.for' !Read in some input parameters needed to allocate memory
INCLUDE 'Allocate_mem.for' ! Allocate memory
INCLUDE 'Input2.for' ! Read in input parameters & variables
INCLUDE 'Setup.for' ! Set up variables & arrays & Initialise data
!$OMP PARALLEL
nthreads = OMP_GET_NUM_THREADS()
!$OMP END PARALLEL
INCLUDE 'Continue_run.for' ! Reads in data from previous run if necces.
c*************************************************************
c BEGINN CYCLEs
call cpu_time(t1)
call system_clock(count_rate=iclock_rate) !Find the time rate
call system_clock(count=iclock_start) !Start Timer
cycle = 0
100 cycle = cycle + 1
INCLUDE 'Routine1.for' ! Routine1
INCLUDE 'Routine2.for' ! Routine2
...
INCLUDE 'Routine20.for' ! Routine20
if ( cycle .lt. cycle_max ) goto 100
c END CYCLEs
c*************************************************************
call cpu_time(t2)
print *, "CPU time(seconds): ", t2 - t1
call system_clock(count=iclock_stop) !End Timer
s_time = real( iclock_stop - iclock_start ) / real(iclock_rate)
print *, "system_clock time: ", s_time
INCLUDE 'Output.for' ! Output data
c-------------------------------------------------------
END
c===================================
General structure of the ROUTINEs:
is = 1
ie = ijmax1
data_1 = 9.81
!$OMP PARALLEL DO
!$OMP+ DEFAULT (SHARED)
!$OMP+ PRIVATE(ll,I,IP,IJ,IPJ,... ! there are no arrays inside private
!$OMP+ ,DY, DX, FLX, coef, ...
!$OMP+ ,...
!$OMP+ ,...
!$OMP+ ,p1, p2, p3 )
do ll = is, ie
ij = Nodes(ll)
Q(ij) = 0.
coef = data_1 * array1(ij)
fi_1(ij) = coef * array2(ij)
fi_2(ij) = coef * array3(ij)
enddo
!$OMP END PARALLEL DO
c----------
is = 1
ie = ijmax2
p1 = 20.
!$OMP PARALLEL DO
!$OMP+ DEFAULT (SHARED)
!$OMP+ PRIVATE(ll,I,IP,IJ,IPJ,... ! there are no arrays inside private
!$OMP+ ,DY, DX, FLX, coef, ...
!$OMP+ ,...
!$OMP+ ,...
!$OMP+ ,p1, p2, p3 )
DO ll = is, ie
ij = Pointer(ll)
I = array(1,ij)
IP = array(2,ij)
DX = x(ip) - x(i)
DY = y(ip) - y(i)
coef = sqrt ( dx**2 + dy**2 )
----
FLX = ...
Q(i) = Q(i) - FLX * coef * p1
Q(ip) = Q(ip) + FLX * coef * p1
enddo
!$OMP END PARALLEL DO
---------------------------------------
I hope this will help.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Since this seems to be a concern of your, I will assume you will be willing to do a little bit of cleanup work...
1) place your parameters and common variables by your INCLUDE 'xxx.for' source files into a source file as a module, preferably .f90, and replace the INCLUDE 'Parameters.for' (and INCLUDE 'CommonVariables.for') with USE YourNameOfChoice
2) Replace the inline code introduced using INCLUDE 'whatever.for' with a CALL whatever(). Note, these will be enclosed with SUBROUTINE whatever/END SUBROUTIN and contain USE YourNameOfChoice to obtain the common parameters and variables. "whatever" could be Routine1, Routine2, ... as the case may be.
Note, by doing this you have partitioned one large 10000-ish line single program containing 26-some INCLUDE-ed sources into a relatively small PROGRAM plus 26-some relatively small subroutines. And this in turn will aid the compiler in optimization (of register usage).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
OK, thanks! I'll splitt up the program into several subroutines and report if this helped.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page