- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In order to avoid any further confusion I start a new thread. This is a sequel to
performance break-down in conjunction with generic assignment
There were actually two quite different cases where ifort and ifx generate writes to shared data, which are obviously wrong. I checked with the recently published oneapi compilers (ifort 2021.10.0 20230609 and ifx 2023.2.0 20230622) and the bugs are still present in both compilers for all testcases. However, in one case ifx is now able to optimise away the offending write-only code in -O2, providing good performance. Without optimisation, the problematic writes are still visible.
For an analysis of the reason of why performance breaks down, see the linked thread. As noted there, this looks like a front-end issue, as ifort and ifx both generate very similar shared-data writes seen in assembly output.
Hopefully, the following instructions will help to get an acceptable reproducer this time. First the easy one (triggered by allocatable type component and recursion), which breaks down with both compiler at optimisations -O2.
Compile with "ifx -O2 -qopenmp alloc_rec.f90 -o alloc_rec.x" and run with either 1 or 2 threads (export OMP_NUM_THREADS=1 or 2):
module mod
implicit none
public
type :: t
integer, dimension(:), allocatable :: a
end type t
contains
recursive subroutine rec(n, x, i)
integer, intent(in) :: n
type(t), intent(inout) :: x
integer, intent(inout) :: i
type(t) :: y
if (n > 0) then
call rec(n-1, y, i)
i = i + 1
end if
end subroutine rec
end module mod
program alloc_rec
use mod
implicit none
type(t) :: x
integer :: i, j
integer, parameter :: N = 10000000
j = 0
!$omp parallel default(shared) private(i, x) reduction(+:j)
!$omp do schedule(dynamic,1000)
do i = 1,N
call rec(10, x, j)
end do
!$omp end do
!$omp end parallel
print *, j, N
end program alloc_rec
The loop should parallelise and scale just perfectly. However, going from 1 to just 2 threads, execution time increases by a factor of about 20(!) on my computer.
Out of cursiosity I also compared with gfortran (version 13). Using just one thread it was faster by a factor of more than 50(!) compared to ifx with 1 thread. This is not really surprising considering the many useless writes generated by ifx. With two threads this results in a performance difference of about 1000.
(classical ifort looks similarly deplorable).
Now the second case (triggered by unused type-bound generic assignment declaration). As remarked above, this requires to do compilation without optimisation, as ifx has become quite clever to optimise away the shared-data writes. I have not been able to make this one sufficiently complicated to dodge the optimisation by ifx. But still, there should not be such a big performance break-down with -O0, going from one to two threads (in fact, there should not be any performance break-down, as two threads should run almost perfectly in parallel).
Compile with "ifx -O0 -qopenmp generic_assignment.f90 -o generic_assignment.x" and run with either 1 or 2 threads (export OMP_NUM_THREADS=1 or 2):
module str
implicit none
private
type, public :: s
character(len=:), allocatable :: a
contains
procedure :: assign
generic :: assignment(=) => assign
end type s
contains
subroutine assign(self, x)
class(s), intent(inout) :: self
class(s), intent(in) :: x
self%a = x%a
end subroutine assign
end module str
module mod
use str
implicit none
private
type, abstract, public :: t
type(s) :: x
contains
procedure :: foo
procedure(bar_ifc), deferred :: bar
end type t
abstract interface
function bar_ifc(self) result(i)
import t
class(t), intent(in) :: self
integer :: i
end function bar_ifc
end interface
type, extends(t), public :: r
contains
procedure :: bar
end type r
contains
function foo(self) result(m)
class(t), intent(in) :: self
integer :: m
m = self%bar()
end function foo
function bar(self) result(i)
class(r), intent(in) :: self
integer :: i
i = len(self%x%a)
end function bar
end module mod
program generic_assignment
use mod
implicit none
integer, parameter :: N = 10000000
integer :: i, c
class(t), pointer :: u
c = 0
!$omp parallel default(shared) private(i, u) reduction(+:c)
allocate(r :: u)
u%x%a = '**+*'
!$omp do schedule(dynamic,1000)
do i = 1, N
c = c + u%foo()
end do
!$omp end do
deallocate(u)
!$omp end parallel
print *, c, N
end program generic_assignment
Runtime increases by a factor of about 25 going from 1 to 2 threads on my computer.
PS: All tests have been done in linux.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Let me explore this. For the first case, I don't see quite the same slowdown between 1 and 2 threads. I see ~3x. That's still not a good thing!
And 2 and 4 threads run in the same amount of time!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The performance loss depends on the processor and even on thread placement. The original numbers from me were actually from an AMD Zen system, which is more susceptible to this cache issue and thus originally helped me to find these problems in the real code. On one of our intel CPU systems (18 core processor) I get a slowdown of about 8-10. Similar on other intel systems. In all cases, it flattens out and runtime remains roughly the same with more than 2 threads.
Just to convince you that the code should just scale fine: Replace the allocatable attribute in line 7 of the first testcase by a pointer attribute. This is the mentioned line:
integer, dimension(:), allocatable :: a
Now runtime scales almost perfectly on my 18 core intel system up to 18 threads (though I had to increase the counter N by a factor of 100, runtime was just too short for meaningful measurement). Note that this component is not even used in the code!
To give you another data point, using 8 threads (only 8 because otherwise runtime just varies too much) on the 18 core intel system and "N=100000000" I get the following runtime:
- 0.3s for the pointer variant
- 135s for the allocatable variant
So the pointer variant is roughly 500 times faster for this loop count and number of threads!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I filed a bug report, CMPLRLLVM-49991, for this negative performance gain using the first test in this thread.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
for the first case, 49991, openmp is a red herring. even the non openmp case has a 10x difference in runtime between the pointer and allocatable component case. I have attached what I used for my analysis and mm2.f90 and mm2.ptr.f90. Has nothing to do with temporaries either.
The difference is due to this declaration in rec
type(t) :: y
I'll state the obvious that you already know - y is never allocated or used. It's simply passed around inrecursive calls to rec.
Let's take the fast case, mm2.ptr.f90. Y is declared
type :: t
integer, dimension(:), pointer :: a
end type t
Background: in both cases at O0 we create a struct for y. Now we drop in the 10 recursive calls passing it around even though it's not used.
type :: t
integer, dimension(:), allocatable :: a
end type t
At the subroutine return, the front end has to determine how to deallocate non-save locals. This is where the 2 cases differ dramatically for cleaning up y.
Let's take the easy case of mm2.ptr.f90, the pointer component. Since t has just one component, we are free to represent y as just a single pointer to data of type int32, rank 1. The status if y is undefined. it is never associated. I think we can all write a simple bit of code for the epilog - if undefined or disassociated simply toss the pointer. That is what we do for mm2.ptr.f90. In this case, the front end uses inline logic for pointers, does a fast check on the pointer status, and if there is not associated data do nother. Super fast simple return. and this code, remember, is run 10M times. Now if y actually got associated the logic would be vastly different. In that case we'd have to call a cleanup routine for the target data - making sure the reference count was just 1, etc. That would have required a function call for the cleanup. The smart "inline" logic in the front-end prevented the call to the cleanup routine. Most compilers use this inline optimization for pointer cleanup.
For types with allocatable components it's more complicated. The type is represented by a structure. This structure has the number of components. And for each component of allocatable type (or more complex types) we use a struct to describe each component. In this case we have a struct for type integer, kind in32, allocatable, rank 1. If the type had more components, more structs would be needed.
Now, at return, how do you clean up y in this case? You might say run a loop over the number of components. access the structs for each component. Branch, do the right clean up if any for each component. You can see the permutation of types, kinds, types, extended types, etc. can cause a sizable amount of code. So instead of repeating this common code in the epilog, we put it in a runtime routine call to do the cleanup. ours is named for_dealloc_all_nocheck.
So at O0 what you are getting is 10M calls to for_dealloc_all_nocheck. Gfortran and nag have some clever inline "optimization" to recognize that y was never used, and like the pointer case, simply toss the struct for type y and return. Good on them. We don't have this inline optimization currently for this case. We will take a look at this for a possible inline check. Our developer for this front end code has an idea that may help with this case. But not until, at earliest, fall 2024. This is O0 after all. O0 usually means I am testing and I don't care so much about speed. As you noted, at O2 and above the optimization phases identify unused variables and code and remove the calls to clean up y by removing it altogether. And yes, LLVM is better at removing unused vars and code paths compared to our Classic compiler. This fantastic dead var and dead code elimination also tripped us up a number of times when it removed loops and things if the data calculated therein is never reused. I am getting back in the habit of using prints after any calculations to make sure the optimizer know that, yes, I really do use the results of the calculation.
As for openmp scaling. I think it's just a red herring. You take the overhead of 10M calls to for_dealloc_all_nocheck and scale it, well you get all sorts of cache false sharing, extra memory overhead for all the parallel stack allocs/deallocs. Beside, the logic in our deallocation routine is complex since it has to handle all possible data cases. You can test this out this theory that omp is a red herring, like I did. Comment out the declaration of y in rec. And instead of passing Y in the recursion, simply pass X. Slowdown goes away. Same for the serial case - comment out y, pass x in the recursive calls. This case is all about how y is cleaned up.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for the detailed explanation. A few years ago I actually dived into the scalable allocator implementation (using the tbb sources and gdb) due to a false positive in valgrind. I am well aware of the intricacies of for_dealloc_all_nocheck. However, I do not think your analysis is correct here for two reasons:
for_dealloc_all_nocheck on a local variable on a thread private stack of a derived-type with non-allocated variable might take some time, but it should not have any false-sharing. If shared data is accessed by for_dealloc_all_nocheck, then I would consider this a bug. It essentially means that using variables of derived-types with allocatable components should be avoided within openmp regions. Because in general these variables are used and dead-code elimination is not possible. Example: some derived type with character(len=:), allocatable component for storing an errmsg, which is never used in regular runs, but other component are used for computational purposes.
I just re-checked with ifx 2024.0.0 20231017, compiled my variant with openmp but reduce N = 1000000 (this collects sufficient data for perf), with "ifx -O0 -qopenmp", run with "perf record -g -F 1000". With one thread, almost 100% of runtime (about 0.4s) is spent in do_deallocate_all (call by for_dealloc_all_nocheck). Considering the handful of assembler statements for the subroutine rec itself this is not surprising. If I use 18 threads on 18 physical cores for_dealloc_all_nocheck takes about 4% of overall runtime! Most time is spent in an entry block to subroutine rec, where internal global variables var$5 and var$6 are written to, see assembler output of "ifx -O0 -qopenmp -S" below (starting with line 19). Namely var$5 and var$6 are just initialised. Actually var$5+24 is read from, but always contains constant $1091, just written a couple of commands before. In my opinion this is clearly a compiler bug, such code should not be produced. The compiler cannot rely on optimisation to remove such bad code, right? I hope I do not read the assembler and the annotated perf report wrongly.
edit 1:
PS: I just see that for the first case, dead code elimination does not work and ifx (and classical ifort as well) does not scale (or rather it badly breaks down with many threads). This is not a -O0 issue. This example can be compiled with -O2 and still shows the same behaviour (going from 1 to 18 threads increases runtime by factor 6 on an 9980xe).
edit 2:
To be thorough I just ran the pointer variant (mm2.ptr.f90 but with => null() default initialisation causing some additional memcpy) and it scales perfectly (minus cpu frequency scaling, to be precise with N=1000000000 I get speedup of factor 14 comparing 1 and 18 threads).
So it sounds like from your reply that the intel compiler team thinks that it is fine that an unused allocatable component can completely break down openmp scaling in a code, where no shared user data is accessed? Remember, this is a boiled down code, I have seen exactly this in real code and only saw the false sharing in the assembler but not in the fortran code!
.globl mod_mp_rec_
.p2align 4, 0x90
.type mod_mp_rec_,@function
mod_mp_rec_:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset %rbp, -16
movq %rsp, %rbp
.cfi_def_cfa_register %rbp
subq $192, %rsp
movq %rdi, -72(%rbp)
movq -72(%rbp), %rax
movq %rax, -176(%rbp)
movq %rsi, -80(%rbp)
movq %rdx, -88(%rbp)
movq -88(%rbp), %rax
movq %rax, -184(%rbp)
movq $1091, var$5+24
movq $72, var$5+8
movq $0, var$5+32
movq $0, var$5+16
movq $0, var$5
movq var$5+24, %rax
orq $1, %rax
movq %rax, var$5+24
movq $0, var$5+16
movabsq $_DYNTYPE_RECORD0, %rax
movq %rax, var$5+48
movq $0, var$5+56
movq $0, var$5+80
movq $0, var$5+96
movq $0, var$5+88
movq $1248, var$6+24
movq $1, var$6+32
movq $0, var$6+16
movabsq $_DYNTYPE_RECORD1, %rax
movq %rax, var$6+72
movq $0, var$6+80
movq $0, var$6+104
movq $0, var$6+120
movq $0, var$6+112
movq $0, var$6+96
movq $0, var$6+128
movq $0, var$6+136
movabsq $_ALLOC_RECORD_LIST_VAR_0, %rax
movq %rax, var$5+72
movabsq $_INFO_LIST_VAR_0, %rax
movq %rax, var$5+104
movq $0, var$5+112
leaq -160(%rbp), %rdi
movabsq $mod_mp_rec_$blk.var$3, %rsi
movl $72, %edx
callq memcpy@PLT
movq -176(%rbp), %rax
cmpl $0, (%rax)
setg %al
andb $1, %al
movzbl %al, %eax
testl $1, %eax
je .LBB1_2
movq -184(%rbp), %rdx
movq -176(%rbp), %rax
movl (%rax), %eax
subl $1, %eax
movl %eax, -164(%rbp)
leaq -164(%rbp), %rdi
leaq -160(%rbp), %rsi
callq mod_mp_rec_@PLT
movq -184(%rbp), %rax
movl (%rax), %ecx
addl $1, %ecx
movl %ecx, (%rax)
jmp .LBB1_3
.LBB1_2:
jmp .LBB1_3
.LBB1_3:
movabsq $var$5, %rdi
leaq -160(%rbp), %rsi
movl $262144, %edx
callq for_dealloc_all_nocheck@PLT
addq $192, %rsp
popq %rbp
.cfi_def_cfa %rsp, 8
retq
.Lfunc_end1:
.size mod_mp_rec_, .Lfunc_end1-mod_mp_rec_
.cfi_endproc
...
.type var$5,@object
.local var$5
.comm var$5,128,16
.type _DYNTYPE_RECORD0,@object
.p2align 3, 0x0
_DYNTYPE_RECORD0:
.quad strlit
.quad 0
.size _DYNTYPE_RECORD0, 16
.type strlit,@object
.section .rodata.str1.1,"aMS",@progbits,1
strlit:
.asciz "MOD#T"
.size strlit, 6
.type var$6,@object
.local var$6
.comm var$6,152,16
.type _DYNTYPE_RECORD1,@object
.data
.p2align 3, 0x0
_DYNTYPE_RECORD1:
.quad strlit.1
.quad _DYNTYPE_RECORD2
.size _DYNTYPE_RECORD1, 16
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I filed a second bug, CMPLRLLVM-49995, for the second reproducer. I'll let the compiler developers decide if they are related.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page