ifort/ifx openmp performance breakdowns due to seemingly useles internal shared data writes

martinmath · ‎07-26-2023

In order to avoid any further confusion I start a new thread. This is a sequel to

performance break-down in conjunction with generic assignment

There were actually two quite different cases where ifort and ifx generate writes to shared data, which are obviously wrong. I checked with the recently published oneapi compilers (ifort 2021.10.0 20230609 and ifx 2023.2.0 20230622) and the bugs are still present in both compilers for all testcases. However, in one case ifx is now able to optimise away the offending write-only code in -O2, providing good performance. Without optimisation, the problematic writes are still visible.

For an analysis of the reason of why performance breaks down, see the linked thread. As noted there, this looks like a front-end issue, as ifort and ifx both generate very similar shared-data writes seen in assembly output.

Hopefully, the following instructions will help to get an acceptable reproducer this time. First the easy one (triggered by allocatable type component and recursion), which breaks down with both compiler at optimisations -O2.

Compile with "ifx -O2 -qopenmp alloc_rec.f90 -o alloc_rec.x" and run with either 1 or 2 threads (export OMP_NUM_THREADS=1 or 2):

module mod

implicit none
public

type :: t
   integer, dimension(:), allocatable :: a
end type t

contains

recursive subroutine rec(n, x, i)
   integer, intent(in) :: n
   type(t), intent(inout) :: x
   integer, intent(inout) :: i
   type(t) :: y
   if (n > 0) then
      call rec(n-1, y, i)
      i = i + 1
   end if
end subroutine rec


end module mod


program alloc_rec

use mod
implicit none

type(t) :: x
integer :: i, j
integer, parameter :: N = 10000000

j = 0

!$omp parallel default(shared) private(i, x) reduction(+:j)
!$omp do schedule(dynamic,1000)
do i = 1,N
   call rec(10, x, j)
end do
!$omp end do
!$omp end parallel

print *, j, N

end program alloc_rec

The loop should parallelise and scale just perfectly. However, going from 1 to just 2 threads, execution time increases by a factor of about 20(!) on my computer.

Out of cursiosity I also compared with gfortran (version 13). Using just one thread it was faster by a factor of more than 50(!) compared to ifx with 1 thread. This is not really surprising considering the many useless writes generated by ifx. With two threads this results in a performance difference of about 1000.

(classical ifort looks similarly deplorable).

Now the second case (triggered by unused type-bound generic assignment declaration). As remarked above, this requires to do compilation without optimisation, as ifx has become quite clever to optimise away the shared-data writes. I have not been able to make this one sufficiently complicated to dodge the optimisation by ifx. But still, there should not be such a big performance break-down with -O0, going from one to two threads (in fact, there should not be any performance break-down, as two threads should run almost perfectly in parallel).

Compile with "ifx -O0 -qopenmp generic_assignment.f90 -o generic_assignment.x" and run with either 1 or 2 threads (export OMP_NUM_THREADS=1 or 2):

module str

implicit none
private

type, public :: s
   character(len=:), allocatable :: a
contains
   procedure :: assign
   generic :: assignment(=) => assign
end type s

contains

subroutine assign(self, x)
   class(s), intent(inout) :: self
   class(s), intent(in) :: x
   self%a = x%a
end subroutine assign

end module str



module mod

use str
implicit none
private

type, abstract, public :: t
   type(s) :: x
contains
   procedure :: foo
   procedure(bar_ifc), deferred :: bar
end type t

abstract interface
   function bar_ifc(self) result(i)
      import t
      class(t), intent(in) :: self
      integer :: i
   end function bar_ifc
end interface

type, extends(t), public :: r
contains
   procedure :: bar
end type r

contains

function foo(self) result(m)
   class(t), intent(in) :: self
   integer :: m
   m = self%bar()
end function foo

function bar(self) result(i)
   class(r), intent(in) :: self
   integer :: i
   i = len(self%x%a)
end function bar

end module mod



program generic_assignment

use mod
implicit none

integer, parameter :: N = 10000000
integer :: i, c
class(t), pointer :: u

c = 0
!$omp parallel default(shared) private(i, u) reduction(+:c)
allocate(r :: u)
u%x%a = '**+*'
!$omp do schedule(dynamic,1000)
do i = 1, N
   c = c + u%foo()
end do
!$omp end do
deallocate(u)
!$omp end parallel

print *, c, N

end program generic_assignment

Runtime increases by a factor of about 25 going from 1 to 2 threads on my computer.

PS: All tests have been done in linux.

Barbara_P_Intel · ‎07-28-2023

Let me explore this. For the first case, I don't see quite the same slowdown between 1 and 2 threads. I see ~3x. That's still not a good thing!

And 2 and 4 threads run in the same amount of time!

martinmath · ‎07-31-2023

The performance loss depends on the processor and even on thread placement. The original numbers from me were actually from an AMD Zen system, which is more susceptible to this cache issue and thus originally helped me to find these problems in the real code. On one of our intel CPU systems (18 core processor) I get a slowdown of about 8-10. Similar on other intel systems. In all cases, it flattens out and runtime remains roughly the same with more than 2 threads.

Just to convince you that the code should just scale fine: Replace the allocatable attribute in line 7 of the first testcase by a pointer attribute. This is the mentioned line:

integer, dimension(:), allocatable :: a

Now runtime scales almost perfectly on my 18 core intel system up to 18 threads (though I had to increase the counter N by a factor of 100, runtime was just too short for meaningful measurement). Note that this component is not even used in the code!

To give you another data point, using 8 threads (only 8 because otherwise runtime just varies too much) on the 18 core intel system and "N=100000000" I get the following runtime:

0.3s for the pointer variant
135s for the allocatable variant

So the pointer variant is roughly 500 times faster for this loop count and number of threads!

Barbara_P_Intel · ‎07-31-2023

I filed a bug report, CMPLRLLVM-49991, for this negative performance gain using the first test in this thread.

Ron_Green · ‎02-08-2024

for the first case, 49991, openmp is a red herring. even the non openmp case has a 10x difference in runtime between the pointer and allocatable component case. I have attached what I used for my analysis and mm2.f90 and mm2.ptr.f90. Has nothing to do with temporaries either.

The difference is due to this declaration in rec

type(t) :: y

I'll state the obvious that you already know - y is never allocated or used. It's simply passed around inrecursive calls to rec.

Let's take the fast case, mm2.ptr.f90. Y is declared

type :: t

integer, dimension(:), pointer :: a

end type t

Background: in both cases at O0 we create a struct for y. Now we drop in the 10 recursive calls passing it around even though it's not used.

type :: t

integer, dimension(:), allocatable :: a

end type t

At the subroutine return, the front end has to determine how to deallocate non-save locals. This is where the 2 cases differ dramatically for cleaning up y.

Let's take the easy case of mm2.ptr.f90, the pointer component. Since t has just one component, we are free to represent y as just a single pointer to data of type int32, rank 1. The status if y is undefined. it is never associated. I think we can all write a simple bit of code for the epilog - if undefined or disassociated simply toss the pointer. That is what we do for mm2.ptr.f90. In this case, the front end uses inline logic for pointers, does a fast check on the pointer status, and if there is not associated data do nother. Super fast simple return. and this code, remember, is run 10M times. Now if y actually got associated the logic would be vastly different. In that case we'd have to call a cleanup routine for the target data - making sure the reference count was just 1, etc. That would have required a function call for the cleanup. The smart "inline" logic in the front-end prevented the call to the cleanup routine. Most compilers use this inline optimization for pointer cleanup.

For types with allocatable components it's more complicated. The type is represented by a structure. This structure has the number of components. And for each component of allocatable type (or more complex types) we use a struct to describe each component. In this case we have a struct for type integer, kind in32, allocatable, rank 1. If the type had more components, more structs would be needed.

Now, at return, how do you clean up y in this case? You might say run a loop over the number of components. access the structs for each component. Branch, do the right clean up if any for each component. You can see the permutation of types, kinds, types, extended types, etc. can cause a sizable amount of code. So instead of repeating this common code in the epilog, we put it in a runtime routine call to do the cleanup. ours is named for_dealloc_all_nocheck.

So at O0 what you are getting is 10M calls to for_dealloc_all_nocheck. Gfortran and nag have some clever inline "optimization" to recognize that y was never used, and like the pointer case, simply toss the struct for type y and return. Good on them. We don't have this inline optimization currently for this case. We will take a look at this for a possible inline check. Our developer for this front end code has an idea that may help with this case. But not until, at earliest, fall 2024. This is O0 after all. O0 usually means I am testing and I don't care so much about speed. As you noted, at O2 and above the optimization phases identify unused variables and code and remove the calls to clean up y by removing it altogether. And yes, LLVM is better at removing unused vars and code paths compared to our Classic compiler. This fantastic dead var and dead code elimination also tripped us up a number of times when it removed loops and things if the data calculated therein is never reused. I am getting back in the habit of using prints after any calculations to make sure the optimizer know that, yes, I really do use the results of the calculation.

As for openmp scaling. I think it's just a red herring. You take the overhead of 10M calls to for_dealloc_all_nocheck and scale it, well you get all sorts of cache false sharing, extra memory overhead for all the parallel stack allocs/deallocs. Beside, the logic in our deallocation routine is complex since it has to handle all possible data cases. You can test this out this theory that omp is a red herring, like I did. Comment out the declaration of y in rec. And instead of passing Y in the recursion, simply pass X. Slowdown goes away. Same for the serial case - comment out y, pass x in the recursive calls. This case is all about how y is cleaned up.

martinmath · ‎02-09-2024

Thanks for the detailed explanation. A few years ago I actually dived into the scalable allocator implementation (using the tbb sources and gdb) due to a false positive in valgrind. I am well aware of the intricacies of for_dealloc_all_nocheck. However, I do not think your analysis is correct here for two reasons:

for_dealloc_all_nocheck on a local variable on a thread private stack of a derived-type with non-allocated variable might take some time, but it should not have any false-sharing. If shared data is accessed by for_dealloc_all_nocheck, then I would consider this a bug. It essentially means that using variables of derived-types with allocatable components should be avoided within openmp regions. Because in general these variables are used and dead-code elimination is not possible. Example: some derived type with character(len=:), allocatable component for storing an errmsg, which is never used in regular runs, but other component are used for computational purposes.

I just re-checked with ifx 2024.0.0 20231017, compiled my variant with openmp but reduce N = 1000000 (this collects sufficient data for perf), with "ifx -O0 -qopenmp", run with "perf record -g -F 1000". With one thread, almost 100% of runtime (about 0.4s) is spent in do_deallocate_all (call by for_dealloc_all_nocheck). Considering the handful of assembler statements for the subroutine rec itself this is not surprising. If I use 18 threads on 18 physical cores for_dealloc_all_nocheck takes about 4% of overall runtime! Most time is spent in an entry block to subroutine rec, where internal global variables var$5 and var$6 are written to, see assembler output of "ifx -O0 -qopenmp -S" below (starting with line 19). Namely var$5 and var$6 are just initialised. Actually var$5+24 is read from, but always contains constant $1091, just written a couple of commands before. In my opinion this is clearly a compiler bug, such code should not be produced. The compiler cannot rely on optimisation to remove such bad code, right? I hope I do not read the assembler and the annotated perf report wrongly.

edit 1:

PS: I just see that for the first case, dead code elimination does not work and ifx (and classical ifort as well) does not scale (or rather it badly breaks down with many threads). This is not a -O0 issue. This example can be compiled with -O2 and still shows the same behaviour (going from 1 to 18 threads increases runtime by factor 6 on an 9980xe).

edit 2:

To be thorough I just ran the pointer variant (mm2.ptr.f90 but with => null() default initialisation causing some additional memcpy) and it scales perfectly (minus cpu frequency scaling, to be precise with N=1000000000 I get speedup of factor 14 comparing 1 and 18 threads).

So it sounds like from your reply that the intel compiler team thinks that it is fine that an unused allocatable component can completely break down openmp scaling in a code, where no shared user data is accessed? Remember, this is a boiled down code, I have seen exactly this in real code and only saw the false sharing in the assembler but not in the fortran code!

	.globl	mod_mp_rec_
	.p2align	4, 0x90
	.type	mod_mp_rec_,@function
mod_mp_rec_:
	.cfi_startproc
	pushq	%rbp
	.cfi_def_cfa_offset 16
	.cfi_offset %rbp, -16
	movq	%rsp, %rbp
	.cfi_def_cfa_register %rbp
	subq	$192, %rsp
	movq	%rdi, -72(%rbp)
	movq	-72(%rbp), %rax
	movq	%rax, -176(%rbp)
	movq	%rsi, -80(%rbp)
	movq	%rdx, -88(%rbp)
	movq	-88(%rbp), %rax
	movq	%rax, -184(%rbp)
	movq	$1091, var$5+24
	movq	$72, var$5+8
	movq	$0, var$5+32
	movq	$0, var$5+16
	movq	$0, var$5
	movq	var$5+24, %rax
	orq	$1, %rax
	movq	%rax, var$5+24
	movq	$0, var$5+16
	movabsq	$_DYNTYPE_RECORD0, %rax
	movq	%rax, var$5+48
	movq	$0, var$5+56
	movq	$0, var$5+80
	movq	$0, var$5+96
	movq	$0, var$5+88
	movq	$1248, var$6+24
	movq	$1, var$6+32
	movq	$0, var$6+16
	movabsq	$_DYNTYPE_RECORD1, %rax
	movq	%rax, var$6+72
	movq	$0, var$6+80
	movq	$0, var$6+104
	movq	$0, var$6+120
	movq	$0, var$6+112
	movq	$0, var$6+96
	movq	$0, var$6+128
	movq	$0, var$6+136
	movabsq	$_ALLOC_RECORD_LIST_VAR_0, %rax
	movq	%rax, var$5+72
	movabsq	$_INFO_LIST_VAR_0, %rax
	movq	%rax, var$5+104
	movq	$0, var$5+112
	leaq	-160(%rbp), %rdi
	movabsq	$mod_mp_rec_$blk.var$3, %rsi
	movl	$72, %edx
	callq	memcpy@PLT
	movq	-176(%rbp), %rax
	cmpl	$0, (%rax)
	setg	%al
	andb	$1, %al
	movzbl	%al, %eax
	testl	$1, %eax
	je	.LBB1_2
	movq	-184(%rbp), %rdx
	movq	-176(%rbp), %rax
	movl	(%rax), %eax
	subl	$1, %eax
	movl	%eax, -164(%rbp)
	leaq	-164(%rbp), %rdi
	leaq	-160(%rbp), %rsi
	callq	mod_mp_rec_@PLT
	movq	-184(%rbp), %rax
	movl	(%rax), %ecx
	addl	$1, %ecx
	movl	%ecx, (%rax)
	jmp	.LBB1_3
.LBB1_2:
	jmp	.LBB1_3
.LBB1_3:
	movabsq	$var$5, %rdi
	leaq	-160(%rbp), %rsi
	movl	$262144, %edx
	callq	for_dealloc_all_nocheck@PLT
	addq	$192, %rsp
	popq	%rbp
	.cfi_def_cfa %rsp, 8
	retq
.Lfunc_end1:
	.size	mod_mp_rec_, .Lfunc_end1-mod_mp_rec_
	.cfi_endproc


...


	.type	var$5,@object
	.local	var$5
	.comm	var$5,128,16
	.type	_DYNTYPE_RECORD0,@object
	.p2align	3, 0x0
_DYNTYPE_RECORD0:
	.quad	strlit
	.quad	0
	.size	_DYNTYPE_RECORD0, 16

	.type	strlit,@object
	.section	.rodata.str1.1,"aMS",@progbits,1
strlit:
	.asciz	"MOD#T"
	.size	strlit, 6

	.type	var$6,@object
	.local	var$6
	.comm	var$6,152,16
	.type	_DYNTYPE_RECORD1,@object
	.data
	.p2align	3, 0x0
_DYNTYPE_RECORD1:
	.quad	strlit.1
	.quad	_DYNTYPE_RECORD2
	.size	_DYNTYPE_RECORD1, 16

Barbara_P_Intel · ‎07-31-2023

I filed a second bug, CMPLRLLVM-49995, for the second reproducer. I'll let the compiler developers decide if they are related.