ifort/ifx Performance break-down in conjunction with generic assignment

martinmath · ‎04-18-2023

After further trying some variations of the issue I asked about in Performance issue in strange subroutine entry block I think I can report this as a pretty obvious bug, present both in classical ifort (2021.8.0 20221119 and older version) as well as ifx (2023.0.0 20221201). However, ifx seems to be able to optimise away some of the bad code. The example below is the shortest version I could come up with, showing the problem with classical ifort. With ifx it can be avoided by optimisation -O2, but the offending code block becomes visible with -O0. To also see the problem with ifx and optimisations turned on, the slightly more complex version provided in the topic linked above should be used. Then ifx breaks down as well.

Description of the problem: If derived-type "s" with a generic assignment is used as component "type(s) :: x" in another derived type with two type-bound procedures, one (foo) calling the other (bar), then a useless write-to-shared-data block is created at the start of foo. The more threads are calling the function the more performance breaks down, presumably due to cache coherency issues.

I think this is a pretty serious issue for modern fortran code, as encapsulating character(len=:),allocatable in a derived-type for easier string handling is not uncommon. And even declaring but not using such a component can break down performance in critical loops. And it does not need to be a character component, just generic assignment suffices to trigger it.

If the module (without the program part) is compiled with "ifort -S -qopenmp -O2 ...", the offending code block at the start of foo can be easily spotted.

Options qopenmp or O2 do not make a difference, but both should be used for performance benchmarks. If the unused(!) generic assignment is removed, execution time scales well with thread-count. If generic assignment is present, then performance decreases considerably (getting worse with increasing thread count).

module mod

implicit none
private

type, public :: s
contains
   procedure :: assign
   generic :: assignment(=) => assign
end type s

type, public :: t
   type(s) :: x
contains
   procedure :: foo
   procedure :: bar
end type t

contains

subroutine assign(self, x)
   class(s), intent(inout) :: self
   class(s), intent(in) :: x
end subroutine assign

function foo(self) result(m)
   integer :: m
   class(t), intent(in) :: self
   m = self%bar()
end function foo

function bar(self) result(i)
   integer :: i
   class(t), intent(in) :: self
   i = 2
end function bar

end module mod



program test

use mod
implicit none

integer, parameter :: N = 100000000
integer :: i, c
class(t), pointer :: u

c = 0
!$omp parallel default(shared) private(i, u) reduction(+:c)
allocate(t :: u)
!$omp do schedule(dynamic,1000)
do i = 1, N
   c = c + u%foo()
end do
!$omp end do
deallocate(u)
!$omp end parallel

print *, c, N

end program test

Barbara_P_Intel · ‎04-20-2023

Let me investigate this.

martinmath · ‎04-21-2023

Thanks for looking into this.

I was just checking segfault_with_ifx_with_openmp_recursion_and_allocatable with recently released ifx and also looked at the assembler only to see a similar pattern as above. Maybe the segfault there is related to the write-to-shared data code. Anyway, here is an example derived from the code linked above with a derived type containing a component with allocatable attribute, but without generic assignment. So this is not a necessary ingredient. Note that local variable "y" of type(t) is declared and passed to recursively called subroutine rec, but otherwise it is not used. Replacing allocatable by pointer for "a" in type "t" seems to be sufficient, to resolve the issue. Of course, this code runs with classical ifort but segfaults with ifx (which also shows the write-to-shared data blocks). Compilation should be done with -qopenmp (and -O2 or -O3 or similar).

module mod

implicit none
public

type :: t
   integer, dimension(:), allocatable :: a
end type t

contains

recursive subroutine rec(n, x, i)
   integer, intent(in) :: n
   type(t), intent(inout) :: x
   integer, intent(inout) :: i
   type(t) :: y
   if (n > 0) then
      call rec(n-1, y, i)
      i = i + 1
   end if
end subroutine rec

end module mod


program alloc_rec

use mod
implicit none

type(t) :: x
integer :: i, j
integer, parameter :: N = 1000000

j = 0

!$omp parallel default(shared) private(i, x) reduction(+:j)
!$omp do schedule(dynamic,1000)
do i = 1,N
   call rec(10, x, j)
end do
!$omp end do
!$omp end parallel

print *, j, N

end program alloc_rec

Barbara_P_Intel · ‎04-21-2023

I just added a comment on your earlier thread, segfault_with_ifx_with_openmp_recursion_and_allocatable. It's fixed in a future release.

Barbara_P_Intel · ‎04-21-2023

Since the performance of ifx is better for this source file, can you use ifx for this usage in your application? You can mix and match object files, libraries, and .mod files between ifx and ifort.

Remember that ifort and ifx share the same front end, the part of the compiler that interprets your source code. The optimization steps are completely different. It's nice to see a performance improvement with ifx.

martinmath · ‎04-23-2023

Currently I cannot check ifx with the real code due to the segfault_with_ifx_with_openmp_recursion_and_allocatable issue. However, I have resolved the issue by declaring components of a derived-type with generic assignment using class instead of type keyword (i.e. in the example below this would be "class(s), pointer :: x => null" instead of "type(s)..."). The variant with recursion and allocatable component posted above can be avoided by using pointer instead of allocatable arrays.

But generally I would think that ifx shows the same performance problem in any real code. Only the most simple testcase (the one at the start of this thread) runs fine with ifx and O2. If I add just one unused array allocatable component in this testcase, runtime easily decreases by a factor of 100-1000 (with versus without generic assignment line). Of course real code has more well-running code and performance does not decrease as much, but still very noticably. For the sake of completeness, here is the simple testcase with the unused array component added:

module mod

implicit none
private

type, public :: s
   integer, dimension(:), allocatable :: u
contains
   procedure :: assign
   generic :: assignment(=) => assign
end type s

type, public :: t
   type(s) :: x
contains
   procedure :: foo
   procedure :: bar
end type t

contains

subroutine assign(self, x)
   class(s), intent(inout) :: self
   class(s), intent(in) :: x
end subroutine assign

function foo(self) result(m)
   integer :: m
   class(t), intent(in) :: self
   m = self%bar()
end function foo

function bar(self) result(i)
   integer :: i
   class(t), intent(in) :: self
   i = 2
end function bar

end module mod


program test

use mod
implicit none

integer, parameter :: N = 10000000
integer :: i, c
class(t), pointer :: u

c = 0
!$omp parallel default(shared) private(i, u) reduction(+:c)
allocate(t :: u)
!$omp do schedule(dynamic,1000)
do i = 1, N
   c = c + u%foo()
end do
!$omp end do
deallocate(u)
!$omp end parallel

print *, c, N

end program test

And this is the complete ifx -qopenmp -O2 -S output for the module part. It is the var$4 shared memory part, which breaks down performance (line 28-36). The small memory block is never used.

	.text
	.file	"entry_block.f90"
	.globl	mod._
	.p2align	4, 0x90
	.type	mod._,@function
mod._:
	.cfi_startproc
	retq
.Lfunc_end0:
	.size	mod._, .Lfunc_end0-mod._
	.cfi_endproc

	.globl	mod_mp_assign_
	.p2align	4, 0x90
	.type	mod_mp_assign_,@function
mod_mp_assign_:
	.cfi_startproc
	retq
.Lfunc_end1:
	.size	mod_mp_assign_, .Lfunc_end1-mod_mp_assign_
	.cfi_endproc

	.globl	mod_mp_foo_
	.p2align	4, 0x90
	.type	mod_mp_foo_,@function
mod_mp_foo_:
	.cfi_startproc
	movq	$1248, var$4+24(%rip)
	movq	$1, var$4+32(%rip)
	movq	$0, var$4+16(%rip)
	movq	$_DYNTYPE_RECORD1, var$4+72(%rip)
	movq	$0, var$4+80(%rip)
	xorps	%xmm0, %xmm0
	movaps	%xmm0, var$4+96(%rip)
	movaps	%xmm0, var$4+112(%rip)
	movaps	%xmm0, var$4+128(%rip)
	movq	56(%rdi), %rax
	movq	8(%rax), %rcx
	xorl	%eax, %eax
	jmpq	*%rcx
.Lfunc_end2:
	.size	mod_mp_foo_, .Lfunc_end2-mod_mp_foo_
	.cfi_endproc

	.globl	mod_mp_bar_
	.p2align	4, 0x90
	.type	mod_mp_bar_,@function
mod_mp_bar_:
	.cfi_startproc
	movl	$2, %eax
	retq
.Lfunc_end3:
	.size	mod_mp_bar_, .Lfunc_end3-mod_mp_bar_
	.cfi_endproc

	.type	var$4,@object
	.local	var$4
	.comm	var$4,152,16
	.type	_DYNTYPE_RECORD1,@object
	.data
	.p2align	3, 0x0
_DYNTYPE_RECORD1:
	.quad	strlit.1
	.quad	_DYNTYPE_RECORD2
	.size	_DYNTYPE_RECORD1, 16

	.type	strlit.1,@object
	.section	.rodata.str1.1,"aMS",@progbits,1
strlit.1:
	.asciz	"intr#int#4"
	.size	strlit.1, 11

	.type	_DYNTYPE_RECORD2,@object
	.data
	.p2align	4, 0x0
_DYNTYPE_RECORD2:
	.quad	strlit.2
	.quad	0
	.quad	_DYNTYPE_RECORD3
	.size	_DYNTYPE_RECORD2, 24

	.type	strlit.2,@object
	.section	.rodata.str1.1,"aMS",@progbits,1
strlit.2:
	.asciz	"intr#int#4"
	.size	strlit.2, 11

	.type	_DYNTYPE_RECORD3,@object
	.data
	.p2align	3, 0x0
_DYNTYPE_RECORD3:
	.quad	strlit.3
	.quad	0
	.size	_DYNTYPE_RECORD3, 16

	.type	strlit.3,@object
	.section	.rodata.str1.1,"aMS",@progbits,1
strlit.3:
	.asciz	"intr#int#"
	.size	strlit.3, 10

	.section	".note.GNU-stack","",@progbits

Barbara_P_Intel · ‎04-24-2023

@martinmath, I'm a bit confused. There are 2 different reproducers in this thread. Which one should I take a deeper look at?

martinmath · ‎04-24-2023

I am very sorry if I have confused you. Just take a look at the testcase posted yesterday (two posts above). It shows the problem with both ifort as well as ifx (current release: ifx (IFX) 2023.1.0 20230320). And it has the advantage that the assembler output from the compiler is already posted. If you can read the assembler you can spot the problem right away at line 28-36. Such code is an absolute no-go and hints at some internally created shared variable.

But as I have noticed with the first testcase in this thread, ifx might be able to remove the write-to-shared data block by some optimisation steps, but seemingly only for very simple testcases.

As a coder myself I very well know the difficulty and intricacies of having a (small) proper reproducer. That is why I posted the variants. The one at the end of Performance issue in strange subroutine entry block is the most complex one and closest to real code, but the assembler output is quite a bit larger.

Please note that the second testcase posted here (with allocatable and recursion) is very different but shows the same pattern in performance breakdown as well as in the assembler, so has probably the same source. Presumably in the front-end, as ifx and ifort share that part, and both compilers produce very similar write-to-shared data blocks.

Barbara_P_Intel · ‎04-26-2023

Thanks for clarifying which reproducer I should use. I see a performance improvement using ifx.

ifx is the next generation of Intel's Fortran Compiler. I know you are reporting issues. Thank you for helping improve the compiler. The stability of ifx is improving with each release.

As I mentioned before you can mix and match object files, .mod files and libraries between the two compilers to make progress with compiling and running your application.

jimdempseyatthecove · ‎04-27-2023

Barbara,

The important part of martinmath's post is that exposes (potential) non-thread-safe activity by the code generation. What was shown was what appears to be a copy-to/initialization-of a procedure SAVE variable, regardless of that it wasn't referenced later. The performance issue is something someone can grudgingly live with. However, in other code (not shown), IIF the false SAVE variable becomes referenced, you would then have non-thread-safe code.

Jim Dempsey

martinmath · ‎05-02-2023

@jimdempseyatthecove Thanks pointing that out. I stumbled upon this due to the performance anomaly, so that was my focus and I was not willing to accept it grudgingly. I actually did worry about data race errors, but as I could not find any hint that the shared data was ever read I assumed that thread safety was not an issue. But who knows.

@Barbara_P_Intel : There is no reproducer demonstrating failure due to the thread safety concerns as far as I know, but the assembler code patterns naturally raise the question. Any write access to shared data must be guarded (or reasoned about by strong memory ordering), which clearly is not the case. I hope that your assurance that the next release of ifx fixes the issue. But I do actually wonder as this bug is shared by ifort and ifx (hence a frontend issue?), so why should ifx fix it, but not ifort. Did you compare new ifx with and without the generic assignment line (line 10)

   generic :: assignment(=) => assign

(removed or commented out). If the bug is still present, then execution time on a processor with lots of cores (I checked it on 18 core system) should vary by 2-3 orders.

Barbara_P_Intel · ‎04-28-2023

I understand, @jimdempseyatthecove. Do you have a reproducer?

Barbara_P_Intel · ‎05-03-2023

I'll be happy to report a bug on the possible race condition, but I do need a reproducer, a code that demonstrates a runtime failure. The compiler development team requires that.