- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
After further trying some variations of the issue I asked about in Performance issue in strange subroutine entry block I think I can report this as a pretty obvious bug, present both in classical ifort (2021.8.0 20221119 and older version) as well as ifx (2023.0.0 20221201). However, ifx seems to be able to optimise away some of the bad code. The example below is the shortest version I could come up with, showing the problem with classical ifort. With ifx it can be avoided by optimisation -O2, but the offending code block becomes visible with -O0. To also see the problem with ifx and optimisations turned on, the slightly more complex version provided in the topic linked above should be used. Then ifx breaks down as well.
Description of the problem: If derived-type "s" with a generic assignment is used as component "type(s) :: x" in another derived type with two type-bound procedures, one (foo) calling the other (bar), then a useless write-to-shared-data block is created at the start of foo. The more threads are calling the function the more performance breaks down, presumably due to cache coherency issues.
I think this is a pretty serious issue for modern fortran code, as encapsulating character(len=:),allocatable in a derived-type for easier string handling is not uncommon. And even declaring but not using such a component can break down performance in critical loops. And it does not need to be a character component, just generic assignment suffices to trigger it.
If the module (without the program part) is compiled with "ifort -S -qopenmp -O2 ...", the offending code block at the start of foo can be easily spotted.
Options qopenmp or O2 do not make a difference, but both should be used for performance benchmarks. If the unused(!) generic assignment is removed, execution time scales well with thread-count. If generic assignment is present, then performance decreases considerably (getting worse with increasing thread count).
module mod
implicit none
private
type, public :: s
contains
procedure :: assign
generic :: assignment(=) => assign
end type s
type, public :: t
type(s) :: x
contains
procedure :: foo
procedure :: bar
end type t
contains
subroutine assign(self, x)
class(s), intent(inout) :: self
class(s), intent(in) :: x
end subroutine assign
function foo(self) result(m)
integer :: m
class(t), intent(in) :: self
m = self%bar()
end function foo
function bar(self) result(i)
integer :: i
class(t), intent(in) :: self
i = 2
end function bar
end module mod
program test
use mod
implicit none
integer, parameter :: N = 100000000
integer :: i, c
class(t), pointer :: u
c = 0
!$omp parallel default(shared) private(i, u) reduction(+:c)
allocate(t :: u)
!$omp do schedule(dynamic,1000)
do i = 1, N
c = c + u%foo()
end do
!$omp end do
deallocate(u)
!$omp end parallel
print *, c, N
end program test
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Let me investigate this.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for looking into this.
I was just checking segfault_with_ifx_with_openmp_recursion_and_allocatable with recently released ifx and also looked at the assembler only to see a similar pattern as above. Maybe the segfault there is related to the write-to-shared data code. Anyway, here is an example derived from the code linked above with a derived type containing a component with allocatable attribute, but without generic assignment. So this is not a necessary ingredient. Note that local variable "y" of type(t) is declared and passed to recursively called subroutine rec, but otherwise it is not used. Replacing allocatable by pointer for "a" in type "t" seems to be sufficient, to resolve the issue. Of course, this code runs with classical ifort but segfaults with ifx (which also shows the write-to-shared data blocks). Compilation should be done with -qopenmp (and -O2 or -O3 or similar).
module mod
implicit none
public
type :: t
integer, dimension(:), allocatable :: a
end type t
contains
recursive subroutine rec(n, x, i)
integer, intent(in) :: n
type(t), intent(inout) :: x
integer, intent(inout) :: i
type(t) :: y
if (n > 0) then
call rec(n-1, y, i)
i = i + 1
end if
end subroutine rec
end module mod
program alloc_rec
use mod
implicit none
type(t) :: x
integer :: i, j
integer, parameter :: N = 1000000
j = 0
!$omp parallel default(shared) private(i, x) reduction(+:j)
!$omp do schedule(dynamic,1000)
do i = 1,N
call rec(10, x, j)
end do
!$omp end do
!$omp end parallel
print *, j, N
end program alloc_rec
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I just added a comment on your earlier thread, segfault_with_ifx_with_openmp_recursion_and_allocatable. It's fixed in a future release.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Since the performance of ifx is better for this source file, can you use ifx for this usage in your application? You can mix and match object files, libraries, and .mod files between ifx and ifort.
Remember that ifort and ifx share the same front end, the part of the compiler that interprets your source code. The optimization steps are completely different. It's nice to see a performance improvement with ifx.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Currently I cannot check ifx with the real code due to the segfault_with_ifx_with_openmp_recursion_and_allocatable issue. However, I have resolved the issue by declaring components of a derived-type with generic assignment using class instead of type keyword (i.e. in the example below this would be "class(s), pointer :: x => null" instead of "type(s)..."). The variant with recursion and allocatable component posted above can be avoided by using pointer instead of allocatable arrays.
But generally I would think that ifx shows the same performance problem in any real code. Only the most simple testcase (the one at the start of this thread) runs fine with ifx and O2. If I add just one unused array allocatable component in this testcase, runtime easily decreases by a factor of 100-1000 (with versus without generic assignment line). Of course real code has more well-running code and performance does not decrease as much, but still very noticably. For the sake of completeness, here is the simple testcase with the unused array component added:
module mod
implicit none
private
type, public :: s
integer, dimension(:), allocatable :: u
contains
procedure :: assign
generic :: assignment(=) => assign
end type s
type, public :: t
type(s) :: x
contains
procedure :: foo
procedure :: bar
end type t
contains
subroutine assign(self, x)
class(s), intent(inout) :: self
class(s), intent(in) :: x
end subroutine assign
function foo(self) result(m)
integer :: m
class(t), intent(in) :: self
m = self%bar()
end function foo
function bar(self) result(i)
integer :: i
class(t), intent(in) :: self
i = 2
end function bar
end module mod
program test
use mod
implicit none
integer, parameter :: N = 10000000
integer :: i, c
class(t), pointer :: u
c = 0
!$omp parallel default(shared) private(i, u) reduction(+:c)
allocate(t :: u)
!$omp do schedule(dynamic,1000)
do i = 1, N
c = c + u%foo()
end do
!$omp end do
deallocate(u)
!$omp end parallel
print *, c, N
end program test
And this is the complete ifx -qopenmp -O2 -S output for the module part. It is the var$4 shared memory part, which breaks down performance (line 28-36). The small memory block is never used.
.text
.file "entry_block.f90"
.globl mod._
.p2align 4, 0x90
.type mod._,@function
mod._:
.cfi_startproc
retq
.Lfunc_end0:
.size mod._, .Lfunc_end0-mod._
.cfi_endproc
.globl mod_mp_assign_
.p2align 4, 0x90
.type mod_mp_assign_,@function
mod_mp_assign_:
.cfi_startproc
retq
.Lfunc_end1:
.size mod_mp_assign_, .Lfunc_end1-mod_mp_assign_
.cfi_endproc
.globl mod_mp_foo_
.p2align 4, 0x90
.type mod_mp_foo_,@function
mod_mp_foo_:
.cfi_startproc
movq $1248, var$4+24(%rip)
movq $1, var$4+32(%rip)
movq $0, var$4+16(%rip)
movq $_DYNTYPE_RECORD1, var$4+72(%rip)
movq $0, var$4+80(%rip)
xorps %xmm0, %xmm0
movaps %xmm0, var$4+96(%rip)
movaps %xmm0, var$4+112(%rip)
movaps %xmm0, var$4+128(%rip)
movq 56(%rdi), %rax
movq 8(%rax), %rcx
xorl %eax, %eax
jmpq *%rcx
.Lfunc_end2:
.size mod_mp_foo_, .Lfunc_end2-mod_mp_foo_
.cfi_endproc
.globl mod_mp_bar_
.p2align 4, 0x90
.type mod_mp_bar_,@function
mod_mp_bar_:
.cfi_startproc
movl $2, %eax
retq
.Lfunc_end3:
.size mod_mp_bar_, .Lfunc_end3-mod_mp_bar_
.cfi_endproc
.type var$4,@object
.local var$4
.comm var$4,152,16
.type _DYNTYPE_RECORD1,@object
.data
.p2align 3, 0x0
_DYNTYPE_RECORD1:
.quad strlit.1
.quad _DYNTYPE_RECORD2
.size _DYNTYPE_RECORD1, 16
.type strlit.1,@object
.section .rodata.str1.1,"aMS",@progbits,1
strlit.1:
.asciz "intr#int#4"
.size strlit.1, 11
.type _DYNTYPE_RECORD2,@object
.data
.p2align 4, 0x0
_DYNTYPE_RECORD2:
.quad strlit.2
.quad 0
.quad _DYNTYPE_RECORD3
.size _DYNTYPE_RECORD2, 24
.type strlit.2,@object
.section .rodata.str1.1,"aMS",@progbits,1
strlit.2:
.asciz "intr#int#4"
.size strlit.2, 11
.type _DYNTYPE_RECORD3,@object
.data
.p2align 3, 0x0
_DYNTYPE_RECORD3:
.quad strlit.3
.quad 0
.size _DYNTYPE_RECORD3, 16
.type strlit.3,@object
.section .rodata.str1.1,"aMS",@progbits,1
strlit.3:
.asciz "intr#int#"
.size strlit.3, 10
.section ".note.GNU-stack","",@progbits
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@martinmath, I'm a bit confused. There are 2 different reproducers in this thread. Which one should I take a deeper look at?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am very sorry if I have confused you. Just take a look at the testcase posted yesterday (two posts above). It shows the problem with both ifort as well as ifx (current release: ifx (IFX) 2023.1.0 20230320). And it has the advantage that the assembler output from the compiler is already posted. If you can read the assembler you can spot the problem right away at line 28-36. Such code is an absolute no-go and hints at some internally created shared variable.
But as I have noticed with the first testcase in this thread, ifx might be able to remove the write-to-shared data block by some optimisation steps, but seemingly only for very simple testcases.
As a coder myself I very well know the difficulty and intricacies of having a (small) proper reproducer. That is why I posted the variants. The one at the end of Performance issue in strange subroutine entry block is the most complex one and closest to real code, but the assembler output is quite a bit larger.
Please note that the second testcase posted here (with allocatable and recursion) is very different but shows the same pattern in performance breakdown as well as in the assembler, so has probably the same source. Presumably in the front-end, as ifx and ifort share that part, and both compilers produce very similar write-to-shared data blocks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for clarifying which reproducer I should use. I see a performance improvement using ifx.
ifx is the next generation of Intel's Fortran Compiler. I know you are reporting issues. Thank you for helping improve the compiler. The stability of ifx is improving with each release.
As I mentioned before you can mix and match object files, .mod files and libraries between the two compilers to make progress with compiling and running your application.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Barbara,
The important part of martinmath's post is that exposes (potential) non-thread-safe activity by the code generation. What was shown was what appears to be a copy-to/initialization-of a procedure SAVE variable, regardless of that it wasn't referenced later. The performance issue is something someone can grudgingly live with. However, in other code (not shown), IIF the false SAVE variable becomes referenced, you would then have non-thread-safe code.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@jimdempseyatthecove Thanks pointing that out. I stumbled upon this due to the performance anomaly, so that was my focus and I was not willing to accept it grudgingly. I actually did worry about data race errors, but as I could not find any hint that the shared data was ever read I assumed that thread safety was not an issue. But who knows.
@Barbara_P_Intel : There is no reproducer demonstrating failure due to the thread safety concerns as far as I know, but the assembler code patterns naturally raise the question. Any write access to shared data must be guarded (or reasoned about by strong memory ordering), which clearly is not the case. I hope that your assurance that the next release of ifx fixes the issue. But I do actually wonder as this bug is shared by ifort and ifx (hence a frontend issue?), so why should ifx fix it, but not ifort. Did you compare new ifx with and without the generic assignment line (line 10)
generic :: assignment(=) => assign
(removed or commented out). If the bug is still present, then execution time on a processor with lots of cores (I checked it on 18 core system) should vary by 2-3 orders.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'll be happy to report a bug on the possible race condition, but I do need a reproducer, a code that demonstrates a runtime failure. The compiler development team requires that.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page