- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Profiling strange performance issues seen with ifort (current oneapi release) using the linux perf tool pointed me to a strange code blocks I cannot make sense of, and which looks rather suspicous. Here is one such example which was generated using the -S compiler option in linux to obtain annotated assembler output:
movq $_DYNTYPE_PACK_35, 48+var$16618(%rip) # subroutine line
movq %r8, 56+var$16618(%rip) # subroutine line
movq %r8, 80+var$16618(%rip) # subroutine line
movq %r8, 96+var$16618(%rip) # subroutine line
movq %r8, 88+var$16618(%rip) # subroutine line
movq %r8, 72+var$16618(%rip) # subroutine line
movq %r8, 104+var$16618(%rip) # subroutine line
movq %r8, 112+var$16618(%rip) # subroutine line
movq %r8, 32+var$16606(%rip) # subroutine line
movq %r8, var$16606(%rip) # subroutine line
movq %r8, 16+var$16606(%rip) # subroutine line
movq %r11, 48+var$16606(%rip) # subroutine line
movq %r11, 48+var$16638(%rip) # subroutine line
movq %r8, 80+var$16606(%rip) # subroutine line
movq %r8, 96+var$16606(%rip) # subroutine line
movq %r8, 88+var$16606(%rip) # subroutine line
The "# subroutine line" actually points to the "subroutine name(arguments)" source line of the related subroutine the assembler code block belongs to. The subroutine itself appears as procedure in a derived-type.
What I can see here is that register values in %r8 and %r11 are written to an address relative to the current instruction pointer! As this routine is called from within an openmp parallel region, all threads seem to write to the same address in parallel.
If my assessment is correct, then this cannot make much sense? It would cause cache coherency traffic, and could explain why the profiler points to some of these instructions. Right? Moreover, that data is nowhere used in the assembler listing for the module, as far as I can see. So why write it? I have tried various command line options (-g0, -notraceback etc.) but it does not make any difference.
At the moment I am just confused and not even sure, whether my understanding of the assembler code is correct, I am not that fluent in assembler (in particular with the at&t versus intel syntax confusion, so seems to be at&t syntax).
The DYNTYPE_PACK reference points to some OOP magic, the concerned subroutine has class-arguments but no type arguments.
Can anyone help me to understand what is happening here and why the profiler complains about just these lines?
PS: The var$16618$ appears at the end of the assembler listing and looks like
var$16618:
.type var$16618,@object
.size var$16618,128
.space 128 # pad
.align 32
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Is the specific UDT declared as SAVE? Or is a module variable and the procedure is contained in the same module.
Note, older Fortran standards had default SAVE for procedure local UDTs and arrays.
-openmp should have made these stack based (as well as the newer language standards make them stack based).
Can you show your code?
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Passing a TYPE to a CLASS was my first thought - this is done very inefficiently in the Intel compiler, with hundreds of instructions to set up a temporary descriptor.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
My first thought also, but all arguments or local variables are either plain values or declared as class. And even if not, writing to a shared memory block instead of (thread-local) stack does not make sense. I have looked at these type-descriptor creation and it can slow down execution a bit, but not terribly. What I see here are really bad execution times, on an AMD processor more than on an Intel one, presumably due to different cache architectures. At least it feels like that the perf-output is consistent with what I would expect if all threads are hammering the same few memory addresses with writes.
It actually reminds me of a (still open) bug in gfortran, where type-bound IO leads internally to a SAVE attribute. If a derived type variable is used within openmp region or in a recursion, then the code segfaults.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Here we go, the following reproducer shows the problem. Compile with
"ifort -O2 -qopenmp entry_block.f90 -o entry_block" for the executable and
"ifort -O2 -qopenmp -S entry_block.f90 -o entry_block.s" for the assembler code, where the strange code blocks can be seen.
Run with as many threads as there are real cores and with "perf record -g -F 10000 ./entry_block" to collect detailed profiling data. The code block in question shows up with "perf report -g". Execution spends almost all time there. Running with one thread takes about 0.1s and with 18 threads abound 0.8s (with a 32 core threadripper I get 5.5s). However, this code should scale almost perfectly, instead of slowing down by a factor of 8 or even 50.
I have not tried hard to reduce it further, but the following ingredients seem necessary. A type "s" with generic assignment(=). A type which has a component of type "s" and a deferred procedure. The code itself does not do anything meaningful, just demonstrates the problem, of course.
module str
implicit none
private
type, public :: s
character(len=:), allocatable :: a
contains
procedure :: assign
generic :: assignment(=) => assign
end type s
contains
subroutine assign(self, x)
class(s), intent(out) :: self
class(s), intent(in) :: x
self%a = x%a
end subroutine assign
end module str
module mod
use str
implicit none
private
type, abstract, public :: t
type(s) :: x
contains
procedure :: foo
procedure(bar_ifc), deferred :: bar
end type t
abstract interface
function bar_ifc(self) result(i)
import t
class(t), intent(in) :: self
integer :: i
end function bar_ifc
end interface
type, extends(t), public :: r
contains
procedure :: bar
end type r
contains
function foo(self) result(m)
class(t), intent(in) :: self
integer :: m
m = self%bar()
end function foo
function bar(self) result(i)
class(r), intent(in) :: self
integer :: i
i = len(self%x%a)
end function bar
end module mod
program test
use mod
implicit none
integer, parameter :: N = 10000000
integer :: i, c
class(t), pointer :: u
c = 0
!$omp parallel default(shared) private(i, u) reduction(+:c)
allocate(r :: u)
u%x%a = '**'
!$omp do schedule(dynamic,1000)
do i = 1, N
c = c + u%foo()
end do
!$omp end do
deallocate(u)
!$omp end parallel
print *, c, N
end program test
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
PS: ifx assembler output looks quite different. And execution time does not suffer as much as with ifort. But there are still similar writes which show up in perf consuming almost 100% execution time. So ifx has the same problem, but not as bad as classical ifort.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
PPS: Sorry, I just saw that in order to see the relevant assembler listing, the compilation with "-S" needs to be done without the program part, just the two modules. Otherwise only the program part is written out.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page