Performance issue in strange subroutine entry block

martinmath · ‎04-03-2023

Profiling strange performance issues seen with ifort (current oneapi release) using the linux perf tool pointed me to a strange code blocks I cannot make sense of, and which looks rather suspicous. Here is one such example which was generated using the -S compiler option in linux to obtain annotated assembler output:

        movq      $_DYNTYPE_PACK_35, 48+var$16618(%rip) # subroutine line
        movq      %r8, 56+var$16618(%rip)               # subroutine line
        movq      %r8, 80+var$16618(%rip)               # subroutine line
        movq      %r8, 96+var$16618(%rip)               # subroutine line
        movq      %r8, 88+var$16618(%rip)               # subroutine line
        movq      %r8, 72+var$16618(%rip)               # subroutine line
        movq      %r8, 104+var$16618(%rip)              # subroutine line
        movq      %r8, 112+var$16618(%rip)              # subroutine line
        movq      %r8, 32+var$16606(%rip)               # subroutine line
        movq      %r8, var$16606(%rip)                  # subroutine line
        movq      %r8, 16+var$16606(%rip)               # subroutine line
        movq      %r11, 48+var$16606(%rip)              # subroutine line
        movq      %r11, 48+var$16638(%rip)              # subroutine line
        movq      %r8, 80+var$16606(%rip)               # subroutine line
        movq      %r8, 96+var$16606(%rip)               # subroutine line
        movq      %r8, 88+var$16606(%rip)               # subroutine line

The "# subroutine line" actually points to the "subroutine name(arguments)" source line of the related subroutine the assembler code block belongs to. The subroutine itself appears as procedure in a derived-type.

What I can see here is that register values in %r8 and %r11 are written to an address relative to the current instruction pointer! As this routine is called from within an openmp parallel region, all threads seem to write to the same address in parallel.

If my assessment is correct, then this cannot make much sense? It would cause cache coherency traffic, and could explain why the profiler points to some of these instructions. Right? Moreover, that data is nowhere used in the assembler listing for the module, as far as I can see. So why write it? I have tried various command line options (-g0, -notraceback etc.) but it does not make any difference.

At the moment I am just confused and not even sure, whether my understanding of the assembler code is correct, I am not that fluent in assembler (in particular with the at&t versus intel syntax confusion, so seems to be at&t syntax).

The DYNTYPE_PACK reference points to some OOP magic, the concerned subroutine has class-arguments but no type arguments.

Can anyone help me to understand what is happening here and why the profiler complains about just these lines?

PS: The var$16618$ appears at the end of the assembler listing and looks like

var$16618:
	.type	var$16618,@object
	.size	var$16618,128
	.space 128	# pad
	.align 32

jimdempseyatthecove · ‎04-03-2023

Is the specific UDT declared as SAVE? Or is a module variable and the procedure is contained in the same module.

Note, older Fortran standards had default SAVE for procedure local UDTs and arrays.

-openmp should have made these stack based (as well as the newer language standards make them stack based).

Can you show your code?

Jim Dempsey

martinmath · ‎04-03-2023

I have not been able to find a small reproducer. But I will further try. I just wanted to know whether these code blocks ring any bells. Saved local variables are not involved. The dyn_type_pack can be found here in this forum 12 years ago. It looks like it has something to do with type descriptors (type<->class).

Steve_Lionel · ‎04-03-2023

Passing a TYPE to a CLASS was my first thought - this is done very inefficiently in the Intel compiler, with hundreds of instructions to set up a temporary descriptor.

martinmath · ‎04-03-2023

My first thought also, but all arguments or local variables are either plain values or declared as class. And even if not, writing to a shared memory block instead of (thread-local) stack does not make sense. I have looked at these type-descriptor creation and it can slow down execution a bit, but not terribly. What I see here are really bad execution times, on an AMD processor more than on an Intel one, presumably due to different cache architectures. At least it feels like that the perf-output is consistent with what I would expect if all threads are hammering the same few memory addresses with writes.

It actually reminds me of a (still open) bug in gfortran, where type-bound IO leads internally to a SAVE attribute. If a derived type variable is used within openmp region or in a recursion, then the code segfaults.

martinmath · ‎04-04-2023

Here we go, the following reproducer shows the problem. Compile with

"ifort -O2 -qopenmp entry_block.f90 -o entry_block" for the executable and

"ifort -O2 -qopenmp -S entry_block.f90 -o entry_block.s" for the assembler code, where the strange code blocks can be seen.

Run with as many threads as there are real cores and with "perf record -g -F 10000 ./entry_block" to collect detailed profiling data. The code block in question shows up with "perf report -g". Execution spends almost all time there. Running with one thread takes about 0.1s and with 18 threads abound 0.8s (with a 32 core threadripper I get 5.5s). However, this code should scale almost perfectly, instead of slowing down by a factor of 8 or even 50.

I have not tried hard to reduce it further, but the following ingredients seem necessary. A type "s" with generic assignment(=). A type which has a component of type "s" and a deferred procedure. The code itself does not do anything meaningful, just demonstrates the problem, of course.

module str

implicit none
private

type, public :: s
   character(len=:), allocatable :: a
contains
   procedure :: assign
   generic :: assignment(=) => assign
end type s

contains

subroutine assign(self, x)
   class(s), intent(out) :: self
   class(s), intent(in) :: x
   self%a = x%a
end subroutine assign

end module str



module mod

use str
implicit none
private

type, abstract, public :: t
   type(s) :: x
contains
   procedure :: foo
   procedure(bar_ifc), deferred :: bar
end type t

abstract interface
   function bar_ifc(self) result(i)
      import t
      class(t), intent(in) :: self
      integer :: i
   end function bar_ifc
end interface

type, extends(t), public :: r
contains
   procedure :: bar
end type r

contains

function foo(self) result(m)
   class(t), intent(in) :: self
   integer :: m
   m = self%bar()
end function foo

function bar(self) result(i)
   class(r), intent(in) :: self
   integer :: i
   i = len(self%x%a)
end function bar

end module mod



program test

use mod
implicit none

integer, parameter :: N = 10000000
integer :: i, c
class(t), pointer :: u

c = 0
!$omp parallel default(shared) private(i, u) reduction(+:c)
allocate(r :: u)
u%x%a = '**'
!$omp do schedule(dynamic,1000)
do i = 1, N
   c = c + u%foo()
end do
!$omp end do
deallocate(u)
!$omp end parallel

print *, c, N

end program test

martinmath · ‎04-04-2023

PS: ifx assembler output looks quite different. And execution time does not suffer as much as with ifort. But there are still similar writes which show up in perf consuming almost 100% execution time. So ifx has the same problem, but not as bad as classical ifort.

martinmath · ‎04-04-2023

PPS: Sorry, I just saw that in order to see the relevant assembler listing, the compilation with "-S" needs to be done without the program part, just the two modules. Otherwise only the program part is written out.