Solved: Sorry for the late response.

Juan_Pablo_S_1 · ‎11-27-2014

hi everyone. I've been profiling the attached code to measure the performance impact when using virtual procedures in Fortran 2003. As it's shown in the profile log, I've found that the polymorphic variable "af" actually consumes less time than the static one "sf". Is there something am I doing wrong in this example? Or it's possible to get best performance with polymorphic variables in some cases?

-juan

Test data: compiler: ifort13.1.1; compiler flags = -ipo -O3 -inline-forceinline; OS: linux debian; execution script: mpirun -np 1 main.out

Module TestOOP
	 implicit none
	 public
	 type	 :: pointerfunction
			procedure(a), pointer		:: f
	 end type
	 type, abstract :: abstractfunction
			contains
			procedure(b), deferred, public	 ::	f
	 end type
	 	type, extends(abstractfunction)		:: concretefunction
			contains
			procedure, public	:: f => fconcrete
	 end type
	 abstract interface
	 pure subroutine a(this,i)
			import	 ::	 pointerfunction
			class(pointerfunction), intent(inout)	 ::	this
			integer, intent(in)	 ::	i		
	 end subroutine
	 pure	 subroutine b(this,i)
			import	 ::	abstractfunction
			class(abstractfunction), intent(inout)	 ::	this
			integer, intent(in)	 ::	i			
	 end subroutine	 			
	 end interface
	 type staticfunction
			contains
			procedure, public	:: f => fstatic
	 end type
	 contains 
	 	pure subroutine fpointer(this,i)
			implicit none
			class(pointerfunction), intent(inout)	 ::	this
			integer, intent(in)	 ::	i
	 end subroutine
	 pure  subroutine fstatic(this,i)
			implicit none
			class(staticfunction), intent(inout)	 ::	this
			integer,intent(in)	 ::	i
	 end subroutine
	 pure subroutine fconcrete(this,i)
			implicit none
			class(concretefunction), intent(inout)	 ::	this
			integer, intent(in)	 ::	i
	 end subroutine
End Module

Program Main
			use TestOOP
			use base_parallel_mod
			implicit none
			integer 	 ::	i, j
			integer, parameter	:: N = 2000
			type(pointerfunction)		::		pf
			type(staticfunction) ::	sf
			class(abstractfunction), pointer	 ::	af
			type(concretefunction)	:: cf
			call pumainit()  !wrapper to mpi_initialize
			pf%f => fpointer
			allocate(concretefunction::af)			
				 do j = 1, N
				  do i = 1, N
							call pf%f(i)
						end do
				 end do
			do j = 1, N
				 do i = 1, N
				 call af%f(i)
				 end do
			end do
			do j = 1, N
				 do i = 1, N
						call cf%f(i)
				 end do
			end do
			do j = 1, N
				 do i = 1, N
						call sf%f(i)
				 end do
			end do
			call pumaend() ! wrapper to mpi_finalize
			End  Program

jimdempseyatthecove · ‎12-04-2014

We've seen plenty of "proofs" here that ended up with the compiler optimization removing code that produced results not used. We've also seen "proofs" that failed to take into account "first touch" overhead IOW assuming the allocation is complete when the new/allocate returns as opposed to actually the first time the program touches (read or write) a virtual memory page (this is where the mapping of virtual address space to RAM and page file occurs, and which may also include a wipe of the RAM to 0's). This would not be the case in the above NOP subroutines.

Jim Dempsey

View solution in original post

Juan_Pablo_S_1 · ‎11-27-2014

reattach loop profile data

pbkenned1 · ‎12-01-2014

Thanks for the report. I'm looking into this.

Patrick

pbkenned1 · ‎12-01-2014

Please provide the code for module base_parallel_mod. I tried commenting out the 'use base_parallel_mod', but the MPI wrapper functions are unresolved.

[U536538]$ ifort Main.o TestOOP.o
Main.o: In function `MAIN__':
Main.f90:(.text+0x38): undefined reference to `pumainit_'
Main.f90:(.text+0x33c): undefined reference to `pumaend_'

Patrick

Juan_Pablo_S_1 · ‎12-02-2014

I attach the code for the required module

module base_parallel_mod
use mpi
implicit none
public
contains
subroutine pumainit()
integer :: ierr
call mpi_init(ierr)
end subroutine
subroutine pumaend()
integer :: ierr
call mpi_finalize(ierr)
end subroutine
end module

pbkenned1 · ‎12-03-2014

Sorry for the late response. Thanks for providing the module code. I'm looking into this now.

Patrick

pbkenned1 · ‎12-03-2014

The loop using polymorphic variable 'af' gets more than a 2x speed up on my test machine, compared to the loop using the staticfunction type. Can we just say "Hoorah! Polymorphism!" and leave it at that? All joking aside, I'll look into the reason.

Patrick

[U536538]$ ./TestOOP-ifort.x
pf loop took 2.076601982116699E-002
af loop took 9.101152420043945E-003
cf loop took 2.075791358947754E-002
sf loop took 2.075791358947754E-002
[U536538]$

jimdempseyatthecove · ‎12-03-2014

Patrick, with your current code as a base line, can you add some functionality into each of the subroutines such that the compiler will not optimize out the CALL and/or loops.

Also, longer runs than .01 seconds are required for any valid timing.

I am sure you are aware that the compiler is quite smart in finding useless code, so you may have to insert a print or something of the results at the end.

Jim Dempsey

pbkenned1 · ‎12-04-2014

Jim, thanks for the feedback, point well taken. Indeed I may need to beef up the loop workloads by orders of magnitude to obtain valid results. I was a bit anxious to just get the code running, insert some timing, and see if I could quickly validate Juan's claim.

Patrick

jimdempseyatthecove · ‎12-04-2014

We've seen plenty of "proofs" here that ended up with the compiler optimization removing code that produced results not used. We've also seen "proofs" that failed to take into account "first touch" overhead IOW assuming the allocation is complete when the new/allocate returns as opposed to actually the first time the program touches (read or write) a virtual memory page (this is where the mapping of virtual address space to RAM and page file occurs, and which may also include a wipe of the RAM to 0's). This would not be the case in the above NOP subroutines.

Jim Dempsey

pbkenned1 · ‎12-09-2014

It took me longer than I expected to analyze this, but the results are interesting and a bit unexpected.

First, I substantially beefed up the functions to add some real work, increased the loop upper bound to 20000, added some 'first touch' logic ahead of the kernel loops, and extracted the 'af' (polymorphic) case and 'sf' case (static) into separate test cases.

I consistently get at least an 8% speedup with the polymorphic case:

c:\ISN_Forums\U536538\DPD200364168\af>Main-af-no-mpi.exe
af loop took 4.84769248962402
jasin = 0.9465396

c:\ISN_Forums\U536538\DPD200364168\sf>Main-sf-no-mpi.exe
sf loop took 5.26771163940430
jasin = 0.9465396

c:\ISN_Forums\U536538\DPD200364168\sf>

Next, I ran a Vtune analysis on each case. The summary of the hot spots of interest:

'af' case:

TESTOOP_mp_FCONCRETE 0.774 sec
MAIN 0.261 sec

'sf' case:
MAIN 0.895 sec
TESTOOP_mp_FSTATIC 0.636 sec

So while the implementation of the static function itself is actually faster compared to the polymorphic version, it's the calling overhead in MAIN for the static version that contributes most to the slowdown.

The MAIN routine for the 'sf' case was compiled with ifort -c Main-sf-no-mpi.f90 -Zi -O2

The generated code is rather cumbersome:

$LN21:

        02 00 00         lea r9, QWORD PTR [560+rsp]            ;C:\U536538\sf\Main-sf-no-mpi.f90:21.13

  00178 4c 89 e9         mov rcx, r13                           ;C:\U536538\sf\Main-sf-no-mpi.f90:21.13

  0017b 48 89 ea         mov rdx, rbp                           ;C:\U536538\sf\Main-sf-no-mpi.f90:21.13

        00 00            lea r8, QWORD PTR [__NLITPACK_2.0.1]   ;C:\U536538\sf\Main-sf-no-mpi.f90:21.13

        01 00 00         mov QWORD PTR [440+rsp], r14           ;C:\U536538\sf\Main-sf-no-mpi.f90:21.13

        01 00 00         mov QWORD PTR [464+rsp], r14           ;C:\U536538\sf\Main-sf-no-mpi.f90:21.13

        01 00 00         mov QWORD PTR [432+rsp], r15           ;C:\U536538\sf\Main-sf-no-mpi.f90:21.13

        01 00 00         mov QWORD PTR [448+rsp], r14           ;C:\U536538\sf\Main-sf-no-mpi.f90:21.13

        01 00 00         mov QWORD PTR [480+rsp], rsi           ;C:\U536538\sf\Main-sf-no-mpi.f90:21.13

        01 00 00         mov QWORD PTR [488+rsp], rdi           ;C:\U536538\sf\Main-sf-no-mpi.f90:21.13

        01 00 00         mov QWORD PTR [496+rsp], r14           ;C:\U536538\sf\Main-sf-no-mpi.f90:21.13

        02 00 00         mov QWORD PTR [512+rsp], r14           ;C:\U536538\sf\Main-sf-no-mpi.f90:21.13

        02 00 00         mov QWORD PTR [528+rsp], r14           ;C:\U536538\sf\Main-sf-no-mpi.f90:21.13

        02 00 00         mov QWORD PTR [520+rsp], r14           ;C:\U536538\sf\Main-sf-no-mpi.f90:21.13

        01 00 00         mov QWORD PTR [504+rsp], r14           ;C:\U536538\sf\Main-sf-no-mpi.f90:21.13

        02 00 00         mov QWORD PTR [536+rsp], r14           ;C:\U536538\sf\Main-sf-no-mpi.f90:21.13

        02 00 00         mov QWORD PTR [544+rsp], r14           ;C:\U536538\sf\Main-sf-no-mpi.f90:21.13

        00 00            mov QWORD PTR [456+rsp], 3             ;C:\U536538\sf\Main-sf-no-mpi.f90:21.13

001f9 e8 fc ff ff ff call TESTOOP_mp_FSTATIC

Compare to the 'af' case (same compiler options).

        00               mov DWORD PTR [180+rsp], 1             ;C:\U536538\af\Main-af-no-mpi.f90:21.6

  00135 eb 07            jmp .B1.8 ; Prob 100%                  ;C:\U536538\af\Main-af-no-mpi.f90:21.6

        00 00            mov DWORD PTR [180+rsp], ecx           ;C:\U536538\af\Main-af-no-mpi.f90:21.6

        00 00            mov DWORD PTR [180+rsp], ecx           ;C:\U536538\af\Main-af-no-mpi.f90:21.6

        00 00 00         lea r9, QWORD PTR [176+rsp]            ;C:\U536538\af\Main-af-no-mpi.f90:22.11

        00 00            mov r10, QWORD PTR [MAIN$AF.0.1+56]    ;C:\U536538\af\Main-af-no-mpi.f90:22.11

        00 00            lea rcx, QWORD PTR [MAIN$AF.0.1]       ;C:\U536538\af\Main-af-no-mpi.f90:22.11

  00154 48 89 ea         mov rdx, rbp                           ;C:\U536538\af\Main-af-no-mpi.f90:22.11

        00 00            lea r8, QWORD PTR [__NLITPACK_2.0.1]   ;C:\U536538\af\Main-af-no-mpi.f90:22.11

  0015e 41 ff 12         call QWORD PTR [r10]                   ;C:\U536538\af\Main-af-no-mpi.f90:22.11

I filed this as a feature request to improve the generated code for the static call case (tracking ID DPD200364168).

Patrick

jimdempseyatthecove · ‎12-09-2014

Patrick,

Could you show the entire loop(s) for both af and sf (disassembly)

The moves of r14 to stack looks to me like some once-only code (program initialization), and that optimization placed your actual loop elsewhere. Do you have call trace back enabled? If so, this could be pushing (writing) a signature frame onto the stack.

Also when testing like this, make different runs with different loop sequence order. After the first set of runs, copy the first series of nested loops to the last (making that test last). Repeat the run, rotate the test loops again, ...

Jim Dempsey

pbkenned1 · ‎12-10-2014

>>>Could you show the entire loop(s) for both af and sf (disassembly)

The 'sf' loop is at lines 20-22 in the source code. Complete disassembly is:

;;;      do i = 1, N
;;;        call sf%f(i,N,jasin)
;;;      end do
;;;    end do
$LN78:
$LN79:
0015b c7 84 24 34 02
        00 00 01 00 00
        00               mov DWORD PTR [564+rsp], 1             ;C:\U536538\sf\Main-sf-no-mpi.f90:20.6
$LN80:
00166 eb 08            jmp .B1.7 ; Prob 100%                  ;C:\U536538\sf\Main-sf-no-mpi.f90:20.6
$LN81:
                                ; LOE rbx rbp rsi rdi r13 r14 r15 r12d xmm6 xmm7 xmm8 xmm9 xmm10 xmm11 xmm12 xmm13 xmm14
.B1.9::                         ; Preds .B1.8
$LN82:
00168 44 89 84 24 34
        02 00 00         mov DWORD PTR [564+rsp], r8d           ;C:\U536538\sf\Main-sf-no-mpi.f90:20.6
$LN83:
                                ; LOE rbx rbp rsi rdi r13 r14 r15 r12d xmm6 xmm7 xmm8 xmm9 xmm10 xmm11 xmm12 xmm13 xmm14
.B1.7::                         ; Preds .B1.9 .B1.6
$LN84:
$LN85:
00170 4c 8d 8c 24 30
        02 00 00         lea r9, QWORD PTR [560+rsp]            ;C:\U536538\sf\Main-sf-no-mpi.f90:21.13
$LN86:
00178 4c 89 e9         mov rcx, r13                           ;C:\U536538\sf\Main-sf-no-mpi.f90:21.13
$LN87:
0017b 48 89 ea         mov rdx, rbp                           ;C:\U536538\sf\Main-sf-no-mpi.f90:21.13
$LN88:
0017e 4c 8d 05 3c 00
        00 00            lea r8, QWORD PTR [__NLITPACK_2.0.1]   ;C:\U536538\sf\Main-sf-no-mpi.f90:21.13
$LN89:
00185 4c 89 b4 24 b8
        01 00 00         mov QWORD PTR [440+rsp], r14           ;C:\U536538\sf\Main-sf-no-mpi.f90:21.13
$LN90:
0018d 4c 89 b4 24 d0
        01 00 00         mov QWORD PTR [464+rsp], r14           ;C:\U536538\sf\Main-sf-no-mpi.f90:21.13
$LN91:
00195 4c 89 bc 24 b0
        01 00 00         mov QWORD PTR [432+rsp], r15           ;C:\U536538\sf\Main-sf-no-mpi.f90:21.13
$LN92:
0019d 4c 89 b4 24 c0
        01 00 00         mov QWORD PTR [448+rsp], r14           ;C:\U536538\sf\Main-sf-no-mpi.f90:21.13
$LN93:
001a5 48 89 b4 24 e0
        01 00 00         mov QWORD PTR [480+rsp], rsi           ;C:\U536538\sf\Main-sf-no-mpi.f90:21.13
$LN94:
001ad 48 89 bc 24 e8
        01 00 00         mov QWORD PTR [488+rsp], rdi           ;C:\U536538\sf\Main-sf-no-mpi.f90:21.13
$LN95:
001b5 4c 89 b4 24 f0
        01 00 00         mov QWORD PTR [496+rsp], r14           ;C:\U536538\sf\Main-sf-no-mpi.f90:21.13
$LN96:
001bd 4c 89 b4 24 00
        02 00 00         mov QWORD PTR [512+rsp], r14           ;C:\U536538\sf\Main-sf-no-mpi.f90:21.13
$LN97:
001c5 4c 89 b4 24 10
        02 00 00         mov QWORD PTR [528+rsp], r14           ;C:\U536538\sf\Main-sf-no-mpi.f90:21.13
$LN98:
001cd 4c 89 b4 24 08
        02 00 00         mov QWORD PTR [520+rsp], r14           ;C:\U536538\sf\Main-sf-no-mpi.f90:21.13
$LN99:
001d5 4c 89 b4 24 f8
        01 00 00         mov QWORD PTR [504+rsp], r14           ;C:\U536538\sf\Main-sf-no-mpi.f90:21.13
$LN100:
001dd 4c 89 b4 24 18
        02 00 00         mov QWORD PTR [536+rsp], r14           ;C:\U536538\sf\Main-sf-no-mpi.f90:21.13
$LN101:
001e5 4c 89 b4 24 20
        02 00 00         mov QWORD PTR [544+rsp], r14           ;C:\U536538\sf\Main-sf-no-mpi.f90:21.13
$LN102:
001ed 48 c7 84 24 c8
        01 00 00 03 00
        00 00            mov QWORD PTR [456+rsp], 3             ;C:\U536538\sf\Main-sf-no-mpi.f90:21.13
$LN103:
001f9 e8 fc ff ff ff   call TESTOOP_mp_FSTATIC                ;C:\U536538\sf\Main-sf-no-mpi.f90:21.13
$LN104:
                                ; LOE rbx rbp rsi rdi r13 r14 r15 r12d xmm6 xmm7 xmm8 xmm9 xmm10 xmm11 xmm12 xmm13 xmm14
.B1.8::                         ; Preds .B1.7
$LN105:
001fe 44 8b 84 24 34
        02 00 00         mov r8d, DWORD PTR [564+rsp]           ;C:\U536538\sf\Main-sf-no-mpi.f90:22.6
$LN106:
00206 41 ff c0         inc r8d                                ;C:\U536538\sf\Main-sf-no-mpi.f90:22.6
$LN107:
00209 41 81 f8 20 4e
        00 00            cmp r8d, 20000                         ;C:\U536538\sf\Main-sf-no-mpi.f90:22.6
$LN108:
00210 0f 8e 52 ff ff
        ff               jle .B1.9 ; Prob 99%                   ;C:\U536538\sf\Main-sf-no-mpi.f90:22.6

The 'af' loop is at lines 21-23 in the source. Complete disassembly:

;;;      do i = 1, N
;;;      call af%f(i,N,jasin)
;;;      end do
;;;    end do
$LN58:
$LN59:
0012a c7 84 24 b4 00
        00 00 01 00 00
        00               mov DWORD PTR [180+rsp], 1             ;C:\U536538\af\Main-af-no-mpi.f90:21.6
$LN60:
00135 eb 07            jmp .B1.8 ; Prob 100%                  ;C:\U536538\af\Main-af-no-mpi.f90:21.6
$LN61:
                                ; LOE rbx rbp rsi rdi r12 r13 r14 r15d xmm6 xmm7 xmm8 xmm9 xmm10 xmm11 xmm12 xmm13 xmm14
.B1.10::                        ; Preds .B1.9
$LN62:
00137 89 8c 24 b4 00
        00 00            mov DWORD PTR [180+rsp], ecx           ;C:\U536538\af\Main-af-no-mpi.f90:21.6
$LN63:
                                ; LOE rbx rbp rsi rdi r12 r13 r14 r15d xmm6 xmm7 xmm8 xmm9 xmm10 xmm11 xmm12 xmm13 xmm14
.B1.8::                         ; Preds .B1.10 .B1.7
$LN64:
$LN65:
0013e 4c 8d 8c 24 b0
        00 00 00         lea r9, QWORD PTR [176+rsp]            ;C:\U536538\af\Main-af-no-mpi.f90:22.11
$LN66:
00146 4c 8b 15 38 00
        00 00            mov r10, QWORD PTR [MAIN$AF.0.1+56]    ;C:\U536538\af\Main-af-no-mpi.f90:22.11
$LN67:
0014d 48 8d 0d 00 00
        00 00            lea rcx, QWORD PTR [MAIN$AF.0.1]       ;C:\U536538\af\Main-af-no-mpi.f90:22.11
$LN68:
00154 48 89 ea         mov rdx, rbp                           ;C:\U536538\af\Main-af-no-mpi.f90:22.11
$LN69:
00157 4c 8d 05 6c 00
        00 00            lea r8, QWORD PTR [__NLITPACK_2.0.1]   ;C:\U536538\af\Main-af-no-mpi.f90:22.11
$LN70:
0015e 41 ff 12         call QWORD PTR [r10]                   ;C:\U536538\af\Main-af-no-mpi.f90:22.11
$LN71:
                                ; LOE rbx rbp rsi rdi r12 r13 r14 r15d xmm6 xmm7 xmm8 xmm9 xmm10 xmm11 xmm12 xmm13 xmm14
.B1.9::                         ; Preds .B1.8
$LN72:
00161 8b 8c 24 b4 00
        00 00            mov ecx, DWORD PTR [180+rsp]           ;C:\U536538\af\Main-af-no-mpi.f90:23.6
$LN73:
00168 ff c1            inc ecx                                ;C:\U536538\af\Main-af-no-mpi.f90:23.6
$LN74:
0016a 81 f9 20 4e 00
        00               cmp ecx, 20000                         ;C:\U536538\af\Main-af-no-mpi.f90:23.6
$LN75:
00170 7e c5            jle .B1.10 ; Prob 99%                  ;C:\U536538\af\Main-af-no-mpi.f90:23.6

>>>Also when testing like this, make different runs with different loop sequence order....

I can't spend any more time investigating this issue. It doesn't appear to be of much interest to the community, and it's in the developer's court now. But I'll take your suggestion as a good BKM for investigating similar issues in future, thanks!

Patrick

virtual method faster than static method ?