- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
hi everyone. I've been profiling the attached code to measure the performance impact when using virtual procedures in Fortran 2003. As it's shown in the profile log, I've found that the polymorphic variable "af" actually consumes less time than the static one "sf". Is there something am I doing wrong in this example? Or it's possible to get best performance with polymorphic variables in some cases?
-juan
Test data: compiler: ifort13.1.1; compiler flags = -ipo -O3 -inline-forceinline; OS: linux debian; execution script: mpirun -np 1 main.out
Module TestOOP implicit none public type :: pointerfunction procedure(a), pointer :: f end type type, abstract :: abstractfunction contains procedure(b), deferred, public :: f end type type, extends(abstractfunction) :: concretefunction contains procedure, public :: f => fconcrete end type abstract interface pure subroutine a(this,i) import :: pointerfunction class(pointerfunction), intent(inout) :: this integer, intent(in) :: i end subroutine pure subroutine b(this,i) import :: abstractfunction class(abstractfunction), intent(inout) :: this integer, intent(in) :: i end subroutine end interface type staticfunction contains procedure, public :: f => fstatic end type contains pure subroutine fpointer(this,i) implicit none class(pointerfunction), intent(inout) :: this integer, intent(in) :: i end subroutine pure subroutine fstatic(this,i) implicit none class(staticfunction), intent(inout) :: this integer,intent(in) :: i end subroutine pure subroutine fconcrete(this,i) implicit none class(concretefunction), intent(inout) :: this integer, intent(in) :: i end subroutine End Module
Program Main use TestOOP use base_parallel_mod implicit none integer :: i, j integer, parameter :: N = 2000 type(pointerfunction) :: pf type(staticfunction) :: sf class(abstractfunction), pointer :: af type(concretefunction) :: cf call pumainit() !wrapper to mpi_initialize pf%f => fpointer allocate(concretefunction::af) do j = 1, N do i = 1, N call pf%f(i) end do end do do j = 1, N do i = 1, N call af%f(i) end do end do do j = 1, N do i = 1, N call cf%f(i) end do end do do j = 1, N do i = 1, N call sf%f(i) end do end do call pumaend() ! wrapper to mpi_finalize End Program
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We've seen plenty of "proofs" here that ended up with the compiler optimization removing code that produced results not used. We've also seen "proofs" that failed to take into account "first touch" overhead IOW assuming the allocation is complete when the new/allocate returns as opposed to actually the first time the program touches (read or write) a virtual memory page (this is where the mapping of virtual address space to RAM and page file occurs, and which may also include a wipe of the RAM to 0's). This would not be the case in the above NOP subroutines.
Jim Dempsey
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
reattach loop profile data
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for the report. I'm looking into this.
Patrick
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Please provide the code for module base_parallel_mod. I tried commenting out the 'use base_parallel_mod', but the MPI wrapper functions are unresolved.
[U536538]$ ifort Main.o TestOOP.o
	Main.o: In function `MAIN__':
	Main.f90:(.text+0x38): undefined reference to `pumainit_'
	Main.f90:(.text+0x33c): undefined reference to `pumaend_'
Patrick
	 
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I attach the code for the required module
module base_parallel_mod use mpi implicit none public contains subroutine pumainit() integer :: ierr call mpi_init(ierr) end subroutine subroutine pumaend() integer :: ierr call mpi_finalize(ierr) end subroutine end module
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sorry for the late response. Thanks for providing the module code. I'm looking into this now.
Patrick
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The loop using polymorphic variable 'af' gets more than a 2x speed up on my test machine, compared to the loop using the staticfunction type. Can we just say "Hoorah! Polymorphism!" and leave it at that? All joking aside, I'll look into the reason.
Patrick
[U536538]$ ./TestOOP-ifort.x
	 pf loop took  2.076601982116699E-002
	 af loop took  9.101152420043945E-003
	 cf loop took  2.075791358947754E-002
	 sf loop took  2.075791358947754E-002
	[U536538]$
	 
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Patrick, with your current code as a base line, can you add some functionality into each of the subroutines such that the compiler will not optimize out the CALL and/or loops.
Also, longer runs than .01 seconds are required for any valid timing.
I am sure you are aware that the compiler is quite smart in finding useless code, so you may have to insert a print or something of the results at the end.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jim, thanks for the feedback, point well taken. Indeed I may need to beef up the loop workloads by orders of magnitude to obtain valid results. I was a bit anxious to just get the code running, insert some timing, and see if I could quickly validate Juan's claim.
Patrick
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We've seen plenty of "proofs" here that ended up with the compiler optimization removing code that produced results not used. We've also seen "proofs" that failed to take into account "first touch" overhead IOW assuming the allocation is complete when the new/allocate returns as opposed to actually the first time the program touches (read or write) a virtual memory page (this is where the mapping of virtual address space to RAM and page file occurs, and which may also include a wipe of the RAM to 0's). This would not be the case in the above NOP subroutines.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It took me longer than I expected to analyze this, but the results are interesting and a bit unexpected.
First, I substantially beefed up the functions to add some real work, increased the loop upper bound to 20000, added some 'first touch' logic ahead of the kernel loops, and extracted the 'af' (polymorphic) case and 'sf' case (static) into separate test cases.
I consistently get at least an 8% speedup with the polymorphic case:
c:\ISN_Forums\U536538\DPD200364168\af>Main-af-no-mpi.exe
	 af loop took   4.84769248962402
	    jasin =  0.9465396
c:\ISN_Forums\U536538\DPD200364168\sf>Main-sf-no-mpi.exe
	 sf loop took   5.26771163940430
	    jasin =  0.9465396
c:\ISN_Forums\U536538\DPD200364168\sf>
Next, I ran a Vtune analysis on each case. The summary of the hot spots of interest:
'af' case:
   TESTOOP_mp_FCONCRETE         0.774 sec
	   MAIN                                                0.261 sec
'sf' case:
	   MAIN                                            0.895 sec
	   TESTOOP_mp_FSTATIC           0.636 sec
So while the implementation of the static function itself is actually faster compared to the polymorphic version, it's the calling overhead in MAIN for the static version that contributes most to the slowdown.
The MAIN routine for the 'sf' case was compiled with ifort -c Main-sf-no-mpi.f90 -Zi -O2
The generated code is rather cumbersome:
$LN21:
02 00 00 lea r9, QWORD PTR [560+rsp] ;C:\U536538\sf\Main-sf-no-mpi.f90:21.13
00178 4c 89 e9 mov rcx, r13 ;C:\U536538\sf\Main-sf-no-mpi.f90:21.13
0017b 48 89 ea mov rdx, rbp ;C:\U536538\sf\Main-sf-no-mpi.f90:21.13
00 00 lea r8, QWORD PTR [__NLITPACK_2.0.1] ;C:\U536538\sf\Main-sf-no-mpi.f90:21.13
01 00 00 mov QWORD PTR [440+rsp], r14 ;C:\U536538\sf\Main-sf-no-mpi.f90:21.13
01 00 00 mov QWORD PTR [464+rsp], r14 ;C:\U536538\sf\Main-sf-no-mpi.f90:21.13
01 00 00 mov QWORD PTR [432+rsp], r15 ;C:\U536538\sf\Main-sf-no-mpi.f90:21.13
01 00 00 mov QWORD PTR [448+rsp], r14 ;C:\U536538\sf\Main-sf-no-mpi.f90:21.13
01 00 00 mov QWORD PTR [480+rsp], rsi ;C:\U536538\sf\Main-sf-no-mpi.f90:21.13
01 00 00 mov QWORD PTR [488+rsp], rdi ;C:\U536538\sf\Main-sf-no-mpi.f90:21.13
01 00 00 mov QWORD PTR [496+rsp], r14 ;C:\U536538\sf\Main-sf-no-mpi.f90:21.13
02 00 00 mov QWORD PTR [512+rsp], r14 ;C:\U536538\sf\Main-sf-no-mpi.f90:21.13
02 00 00 mov QWORD PTR [528+rsp], r14 ;C:\U536538\sf\Main-sf-no-mpi.f90:21.13
02 00 00 mov QWORD PTR [520+rsp], r14 ;C:\U536538\sf\Main-sf-no-mpi.f90:21.13
01 00 00 mov QWORD PTR [504+rsp], r14 ;C:\U536538\sf\Main-sf-no-mpi.f90:21.13
02 00 00 mov QWORD PTR [536+rsp], r14 ;C:\U536538\sf\Main-sf-no-mpi.f90:21.13
02 00 00 mov QWORD PTR [544+rsp], r14 ;C:\U536538\sf\Main-sf-no-mpi.f90:21.13
00 00 mov QWORD PTR [456+rsp], 3 ;C:\U536538\sf\Main-sf-no-mpi.f90:21.13
001f9 e8 fc ff ff ff call TESTOOP_mp_FSTATIC
Compare to the 'af' case (same compiler options).
00 mov DWORD PTR [180+rsp], 1 ;C:\U536538\af\Main-af-no-mpi.f90:21.6
00135 eb 07 jmp .B1.8 ; Prob 100% ;C:\U536538\af\Main-af-no-mpi.f90:21.6
00 00 mov DWORD PTR [180+rsp], ecx ;C:\U536538\af\Main-af-no-mpi.f90:21.6
00 00 mov DWORD PTR [180+rsp], ecx ;C:\U536538\af\Main-af-no-mpi.f90:21.6
00 00 00 lea r9, QWORD PTR [176+rsp] ;C:\U536538\af\Main-af-no-mpi.f90:22.11
00 00 mov r10, QWORD PTR [MAIN$AF.0.1+56] ;C:\U536538\af\Main-af-no-mpi.f90:22.11
00 00 lea rcx, QWORD PTR [MAIN$AF.0.1] ;C:\U536538\af\Main-af-no-mpi.f90:22.11
00154 48 89 ea mov rdx, rbp ;C:\U536538\af\Main-af-no-mpi.f90:22.11
00 00 lea r8, QWORD PTR [__NLITPACK_2.0.1] ;C:\U536538\af\Main-af-no-mpi.f90:22.11
0015e 41 ff 12 call QWORD PTR [r10] ;C:\U536538\af\Main-af-no-mpi.f90:22.11
I filed this as a feature request to improve the generated code for the static call case (tracking ID DPD200364168).
Patrick
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Patrick,
Could you show the entire loop(s) for both af and sf (disassembly)
The moves of r14 to stack looks to me like some once-only code (program initialization), and that optimization placed your actual loop elsewhere. Do you have call trace back enabled? If so, this could be pushing (writing) a signature frame onto the stack.
Also when testing like this, make different runs with different loop sequence order. After the first set of runs, copy the first series of nested loops to the last (making that test last). Repeat the run, rotate the test loops again, ...
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>>Could you show the entire loop(s) for both af and sf (disassembly)
The 'sf' loop is at lines 20-22 in the source code. Complete disassembly is:
;;;      do i = 1, N
	;;;        call sf%f(i,N,jasin)
	;;;      end do
	;;;    end do
	$LN78:
	$LN79:
	  0015b c7 84 24 34 02
	        00 00 01 00 00
	        00               mov DWORD PTR [564+rsp], 1             ;C:\U536538\sf\Main-sf-no-mpi.f90:20.6
	$LN80:
	  00166 eb 08            jmp .B1.7 ; Prob 100%                  ;C:\U536538\sf\Main-sf-no-mpi.f90:20.6
	$LN81:
	                                ; LOE rbx rbp rsi rdi r13 r14 r15 r12d xmm6 xmm7 xmm8 xmm9 xmm10 xmm11 xmm12 xmm13 xmm14
	.B1.9::                         ; Preds .B1.8
	$LN82:
	  00168 44 89 84 24 34
	        02 00 00         mov DWORD PTR [564+rsp], r8d           ;C:\U536538\sf\Main-sf-no-mpi.f90:20.6
	$LN83:
	                                ; LOE rbx rbp rsi rdi r13 r14 r15 r12d xmm6 xmm7 xmm8 xmm9 xmm10 xmm11 xmm12 xmm13 xmm14
	.B1.7::                         ; Preds .B1.9 .B1.6
	$LN84:
	$LN85:
	  00170 4c 8d 8c 24 30
	        02 00 00         lea r9, QWORD PTR [560+rsp]            ;C:\U536538\sf\Main-sf-no-mpi.f90:21.13
	$LN86:
	  00178 4c 89 e9         mov rcx, r13                           ;C:\U536538\sf\Main-sf-no-mpi.f90:21.13
	$LN87:
	  0017b 48 89 ea         mov rdx, rbp                           ;C:\U536538\sf\Main-sf-no-mpi.f90:21.13
	$LN88:
	  0017e 4c 8d 05 3c 00
	        00 00            lea r8, QWORD PTR [__NLITPACK_2.0.1]   ;C:\U536538\sf\Main-sf-no-mpi.f90:21.13
	$LN89:
	  00185 4c 89 b4 24 b8
	        01 00 00         mov QWORD PTR [440+rsp], r14           ;C:\U536538\sf\Main-sf-no-mpi.f90:21.13
	$LN90:
	  0018d 4c 89 b4 24 d0
	        01 00 00         mov QWORD PTR [464+rsp], r14           ;C:\U536538\sf\Main-sf-no-mpi.f90:21.13
	$LN91:
	  00195 4c 89 bc 24 b0
	        01 00 00         mov QWORD PTR [432+rsp], r15           ;C:\U536538\sf\Main-sf-no-mpi.f90:21.13
	$LN92:
	  0019d 4c 89 b4 24 c0
	        01 00 00         mov QWORD PTR [448+rsp], r14           ;C:\U536538\sf\Main-sf-no-mpi.f90:21.13
	$LN93:
	  001a5 48 89 b4 24 e0
	        01 00 00         mov QWORD PTR [480+rsp], rsi           ;C:\U536538\sf\Main-sf-no-mpi.f90:21.13
	$LN94:
	  001ad 48 89 bc 24 e8
	        01 00 00         mov QWORD PTR [488+rsp], rdi           ;C:\U536538\sf\Main-sf-no-mpi.f90:21.13
	$LN95:
	  001b5 4c 89 b4 24 f0
	        01 00 00         mov QWORD PTR [496+rsp], r14           ;C:\U536538\sf\Main-sf-no-mpi.f90:21.13
	$LN96:
	  001bd 4c 89 b4 24 00
	        02 00 00         mov QWORD PTR [512+rsp], r14           ;C:\U536538\sf\Main-sf-no-mpi.f90:21.13
	$LN97:
	  001c5 4c 89 b4 24 10
	        02 00 00         mov QWORD PTR [528+rsp], r14           ;C:\U536538\sf\Main-sf-no-mpi.f90:21.13
	$LN98:
	  001cd 4c 89 b4 24 08
	        02 00 00         mov QWORD PTR [520+rsp], r14           ;C:\U536538\sf\Main-sf-no-mpi.f90:21.13
	$LN99:
	  001d5 4c 89 b4 24 f8
	        01 00 00         mov QWORD PTR [504+rsp], r14           ;C:\U536538\sf\Main-sf-no-mpi.f90:21.13
	$LN100:
	  001dd 4c 89 b4 24 18
	        02 00 00         mov QWORD PTR [536+rsp], r14           ;C:\U536538\sf\Main-sf-no-mpi.f90:21.13
	$LN101:
	  001e5 4c 89 b4 24 20
	        02 00 00         mov QWORD PTR [544+rsp], r14           ;C:\U536538\sf\Main-sf-no-mpi.f90:21.13
	$LN102:
	  001ed 48 c7 84 24 c8
	        01 00 00 03 00
	        00 00            mov QWORD PTR [456+rsp], 3             ;C:\U536538\sf\Main-sf-no-mpi.f90:21.13
	$LN103:
	  001f9 e8 fc ff ff ff   call TESTOOP_mp_FSTATIC                ;C:\U536538\sf\Main-sf-no-mpi.f90:21.13
	$LN104:
	                                ; LOE rbx rbp rsi rdi r13 r14 r15 r12d xmm6 xmm7 xmm8 xmm9 xmm10 xmm11 xmm12 xmm13 xmm14
	.B1.8::                         ; Preds .B1.7
	$LN105:
	  001fe 44 8b 84 24 34
	        02 00 00         mov r8d, DWORD PTR [564+rsp]           ;C:\U536538\sf\Main-sf-no-mpi.f90:22.6
	$LN106:
	  00206 41 ff c0         inc r8d                                ;C:\U536538\sf\Main-sf-no-mpi.f90:22.6
	$LN107:
	  00209 41 81 f8 20 4e
	        00 00            cmp r8d, 20000                         ;C:\U536538\sf\Main-sf-no-mpi.f90:22.6
	$LN108:
	  00210 0f 8e 52 ff ff
	        ff               jle .B1.9 ; Prob 99%                   ;C:\U536538\sf\Main-sf-no-mpi.f90:22.6
The 'af' loop is at lines 21-23 in the source. Complete disassembly:
;;;      do i = 1, N
	;;;      call af%f(i,N,jasin)
	;;;      end do
	;;;    end do
	$LN58:
	$LN59:
	  0012a c7 84 24 b4 00
	        00 00 01 00 00
	        00               mov DWORD PTR [180+rsp], 1             ;C:\U536538\af\Main-af-no-mpi.f90:21.6
	$LN60:
	  00135 eb 07            jmp .B1.8 ; Prob 100%                  ;C:\U536538\af\Main-af-no-mpi.f90:21.6
	$LN61:
	                                ; LOE rbx rbp rsi rdi r12 r13 r14 r15d xmm6 xmm7 xmm8 xmm9 xmm10 xmm11 xmm12 xmm13 xmm14
	.B1.10::                        ; Preds .B1.9
	$LN62:
	  00137 89 8c 24 b4 00
	        00 00            mov DWORD PTR [180+rsp], ecx           ;C:\U536538\af\Main-af-no-mpi.f90:21.6
	$LN63:
	                                ; LOE rbx rbp rsi rdi r12 r13 r14 r15d xmm6 xmm7 xmm8 xmm9 xmm10 xmm11 xmm12 xmm13 xmm14
	.B1.8::                         ; Preds .B1.10 .B1.7
	$LN64:
	$LN65:
	  0013e 4c 8d 8c 24 b0
	        00 00 00         lea r9, QWORD PTR [176+rsp]            ;C:\U536538\af\Main-af-no-mpi.f90:22.11
	$LN66:
	  00146 4c 8b 15 38 00
	        00 00            mov r10, QWORD PTR [MAIN$AF.0.1+56]    ;C:\U536538\af\Main-af-no-mpi.f90:22.11
	$LN67:
	  0014d 48 8d 0d 00 00
	        00 00            lea rcx, QWORD PTR [MAIN$AF.0.1]       ;C:\U536538\af\Main-af-no-mpi.f90:22.11
	$LN68:
	  00154 48 89 ea         mov rdx, rbp                           ;C:\U536538\af\Main-af-no-mpi.f90:22.11
	$LN69:
	  00157 4c 8d 05 6c 00
	        00 00            lea r8, QWORD PTR [__NLITPACK_2.0.1]   ;C:\U536538\af\Main-af-no-mpi.f90:22.11
	$LN70:
	  0015e 41 ff 12         call QWORD PTR [r10]                   ;C:\U536538\af\Main-af-no-mpi.f90:22.11
	$LN71:
	                                ; LOE rbx rbp rsi rdi r12 r13 r14 r15d xmm6 xmm7 xmm8 xmm9 xmm10 xmm11 xmm12 xmm13 xmm14
	.B1.9::                         ; Preds .B1.8
	$LN72:
	  00161 8b 8c 24 b4 00
	        00 00            mov ecx, DWORD PTR [180+rsp]           ;C:\U536538\af\Main-af-no-mpi.f90:23.6
	$LN73:
	  00168 ff c1            inc ecx                                ;C:\U536538\af\Main-af-no-mpi.f90:23.6
	$LN74:
	  0016a 81 f9 20 4e 00
	        00               cmp ecx, 20000                         ;C:\U536538\af\Main-af-no-mpi.f90:23.6
	$LN75:
	  00170 7e c5            jle .B1.10 ; Prob 99%                  ;C:\U536538\af\Main-af-no-mpi.f90:23.6
>>>Also when testing like this, make different runs with different loop sequence order....
I can't spend any more time investigating this issue. It doesn't appear to be of much interest to the community, and it's in the developer's court now. But I'll take your suggestion as a good BKM for investigating similar issues in future, thanks!
Patrick
 
					
				
				
			
		
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page