Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.

overhead when using subroutine - how to use inline

Lars_Jakobsen
Beginner
1,058 Views

Hi all,

I have tried to test if there was an overhead in using a subroutine compared to not doing so. It was my impression that for small functions etc. the compiler would automatically inline the functions having the meaning that there was no overhead in using a subroutine call or function. However I have made a test example that shows otherwise (or perhaps I am not using the right compiler settings). The test is this (see also attached source):

do i=1,nLoop

  call MySubroutine(Var1,var2,...Var20)

end do

vs

do i=1,nLoop

Var1 = Var1 + 1.5

Var2 = Var2 + 1.5

...

Var20 = Var20 + 1.5

end do

 

Where MySubroutine performs the same calculations as below. The code that is not placed in a subroutine only takes approximately 50% of the time for the code placed in a subroutine.

I have tried to use /Ob2, /Qinline-forceinline /Qipo and other settings, however the difference in time consumption remains.

Is this general behaviour and the lesson to learn do not place code in subroutines if you want fastest code or am I doing something wrong?

I am using Intel(R) Visual Fortran Compiler XE 13.0.0.089 [IA-32]

Regards

Lars

0 Kudos
9 Replies
mecej4
Honored Contributor III
1,058 Views
Your test is deficient in one regard, and that is that when you call the subroutine you pass it twenty arguments where a single array argument vec(1:20) comprising vec1,...,vec20 would do. Making that change, and compiling with /fast and running on a laptop with an i7-2720QM CPU, I get Subroutine = 2.31 2.32 Without subroutine = 2.29 2.28 L2Norm in seperate subroutine = 0.98 0.98 L2Norm inline = 0.97 0.98 The conclusion I would draw is that subroutine/function argument lists should be kept as short as possible, rather than that subroutines should be avoided altogether.
This becomes even more important if the ABI involves passing arguments through registers rather than on the stack. On AMD/Intel X64, for example, it is impossible to pass twenty double-precision reals through the XMM registers. The code to perform a read-modify-write cycle on the twenty arguments would consist of a substantial part that simply moves values between memory and the register file.
0 Kudos
Lars_Jakobsen
Beginner
1,058 Views
Hi mecej4, Thanks for the reply. I intentionally wanted to test with a large number of values and not an array (or derived type for that matter), however as you show there might be some advantage in using arrays. I also see a difference in L2Norm of a factor of two when I run it (and there is almost no difference in your run)- did you change anything in the call to L2Norm? Edit: I tried to recompile using the /fast compiler option and now my results are: Subroutine = 8.98 8.75 Without subroutine = 7.25 7.05 L2Norm in seperate subroutine = 2.38 2.30 L2Norm inline = 1.02 0.98 Regards Lars
0 Kudos
TimP
Honored Contributor III
1,058 Views
When you are depending on in-lining for loop optimization, it may be helpful to use internal subroutines, or follow the ancient principle of pushing the inner loop inside the subroutine. If you've looked at the ifort documentation, there are so many options with ATTRIBUTES and command line options to raise threshold limits that you may well conclude that the old methods are preferable.
0 Kudos
jimdempseyatthecove
Honored Contributor III
1,058 Views
Lars, I think some of the optimizations are not performed unless you enable IPO (InterProcedural Optimizaitons). This has to be enabled in both the compiler and the linker. If this improves your situation, then please report back so others reading this thread can be informed. Jim Dempsey
0 Kudos
jimdempseyatthecove
Honored Contributor III
1,058 Views
Also consider: do i=1,nLoop call MySubroutine((/Var1,var2,...Var20/)) end do ... subroutine MySubroutine( args ) real :: args(20) ... .OR. do i=1,nLoop argsOfYourType = YourType(iVar,fVar,dVar,'TextVal'...) call MySubroutine(argsOfYourType ) end do Jim Dempsey
0 Kudos
mecej4
Honored Contributor III
1,058 Views
did you change anything in the call to L2Norm? I merely changed Var = L2Norm(vec) to Var(1) = L2Norm(vec)
0 Kudos
Lars_Jakobsen
Beginner
1,058 Views
Hi Jim, Thank you for the replies. I tried to use /QIpo in both the compiler and the linker, but this does not change my results. I think however my original test case is flawed in two cases: 1) In the code that I posted initially the values computed by my function L2Norm was not used for anything. When I rewrote the code to print the value to the screen I get almost the same time consumption when using the subroutine and the manually "inline" written code. 2) The order in which I do the computation also seem to influence the results i.e. I tried to interchange the loop with the call to MySubroutine and the loop doing the same thing but without the subroutine and now my results are different. I changed the test so that different variables are used for each test case (previously the same variables was reused) and now the results are more constant. My latest results are O V E R A L L T I M E L O G WC-time CPU-time Subroutine = 11.75 11.73 Without subroutine = 11.73 11.72 L2Norm inline = 2.61 2.61 L2Norm in seperate subroutine = 3.05 3.05 There still is a small difference between computing the L2Norm "inline" or using a subroutine (in the above approximately 17%). Regards Lars
0 Kudos
jimdempseyatthecove
Honored Contributor III
1,058 Views
Lars, I downloaded your project and had an issue with MS VS 1>ipo: error #11034: Il version for C:\Downloads\speedtestsubroutinecall\SpeedTestSubroutineCall\Release\Subroutine.obj (216458) does not match compiler's il version (213490), please regenerate I created a new solution file and this fixed the issue. Noticing L2Norm uses Assumed Shape to pass the array, I took the liberty to create L2NormN [fortran] ! pass arg dimension(:) function L2Norm(a) real(8), dimension(:), intent(in) :: a real(8) :: L2Norm L2Norm = sqrt(dot_product(a,a)) end function L2Norm ! add n, pass arg dimension(n) function L2NormN(a, n) integer :: n real(8), dimension(n), intent(in) :: a real(8) :: L2NormN L2NormN = sqrt(dot_product(a,a)) end function L2NormN [/fortran] Running the test (x64 Release) Subroutine = 11.21 11.20 Without subroutine = 5.75 5.74 L2Norm in seperate subroutine = 6.85 6.85 L2NormN in seperate subroutine = 5.90 5.90 L2Norm inline = 5.90 5.91 You can see the L2NormN in seperate subroutine is the same speed as L2Norm inline. The difference being how you pass the arguments
0 Kudos
Lars_Jakobsen
Beginner
1,058 Views
Hi Jim, So in this case the overhead is due to the assumed shape rather than an issue of inlining the function. Thanks Lars
0 Kudos
Reply