Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.
28446 Discussions

subroutine with many arguments - how to ensure maximum performance?

gregfi04
Beginner
638 Views

Hello,

I have a performance-critical peice of code that currently sits in a nested loop, e.g.:

do kk=1,k_max
  do jj=1,j_max
    do ii=1,i_max
      function_with_11_inputs_4_outputs
    enddo
  enddo
enddo

For readability, I'd like this to appear in a separate subroutine. (This gets rid of all of the array indices and makes things look much cleaner.) Assuming that I do this, to achieve maximum performance, do I need to create a subroutine with 15 arguments and only operate on the direct inputs and outputs? Or, can I do something like the following and count on the compiler to get rid of the intermediate steps?

real :: inputs(11), outputs(4)

do kk=1,k_max
  do jj=1,j_max
    do ii=1,i_max
      inputs(1) = varA
      inputs(2) = varB
      inputs(3) = varC
      ...

      function(inputs, outputs)

      varX = outputs(1)
      varY = outputs(2)
      varZ = outputs(3)
      ...
    enddo
  enddo
enddo

(Presumably, there would be a similar translation inside the function, itself.)  Alternatively, is there any other "cleaner" way to efficiently pass a lot of arguments to a subroutine and still achieve maximum performance?

Thanks,
Greg

0 Kudos
4 Replies
TimP
Honored Contributor III
638 Views
The traditional way to see maximum performance is to push a sufficient number of inner loops inside the function. Otherwise, you depend on in-lining (as ifort attempts to do by default, if permitted to do so, subject to the limits, for which many options are available). Satisfactory vectorization reports are an excellent first step. Current CPU generations are less susceptible than past ones to fill buffer thrashing associated with pushing too many arguments. If your cases are marginal you will likely need to analyze the actual cases under VTune.
0 Kudos
jimdempseyatthecove
Honored Contributor III
638 Views

There was a similar query here where user defined type was attempted. This yielded lesser performance.

The list of input arguments, on call "costs" a LEA and PUSH (Load effective address, and push that address on stack). This is highly efficient. Copying the arg "costs" the equivalent of an LEA plus read plus write plus PUSH. In effect passing the argument saves a read and a write and it saves potential cache line evictions.

On the receiving end (called subroutine) the costs are about the same.

When you return from call, the secondary copy is additional overhead for the use of the outputs.

It looks like passing the args is the faster way to go, ... unless when the args are already in a user defined type (then pass the type reference).

Jim Dempsey

0 Kudos
Roman1
New Contributor I
638 Views

Can you declare all the variables inside a MODULE, and then USE this module inside the subroutines.  You would not have to pass any arguments to the subroutine.  This might improve performance.

Roman

 

0 Kudos
jimdempseyatthecove
Honored Contributor III
638 Views

Roman

Threadprivate variables is an additional option to investigate.

Jim Dempsey

0 Kudos
Reply