Solved: inlined subroutine still slow

Guanfeng_Z_ · ‎04-06-2018

Hi everyone,

We have some legacy code in F77, and there are many math function, like matrix and/or vector multiplication, copy vectors, initialization of vector and matrix. All of these F77 code are optimized (like unrolling).

From the optimization report, I can see these functions are inlined and operations are all VECTORIZED (estimated potential speedup about: 1.6).

However, if I replace these F77 function call by F90 code,

for example (a matrix multiply a vector here)

c(:) = matmul(a(:,:),b(:)).

I can save about 50% time for these matrix and vector operation.

Does this mean I still have overhead even these functions are inlined?

Could anyone give me some explanationand suggestion about how to optimize these code? Thank you in advance!

jimdempseyatthecove · ‎04-10-2018

Don't know. You can find out by running VTune on the Release version (with debug symbols).

You should be able to see MKL references (assuming the compute load in MKL is sufficient enough to get sampled by VTune).

Bottom-Up should be able to show the call stack.

Jim Dempsey

View solution in original post

jimdempseyatthecove · ‎04-07-2018

Inlining saves the function call overhead inclusive of argument saving on stack and/or registers with its potential for saving/restoring register on stack. For function such as matrix multiply you will be comparing the implementation of your F77/F90 code against the code called by the newer compiler (principally Intel's MKL). For other than small matrices, MKL will likely be much faster than anything you can write.

By the way, the MKL call is not inlined.

Jim Dempsey

Guanfeng_Z_ · ‎04-09-2018

Thanks for your reply, Jim.

matmul use the MKL.

Does following code (vectors multiplication) also calculated by using the MKL?

c(i) = sum((a(:,i) * b(:)))

Thanks,

GZ

jimdempseyatthecove · ‎04-10-2018

Don't know. You can find out by running VTune on the Release version (with debug symbols).

You should be able to see MKL references (assuming the compute load in MKL is sufficient enough to get sampled by VTune).

Bottom-Up should be able to show the call stack.

Jim Dempsey

TimP · ‎04-10-2018

Besides what Jim said, you could use nm to see whether you have linked MKL. For most purposes, sum(a*b) should be equivalent to dotprod(a,b) but it's not obvious what might be the requirements for automatic MKL substitution. I think writing MATMUL explicitly and using the opt_matmul option of ifort (included in -O3) (gfortran has an equivalent) would be best since you have access to change source.

jimdempseyatthecove · ‎04-10-2018

TimP,

The linking dependency of MKL only indicates MKL is linked into the application. This does not indicate if

c(i) = sum((a(:,i) * b(:)))

calls MKL.

VTune is one way to get this information (as indicated in #4), setting a Debug break at statement (which may be difficult with full optimizations), and then using the Disassembly window is another way.

Jim Dempsey

Steve_Lionel · ‎04-10-2018

As far as I know, the only thing the compiler calls into MKL on its own for is MATMUL (when certain optimizations are enabled.) But I'll admit that my knowledge here is a bit stale.

Guanfeng_Z_ · ‎04-11-2018

Thank you all for detailed information!

GZ