BTW your C/C++ code is not

dkokron · ‎11-10-2016

This is a follow up to the now closed topic

https://software.intel.com/en-us/forums/intel-fortran-compiler-for-linux-and-mac-os-x/topic/270998

I'm seeing the same issue using the 2017 initial release on linux. Can anyone tell me how to get the compiler to inline my dense_sse_mul?

icc (ICC) 17.0.0 20160721
Intel(R) Fortran Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 17.0.0.098 Build 20160721

The underlying gcc is
gcc (GCC) 4.7.0

Text from ipo_out.optrpt

-> (NOFORCE): (141,12) dense_sse_mul (isz = 22) (sz = 31)
[[ Unable to inline callsite <1>]]

The dense_sse_mul symbol is declared as

void inline dense_sse_mul(const double* A, const double* B, double* C) {
__asm__ __volatile__(" .........................

and is compiled with

icc -c -O3 -xSSE4.2 -ipo -restrict -DNDEBUG sse_5_5_5_DP.c..

The main routine is fortran and contains an interface for dense_sse_mul and is compiled with

ifort -O3 -ipo -inline-factor=1000 -align array32byte

    INTERFACE
      SUBROUTINE dense_sse_mul(a, b, mm) BIND(C)
        !dir$ attributes forceinline :: dense_sse_mul
        USE, INTRINSIC :: ISO_C_BINDING, ONLY: C_DOUBLE
        IMPLICIT NONE
        real(C_DOUBLE), dimension(5,5), intent(in   ) :: a
        real(C_DOUBLE), dimension(5,5), intent(in   ) :: b
        real(C_DOUBLE), dimension(5,5), intent( out) :: mm
      END SUBROUTINE dense_sse_mul
    END INTERFACE

jimdempseyatthecove · ‎11-10-2016

I do not believe Fortran IPO is capable of inlining C/C++ (with or without) containing assembler.

Have you checked the assembler ouput of writing your inlineable subroutine in Fortran (and placed in module)?

Also, for Fortran consider aligning and declaring dummies in module with alignment requirements. I see you use -align array32byte but this applies to arrays you declare, and not necessary to array dummies declared in subroutines.

Also, since you are targeting SSE instead of AVX, consider dimensioning your arrays as (6,6) and zero filling the extra row and column...
or use (6,5) for a and b, and (5,6) for mm.

I recommend that you not target SSE as AVX and AVX2 is now more predominant.

Jim Dempsey

dkokron · ‎11-10-2016

Jim,

Thanks for the suggestions. I have tested an AVX version of the dense_sse_mul routine and found it to be slightly slower than the SSE version, probably due to the slower clock used when the AVX pipes are active. Just to clarify, the underlying code for dense_*_mul is coming from the libxsmm code generator. I'm aware that padding and alignment can impact performance, but I was going to leave those to another round of optimizing.

I'm not exactly sure what "Have you checked the assembler ouput of writing your inlineable subroutine in Fortran (and placed in module)?" means.

The disassembly does show calls to the dense_*_mul routines.

objdump -D a.out | grep dense
404124:   e8 a7 1f 00 00          callq 4060d0 <dense_sse_mul>
40420c:   e8 2f 1a 00 00          callq 405c40 <dense_avx_mul>
0000000000405c40 <dense_avx_mul>:
405dac:   0f 8c d2 fe ff ff       jl     405c84 <dense_avx_mul+0x44>
405ecb:   0f 8c e1 fe ff ff       jl     405db2 <dense_avx_mul+0x172>
405ee1:   0f 8c 92 fd ff ff       jl     405c79 <dense_avx_mul+0x39>
405fca:   0f 8c 22 ff ff ff       jl     405ef2 <dense_avx_mul+0x2b2>
40609e:   0f 8c 2c ff ff ff       jl     405fd0 <dense_avx_mul+0x390>
4060b4:   0f 8c 2d fe ff ff       jl     405ee7 <dense_avx_mul+0x2a7>
00000000004060d0 <dense_sse_mul>:
406356:   0f 8c b8 fd ff ff       jl     406114 <dense_sse_mul+0x44>
40648a:   0f 8c cc fe ff ff       jl     40635c <dense_sse_mul+0x28c>
4064a0:   0f 8c 63 fc ff ff       jl     406109 <dense_sse_mul+0x39>
406640:   0f 8c 6b fe ff ff       jl     4064b1 <dense_sse_mul+0x3e1>
406722:   0f 8c 1e ff ff ff       jl     406646 <dense_sse_mul+0x576>
406738:   0f 8c 68 fd ff ff       jl     4064a6 <dense_sse_mul+0x3d6>

jimdempseyatthecove · ‎11-11-2016

>>I'm not exactly sure what "Have you checked the assembler ouput of writing your inlineable subroutine in Fortran (and placed in module)?" means.

SUBROUTINE dense_Fortran_mul(a, b, mm)
   !dir$ attributes forceinline :: dense_Fortran_mul
   USE, INTRINSIC :: ISO_C_BINDING, ONLY: C_DOUBLE
   IMPLICIT NONE
   real(C_DOUBLE), dimension(5,5), intent(in   ) :: a
   real(C_DOUBLE), dimension(5,5), intent(in   ) :: b
   real(C_DOUBLE), dimension(5,5), intent(  out) :: mm
   mm = matmul(a,b)
END SUBROUTINE dense_Fortran_mul

Build targeting either sse or AVX/AVX2
Then modify for use with aligned arrays.

The better route would be use MATMUL directly in the application.

Jim Dempsey

jimdempseyatthecove · ‎11-11-2016

BTW your C/C++ code is not showing the entry and exit point code.

Jim Dempsey

What does NOFORCE mean in opt-report?