OK, I did RTFM now. From the

Marcin_K_ · ‎01-24-2018

I am having trouble understanding the output of ICC (18.0.1.163) when generating vectorized functions with pragma omp declare simd. Consider the following simple code for a vectorized pow10 function:

#include <math.h>
#pragma omp declare simd simdlen(4)
double pow10v(double x)
{
  return exp(2.3025850929940459*x);
}

I compile this for an AVX2 capable CPU:

icc -std=c++11 -qopenmp -xCORE-AVX2 -O3 -qopt-report-phase=vec -qopt-report=5 -c micro.c -o micro.o

The compiler generates two vectorized functions (masked / nonmasked). Vectorization report for nonmasked version reads that XMM registers are used, which I confirm by looking at the assembly code:

Begin optimization report for: pow10v..xN4v(double)

    Report from: Vector optimizations [vec]

remark #15347: FUNCTION WAS VECTORIZED with xmm, simdlen=4, unmasked, formal parameter types: (vector) 
remark #15305: vectorization support: vector length 4
remark #15475: --- begin vector cost summary ---
remark #15482: vectorized math library calls: 1 
remark #15488: --- end vector cost summary ---
===========================================================================

_ZGVxN4v_pow10v:
# parameter 1: %xmm0
# parameter 2: %xmm1
[...]
        vinsertf128 $1, %xmm1, %ymm0, %ymm2                     #5.1
        vmulpd    .L_2il0floatpacket.0(%rip), %ymm2, %ymm0      #6.33
        call      *__svml_exp4_l9@GOTPCREL(%rip)                #6.10
                                # LOE rbx r12 r13 r14 r15 xmm8 xmm9 xmm10 xmm11 xmm12 xmm13 xmm14 xmm15 ymm0
                                # Execution count [1.00e+00]
        vextractf128 $1, %ymm0, %xmm1                           #6.10
        vzeroupper                                              #6.10
[...]

So it seems that arguments are passed to svml_exp4 using the AVX registers, but the function itself takes SSE2 registers as parameters, and then reassembles them into YMM.

Looking at the Vector ABI specification, _ZGVxN4v_pow10v denotes an SSE function. First, this is not entirely correct, since the function uses AVX instructions and calls an AVX-enabled exp implementation. But then why does ICC not generate the (IMO requested) AVX version in the first place?

Can somebody hint what am I doing wrong?

Thanks a lot!

Marcin_K_ · ‎01-24-2018

OK, I did RTFM now. From the document at https://software.intel.com/en-us/articles/vector-simd-function-abi

For other IA-32 and Intel®64 processors, when the processor clause is not specified, the default target
processor is the “pentium_4” for Windows* and Linux*, and the “pentium_4_sse3” for macOS*; the ISA class
for those target processors is “XMM”. The only way to affect ISA class selection is through the processor
clause. The command line processor flag has no impact on ISA class selection for the vector function ABI.

So for others who may end up having this problem, the processor clause (an Intel extension) is required in the omp declare simd pragma.

Not really sure why this is so. It makes life harder when one wants to write GCC-compatible code.

AVX and omp simd vectorization of functions