Section 5.15.12 of the

TimP · ‎06-23-2014

I've been trying to understand what the implicit_index intrinsic may be intended for. It's tricky to get adequate performance from it, and apparently not possible in some of the more obvious contexts (unless the goal is only to get a positive vectorization report).

It seems to be competitive for the usage of setting up an identity matrix.

In the context of dividing its result by 2, different treatments are required on MIC and host:

#ifdef __MIC__
a[2:i__2-1] = b[2:i__2-1] + c[((unsigned)__sec_implicit_index(0)>>1)+1] * d__[2:i__2-1];
#else
a[2:i__2-1] = b[2:i__2-1] + c[__sec_implicit_index(0)/2+1] * d__[2:i__2-1];
#endif

That is, the unsigned right shift is several times as fast as the divide on MIC (and not much slower than plain C code), while the signed divide by 2 is up to 60% faster on host (but not as fast as C code).

The only advantage in it seems to be the elimination of a for(), if in fact that is considered to be an advantage.

I didn't see documented anywhere that it is int data type, although the opt-report shows it. I can't see how it could be anything other than positive integers, so the (unsigned) cast seems valid. I guess >>1U would have the same effect with less space taken up compared with (unsigned). The notation is already cryptic from my point of view.

ARCH_R_Intel · ‎06-24-2014

Section 5.15.12 of the current Cilk Plus specification documents __sec_implicit_index as returning an intptr_t. That's a 64-bit integer type on MIC, which lacks support for vectorizing 64-bit integer arithmetic. (Well, the compiler "vectorizes" it, but through a remarkably circuitous sequence of instructions.) The cast to unsigned gets it down to 32 bits, for which there is more direct vectorization support.

TimP · ‎06-24-2014

Thanks for the useful hint. I updated the benchmarks (Fortran, C, C++, Cilk(tm) Plus) at https://github.com/tprince/lcd after finding that all my use of (int)__sec_implicit_index work well on all targets without target-dependent variations. It's fairly well known that MIC (like ia32 mode) doesn't have full native int64_t instruction level support, but I hadn't guessed this would be an issue for implicit_index. Even on host, implicit cast from unsigned long long to float is ridiculously slow.