can icc generate permutation instructions without using intrinsics

BradleyKuszmaul · ‎04-15-2012

Is there any way I can get icc to generate the permutation instructions without using intrinsics?

For example, icc -xAVX vectorizes the block in f1, but not the one in f2:

[cpp]struct d4 { double d[4] __attribute__((aligned(32))); }; void f1 (struct d4 *a, struct d4 *b, struct d4 *__restrict__ c) { c[0].d[0]+=a[0].d[0]*b[0].d[0]; c[0].d[1]+=a[0].d[1]*b[0].d[1]; c[0].d[2]+=a[0].d[2]*b[0].d[2]; c[0].d[3]+=a[0].d[3]*b[0].d[3]; } void f2 (struct d4 *a, struct d4 *b, struct d4 *__restrict__ c) { c[0].d[0]+=a[0].d[1]*b[0].d[0]; c[0].d[1]+=a[0].d[0]*b[0].d[1]; c[0].d[2]+=a[0].d[3]*b[0].d[2]; c[0].d[3]+=a[0].d[2]*b[0].d[3]; } [/cpp] f1 comes out as

[plain] vmovupd (%rdi), %ymm0 #33.26 vmulpd (%rsi), %ymm0, %ymm1 #33.26 vaddpd (%rdx), %ymm1, %ymm2 #33.5 vmovupd %ymm2, (%rdx) #33.5 [/plain] And I'd like f2 to come out as

[plain] vmovupd (%rdi), %ymm0 #33.26 vpermilpd $0x5,%ymm0,%ymm0 vmulpd (%rsi), %ymm0, %ymm1 #33.26 vaddpd (%rdx), %ymm1, %ymm2 #33.5 vmovupd %ymm2, (%rdx) #33.5 [/plain] but I cannot figure out how to get the compiler to vectorize it. I can do it with intrinsics, as follows, but I'd rather avoid writing with intrinsics, since I'm trying to explain this to someone without first teaching the intrinsics.

[cpp] void f3 (struct d4 *a, struct d4 *b, struct d4 *__restrict__ c, int n) { __m256d *av = (__m256d*)a; __m256d *bv = (__m256d*)b; __m256d *cv = (__m256d*)c; *cv = _mm256_add_pd(*cv, _mm256_mul_pd(_mm256_permute_pd(*av, 5), *bv)); } [/cpp]

Any help from people who have experience convincing icc to generate these instructions from ordinary C code with no intrinsics?

-Bradley

Georg_Z_Intel · ‎04-26-2012

Hello,

unfortunately there's no way to force the compiler to (auto-)vectorize using certain instructions. The only way to do so is to either use intrinsics or write the assembly manually.

The reason why our compiler is not using packed/AVX instructions (like permutations from your example) here is that it'd be much slower. I've verified that by measuring the execution time with "rdtsc" and manually creating the assembly. There's roughly 30% difference in runtime.

The reason for this is that the execution units of the CPU are best utilized with the code the compiler genrates now (software pipelining). The thoughput is higher than for the packed versions and interleaved load/operate/store allows the computations to be started earlier, keeping in mind the super-scalar out-of-order architecture we have.

Neither compact code nor using packed instructions is a guarantee for best performance per se. Our compiler has lots of heuristics to find the best instruction sequences. Some of them might be the opposite one would expect but are worth it.

We appreciate your feedback and kindly ask everyone to let us know about inefficient patterns you might encounter.

Thank you & best regards,

Georg Zitzlsberger

P.S.: This is what we're currently creating for "f2(...)":
[plain]f2: vmovsd 8(%rdi), %xmm0 #13.16 vmovsd (%rdi), %xmm3 #14.16 vmovsd 24(%rdi), %xmm6 #15.16 vmovsd 16(%rdi), %xmm9 #16.16 vmulsd (%rsi), %xmm0, %xmm1 #13.26 vmulsd 8(%rsi), %xmm3, %xmm4 #14.26 vmulsd 16(%rsi), %xmm6, %xmm7 #15.26 vaddsd (%rdx), %xmm1, %xmm2 #13.5 vmulsd 24(%rsi), %xmm9, %xmm10 #16.26 vaddsd 8(%rdx), %xmm4, %xmm5 #14.5 vaddsd 16(%rdx), %xmm7, %xmm8 #15.5 vaddsd 24(%rdx), %xmm10, %xmm11 #16.5 vmovsd %xmm2, (%rdx) #13.5 vmovsd %xmm5, 8(%rdx) #14.5 vmovsd %xmm8, 16(%rdx) #15.5 vmovsd %xmm11, 24(%rdx) #16.5 ret #17.1[/plain]
The software pipelining pattern is clearly visible, which:

Loads %xmm[0|3|6|9] registers independently
Does operations (vmulsd) for each such %xmm[0|3|6|9] registers
Because of independence this can take place in an interleaved way
(e.g. while %xmm[3|6|9] are still loading vmulsd with %xmm0 can already be executed by another execution unit)

Georg_Z_Intel · ‎05-08-2012

Hello,

I'd like to add that using Intel Cilk Plus Array Notations here can provide faster code for you:
[cpp]struct d4 { double d[4] __attribute__((aligned(32))); }; void f1 (struct d4 *a, struct d4 *b, struct d4 *__restrict__ c) { c[0].d[0]+=a[0].d[0]*b[0].d[0]; c[0].d[1]+=a[0].d[1]*b[0].d[1]; c[0].d[2]+=a[0].d[2]*b[0].d[2]; c[0].d[3]+=a[0].d[3]*b[0].d[3]; } void f1_cilk (struct d4 * a, struct d4 * b, struct d4 *__restrict__ c) { c[0].d[:]+=a[0].d[:]*b[0].d[:]; } void f2 (struct d4 * a, struct d4 * b, struct d4 *__restrict__ c) { c[0].d[0]+=a[0].d[1]*b[0].d[0]; c[0].d[1]+=a[0].d[0]*b[0].d[1]; c[0].d[2]+=a[0].d[3]*b[0].d[2]; c[0].d[3]+=a[0].d[2]*b[0].d[3]; } void f2_cilk (struct d4 * a, struct d4 * b, struct d4 *__restrict__ c) { unsigned int perm[4] = {1, 0, 3, 2}; c[0].d[:]+=a[0].d[perm[:]]*b[0].d[:]; }[/cpp]
The *_cilk versions are making use of the Array Notations. Both "f1(...)" and "f1_cilk(...)" produce the same assembly. However, the assembly produced for "f2_cilk(...)" is more efficient (~2% on my system; yours might differ) than the one for "f2(...)".

The assembly created for "f2_cilk(...)" is this:
[plain]f2_cilk: vmovsd 8(%rdi), %xmm0 vmovsd 24(%rdi), %xmm1 vmovhpd (%rdi), %xmm0, %xmm2 vmovhpd 16(%rdi), %xmm1, %xmm3 vinsertf128 $1, %xmm3, %ymm2, %ymm4 vmulpd (%rsi), %ymm4, %ymm5 vaddpd (%rdx), %ymm5, %ymm6 vmovupd %ymm6, (%rdx) vzeroupper ret[/plain] It's still not using the permutation operation, though.

In any way, using the Array Notations provides you determinism about the underlying code being vectorized.
A nice side-effect is that the implementations are much better to read now.

Best regards,

Georg Zitzlsberger

TimP · ‎05-08-2012

Cilk+ array notation apparently implies -ansi-alias __restrict__ and #pragma vector always so it gives the compiler extra shots at finding optimization. A few compiler experts consider it a bug if the compiler can't optimize equivalent plain C for() loop code with the aid of those options. In this case, the introduction of the perm[] vector should help the compiler with C as well as Cilk+ code.
Array notation is like a foot in the door toward having the compiler require and take advantage of the standard compliance aspects of -ansi-alias.
The compiler likely would not use the AVX-256 instructions if it did not have the __attribute__((aligned(32))) qualifier. The code may be valid regardless of alignment, but could be much slower on Sandy Bridge than AVX-128 code if it were not aligned. The compiler never uses vmovapd even though it would be valid for the cases where the compiler chooses vmovupd %ymm...
From what I've seen so far, the compiler doesn't choose AVX-256 over AVX-128 on account of Ivy Bridge compile options.

BradleyKuszmaul · ‎07-16-2012

Are there any cases where the compiler *does* produce the vector instructions? I'm hacking on a more complex code than this, and I'd like not to have to write intrinsics for the permutations.

Georg_Z_Intel · ‎07-18-2012

Hello,

there are, e.g.:
[bash]void perm(double * __restrict__ dp, double *sp, int n) { int i; __assume_aligned(dp, 32); __assume_aligned(sp, 32); for(i = 0; i < n; i++){ dp[2 * i] = sp[2 * i + 1]; dp[2 * i + 1] = sp[2 * i]; } }[/bash]
...produces this for the loop (-xAVX):

[bash]..B2.4: # Preds ..B2.4 ..B2.3 lea (%rcx,%rcx), %r8d #17.21 addl $16, %ecx #16.5 movslq %r8d, %r8 #18.21 vmovupd (%rsi,%r8,8), %ymm0 #12.6 vmovupd 32(%rsi,%r8,8), %ymm1 #12.6 vmovupd 64(%rsi,%r8,8), %ymm4 #12.6 vmovupd 96(%rsi,%r8,8), %ymm5 #12.6 vperm2f128 $32, %ymm1, %ymm0, %ymm2 #17.21 vperm2f128 $49, %ymm1, %ymm0, %ymm3 #17.21 vunpcklpd %ymm3, %ymm2, %ymm9 #17.21 vunpckhpd %ymm3, %ymm2, %ymm8 #17.21 vperm2f128 $32, %ymm5, %ymm4, %ymm6 #17.21 vperm2f128 $49, %ymm5, %ymm4, %ymm7 #17.21 vmovupd 128(%rsi,%r8,8), %ymm4 #12.6 vmovupd 160(%rsi,%r8,8), %ymm5 #12.6 vunpcklpd %ymm9, %ymm8, %ymm10 #18.9 vunpckhpd %ymm9, %ymm8, %ymm11 #18.9 vmovupd 192(%rsi,%r8,8), %ymm8 #12.6 vmovupd 224(%rsi,%r8,8), %ymm9 #12.6 vunpcklpd %ymm7, %ymm6, %ymm15 #17.21 vunpckhpd %ymm7, %ymm6, %ymm14 #17.21 vperm2f128 $32, %ymm11, %ymm10, %ymm12 #18.9 vperm2f128 $49, %ymm11, %ymm10, %ymm13 #18.9 vunpcklpd %ymm15, %ymm14, %ymm0 #18.9 vunpckhpd %ymm15, %ymm14, %ymm1 #18.9 vperm2f128 $32, %ymm5, %ymm4, %ymm6 #17.21 vperm2f128 $49, %ymm5, %ymm4, %ymm7 #17.21 vperm2f128 $32, %ymm9, %ymm8, %ymm10 #17.21 vperm2f128 $49, %ymm9, %ymm8, %ymm11 #17.21 vmovupd %ymm12, (%rdi,%r8,8) #12.6 vmovupd %ymm13, 32(%rdi,%r8,8) #12.6 vperm2f128 $32, %ymm1, %ymm0, %ymm2 #18.9 vperm2f128 $49, %ymm1, %ymm0, %ymm3 #18.9 vunpcklpd %ymm7, %ymm6, %ymm13 #17.21 vunpckhpd %ymm7, %ymm6, %ymm12 #17.21 vunpcklpd %ymm11, %ymm10, %ymm0 #17.21 vunpckhpd %ymm11, %ymm10, %ymm10 #17.21 vmovupd %ymm2, 64(%rdi,%r8,8) #12.6 vmovupd %ymm3, 96(%rdi,%r8,8) #12.6 vunpcklpd %ymm13, %ymm12, %ymm14 #18.9 vunpckhpd %ymm13, %ymm12, %ymm15 #18.9 vunpcklpd %ymm0, %ymm10, %ymm1 #18.9 vunpckhpd %ymm0, %ymm10, %ymm2 #18.9 vperm2f128 $32, %ymm15, %ymm14, %ymm12 #18.9 vperm2f128 $49, %ymm15, %ymm14, %ymm13 #18.9 vperm2f128 $32, %ymm2, %ymm1, %ymm3 #18.9 vperm2f128 $49, %ymm2, %ymm1, %ymm4 #18.9 vmovupd %ymm12, 128(%rdi,%r8,8) #12.6 vmovupd %ymm13, 160(%rdi,%r8,8) #12.6 vmovupd %ymm3, 192(%rdi,%r8,8) #12.6 vmovupd %ymm4, 224(%rdi,%r8,8) #12.6 cmpl %eax, %ecx #16.5 jb ..B2.4 # Prob 82% #16.5[/bash]

I got this example from our compiler engineers. This kind of a reverse search for specific patterns producing certain instructions should not be taken as guaranteed. The compiler's optimization algorithms might change it with every (bigger) update.
I'm only providing it to you as a demonstration that the permute instructions are in fact used by the compiler.

Does this answer your question?

Best regards,

Georg Zitzlsberger