Enabling the use of the PSADBW instruction during autovectorization

encoder · ‎06-08-2012

The C code of aloop given below, which computes the sum of absolute differences between a reference pointed to by *pubuff1 and a current buffer pointed to by *pu_buff2:

for(i = 0 ; i < 16; i++)

{

for(j = 0; j < 16; j++)

{

temp = pu1_buff1 - pu1_buff2;

if(temp<0) temp= -temp;

u4_sad += temp;

}

pu1_buff1 += u4_buf1_width;

pu1_buff2 += u4_buf2_width;

}

This has been auto-vectorized but without using thePSADBW instruction which is an available SSE SIMD instruction. Instead, the compiler subtracts 8 elements from the 2 buffers and then does a 2's complement on the parallel registers. PSADBW would obviously be much more optimal as this is an absolute difference, but the compiler does not seem to realise that.
This function is called many many times in my code so even a small optimization here would save a considerable amount of time.
Is there any way I can rewrite this code to make the compiler understand that using PSADBW would be more worthwhile? Or is the only way around this coding in intrinsics?

Thanks.

Georg_Z_Intel · ‎07-13-2012

Hello,

with ICC 12.1.4.319 and this example

[cpp]#define N 16 unsigned int foo(unsigned char *pu1_buff1, unsigned char *pu1_buff2) { int i, j; int temp; unsigned int u4_sad = 0; unsigned int u4_buf1_width = 1, u4_buf2_width = 1; for(i = 0 ; i < N; i++) { for(j = 0; j < N; j++) { temp = pu1_buff1 - pu1_buff2; if(temp<0) temp= -temp; u4_sad += temp; } pu1_buff1 += u4_buf1_width; pu1_buff2 += u4_buf2_width; } return u4_sad; } int main(int argc, char **argv) { unsigned char a, b; unsigned int res = 0; // Trick compiler to not optimize away "foo" #pragma noinline // Just to easily find "foo" in assembly res = foo(a, b); return res; }[/cpp]

I'm seeing PSADBW. Command line used:

$ icc perf.c -O2 -S

perf.s:
[plain]foo: # parameter 1: %eax # parameter 2: %edx ..B2.1: # Preds ..B2.0 movl 4(%esp), %eax #3.1 movl 8(%esp), %edx #3.1 >---.globl foo. foo.: xorl %ecx, %ecx # pxor %xmm0, %xmm0 # # LOE eax edx ecx ebx ebp esi edi xmm0 ..B2.2: # Preds ..B2.2 ..B2.1 movdqu (%ecx,%eax), %xmm2 #15.13 movdqu (%ecx,%edx), %xmm1 #15.13 psadbw %xmm1, %xmm2 #15.13 incl %ecx #9.5 paddd %xmm2, %xmm0 #17.13 cmpl $16, %ecx #9.5 jb ..B2.2 # Prob 93% #9.5 # LOE eax edx ecx ebx ebp esi edi xmm0 ..B2.3: # Preds ..B2.2 movdqa %xmm0, %xmm1 #6.25 psrldq $8, %xmm1 #6.25 paddd %xmm1, %xmm0 #6.25 movdqa %xmm0, %xmm2 #6.25 psrldq $4, %xmm2 #6.25 paddd %xmm2, %xmm0 #6.25 movd %xmm0, %eax #6.25 ret #24.12[/plain]

Which compiler version are you using?

Best regards,

Georg Zitzlsberger