The C code of aloop given below, which computes the sum of absolute differences between a reference pointed to by *pubuff1 and a current buffer pointed to by *pu_buff2:
for(i = 0 ; i < 16; i++)
for(j = 0; j < 16; j++)
temp = pu1_buff1
if(temp<0) temp= -temp;
u4_sad += temp;
pu1_buff1 += u4_buf1_width;
pu1_buff2 += u4_buf2_width;
This has been auto-vectorized but without using thePSADBW instruction which is an available SSE SIMD instruction. Instead, the compiler subtracts 8 elements from the 2 buffers and then does a 2's complement on the parallel registers. PSADBW would obviously be much more optimal as this is an absolute difference, but the compiler does not seem to realise that.
This function is called many many times in my code so even a small optimization here would save a considerable amount of time.
Is there any way I can rewrite this code to make the compiler understand that using PSADBW would be more worthwhile? Or is the only way around this coding in intrinsics?