- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The C code of aloop given below, which computes the sum of absolute differences between a reference pointed to by *pubuff1 and a current buffer pointed to by *pu_buff2:
for(i = 0 ; i < 16; i++)
{
for(j = 0; j < 16; j++)
{
temp = pu1_buff1
if(temp<0) temp= -temp;
u4_sad += temp;
}
pu1_buff1 += u4_buf1_width;
pu1_buff2 += u4_buf2_width;
}
This has been auto-vectorized but without using thePSADBW instruction which is an available SSE SIMD instruction. Instead, the compiler subtracts 8 elements from the 2 buffers and then does a 2's complement on the parallel registers. PSADBW would obviously be much more optimal as this is an absolute difference, but the compiler does not seem to realise that.
This function is called many many times in my code so even a small optimization here would save a considerable amount of time.
Is there any way I can rewrite this code to make the compiler understand that using PSADBW would be more worthwhile? Or is the only way around this coding in intrinsics?
Thanks.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
with ICC 12.1.4.319 and this example
[cpp]#define N 16 unsigned int foo(unsigned char *pu1_buff1, unsigned char *pu1_buff2) { int i, j; int temp; unsigned int u4_sad = 0; unsigned int u4_buf1_width = 1, u4_buf2_width = 1; for(i = 0 ; i < N; i++) { for(j = 0; j < N; j++) { temp = pu1_buff1
I'm seeing PSADBW. Command line used:
$ icc perf.c -O2 -S
perf.s:
[plain]foo: # parameter 1: %eax # parameter 2: %edx ..B2.1: # Preds ..B2.0 movl 4(%esp), %eax #3.1 movl 8(%esp), %edx #3.1 >---.globl foo. foo.: xorl %ecx, %ecx # pxor %xmm0, %xmm0 # # LOE eax edx ecx ebx ebp esi edi xmm0 ..B2.2: # Preds ..B2.2 ..B2.1 movdqu (%ecx,%eax), %xmm2 #15.13 movdqu (%ecx,%edx), %xmm1 #15.13 psadbw %xmm1, %xmm2 #15.13 incl %ecx #9.5 paddd %xmm2, %xmm0 #17.13 cmpl $16, %ecx #9.5 jb ..B2.2 # Prob 93% #9.5 # LOE eax edx ecx ebx ebp esi edi xmm0 ..B2.3: # Preds ..B2.2 movdqa %xmm0, %xmm1 #6.25 psrldq $8, %xmm1 #6.25 paddd %xmm1, %xmm0 #6.25 movdqa %xmm0, %xmm2 #6.25 psrldq $4, %xmm2 #6.25 paddd %xmm2, %xmm0 #6.25 movd %xmm0, %eax #6.25 ret #24.12[/plain]
Which compiler version are you using?
Best regards,
Georg Zitzlsberger

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page