Code generation bug, IA64 intrinsics

lpkruger · ‎02-20-2006

I am reporting a code generation bug for the IA64 C compiler. Although the code produces the correct value, it does so with a serious and unnecessary performance penalty. Here is a testcase with analysis.

The following function rotates two packed 32 bit integers left by 13 bits:

long rot13(long x) {
int n = 13;
__m64 t1, t2;
__m64 xx;
xx = _m_from_int64(x);
t1 = _m_pslldi(xx, n);
t2 = _m_psrldi(xx, 32-n);
return _m_to_int64(_m_paddd(t1, t2));
}

Here is the generated asm code, using -O3 optimization. The problem is that the pshr4 and pshl4 instructions should be using the immediate form, since the amount of shift is a constant known at compile time. Instead, the generated code wastes 4 registers on preparing the shift amount. (There is a secondary problem also. Even if we wanted to use registers, the zxt4 intructions are useless, since the constants loaded into r11 and r10 already have their high bits clear.)

rot13??unw:
{ .mii
alloc r14=ar.pfs,1,0,0,0 //0: {15:20} 61
add r11=19,r0 //0: {21:9} 49
add r10=13,r0 ;; //0: {20:9} 46
}
{ .mii
nop.m 0
zxt4 r9=r11 //1: {21:9} 50
zxt4 r8=r10 ;; //1: {20:9} 47
}
{ .mii
nop.m 0
pshr4.u r3=r32,r9 //4: {21:9} 51
pshl4 r2=r32,r8 ;; //4: {20:9} 48
}
{ .mib
padd4 r8=r2,r3 //6: {22:23} 52
nop.i 0
// Block 1: exit Pred: 0 Succ: -GO
// Freq 1.0e+00
br.ret.sptk.many b0 ;; //6: {22:11} 55
}