HOWTO reduce register pressure for SSE intrinsic functions

jimdempseyatthecove · ‎11-11-2010

This is a duplicate of a post in Platform Technologies - AVX...
It may be more appropriate for Intel C++ Compiler forum

Intel Compiler XE 12.0.2.55 compiling under x64 Windows XP (as x64 app)

The system has 16 xmm registers

I have a function that uses 8 xmm registers as an accumulator (declared as eight __m128d variables)

I have a single loop contianing four scratch __m128d variables for a total of 12 xmm registers declared.

Leaving 4 xmm registers available for the compiler to reassign variables during optimization of the code.

I've structured the code within the loop to pull data into registers from memory into three of the four loop internal temp registers then manipulate those temps with the forth internal loop temporary. The code, as written, should not consume any more than the 12 specified registers. However, the compiler optimization is consuming multiple additionalxmm registers for use of one or more of declared loop temporaries. The end result is the register pressure exceeds the 16 available. Which then results in additional memory **writes** and reads. Although this memory write/read will be fast from L1 cache (~4 clocks - each hit), it is not as fast as a register to register operation (potentially less than 1 clock tick), and I wish not totrash the cache coherency with unwarrantedmemory writes.

Note, the code originally was written to use all 16 available xmm registers (with a larger accumulator using 12 xmm registers). When that failed to produce the optimal code, I reduced the working set for the accumulator from 12 to 8. While I can write this as a .ASM file, I would rather stick in .CPP using SSEintrinsic functions (and later AVX with its wider registers) .

What I am looking for is a pragma or other option that informs the compiler to use a strict interpretation of the _mm_intrinsic functions with respect to register declarations.For examplemark an __m128d variable as assign once to register. Note, I've tried "register __m128d temp;" and this does not pin temp to a single register. Temp still experiences register reassignment forcing the working register set to exceed those available.

Any input on this would be appreciated.

Jim Dempsey

TimP · ‎11-11-2010

In case it's relevant, we've noticed that disabling interprocedural (unfortunately, not same option for Windows as for linux: -Qip- ?) may improve fidelity of intrinsics code to programmer intention.
I assume that no compiler will attempt unrolling in such a situation; if it were a problem, you could set #pragma unroll(0).

jimdempseyatthecove · ‎11-12-2010

Tim,

Thanks for the suggestion. I've already tried nounroll and noinline.

I had seen this problem earlier in a different section of code and had erroniously thought that the problem was due to register reservation (n registers reserved for whatever).

I've reworked the code, the stack temporaries are now eliminated...

But now some of my explicit loads to __mm128d variables are not performed as expressed, resulting in multiple reads from RAM. e.g.

[plain];;; 	{
;;; 		__m128d At_00_00;
;;; 		__m128d At_10_10;
;;; 		__m128d At_20_20;
;;; 		__m128d At_30_30;
;;; 		__m128d Bt_nm_xy;
;;; 		__m128d temp;
;;; 
;;; 		At_10_10 = At[i*2+0];
;;; 		At_00_00 = _mm_movedup_pd(At_10_10);
;;; 		At_10_10 = _mm_shuffle_pd(At_10_10, At_10_10, 3);
;;; 
;;; 		At_30_30 = At[i*2+1];
;;; 		At_20_20 = _mm_movedup_pd(At_30_30);
;;; 		At_30_30 = _mm_shuffle_pd(At_30_30, At_30_30, 3);
;;; 
;;; 		// Work down first two columns of B		// (low, high)
;;; 		Bt_nm_xy = Bt[i*2+0];					// (B00, B01)
        movaps    xmm9, XMMWORD PTR [rax+rdx]                   ;130.3
        inc       r10                                           ;112.31
        movddup   xmm11, QWORD PTR [rax+rcx]                    ;122.3
;;; 		temp = _mm_mul_pd(At_00_00, Bt_nm_xy);		// (A00xB00, A00xB01)
        movaps    xmm13, xmm9                                   ;131.3
        mulpd     xmm13, xmm11                                  ;131.3
        movaps    xmm10, XMMWORD PTR [rax+rcx]                  ;121.3
        unpckhpd  xmm10, xmm10                                  ;123.3
[/plain]

Note, [rax+rcx] loaded twiceusing

movddup xmm11, QWORD PTR [rax+rcx]
...
movaps xmm10, XMMWORD PTR [rax+rcx]
unpckhpd xmm10, xmm10

verses what was intended as

movaps xmm10, XMMWORD PTR [rax+rcx]
movddup xmm11,xmm10
unpckhpd xmm10, xmm10
...

Although the 2nd read will come from L1 cache (~4 clocks), it will not be as fast as register to register.
Note, the compiler is smart enough to pack something in between the movaps and movddup.
Under the best of conditions [rax+rcx] will be in L1 cache

compiler generated code takes 4+4+1 clocks
as written code takes 4+1+1 clocks

For now, I have rewritten the code to use an __asm {...} block and this works out quite well (Thanks Intel for supporting x64 inline assembly with full SSE/AVX).

Jim Dempsey

TimP · ‎11-12-2010

Jim,
From the fragment you showed, it does look like an inefficiency in the compiler register assignment for intrinsics.

Mark_S_Intel1 · ‎11-12-2010

Jim,

If you provide a summary of the current state of the problem(s) you would like to be resolved along with a test case that can be compiled to reproduce the problem, we will look into the issue.

Thanks,
--mark