HOWTO reduce register pressure _mm_intrinsics

jimdempseyatthecove · ‎11-10-2010

Intel Compiler XE 12.0.2.55 compiling under x64 Windows XP (as x64 app)

The system has 16 xmm registers

I have a function that uses 8 xmm registers as an accumulator (declared as eight __m128d variables)

I have a single loop contianing four scratch __m128d variables for a total of 12 xmm registers.

Leaving 4 xmm registers available for the compiler to reassign variables during optimization of the code.

I've structured the code within the loop to pull data into registers from memory into three of the fourloop internal temp registers then manipulate those temps with the forth internal loop temporary. The code, as written, should not consume any more than the 12 specified registers. However, the compiler optimization is consuming multiple xmm registers for use of one or more of declared loop temporaries. The end result is the register pressure exceeds the 16 available. Which then results in additional memory **writes** and reads. Although this memory write/read will be from fast from L1 cache (~4 clocks - each hit), it is not as fast as a register to register operation (potentially less than 1 clock tick), and I am not trashing the cache coherency with memory writes.

Note, the code originally was written to use all 16 available xmm registers (with a larger accumulator using 12 xmm registers). When that failed to produce the optimal code, I reduced the working set for the accumulator from 12 to 8. While I can write this as a .ASM file, I would rather stick in .CPP using SSE (and later AVX) intrinsic functions.

What I am looking for is a pragma or other option that informs the compiler to use a strict interpretation of the _mm_intrinsic functions. Or mark an __m128d variable as assign once to register.

Any input on this would be appreciated.

Jim Dempsey

Aubrey_W_ · ‎11-15-2010

Hello Jim,

Since you are looking for compiler options, I'll move this to the Intel C++ Compiler forum where our Technical Consulting Engineers can have a look at it.

Best regards,

==
Aubrey W.
Intel Software Network Support

JenniferJ · ‎11-16-2010

Hi Jim,
This isdifficult to answer. It depends on the specific testcase. is it possible to attach a test? or maybe file a ticket to Premier Support, or use a private response.

Thanks,
Jennifer

levicki · ‎12-16-2010

Hi Jim, not sure if you solved this and how, but I want to tell you that in my experience, such critical code should be written in assembler.

Rationale:
Compiler intrinsics are not guaranteed to have the same output in future compiler versions, not to mention between compilers from different vendors.

I know that it complicates maintainance and that it is a PITA to do right when you need both 32-bit and 64-bit code, but I am afraid that is your only option if you want your code not only to be fast but to stay fast over time.

jimdempseyatthecove · ‎12-16-2010

Thanks Igor,

I am currently writing C++ shell function containing something like

void foo(__mm128d* A, __mm128d* B, __mm128* C, intptr_t len)
{
__asm {
#define A rcx
#define B rdx
#define C r8
#define len r9
...
} // __asm
} // void foo

This works, however a lot of attention to details has to be made with respect to the processor pipeline latencies and throughput of instructions required by the enclosed code.

When using the _mm_intrinsics, the compiler optimizations will re-order instructions and assign/reassign temporary xmm registers. So this is a major plus for using _mm_intrinsics. On the minus side, is when the enclosed code requires the maximum number of registers, that the compiler optimizations will at times choose to flush the register mapped temporaries out to memory as opposed to choosing a lesser register consuming route that experiences a pipeline stall.

Most of the time, the use of the _mm_intrinsics do not require all of the registers and the compiler produces exceptionally good code. It is these border line cases that come up from time to time. And for these cases, the difference in execution time can be dramatic when the wrong choice is made.

I think, what would be cool here, is a #pragma optimize keywordHere that instructs the optimizer that the following scoped section of code is problematic for optimization and that you would like the compiler to take additional time if necessary to perform a regression analysis of different optimization strategies of the section, then choose the better route. Some of the routes could not be determined at compile time (cache locality), so some run-time instrumentation might be necessary (incorporated into VTune, PTU, etc).

Or, there may be a utility for this...

A program that integrates into the Disassembly window, and/or reads your ASM code, and provides a chart of the instruction latencies, ports and throughput, etc.. for a selected processor design (together with totals). This would help the developer to re-sequence instructions into pipeline stall points. Do you know of such a tool?

Jim Dempsey

levicki · ‎12-17-2010

Unfortunately no, I am not aware of such tool.

What I know is that in cases with high register pressure it may be beneficial to split the loop into multiple passes. This is what I have done for one such case that required zillions of registers. Sometimes it is faster to do three simple operations in three passes than one complex operation in one pass.