In the optimized DGEMM code

al_p_ · ‎03-03-2016

Hi,

I think I have a good background on how a cpu and memory work; I know the usual stuff about CPUs, especially Intel CPUs with a cache line that is usually 64 bytes, each CPU core having dedicated SSE and/or AVX registers, and so on. I'm also fairly familiar with the common practices that are used to increment performances, improve memory usage, avoid cache spills, and all the modern scenarios that make a good software for concurrency and parallelism in various form ( mainly multicore and SIMD ) .

But when it comes to intrinsics there is very little to no documentation about design, patterns and good practices .

My question is about a series of scenarios that I have encountered and I don't know how to evaluate correctly or how they should be tackle to get the most out of the CPU :

* all the examples are in pseudocode to simplify the writing

Is it a good idea to use a register, that is an operand of the function, as the target for storing the result or is it better to use a third m128 register ? Ex. r1 = _mm_add ( r1, r2 ) or r3 = _mm_add ( r1, r2 )
There is a substantial difference between doing some type punning and using store instructions when returning a value/s from a function as far as the hardware is concerned ? Ex. return ((int32*)(&r1))[0] + ((int32*)(&r1))[1] + ((int32*)(&r1))[2] + ((int32*)(&r1))[3] ( the function is some int foo(...){} ) or _mm_store ( buffer, r1 ) ( and at this point the function is void foo(...){} )
I assume there is no difference in writing a for loop that iterates over X amount of registers doing X times the same operation and writing X times the operations in the body of your function for each register . Correct ?
There is a way to know, programmatically, how many and what are the registers available on a given CPU/architecture ? For example how many xmm registers a given machine has in each CPU core plus all the other registers .
What is supposed to happen if I use more registers than what my machine is capable of ? For example what if I will use 33 xmm registers on a cpu that only has 32 SSE registers ?
There is a cheap trick to avoid writing a load instrunction which consists in casting ( C style cast ) the pointer to your data to ( m128 * ) directly . Is this well defined behaviour ? There are pros and cons ? for example __m128* r1 = (__m128i *) ( &ptr[0] )

I'll appreciate a clarification about how to approach each single point in the list, I find that intrinsics are kinda not that well documented ( and registers too ) in terms of behaviour and patterns, the only real suggestion about any possible pattern is about the load-compute-store pattern, which suggest that you just load all the registers first, do your computation and only at the end you operate your store/s , but this is too generic and it doesn't answer my doubts .

Thank you for your time

Bernard · ‎03-05-2016

Regarding #1 question

In your first example: r1 = _mm_add ( r1, r2 ) union r1 will be overwritten by the result of the vector addition in the second example it will be preserved. Put it simply it depends on your code or algorithm design. Regarding the usage of word "register" bear in mind that r1 variable is of type __m128 which in turn is implemented as union and Compiler will do a best effort to load the content of variable r1 into XMM register from the memory. The emphasis is on word memory because there is no way to create immediate vector values in XMM/YMM register.

Regarding question #4

I do not know if you there is specific machine instruction for querying the number of available vector registers. Usually in long mode or 64-bit mode you have (assuming AVX ISA CPU ) available following SIMD registers: XMM0-XMM15 and YMM0-YMM15 these registers are software writeable/readable. For internal usage CPU like Haswell has approximately 165 physical registers.

McCalpinJohn · ‎03-06-2016

It was certainly not obvious to me at first, but after working with the intrinsics for a while it became clear that the compiler does not take these as literally as I originally expected.

Sometimes this is good news -- for example a VMOVUPS load followed by an VADDPS or other arithmetic instruction will usually compile to a single instruction with a memory argument (saving a register name).

On the other hand, sometimes you want to be able to specify which values should be held in registers (for re-use), and there does not appear to be a way to force this with intrinsics.

Even worse, sometimes the compiler completely rewrites intrinsic-based codes. In one example, a loop with 48 VFMAs was compiled into code with 36 VFMAs, 12 VADDs, and 12 VMULs. The generated code also had a lot more memory references than were required because the compiler refused to keep re-used values in registers and instead re-loaded them (using vector arithmetic instruction with memory arguments) at each use.

Overall, I have found that the compiler does what I want with simple loops -- a handful of intrinsic calls and no register pressure. With more complex loops I have not been pleased with the results.

Bernard · ‎03-06-2016

@John

Maybe in the case when you reported a total number of 36 VFMAs, 12 VADDs, and 12 VMULs, Compiler reloaded values because of pointer aliasing? Did you try to use static arrays?

Bernard · ‎03-06-2016

>>>What is supposed to happen if I use more registers than what my machine is capable of ? For example what if I will use 33 xmm registers on a cpu that only has 32 SSE registers ?>>>

You can use only architectural registers those ones which are software writeable/readable. You cannot access more than that. Additional registers are internal to the CPU and are used for register renaming, constants and/or intermediate results holding etc...

TimP · ‎03-06-2016

John showed us that case in the past. I don't know if the compiler might sometimes be justified in making such an interpretation on grounds of opening more out of order opportunities. While it doesn't appear to apply to John's case, sometimes splitting an fma shortens critical path latency. In comparison of gcc vs icc I've seen duplicate loads in one or the other without being able to pin down a performance difference.

Bernard · ‎03-07-2016

I think that in the case of reloading compiler could have had an opportunity to exploit out-of-order execution when per loop cycle computation was not dependent on the memory access(variable reloading).

Bernard · ‎03-07-2016

>>> I find that intrinsics are kinda not that well documented ( and registers too ) in terms of behaviour and patterns, >>>

You may read "Intel Optimization Manual" where you can find interesting information about the SSE/AVX programming.

McCalpinJohn · ‎03-07-2016

In the optimized DGEMM code that I was trying to implement using intrinsics, there was not any aliasing problem -- the problem seems to be that the compiler really did not want to use all 16 AVX registers. It is important to remember that when using intrinsics to load into an mm256 variable type, the mm256 variable is not a register name -- it is a variable name. The compiler can choose whether to execute a standalone load (VMOVUPS) or integrate that load into the arithmetic instruction where the data is subsequently used.

When attempting to achieve full speed with DGEMM on a Haswell core, the 5-cycle dependent operation latency on the 2 FMA pipelines means that there must be at least 10 independent partial sum vectors (each 4 64-bit elements wide). The amount of data re-use in the kernel depends on the shape of the register blocking as well as the size, with "square" shapes providing more re-use than "rectangular" shapes. For Haswell, it looks like only register blocking strategy that will work is 4x3 (x4), giving 12 (4-wide) partial sums. So 12 of the 16 named AVX registers are locked up before we even start to try to re-use data. For this register blocking strategy, there are four sets of 3 (4-element, contiguous) elements that can be loaded once and used 4 times each, so I really want those to be in registers as well. This accounts for 15 of the 16 AVX register names, leaving only one register for everything else. The final variable used is loaded in a transposed format, which is implemented by loading a 64-bit element and broadcasting it across the 4 positions of the 256-bit AVX register. This requires using the last of the 16 register names, since there is no option to do a "load with broadcast" as an input argument to an arithmetic instruction. The value that is loaded with broadcast is used three times in consecutive statements, then never re-used, so a single register can be used to hold the transposed data for all 48 FMAs in the register-blocked inner loop.

If my analysis is correct, this approach is best way to get 2 FMAs per cycle while executing less than 2 loads per cycle (28 loads + several software prefetches in 24 cycles) and while using no more than 16 registers. The 12 accumulators are loaded before the loop and saved afterward, so they don't impact the load bandwidth. The 16 values that are loaded with broadcast have to use at least one register name, but any rearrangement of the code will cause them to either be loaded more than once (each) or will require more than one register name to hold more than one of these values at a time. Similarly, the four sets of 3 (4-element, contiguous) elements that are loaded once and used 4 times (each) can only fit into 3 registers if the code is not rearranged. Any significant rearrangement will require either that the data be loaded multiple times or that more than 3 AVX registers will be required (up to 12, since that is the total number of these elements used).

I have gotten a lot of different results from different compiler versions and optimization levels, but none have actually used all 16 registers for this code, so all of them have significantly increased load/store traffic that ends up being the performance limiter.

al_p_ · ‎03-07-2016

Thanks to you all for the comments.

Regarding using all the available registers I had the same issues with gcc, this compiler seems to always deny this kind of access, it usually operates what is apparently known in the field as "folding", it's my understanding that a folding is basically a store and load but at registers level and therefore the fastest kind of load and store the machine can give you, but it still hurts the parallel SIMD execution .

Clang is a little bit different and apparently it generates a better assembly, clang also annotates byte spills, so when you generate assembly from your intrinsics you know if you are going to spill or not just by looking at the asm file .

I have also noticed that for loops are not that trivial as they looks, on stackoverflow you can find people describing subtle bugs with loops and inlining, which means that a for loop doesn't really stick to that "simple and obvious" semantic ; I think I will just write my iterations step by step .

The only thing that I'm really curious about, for now, is prefetching, I would like to know if prefetching is a good idea to improve on at least the functions that are used the least .

Good practices and design choices for intrinsics