Processing of data in SSE/AVX/AVX2

Samuel_Š_ · ‎10-30-2014

Hello!

Im working on my project and Im looking for the answer:

When Im processing 256-bits of data, is better to use (in one core) for this one whole YMMx register or to split them for 2x128-bits and process them through 2 XMMx registers at different ports, hence on different SSE/AVX unit (in Sandy Bridge there are 3 ports per core for AVX)? Which option is faster?

And I would be glad to see if anyone of you have some documents about SSE1-5/AVX/AVX2 which go into the details. I have many documents from this official site and there are many of non-official sites about these extensions but I need more specific informations (for ex. implementation limits, reduction of code using more operands, some examples for matrix packed calculations, and so on).

Thank you!

TimP · ‎10-30-2014

For moves to and from memory, a single AVX256 access is preferred if the data are expected to be aligned, although the hardware actually splits moves to memory on Sandy and Ivy Bridge. Unaligned moves on Sandy Bridge are much faster when split into 128-bit move instructions. The performance difference is smaller on Ivy Bridge, so as to improve performance of applications written with AVX256 unaligned moves, but compilers still choose to split them. For Haswell, AVX256 is generally preferred.

Most documentation I've seen on this point is somewhat stale; probably not many developers are still targeting Sandy Bridge with low level code, since the compilers do a good job of making the adjustments. Presumably, there have been compiler writers' guides which have been closely held.

SSE5 was never proposed for Intel CPUs. Intel compilers no longer offer ability to target original SSE (Pentium III compatible CPUs), which I guess you mean by SSE1. Compilers which target SSE2 are likely to split unaligned 128-bit moves. This sometimes offered an advantage on CPUs as recent as Westmere, although this fact was in conflict with public documentation.

A somewhat related obscure question is the choice of instructions such as palignr for the case where data are accessed both with and without alignment. The palignr choice seems to be preferred for SSE4.1 but could offer an advantage on Westmere as well.

Current Intel compilers offer a pragma for the programmer to specify a preference between code with peeling for alignment or not.

McCalpinJohn · ‎10-30-2014

Unfortunately the performance depends on at least 5 factors that I can think of immediately:

ISA: AVX vs SSE
Alignment: 32 Byte vs 16 Byte vs anything else (3 choices)
Data location: L1 cache vs L2 cache vs L3 cache vs Memory (4 choices)
Direction: Loads vs Stores (2 choices)
Processor Model: Sandy Bridge/Ivy Bridge vs Haswell (2 choices)

So assuming you want to compare the two options in 1, it is easy to come up with 3*4*2*2 = 48 different cases for items 2-5.

There are plenty of counter-intuitive results when comparing performance across these 48 cases, making it challenging to come up with a simple set of guidelines. The more you can narrow the scope of the question, the easier it is to summarize the results.

bronxzv · ‎10-30-2014

Samuel Š. wrote:

When Im processing 256-bits of data, is better to use (in one core) for this one whole YMMx register or to split them for 2x128-bits and process them through 2 XMMx registers at different ports, hence on different SSE/AVX unit (in Sandy Bridge there are 3 ports per core for AVX)? Which option is faster?

ignoring the subtleties of unaligned loads/stores addressed by Tim, processing 256-bit of data at once (i.e. using VEX.256 AVX/AVX2) is generally faster than using 128-bit instructions, it's the most important point to motivate an AVX port

note that the number of issue ports is the same for 128-bit and 256-bit code (*1), for example you can issue a 256-bit VMULPS to one port and a 256-bit VADDPS to another port in the same clock (or even two 256-bit FMA with AVX2 under Haswell) just like you do with the 128-bit variant

the take home message is that the theoretical peak throughput is doubled with 256-bit instructions vs. their 128-bit variant, actual speedups aren't 2x but you can count on 1.5x or more provided that your code isn't memory constrained (i.e. it is L1D cache friendly)

*1: speaking of Intel products only

Samuel_Š_ · ‎10-30-2014

Thank you all!

@John - Im glad you told me all these cases, because probably I will have to test all these cases within my project so now I know how to combine all these factors. I expect AVX2 (with FMA) to be faster with 32 Byte datas than AVX, but Im not sure if AVX and AVX2 will be faster with 16 Byte datas vs SSE.

@bronxzv - Im not sure if I understand your 2nd section. Lets say I have 256 bit code and I want to split it for 2x128 bit code.

1) What will Haswell architecture do?

2) What will Sandy/Ivy Bridge do?

Here is the simple image " http://oi59.tinypic.com/14e27wh.jpg ". So the questions for 1) and 2) are: How will be this 2x128 bit code processed? Will it go through port 0 only (whole 256 bits), or port 1 only, or it will be split between port 1 and port 0?

bronxzv · ‎10-30-2014

Samuel Š. wrote:

@bronxzv - Im not sure if I understand your 2nd section. Lets say I have 256 bit code and I want to split it for 2x128 bit code.

1) What will Haswell architecture do?

2) What will Sandy/Ivy Bridge do?

Here is the simple image " http://oi59.tinypic.com/14e27wh.jpg ". So the questions for 1) and 2) are: How will be this 2x128 bit code processed? Will it go through port 0 only (whole 256 bits), or port 1 only, or it will be split between port 1 and port 0?

example for Sandy Bridge, Ivy Bridge and Haswell:

if you have some code with balanced VMULPS/VADPPS the peak issue rate is one VMULPS on port 0 and one VADPPS on port 1 the same clock cycle, this is true for the 256-bit and 128-bit variants of the instructions

if you split it in 2x128-bit it will serialize one 128-bit VMULPS after the other on port 0, and the same for the split VADDPS on port 1 (i.e. the high 128-bit "lane" will be unused in port 0 and 1)

so peak throughput is doubled with 256-bit vs 128-bit (ignoring load/store)

Bernard · ‎10-31-2014

As a additional information to issues addressed by @bronxzv you must take into account that at HW level at SIMD execution stack FP adder and multiplier units are designed to use pipelining which itself *dictates instruction reciprocal throughput.

* Of course there are present additional factors like pipeline stalls.

bronxzv · ‎10-31-2014

iliyapolak wrote:

you must take into account that at HW level at SIMD execution stack FP adder and multiplier units are designed to use pipelining which itself dictates instruction reciprocal throughput.

the number of execution pipeline stages (*1) is the same when executing 256-bit and 128-bit instructions (the very few exceptions include VSQRTPx,VDIVPx) so I don't think it's something to take into account for selecting 128-bit code vs. 256-bit code as the OP is asking

*1 : see latency values the Intrinsics Guide https://software.intel.com/sites/landingpage/IntrinsicsGuide/)

Bernard · ‎10-31-2014

>>>the number of execution pipeline stages (*1) is the same when executing 256-bit and 128-bit instructions (the very few exceptions include VSQRTPx,VDIVPx)>>>

I know that and I wrote only about the Execution stage of the pipeline. As I was able to understand mainly by going through various Intel patent docs that Execution units like adder and multiplier are also have pipelined design. I agree that this is more relevant to various uops being scheduled for the execution and waiting for processing of their operands by execution units.

Regarding VSQRTPx and VDIVPx the different execution stages could be due to microcode assists.

bronxzv · ‎10-31-2014

iliyapolak wrote:

I agree that this is more relevant to various uops being scheduled for the execution and waiting for processing of their operands by execution units.

trying to make it clearer for the OP: pipelining allows to not wait for previous instructions complete execution before starting the execution of new ones, for example you can have 5 distinct VMULPS instructions executing together on one FMUL unit, each instruction at a different stage of this particular pipeline

the throughput value in the Intrinsics Guide tells you how many cycles must pass between two instructions of the same type may enter the proper execution pipeline, this is by far the most important metric when doing instruction selection

the number of pipeline stages (aka latency) is important if there are dependencies between several instructions in the critical path and this is less an issue with 2 threads running on the same core since the "apparent latency" is halved thanks to hyperthreading

iliyapolak wrote:

Regarding VSQRTPx and VDIVPx the different execution stages could be due to microcode assists.

these are only partialy pipelined (improved in recent cores, even more on Broadwell) but are executed by 128-bit units, thus the better throughput (IPC, not FLOPS) for the 128-bit variants

McCalpinJohn · ‎10-31-2014

Note that on Sandy Bridge and Ivy Bridge the 256-bit loads take two cycles on one port (either port 2 or port 3), while 128-bit loads take one cycle on one port (again, either port 2 or port 3). So these have the same throughput, but the 128-bit versions are a bit more flexible in scheduling.

256-bit stores take one cycle on either port 2 or 3 to generate the address, and then two cycles on port 4 to store the data. 128-bit stores take one cycle on either port 2 or 3 to generate the address, then one cycle on port 4 to store the data.

This means that you cannot execute two 128-bit loads and one 128-bit store per cycle using 128-bit (or smaller) memory references, because that would need three addresses per cycle, and there are only two address generation units (ports 2 and 3). Using 256-bit memory references allows full speed because the data transfers take 2 cycles, and the two address generation units can easily produce three addresses in two cycles.

Bernard · ‎10-31-2014

>>>trying to make it clearer for the OP: pipelining allows to not wait for previous instructions complete execution before starting the execution of new ones, for example you can have 5 distinct VMULPS instructions executing together on one FMUL unit, each instruction at a different stage of this particular pipeline>>>

So at the lowest level probably Cuop??(not sure about the exact name, there was also Auop) will be loaded into currently available physical registers with its decomposed floating point operands(sign ,exponent,mantissa) waiting for the control signal to be transfered over muxed 256-bit width data path to FMUL unit where floating point addition will be performed and the result will be "seen" ny the software in architectural registers. Of course this description is simplified.

bronxzv · ‎10-31-2014

iliyapolak wrote:

So at the lowest level probably Cuop??(not sure about the exact name, there was also Auop) will be loaded into currently available physical registers with its decomposed floating point operands(sign ,exponent,mantissa) waiting for the control signal to be transfered over muxed 256-bit width data path to FMUL unit where floating point addition will be performed and the result will be "seen" ny the software in architectural registers. Of course this description is simplified.

you can find an example of an actual pipelined 32-bit FMA FPU in A Fully-Pipelined Single-Precision Floating Point Unit
in the Synergistic Processor Element of a CELL Processor available here: http://www.oocities.org/de/christian_jacobi/publications/omj05_cellfpu.pdf

this particular implementation is 6-stage deep, from the overall diagram (Fig. 1) and the physical layout (Fig. 7) one can easily figure how the operands flow through the pipeline for the 4 SIMD lanes

Bernard · ‎10-31-2014

For OP: link to description of SIMD unit https://www.google.com/patents/US6694426

Bernard · ‎10-31-2014

@bronxzv

Thanks for the link.

Samuel_Š_ · ‎11-04-2014

Thank you all for comments and for the documents!

@John D. McCalpin - Thank you for your comments. Can you provide me any document where can I find what you said? I will have to mention used documents in my project so I need official documents.

McCalpinJohn · ‎11-05-2014

Most of this information is contained (at least implicitly) in the "Intel 64 and IA-32 Architectures Optimization Reference Manual" (Intel document 248966, revision 030, September 2014). Chapter 2 contains the important microarchitectural features for various Intel processor cores, with section 2.2 describing the Sandy Bridge implementation.

Another set of references that are extremely useful for understanding processor implementation and performance are from Agner Fog.

http://www.agner.org/optimize/microarchitecture.pdf -- describes the microarchitecture of almost all Intel, AMD, and VIA x86 microprocessors. I found it very helpful to read about the microarchitecture of Intel processors in historical order, starting with the simpler designs and adding complexity with each generation.
http://www.agner.org/optimize/instruction_tables.pdf -- includes tables of how x86 instructions map to micro-ops on the various ports and lists the latency and throughput of each instruction. Like the microarchitecture document, it covers almost all Intel, AMD, and VIA x86 processors. Some of this information is contained in Appendix C of the Intel Optimization Reference manual, but Agner Fog's documentation is more complete and includes a lot more historical information.

Samuel_Š_ · ‎11-05-2014

Thank you John!

Bernard · ‎11-05-2014

@Samuel

You can also browse Google Patents repository there is a plenty technically advanced information related to CPU mechanism implementation.

Samuel_Š_ · ‎11-05-2014

Allright, thank you @iliyapolak!

Bernard · ‎11-08-2014

@Samuel

You are welcome.