Im working on my project and Im looking for the answer:
When Im processing 256-bits of data, is better to use (in one core) for this one whole YMMx register or to split them for 2x128-bits and process them through 2 XMMx registers at different ports, hence on different SSE/AVX unit (in Sandy Bridge there are 3 ports per core for AVX)? Which option is faster?
And I would be glad to see if anyone of you have some documents about SSE1-5/AVX/AVX2 which go into the details. I have many documents from this official site and there are many of non-official sites about these extensions but I need more specific informations (for ex. implementation limits, reduction of code using more operands, some examples for matrix packed calculations, and so on).
Although there are three execution ports that can handle AVX instructions, many of the instructions are limited to a specific port.
There is a very good overview of the details of AVX and SSE implementations in the "Intel 64 and IA-32 Architectures Optimization Reference Manual" (document 248966, Revision 030, September 2014). Section 2.1 covers the Haswell microarchitecture, with Figure 2-1 showing (at a high level) the port assignments. For comparison, Section 2.2 covers the Sandy Bridge microarchitecture, with Figure 2-4 showing the port assignments.
There are additional detailed discussions of the implementations in Agner Fog's excellent publications. The two that are most relevant are the microarchitecture writeup (http://www.agner.org/optimize/microarchitecture.pdf) and the instruction tables (http://www.agner.org/optimize/instruction_tables.pdf). The microarchitecture document provides overviews of how each processor works, while the instruction tables provide the latency, the throughput, and the specific ports used by each instruction.
The fastest way is to assure 32-byte data alignment. Then there is no advantage in splitting to 128 bit chunks.
On Sandy Bridge, with misalignment, there is a huge advantage for splitting the data.
On either Sandy Bridge or Ivy Bridge, there isn't much advantage in storing 256-bit chunks to memory, even with alignment.
As the earlier responses indicated, if you are covering all the CPU variants, and don't use compilers to take care of it, you must read the detailed documents for each.