I am writing something about SSE and AVX. For AVX you find quite good documents here on the site. But what about SSE? It would be nice to have something that shows what different SSE versions contributed which kind of instructions. Not each instructions must be described a theoretical overview which names groups of together belonging would be enough.
// EDIT: For SSE4 I found: http://software.intel.com/sites/default/files/m/9/4/2/d/5/17971-intel_20sse4_20programming_20referen...
Thanks in advance!
It all started with MMX really. It offers 64-bit integer vector operations. It reuses the x87 register stack so you can't mix MMX code with floating-point instructions.
SSE then added 128-bit floatint-point vector operations (and a couple more MMX instructions). It uses a separate register set.
SSE2 added 128-bit versions of the MMX operations, making MMX largely obsolete.
SSE3 mainly added 'horizontal' floating-point vector operations, useful for complex numbers. This is also when Intel switched from 64-bit execution units (which needed 2 cycles to process the 128-bit operations), to full width 128-bit execution units.
SSSE3 mainly added horizontal integer vector operations, and a generic byte shuffling instruction.
SSE4.1 added blend, min/max, rounding, sign/zero extension, a few instructions that filled some long-standing 'gaps', and a couple instructions aimed at video processing.
SSE4a is a set of only two AMD-specific instructions (they did not support SSE4.1 at that time).
SSE4.2 added string processing instructions.
AVX extends the SSE registers to 256-bit, and offers 256-bit floating-point operations. Other AVX instructions are still limited to 128-bit. Also, Sandy/Ivy Bridge lack the cache bandwidth to double the throughput in practice. AVX also introduced a new highly extendable instruction encoding format, called VEX.
Last but not least, AVX2 is a massive leap forward. It offers 256-bit integer operations, fused multiply-add, and to top it off, gather support! The Haswell architecture that will introduce it doubles the cache bandwidth, and it even adds more execution ports to free up the vector execution ports and improve Hyper-Threading performance.
There is no information on what comes after AVX2, but the Xeon Phi MIC uses 512-bit vector instructions which use an encoding format called MVEX, which might be a clue. I also have some suggestions of my own.
If you need to go down to particular instructions availability, Intel Intrinsics Guide offers a great reference divided by SSE/AVX versions.