As far as I see from the preliminary documents, most of the extended instructions either operate on the lower half (arithmetic integer, for example) or do the same thing on the two half separately. To me it seems that what are going to get is not double throughput (as the jump from mmx to sse/sse2 meant) but additional eight xmm registers and the troubles managging them because they are tied together with their lower half. There are hardly any instructions that would cross the boundary of the two half. Shuffle the components? Not possible in one step. Want add two sets of integers together? Not even possible in two steps! These operations could be carried out much easier just having those upper register parts as discrate xmm registers. New instruction encoding, 3+ ops, some of the floating point instuctions that have ymm args are nice, but I can't see the advantage of having 256 bit regs over twice the number of 128 bit regs when the instruction set is not extended well to support them.
Yes, the practical implications of some of these issues appear in need of resolution.
gcc uses only 128-bit wide instructions, with performance potential coming from the increased support of unaligned operations, the ability to perform 2 128-bit loads per cycle, avoidance of register to register copy instructions, and the later fused multiply-add.
Likewise, automatic translation of programs where high level source code isn't available seems likely to involve a lot of same data size copying.
It is acknowledged that few operations with no floating point, such as memcpy(), will gain significantly from AVX on the initial implementation. Compiler generation of vmovaps instruction will have to be avoided, except where 32-byte alignment can be assured. As long as the hardware splits memory operands into 128-bit chunks, vmovaps doesn't necessarily offer more performance.
C, C++, and Fortran compilers would be capable of generating both SSE2 and AVX code paths (Intel option /QaxAVX), but they will have a built-in decision to make on whether to generate a 256-bit version when it is possible to make a 128-bit SSE2 version. In view of possible continuing primary emphasis on 32-bit Windows, the decision may be to avoid generating additional code, so that full AVX performance would be available only in an AVX-only build.
Quoted expected gain for AVX over SSE2 in fully vectorized floating point code is typically 1.7 times throughput. For an application currently spending 50% of the time executing such code, Amdahl's law suggests 25% overall speedup.
Certainly, there are numerous advancements in the instructions set, but it's just not going to be useful for image processing without full width aritmetic integer support. Color components only need 16 bit integers. 8 bit for the useful part, the rest for calculations is enough. We can already process 8 x 16 bit on 128 bit, with AVX there are still place for 8 floats only, no improvement in throughput, and there will be the problem of converting integers to floats vica versa, because there is no extended pmovz* or packus*. I'm not the one who designs cpu architectures, just a "user" who tries to make use of it, and while trying I'm bumping into these difficulties.
It is true that announced AVX extensions target floating point codes primarily, and allow improvement of throughput on many FP algorithms. Having two 128-bit parts with replicated behavior is a deliberate approach allowing very efficient micro-architectural implementation, it primarily targets throughput improvement.
Using wider registers has numerous microarchitectural advantages. Increasing number of registers does not translate to throughput improvement, it is extending execution stack what does, and making registers and execution units wider is the most beneficial from performance/power perspective.
We are definitely looking into opportunity to extend AVX in the future to support/extend 256-bit integer operations, we will appreciate your feedback showing examples of your code benefiting from same kind of throughout improvement for 256-bit integer operations as for floating point in AVX.
Well, if you ask, just a simple example, linear interpolation on 16 color components instead of 8. a + (b - a) * f, psubw pmulhrsw paddw, I tried to implement it with floats first, but the throughput was half with the same number of instructions (3).