we have already included a new path forHaswell targets (using FMA and AVX2 instructions).
Thanks to the early support in the Intel compiler and the SDE I was able to port and validate very quickly the codeusing FMA and the 256-bit packed int instructions. A cool feature of the Intel C++ compiler is that legacy code using MUL + ADD intrinsics (such as _mm256_mul_ps / _mm256_add_ps) use FMA instructions wherever possible when compiledwith the "/QxCORE-AVX2" flag, it's a great time saver and we can continue to have exactly the samesourcecodefor all (legacy SSE &AVX andnew FMA+AVX2) paths. Also since we use wrapper classes around intrinsics, the source code is still very readable, for example
res = a*x + b*y +c;
is far more readable than if wehad to introduce FMA functions such as
res = madd(a,x,madd(b,y,c));
More optimization opportunities are still there using any to any permute and gather for example, I suppose that I'll wait for the real chipsfor these.