It looks like when your slides say "Measured CPU cycles" it is in factjust theoretical IACA values, isn't it ?
For example Slide 23 claims a 2x speedup for AVX vs SSE SAXPY (0.6 / 0.3) which is impossible to get on actual SNB hardware even with 100% L1D$ hit (due to the 32B/cycle L1D load bandwidth limitation, among others)
.3*8 is a bit more than 2 (which was the IACA prediction).
soin two cycles, with 2 128 bit load ports, I can load 4 128 bit registers or 16 floating point numbers. thats 8 for x and 8 for y. Given i should be able to multiply 8 and add 8 in 1 cycle, it still feels unfortunate to me that we're only using half of the available flops here.
The way it was compiled and the way it ran, one of the ports was issuing 2 uops per cycle and hence the bottleneck. No doubt the 0.6 could have been less (with custom coding effort or perhaps better compilation). I mention in an earlier (serial) slide (before simd section) about the loop unroll that brings the serial loop iteration time to less than 2 cycles. with one unroll its 3ish cycles (or 1.6 per element).
This talk was not trying to give people any 2X expectations for avx256 over avx128 on SNB. In fact, i verbally dismissed saxpy as a cookie example and not something you can make a game about. In the concrete examples, i was able to squeeze in, i mention abou 70% on vanilla Mv (constant M), but only 40% on a (single bone) skinning sample explaining that the ports are 128 bit and load-heavy computations cant expect as significant a benefit.
I dont know if you watched the video orjust glanced over the slides. It sounds like you might already be well beyond the skill level of the intended audience. :)
thanks for the quick feedback, indeed, it definitely looks like the 128-bit variant may be optimized more, in my own experiments with SAXPY like kernels the actual speedups are morein the 1.4x - 1.5x range for L1D blocked cases, 1.2x - 1.3x for real world usage (data in L2 and LLC)
>I dont know if you watched the video orjust glanced over the slides
I watched it entirelly now, though I posted my remark about SAXPY128 to 256 scaling as soon as I have seen Slide 23 because Itoldmyself "come on! that's too good to be true", now I understand that the goal is to sell AVX to developers, and IMHO you have done a good job at it provided the time you had
To achieve (over) 2x performance with AVX we need gather/scatter. I believe it's actually cheap to implementand saves power because you don't need pairs of extract/insert instructions for each element.
Other than that it's obviously lacking 256-bit integer operations. If you have code with mixed types, it's faster to stick to SSE. I wouldn't mind if early 256-bit integer implementations took two uops.