educational video on to-the-metal x86 optimization including avx

Stanley_M_Intel · ‎03-23-2011

At this year's Game Developer Conference, we presented a session on optimizing code for Intel's latest SandyBridge processors - including AVX obviously.

The session is now publicly available online at gdcvault: http://www.gdcvault.com/play/1014645

Note, the presentation was intendedfor a game, graphics, and simulation developer audience with a with range of background experience in performance tuning.

Stan Melax

bronxzv · ‎03-24-2011

It looks like when your slides say "Measured CPU cycles" it is in factjust theoretical IACA values, isn't it ?

For example Slide 23 claims a 2x speedup for AVX vs SSE SAXPY (0.6 / 0.3) which is impossible to get on actual SNB hardware even with 100% L1D$ hit (due to the 32B/cycle L1D load bandwidth limitation, among others)

Stanley_M_Intel · ‎03-24-2011

measured.

256bit:

.3*8 is a bit more than 2 (which was the IACA prediction).

soin two cycles, with 2 128 bit load ports, I can load 4 128 bit registers or 16 floating point numbers. thats 8 for x and 8 for y. Given i should be able to multiply 8 and add 8 in 1 cycle, it still feels unfortunate to me that we're only using half of the available flops here.

128 bit:

The way it was compiled and the way it ran, one of the ports was issuing 2 uops per cycle and hence the bottleneck. No doubt the 0.6 could have been less (with custom coding effort or perhaps better compilation). I mention in an earlier (serial) slide (before simd section) about the loop unroll that brings the serial loop iteration time to less than 2 cycles. with one unroll its 3ish cycles (or 1.6 per element).

This talk was not trying to give people any 2X expectations for avx256 over avx128 on SNB. In fact, i verbally dismissed saxpy as a cookie example and not something you can make a game about. In the concrete examples, i was able to squeeze in, i mention abou 70% on vanilla Mv (constant M), but only 40% on a (single bone) skinning sample explaining that the ports are 128 bit and load-heavy computations cant expect as significant a benefit.

I dont know if you watched the video orjust glanced over the slides. It sounds like you might already be well beyond the skill level of the intended audience. :)

bronxzv · ‎03-24-2011

thanks for the quick feedback, indeed, it definitely looks like the 128-bit variant may be optimized more, in my own experiments with SAXPY like kernels the actual speedups are morein the 1.4x - 1.5x range for L1D blocked cases, 1.2x - 1.3x for real world usage (data in L2 and LLC)

>I dont know if you watched the video orjust glanced over the slides

I watched it entirelly now, though I posted my remark about SAXPY128 to 256 scaling as soon as I have seen Slide 23 because Itoldmyself "come on! that's too good to be true", now I understand that the goal is to sell AVX to developers, and IMHO you have done a good job at it provided the time you had

capens__nicolas · ‎03-25-2011

To achieve (over) 2x performance with AVX we need gather/scatter. I believe it's actually cheap to implementand saves power because you don't need pairs of extract/insert instructions for each element.

Other than that it's obviously lacking 256-bit integer operations. If you have code with mixed types, it's faster to stick to SSE. I wouldn't mind if early 256-bit integer implementations took two uops.