Intel® ISA Extensions
Use hardware-based isolation and memory encryption to provide more code protection in your solutions.
Announcements
Welcome to the Intel Community. If you get an answer you like, please mark it as an Accepted Solution to help others. Thank you!
For the latest information on Intel’s response to the Log4j/Log4Shell vulnerability, please see Intel-SA-00646
1052 Discussions

educational video on to-the-metal x86 optimization including avx

Stanley_M_Intel
Employee
115 Views
At this year's Game Developer Conference, we presented a session on optimizing code for Intel's latest SandyBridge processors - including AVX obviously.
The session is now publicly available online at gdcvault: http://www.gdcvault.com/play/1014645
Note, the presentation was intendedfor a game, graphics, and simulation developer audience with a with range of background experience in performance tuning.
Stan Melax
0 Kudos
4 Replies
bronxzv
New Contributor II
115 Views
It looks like when your slides say "Measured CPU cycles" it is in factjust theoretical IACA values, isn't it ?

For example Slide 23 claims a 2x speedup for AVX vs SSE SAXPY (0.6 / 0.3) which is impossible to get on actual SNB hardware even with 100% L1D$ hit (due to the 32B/cycle L1D load bandwidth limitation, among others)
Stanley_M_Intel
Employee
115 Views
measured.

256bit:
.3*8 is a bit more than 2 (which was the IACA prediction).
soin two cycles, with 2 128 bit load ports, I can load 4 128 bit registers or 16 floating point numbers. thats 8 for x and 8 for y. Given i should be able to multiply 8 and add 8 in 1 cycle, it still feels unfortunate to me that we're only using half of the available flops here.

128 bit:

The way it was compiled and the way it ran, one of the ports was issuing 2 uops per cycle and hence the bottleneck. No doubt the 0.6 could have been less (with custom coding effort or perhaps better compilation). I mention in an earlier (serial) slide (before simd section) about the loop unroll that brings the serial loop iteration time to less than 2 cycles. with one unroll its 3ish cycles (or 1.6 per element).

This talk was not trying to give people any 2X expectations for avx256 over avx128 on SNB. In fact, i verbally dismissed saxpy as a cookie example and not something you can make a game about. In the concrete examples, i was able to squeeze in, i mention abou 70% on vanilla Mv (constant M), but only 40% on a (single bone) skinning sample explaining that the ports are 128 bit and load-heavy computations cant expect as significant a benefit.

I dont know if you watched the video orjust glanced over the slides. It sounds like you might already be well beyond the skill level of the intended audience. :)
bronxzv
New Contributor II
115 Views
thanks for the quick feedback, indeed, it definitely looks like the 128-bit variant may be optimized more, in my own experiments with SAXPY like kernels the actual speedups are morein the 1.4x - 1.5x range for L1D blocked cases, 1.2x - 1.3x for real world usage (data in L2 and LLC)

>I dont know if you watched the video orjust glanced over the slides

I watched it entirelly now, though I posted my remark about SAXPY128 to 256 scaling as soon as I have seen Slide 23 because Itoldmyself "come on! that's too good to be true", now I understand that the goal is to sell AVX to developers, and IMHO you have done a good job at it provided the time you had
capens__nicolas
New Contributor I
115 Views
To achieve (over) 2x performance with AVX we need gather/scatter. I believe it's actually cheap to implementand saves power because you don't need pairs of extract/insert instructions for each element.
Other than that it's obviously lacking 256-bit integer operations. If you have code with mixed types, it's faster to stick to SSE. I wouldn't mind if early 256-bit integer implementations took two uops.
Reply