- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
At this year's Game Developer Conference, we presented a session on optimizing code for Intel's latest SandyBridge processors - including AVX obviously.
The session is now publicly available online at gdcvault: http://www.gdcvault.com/play/1014645
Note, the presentation was intendedfor a game, graphics, and simulation developer audience with a with range of background experience in performance tuning.
Stan Melax
Link Copied
4 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It looks like when your slides say "Measured CPU cycles" it is in factjust theoretical IACA values, isn't it ?
For example Slide 23 claims a 2x speedup for AVX vs SSE SAXPY (0.6 / 0.3) which is impossible to get on actual SNB hardware even with 100% L1D$ hit (due to the 32B/cycle L1D load bandwidth limitation, among others)
For example Slide 23 claims a 2x speedup for AVX vs SSE SAXPY (0.6 / 0.3) which is impossible to get on actual SNB hardware even with 100% L1D$ hit (due to the 32B/cycle L1D load bandwidth limitation, among others)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
measured.
256bit:
.3*8 is a bit more than 2 (which was the IACA prediction).soin two cycles, with 2 128 bit load ports, I can load 4 128 bit registers or 16 floating point numbers. thats 8 for x and 8 for y. Given i should be able to multiply 8 and add 8 in 1 cycle, it still feels unfortunate to me that we're only using half of the available flops here.
128 bit:
The way it was compiled and the way it ran, one of the ports was issuing 2 uops per cycle and hence the bottleneck. No doubt the 0.6 could have been less (with custom coding effort or perhaps better compilation). I mention in an earlier (serial) slide (before simd section) about the loop unroll that brings the serial loop iteration time to less than 2 cycles. with one unroll its 3ish cycles (or 1.6 per element).
This talk was not trying to give people any 2X expectations for avx256 over avx128 on SNB. In fact, i verbally dismissed saxpy as a cookie example and not something you can make a game about. In the concrete examples, i was able to squeeze in, i mention abou 70% on vanilla Mv (constant M), but only 40% on a (single bone) skinning sample explaining that the ports are 128 bit and load-heavy computations cant expect as significant a benefit.
I dont know if you watched the video orjust glanced over the slides. It sounds like you might already be well beyond the skill level of the intended audience. :)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
thanks for the quick feedback, indeed, it definitely looks like the 128-bit variant may be optimized more, in my own experiments with SAXPY like kernels the actual speedups are morein the 1.4x - 1.5x range for L1D blocked cases, 1.2x - 1.3x for real world usage (data in L2 and LLC)
>I dont know if you watched the video orjust glanced over the slides
I watched it entirelly now, though I posted my remark about SAXPY128 to 256 scaling as soon as I have seen Slide 23 because Itoldmyself "come on! that's too good to be true", now I understand that the goal is to sell AVX to developers, and IMHO you have done a good job at it provided the time you had
>I dont know if you watched the video orjust glanced over the slides
I watched it entirelly now, though I posted my remark about SAXPY128 to 256 scaling as soon as I have seen Slide 23 because Itoldmyself "come on! that's too good to be true", now I understand that the goal is to sell AVX to developers, and IMHO you have done a good job at it provided the time you had
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
To achieve (over) 2x performance with AVX we need gather/scatter. I believe it's actually cheap to implementand saves power because you don't need pairs of extract/insert instructions for each element.
Other than that it's obviously lacking 256-bit integer operations. If you have code with mixed types, it's faster to stick to SSE. I wouldn't mind if early 256-bit integer implementations took two uops.
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page