Sandy Bridge: SSE performance and AVX gather/scatter

capens__nicolas · ‎01-04-2011

Hi all,

I'm curious how the two symmetric 128-bit vector units on Sandy Bridge affect SSE performance. What's its peak throughput, and sustainable throughput for legacy SSE instructions?

I also wonder when parallelgather/scatter instructions will finally be supported. AVX is great in theory, but in practice parallelizing a loop requires the ability to load/store elements from (slightly) divergent memory locations. Serially inserting and extracting elements was still somewhat acceptable for SSE, but with 256-bit AVXitbecomes a serious bottleneck, which partially cancels its theoretical benefits.

Sandy Bridge's CPU cores are actually more powerful than its GPU, but the lack of gather/scatter will limit the use of all this computing power.

Cheers,

Nicolas

bronxzv · ‎01-05-2011

Serially inserting and extracting elements was still somewhat acceptable for SSE, but with 256-bit AVXitbecomes a serious bottleneck,

For "slightly divergent locations", i.e.most elements in the same 64B cache line, AFAIK with SSEthe bestsolutionwas with indirect jumps to a series of static shuflles (controls as immediates) in order to maximize 128-bit load/stores. Now with AVX we can use dynamic shuffles (controls inYMM registers)using VPERMILPS. Based on IACA the new AVX solution is more than 2x faster than legacy SSE, I suppose it will be even more than 2x faster on real hardware since the main issue with the indirect jump solution is the high branch miss rate

capens__nicolas · ‎01-06-2011

Quoting bronxzv

For "slightly divergent locations", i.e.most elements in the same 64B cache line, AFAIK with SSEthe bestsolutionwas with indirect jumps to a series of static shuflles (controls as immediates) in order to maximize 128-bit load/stores. Now with AVX we can use dynamic shuffles (controls inYMM registers)using VPERMILPS. Based on IACA the new AVX solution is more than 2x faster than legacy SSE, I suppose it will be even more than 2x faster on real hardware since the main issue with the indirect jump solution is the high branch miss rate

VPERMILPScan only permute elements within one vector register. It's of no use for real gather/scatter.

Since gather/scatter is the parallel equivalent of load/store, it would allow parallelizing almost any loop, even when it contains indirect loads and stores (the only limitation being that the loop iterations should not alias and the indexing uses 32-bit offsets at most - both of which are easty to guarantee for multimedia applications).

Since all software contains many performance critical loops, just imagine how much faster things would be when four or eight iterations can execute simultaneously!

Sandy Bridge already has two 128-bit load units, so I don't think it would take a lot of extra logic to turn them into two 128-bit gather units, Larrabee style. With AVX the bottleneck of sequentially inserting/extracting elements is just too big. And it's only getting worse with FMA and future 512-bit and 1024-bit vectors. Amdahl's Law becomes a performance wall if load/store isn't parallelised as well.

If the cost of an optimal gather/scatter implementation is too big to justify it (e.g. if it adds another cycle of latency to L1 accesses), I believe it is still critical to add these instructions sooner rather than later, initiallyusing a cheaper implementation. For instance each of the load units can collect one element each cycle, meaning a gather operation would take just four cycles (throughput), which is already way better than extracting the offsets from one vector and inserting elements into another vector (16 cycles). This would allow developers to start using these instructions early on and later architectures (with increased transistor budget) could improve the performance of existing vectorized software with a faster gather/scatter implementation.

It would finally make the SIMD instruction set complete, giving every scalar operation a parallel equivalent...

capens__nicolas · ‎01-06-2011

Much tomy astonishment I just found out that AVX doesn't support integer operations on ymm registers. Frankly this makes it useless for a large range of multimedia applications.Andto boot vinsertps doesn't support ymm registers either. This means that even for floating-point applicationsloading/storing all elements is even slower than I expected.

I can understandthat completely duplicating the 128-bit SSE units would have been expensive, but it's beyond me why the engineers didn't add the 256-bit instructions anyhow and execute them in two cycles on a 128-bit unit (same way SSE used to be executed on 64-bit units on Pentium 3/4). This means we're stuck with the following roadmap:

AVX2 - FMA,half-floatsupport
AVX3 - integer support
AVX4 - gather/scatter support

Frankly this looks like it's going to become the same mess as SSE1-SSE4.2, with the same slow adoption issues. Supporting all these different extensions is a nasty software issue (even today lots of applications have an SSE2 path and a scalar path, and don't bother complicating things with other extensions).

With 200+ SP GFLOPS Sandy Bridge looks good on paper, but in practice it will have limited use. I guess I'll stick to SSE then after all. Hopefully the extraadd andmultiply units still offer a tiny bit of performance improvement...

Please Intel, add integer and gather/scatter support sooner rather than later! The first implementation doesn't need to be optimal, but at least the instructions should be available so developers can actually start using them!

bronxzv · ‎01-06-2011

Andto boot vinsertps doesn't support ymm registers either. This means that even for floating-point applicationsloading/storing all elements is even slower than I expected.

you just need one extra vinsertf128 for 8 vinsertps and 8 32-bit loads, the peformance impact should be very low (< 5%) and even negligible (< 1%) if you haveeven moderateL1D$ misses

capens__nicolas · ‎01-07-2011

Quoting bronxzv

you just need one extra vinsertf128 for 8 vinsertps and 8 32-bit loads, the peformance impact should be very low (< 5%) and even negligible (< 1%) if you haveeven moderateL1D$ misses

So what you're saying is, because it's horrendously slow to emulate a gather operation with extract/insert anyway, it's ok to make it even slower with AVX? Please note that with FMA this will mean you'll be able to do 16operations per cycle, but you'll still only be able to emulate a gather operation in 18 uops. In other words, it's Amdahl's Law at its worst. Fast vector operations are useless when the memory accesses are sequential.

Anyway, since insert/extract will become practically redundant anyway with gather/scatter support, it's probably best to leave them as is. But I seriously hope Intel's intention is to make the AVX instruction set complete by adding 256-bit integer operations and gather/scatter support.

bronxzv · ‎01-07-2011

>So what you're saying is, because it's horrendously slow to emulate a gather operation with extract/insert anyway, it's ok to make it even slower with AVX?

no, I just says that the fact that there is no 256-bit variant of vinsertps doesn't matter since on real world use cases the impact will be negligible

>that with FMA this will mean you'll be able to do 16operations per cycle

hint: we are already able to do 16 flops per clock with balanced vmulps / vaddps

capens__nicolas · ‎01-10-2011

Quoting bronxzv

no, I just says that the fact that there is no 256-bit variant of vinsertps doesn't matter since on real world use cases the impact will be negligible

It does matter, not because it would have that much of an impact on real world performance, but because it would make iteasier for developers to convert their SSE code to AVX. Together with the lack of integer instructions, not having 256-bit insert/extract instructionsmakes AVXless attractive.

It's not that developers are too lazy to emulate the 256-bit operation with some extra instructions, the problem is that these instructions are expected to be added sooner or later anyway. So developers will have to rewrite/update their software again and again. It's very expensive for software companies to make use of all the latest extensions (rearchitecting, implementation, QA, marketing, support, etc).

So what will happen instead is that lots of developers won't even look at AVX and will stick to the more complete SSE instruction set. For Intel this means it takes even longer for these extra transistors to pay off.

Note once again that software developers like myself aren't asking for optimal implementations right away. If integer AVX instructions were executed in two cycles on an 128-bit unit, that would still make it worth starting to rewrite the software right away. The extra register space helps hide memory latencies so it would already be slightly faster. Later implementations can then have true 256-bit execution units for all AVX instructions, and the software would run a lot faster without requiring a rewrite. That's a big incentive for consumers to buy that next generation, since there would already be software making use of these instructions!

So it's in everyone's interest that instruction sets should be as complete as possible, as early as possible. I was really hoping AVX would make an end to the mess created by all the different SSE extensions, but it has taken a dissapointing start...

bronxzv · ‎01-11-2011

From my POV it's not very important if the instructions are in the ISA or not since I can afford to recompile mycode for new targets and that I don't program in assembly or not even with the intrinsics but with higher level wrapper classes to enjoy far more readable and maintanable code thanks to C++ operators overloading. I have typically already a 256-bit packed integer class for example or functionslikeScatter/Gather/Compress/Swizzle/Deswizzle/... Only actual timings are important for the final users, and only the quality of the source code should be important for the coders, well IMHO.

The *exact same source code* can be compiled to target SSE or AVX for example just by changing some headers (in fact huist a compilation flag), and generally I can't see in the ASM dumps much potentialfor improvements so it's IMO the best solution to cope with changes in the ISA and to startwrite and *validate* codebefore the CPU are available, it's of paramount importance since software development cycles are longer than ISA enhancements (with roughly each year some changes in the ISA for x86)

capens__nicolas · ‎01-13-2011

Quoting bronxzv

From my POV it's not very important if the instructions are in the ISA or not since I can afford to recompile mycode for new targets and that I don't program in assembly or not even with the intrinsics but with higher level wrapper classes to enjoy far more readable and maintanable code thanks to C++ operators overloading. I have typically already a 256-bit packed integer class for example or functionslikeScatter/Gather/Compress/Swizzle/Deswizzle/... Only actual timings are important for the final users, and only the quality of the source code should be important for the coders, well IMHO.

The *exact same source code* can be compiled to target SSE or AVX for example just by changing some headers (in fact huist a compilation flag), and generally I can't see in the ASM dumps much potentialfor improvements so it's IMO the best solution to cope with changes in the ISA and to startwrite and *validate* codebefore the CPU are available, it's of paramount importance since software development cycles are longer than ISA enhancements (with roughly each year some changes in the ISA for x86)

Taking optimal advantage of new ISA extensions withmerely arecompileis a luxury the majority of software developers don't have.There are lots ofdifferent extensions so you need multiple code paths to have optimal code for each.Managingmultiple paths is very messy. It's bad enough to have to manage your ownreleases withvarying features, that having to deal withvarying implementationswithin a release can become infeasible. Most developersopt foran SSE2path and a C path and don't bother about the rest. AVX could have been an interesting third path, but it's not complete yet (it's like SSE without SSE2) so most will skip it.

Also note that new extensions can have far-reaching consequences for the software architecture. When SSE2 was introduced it was possible to convert code which previously used MMX for integer operations and SSE for floating-point, to only use SSE2. But as a consequence you only had 8 registers for storing both integer and floating-point data. If previously you had nicely tuned MMX+SSE code which didn't need to spill any registers to the stack, SSE2 required you to throw that around so the register pressure wouldn't cancel the benefit of SSE2 integer instructions. Ironically x64 then solved that but lots of people still have 32-bit operating systems. So MMX+x87, MMX+SSE, SSE2, x64, etc. they all have different optimal usage and compilers are of very little help.

And that's just the development phase. Debugging, code maintenance, feature extensions, customer support, it all getsa lot more complicated with a highly fragmented ISA. And it means it gets adopted a lot more slowly than it would be if an extension was complete (even if sub-optimal).

I fully understand that CPU designers can't introduce it all at once, but at least with AVX they had the opportunity not to make some of the same mistakes again. It seems to me they could have easily extended all SSE2 integer operations to 256-bit AVX instructions by executing them in two sequential 128-bit chunks. It would eliminate at least one additional code path for those who make use of every extension, andfor otherswould make it more attractive to start coding for it right away instead of waiting for AVX3. It doesn't matter much if the integer execution units are extended to 256-bit in two years or four, the software will be ready to instantly take advantage of the hardware improvements.

The AVX emulator may have allowed to validate your code early, but it didn't allow to evaluate whether it's worth the trouble without integer operation support. The solution to the fact that "software development cycles are longer than ISA enhancements" is not to offer early emulation of the extensions alone, but to offer more complete extensions over longer cycles. Which in turn can be done ata manageable transistor cost by implementing less critical instructions in the most straightforward way.

So once again I'd also like to ask the Intel engineers to add gather/scatter instructions sooner rather than later. Even if initially they're just microcoded as sequential load/store operations developers can actually use them in practice and Intel can evaluate when the time is right to give them more optimal hardware support. By then the software making use of the instructions will already be onthe market and the speedup will be instant whenpeople upgrade. So by helping software developers Intel helps itself sell newer CPUs.

Note that AMD is already in the position to extend SSE2 integer instructions to 256-bit instructions and have them executed as a single 256-bit operation, since each Bulldozer module seems tohave a pair of fully symmetric 128-bit SSE2 capable execution units. And NVIDIA's 'Project Dover' may quickly become an interesting multimedia platform if they apply theirSIMD experience to ARM architectures. Because gather/scatter allows most loops to be auto-vectorized by the compiler it can result in very high performance/Watt even if other parts of the chip are still slightly inferior.

Having an SIMD instruction set with the parallel equivalent of ever scalar instruction is just as important as the multi-core revolution. ILP, TLP, DLP, you need all of them to maximize performance/Watt for the architecture that will dominate the future of computer technology.

bronxzv · ‎01-13-2011

Taking optimal advantage of new ISA extensions withmerely arecompileis a luxury the majority of software developers don't have.There are lots ofdifferent extensions so you need multiple code paths to have optimal code for each.Managing

well that's not a mere recompile but the careful design of the variant of your building blocks (for ex. your beloved Gather, 256-bit packed integers) that will be inlined on the new specialized code path (and yes you need multiple paths for your hotspots if you want high performance code)

after that you simply work at the higher level and recompile all your paths from the same source code, it'squite easilymanageable I'll say after many years doing just that

capens__nicolas · ‎01-13-2011

Quoting bronxzv

well that's not a mere recompile but the careful design of the variant of your building blocks (for ex. your beloved Gather, 256-bit packed integers) that will be inlined on the new specialized code path (and yes you need multiple paths for your hotspots if you want high performance code)

after that you simply work at the higher level and recompile all your paths from the same source code, it'squite easilymanageable I'll say after many years doing just that

Look, some extensions target specific applications and make them significantly faster, while other extensions help a wide range of applications but only by a small amount. There's nothing wrong with that in itself, I welcome every incremental improvement, but there's the potential for an extension which speeds up a large range of applications significantly. It may still take years for such a 'complete'extension to be implemented optimally, but since it already holds the promise of speedups developers won't have to wait for a set of smaller extensions to form one complete whole. It's a very attractive idea to know you can invest effort into developing a new code path and see your application become faster over the course of several years without having to worry about having to change it over and over again. And like I said it also helps the CPU manufacturer because it gives consumers a reason to buy newer CPUs which will significantly speed up existing applications, instead of having to wait for applications to appear which make use of an incremental new extension for whichsupport was just added.

AVX with integer operations and gather/scatter would be such a complete extension. I'm not saying incremental extensions are bad, I'm just saying complete extensions are better. I'm happy for you that for your application you seem to be fine with incremental extensions, but I'm sure that for the majority of developersa complete extension would have been very welcome. AVX with integer operations and gather/scatter allows to parallelize nearly every loop, and in many cases automatically. I'm talking about optimizing the hotspots in pretty much every application, not just those which take considerable effort to make use of incremental extensions.

Apparently for your application area you found a way to make it manageable by abstracting the instructions into "building blocks". That's great, but I wouldn't call that a "simple" solution. And for the record I use a powerful abstraction too: LLVM. It supports generic vector operations which are JIT-compiled and make use of whatever extensions your CPU support. But even with LLVM I need to make extensive use of its support for intrinsics to get the best possible performance. And even when no intrinsics are needed it's still important to have different code paths at the level of the abstract vector operations. For instance it would be unwise to use a 256-bit packed integer "building block" if the CPU doesn't support SSE2 because with MMX you can only store two such vectors in registers. The latency from having to spill registers to the stack and read them back a couple instructions later makes it a lot slower than using 64-bit packed integers and keeping things in registers. So LLVM made things a little more manageable for me but it took considerable effort to make use of it in the first place and it's not a silver bullet.

So despitesome helpfrom softwareabstractionsI'mstill joining the large group of developers who'd like to seeAVX support integer operations and gather/scatter as soon as possible so it becomes widely applicable and worth the coding effort...

bronxzv · ‎01-14-2011

So despitesome helpfrom softwareabstractionsI'mstill joining the large group of developers who'd like to seeAVX support integer operations and gather/scatter as soon as possible so it becomes widely applicable and worth the coding effort...

I will also welcome full support in AVX for 256-bit packed integers, though for my ownapplicationfield (realtime 3D rendering)the most important 256-bit integer related instructions are already here : 8 x 32-bit int<->floats conversions

It will be also nice to have vgather/vscatter LRBni like, though for 3D applications it's generally better (i.e. faster since it minimize L1 DCache misses and replace a series of 32-bit loads by 128-bit or 256-bit loads) to keep data in AoS form for the cases requiring gathers (and that's the only casewhere SoA/hybrid SoAare to be avoided I'll say), and thena series of gather is replaced by a swizzle operation

levicki · ‎01-14-2011

I have to side with Nicolas here.

Gather/Scatter are sorely missed from the ISA for many purposes.

AVX not having FMA as promised initially is a disappointment I can live with.

However, AVX not having integer operations is detrimental for its adoption by developers including me. Integer operations would greatly benefit video coding/decoding and other media applications.

Intel has once again released half-baked instruction set extension (with the exception of three operand syntax) even though they said "we won't do that anymore", "orthogonality is important", etc, etc. AVX may eventually become complete over several iterations, but if the trend with incremental additions continues, the code we are writing will spend more time checking which features are supported by the CPU than it spends doing some usefull work with it and Intel's CPU documentation in year 2020 may read "to determine whether your CPU supports AVX10.11b check the bit 255 of the XMM7 register after executing CPUID with EAX=0x7FFFFFFF".

Coupled with SandyBridge's unjustified socket change to force us to buy new mainboards (1156 to 1155 pins with no significant interface changes such as PCIE3.0, more memory controller channels, or support for faster DDR3), Intel CPU and chipset business seems more and more marketing instead of innovation and performance driven.

I am disappointed with first SandyBridge CPUs to the point that I will not upgrade (with i7-920 @ 3.4GHz I have the same performance as I would have with 2600K anyway), and I do not intend to support AVX, at least not this unfinished one, because it is simply not worth the effort.

bronxzv · ‎01-14-2011

your mileage may vary, my #1 request for the future will be more L1 DCache bandwidth, the true limiter of performance for 256-bit code on Sandy IMHO (and the source of disapointing SSE to AVX speedups) is this limit of one 256-bit load per clock (vs 2 128-bit load per clock with SSE), it's simply not matching well with the 16 flops per clock we can theoretically get

levicki · ‎01-14-2011

When I was at the IDF last year, I specifically asked whether SB will have 256-bit paths throughout the chip or will Intel cut corners like they did with Pentium 4. I have been told that it will be fully fledged 256-bit chip but it turns out that my concern was justified after all -- they just combined two 128-bit loads into single 256-bit one instead of expanding them both to 256 bits. Now I am starting to doubt whether FPMUL/FPADD also works as 2x 128-bit and thus cannot bring any speedup in this first AVX capable CPU generation.

bronxzv · ‎01-14-2011

the speedups are there, though deceptive when compared to the IACA estimates

TimP · ‎01-14-2011

Many applications which depend on MidLevel cache (L2) see limited gain for AVX-256 over 128-bit vectorized code, due to the 128-bit path to L2 in the current implementation. The MKL [DS]GEMM showed substantial gains after months of hand coding to gain L1 locality while implementing AVX-256. I suppose IACA doesn't consider cache locality.
Public presentations allude to the lack of hardware support for fast mis-aligned 256-bit access in the current implementation; that is dealt with by explicitly splitting into 128-bit moves (at the instruction level; compilers do it automatically), which take advantage of hardware support for 128-bit moves on 4- and 8-byte alignments.

bronxzv · ‎01-14-2011

yes I've remarked for the 256-bit unaligned moves that the Intel compiler now expands things like _mm256_storeu_ps in 2 VEX-128 vmovups, aligned moves such as_mm256_store_ps are now generating a 256-bit vmovups (instead ofa vmovaps in previous versions), I was a big fan of vmovaps since it was handy to catch missing 32B alignment

I'm not sure about the L1toL2bandwidth limitation since Iget slightly better speedups with HT enabled (and thus only ~ 16 KBL1D per thread)

Though in a lot ofcritical loops I have something like 1 256-bit load each 3 instructions so I suppose this is a seriousbottleneck, the 2nd load port makes128-bit code fly thusdecreasingthe SSE to AVX speedup

levicki · ‎01-14-2011

Tim,

Can you share with us the information whether current implementation of AVX is capable of executing ADD/MUL/DIV/AND/OR/XOR upon 8 pairs of single precision FP values in a single operation or is it being split internally into two 128-bit operations? Also, what about the upcoming LGA2011 version (if you are allowed to comment of course) -- will there be any architectural differences apart from support for PCIE3.0 and four-channel memory controller?

That is my biggest concern about SandyBridge and AVX, apart from L2 cache path width.

TimP · ‎01-14-2011

The AVX-256 parallel operations perform the operations on 8 single precision operands in parallel, not sequenced/pipelined with the second 128-bits following right after the first 128 bits, except for div/sqrt, which are sequenced. You could still argue that those simpler operations are split (as evidenced by restrictions on shuffling), but the FP resources are expanded to double the parallelism going from 128-bit to 256-bit data. I'm not qualified to comment on effect of socket/PCI/memory controller changes, but I don't see them influencing the FP unit internals.