Sandy Bridge: SSE performance and AVX gather/scatter - Page 3

capens__nicolas · ‎01-04-2011

Hi all,

I'm curious how the two symmetric 128-bit vector units on Sandy Bridge affect SSE performance. What's its peak throughput, and sustainable throughput for legacy SSE instructions?

I also wonder when parallelgather/scatter instructions will finally be supported. AVX is great in theory, but in practice parallelizing a loop requires the ability to load/store elements from (slightly) divergent memory locations. Serially inserting and extracting elements was still somewhat acceptable for SSE, but with 256-bit AVXitbecomes a serious bottleneck, which partially cancels its theoretical benefits.

Sandy Bridge's CPU cores are actually more powerful than its GPU, but the lack of gather/scatter will limit the use of all this computing power.

Cheers,

Nicolas

capens__nicolas · ‎01-19-2011

Quoting bronxzv

> NVIDIA's presentation on G80's SFU (http://aith.polito.it/foils/11_2.pdf) can give you an indication of the size of tables that are required for high accuracy: betw

they use 2nd order polynomials so what I was calling M is = 3 in this case, with3rd order polynomialsM = 4 and is a better fit for 128-bit loads,ifthe table is big it's even more important to use an AoS layouti.e. (c0,c1,c2) packed together in your example

With AVX (N= 8), you'd need 8 of these 128-bit loads, and then transpose this 4x8 matrix. If I'm not mistaken, that's 16 shuffle instructions. Extracting the individual addresses takes9 instructions too. And I'm not even counting the spilling instructions.

Just to be clear here: That's at least 33 instructions before you can do any useful work! With FMA the polynomial itself takes only 4 instructions.

With a gather instruction, these table lookups would only take4 instructions.

bronxzv · ‎01-19-2011

well I think that we basically all agree, I simply think we can start writing code today with the "complete" SIMDphilosophy in mind, FYI I just posted a request on the Intel C++ forum:

http://software.intel.com/en-us/forums/showthread.php?t=80085

at the moment I'm very sceptical with new instructions since sometimes not only the speedups aren't there but we even get slowdowns when using them, one example from the past was BLENDVPS which was slower thanthe ANDPS/ANDNPS/ORPS equivalent when introduced (though now it's way faster, it screams on SNB thanks to the dual blend units), a today's example is the fact that using 2 128-bit VMOVUPS is faster than a single 256-bit VMOVUPS (for unaligned moves), thus the optimal code for SNB will be up to 2x slower in the future when we eventually get 256-bit datapaths

here is a post of mine about this very topic:

http://www.realworldtech.com/forums/index.cfm?action=detail&id=115959&threadid=115645&roomid=2

the best solution is probably to use the intrinsic sinceit generates two instuctions for SNB and will generate 1 single instruction for future targets

levicki · ‎01-19-2011

Sigh...

I said "when you have no GPRs to spare" and you answered with "you reuse one GPR 8x". Which part of "no GPRs to spare" you did not understand?!?

There are situations (algorithms) when you don't have even that one GPR free -- you have to spill one to memory and reload it later, and often you have to do that inside of a loop.

Furthermore, no access pattern optimization is possible if indices are calculated during runtime.

Finally, inlining 18+ instructions each time you use gather/scatter increases the code size which in turn reduces decoding throughput and thus IPC, not to mention that it also prevents Loop Stream Detector from kicking in.

So no, I wouldn't say we all agree.

bronxzv · ‎01-19-2011

>So no, I wouldn't say we all agree.

you were the one stating that "initial implementation of gather does not have to be faster than the current alternative as far as I am concerned" so I fail to see where we disagree

>Furthermore, no access pattern optimization is possible if indices are calculated during runtime.

huh ? a lot of optimizations are possible withAoS layouts and each individual gatherfrom the same base address +singleelement offsets (and exactly the same packed indices computed dynamically), we are talking about 128-bit moves instead of 32-bit moves in the cases I have in mind

now, well, if you really think that spilling a single GPR to the L1 DCache will consumes "bus traffic" (sic) I suppose there isn't much left to discuss as far as real world timings are concerned

FYI much of the LSD optimizations are gone on SNB (decoded icache makes it redundant for high performance code), in fact loop unrolling is more important than before (I get pretty speedups by toying wth the #pragma unroll)

one goal of the game with NHM was to try loop fission to maximize LSD usage, now with SNB the goal is just reversed, loop fusion to avoid like the plague useless load/store necessary for multi-passes loop fission

capens__nicolas · ‎01-19-2011

Quoting bronxzv

Amen to what ?

if you use high level constructs likeinlined vscather(), vcompress(), etc.functionsI don't see whygenerating a single instruction instead of several ones will make the source code more clear

maybe you are talking about theASM

It's not about making the source code clearer, not even the assembly code (although it's helpful when debugging).

It's really about knowing thatthe singleinstruction has the potential of becoming faster than the instruction sequence.

Software development cycles can be quite long, and customers don't upgrade their software the minute you release a new version. So to speed up the adoption rate (that's ROI for who's paying the bills), it's important to have early access to new instructions, even if initially they're not (much) faster than a sequence of instructions.

Gather/scatter is most likely already on the roadmap. But instead of waiting for the transistor budget and engineering budget for a high performance implementation, after which it still takes many years for developers to make use of it and get the applications into the hands of customers, they could add a cheap implementation in the near future and by the time the high performance implementatation is ready there will be an instant speedup for the applications customers are already using. It's a big incentive for people to buy the new hardware.

Again, it's faster ROI. Everyone wins.

bronxzv · ‎01-19-2011

>It's not about making the source code clearer,

well you said "Amen" after this Igor's comment:

"-- at least it will make code more clear"

capens__nicolas · ‎01-19-2011

Quoting bronxzv

well I think that we basically all agree, I simply think we can start writing code today with the "complete" SIMDphilosophy in mind, FYI I just posted a request on the Intel C++ forum:

http://software.intel.com/en-us/forums/showthread.php?t=80085

at the moment I'm very sceptical with new instructions since sometimes not only the speedups aren't there but we even get slowdowns when using them, one example from the past was BLENDVPS which was slower thanthe ANDPS/ANDNPS/ORPS equivalent when introduced (though now it's way faster, it screams on SNB thanks to the dual blend units), a today's example is the fact that using 2 128-bit VMOVUPS is faster than a single 256-bit VMOVUPS (for unaligned moves), thus the optimal code for SNB will be up to 2x slower in the future when we eventually get 256-bit datapaths

here is a post of mine about this very topic:

http://www.realworldtech.com/forums/index.cfm?action=detail&id=115959&threadid=115645&roomid=2

the best solution is probably to use the intrinsic sinceit generates two instuctions for SNB and will generate 1 single instruction for future targets

In theorydevelopers can indeed start writing fully parallel SIMD code today. In practice, there's a lot more involved. If I tell my superiors we should start investing time and money into rewriting SSE code into AVX code, using (abstracted) gather/scatter operations which will have an influence on the entire architecture, he'll want a justification for that. Right now, sticking to SSE and benefiting from the extra 128-bit execution units sounds like a better plan. So like I said before it will still take many years for the majority of multimedia software development companies to consider using AVX, and even longer for Intel to see a return on its investment.

BLENDPSis a good example...of doing it all wrong. Adding instructions which intitially are slower will obviously not get adopted any faster. You can expect developers to check for AVX support before attempting to use BLENVPS.Sothe adoption is delayed bythree years and what's worse it costed transistors.Either way, this failure should not make you sceptic about gather/scatter. It merely shows that the initial implementation has to be at least as fast as the instruction sequence to emulate it.

In a way it even makes me hopeful that we'll see gather/scatter sooner rather than later. Maybe BLENDVPS was added this early just for code density reasons (both assembly and binary code). If that's enough of a reason, then certainly gather/scatter must look awesome. Ok maybe there was a pinch of sarcasm there, but still, I honestly can't think of a reason not to add gather/scatter instructions at the earliest possible.

levicki · ‎01-19-2011

Spilling registers to L1D which on Sandy Bridge has the best case latency of 4 cycles has considerable performance impact and is even being discouraged in the latest optimization reference manual.

Furthermore, I really hate when I have to quote myself because someone is putting words in my mouth:

You will be spilling registers to memory and reloading them generating additional cache/bus traffic for already memory intenisve operation with poor data locality. How will that help?

So what you wrote above is not only an incorrect quote of my post, but it also implies that I am incompetent and it is a pure malice on your part. If you cannot use facts instead of ad hominem attacks, then perhaps I should just ignore the rest of your rambling and use the report button instead.

Regarding "at least it will make code more clear" quote -- again you are taking what I said out of context to suit your purpose of attacking people you debate with. There is a continuation to that sentence that says "enable compiler to auto-vectorize more loops, and pave a way for future hardware implementations which will be considerably faster.".

We do not agree because:

- We believe that we should use one new instruction

To get better performance in our case:

a) User has to buy a new CPU

- You believe that we should use an intrinsic function

To get better performance in your case:

a) Developer must buy a new compiler which emits faster intrinsic function
b) Developer must recompile, QA test, and release
c) User has to buy a new CPU
d) User has to pay for new software version due to our development costs
e) User has to spend time reinstalling software

Which one of those gives an end user better incenitive to upgrade?

capens__nicolas · ‎01-19-2011

Quoting bronxzv

>It's not about making the source code clearer,

well you said "Amen" after this Igor's comment:

"-- at least it will make code more clear"

I said amen to the entire comment. Clearer (assembly) code is an welcome bonus, but certainly not the main reason to want gather/scatter even if initially it isn't faster.

bronxzv · ‎01-19-2011

>If I tell my superiors we should start investing time and money into rewriting SSE code into AVX code, using (abstracted) gather/scatter operations

something you can say is that other pure software renderers like my Kribi 3D stuff tested here *before any tuning* are already on a fast track to AVX optimizations

http://www.lostcircuits.com/mambo//index.php?option=com_content&task=view&id=99&Itemid=1&limit=1&limitstart=6

bronxzv · ‎01-19-2011

honestly I feel like we are going nowhere,I can't come with a more concrete idea than requesting intrinsicsthat will map to instructions ifthey are available at some point in the future

in your "We" vs "You" argument there isseveralelephants in the room:
- how willthe backward compatibility with legacy targets be assured using only one instruction, isn't it also a good idea to optimize for the installed base ?
- which one can producevalidated applications today andwhich ones are just waiting still for a better future ?

levicki · ‎01-19-2011

Legacy targets do not have AVX. We are discussing lack of gather in AVX.

Optimizing is happening right now for this SandyBridge.

If this SandyBridge had gather instruction and we used it now in our applications, then next SandyBridge with faster gather instruction could still use the same software we are optimizing now which would automatically run faster by means of CPU upgrade alone.

Is that concept so difficult to grasp or what?

capens__nicolas · ‎01-19-2011

Quoting bronxzv

something you can say is that other pure software renderers like my Kribi 3D stuff tested here *before any tuning* are already on a fast track to AVX optimizations

http://www.lostcircuits.com/mambo//index.php?option=com_content&task=view&id=99&Itemid=1&limit=1&limitstart=6

Cool!

May I ask what your expectations are after tuning? 11% sounds like only a minor improvement for something that's supposed to double the computing power (although obviously the SSE path also benefits from Sandy Bridge's extra execution units).

For SwiftShader I found Sandy Bridge to be 30% faster clock-for-clock compared to Nehalem. Unless a significant part of that is due to other things than the execution units, this means that in theory the use of AVX could speed things up by at most 50%.

So are you expecting to get closer to that 50%, or are you limited by the lack of integer AVX instructions, bandwidth, or dare I ask... load/store and swizzle?

bronxzv · ‎01-19-2011

>11% sounds like only a minor improvement for something that's supposed

sure I was very sad when M.S. told me thescores he was getting and I gota 2600K PC only a few days later so I wasn't able to profile anything

I know hope something like 20% vs SSE on Sandy Bridge or around 50% better IPC overall vs Nehalem

Some loops have no speedups or even slowdowns (now fixed) in the version tested by Michael S.

when testing with a single thread and turbo off, I made these measurements:

- no speedup for aligned copies of arrays or set/clear buffers (16B/clock L1D cache write bottleneck)
- slowdowns for unaligned copies of arrays, due to an issue with the implementation in Sandy Bridge, 2 128-bit vmovups faster than a single 256-bit vmovups
-poor speedups for L2-cache blocked case (1.15 x overall)
- common speedup around 1.3 x for L1D-cache blocked case and normal load/store
- best observed speedup so far 1.82 x (L1D-cache blocked with lower than average load/store)
- generally speaking I get more than the overall speedup for all the swizzle/compress/gather stuff thanks to new instructions like VPERMILPS (and PEXTRD/VINSERTPSwhich are SSE4.1struff that I'm not using in the fallback SSE-2 path) maybe it explains why I'm not seing Amdhall sohard at work ATMthan some other people...
- VBLENDVPS is clearly a killer instruction, I get 1.6x speedups in loops with heavy VBLENDVPS, it's in all case way faster than masked moves, masked moves allow for cleaner and shorter ASM, it's just way slower
- The lack of 256-bit integers isn't very important in my case since the most useful instructions like packed conversions bewteenfloatsand ints are here at full speed

- all in all I expect at most 1.3 x overall speedup with turbo off and a single thread, this should amount for roughly 1.2 x speedup with 8 threads (8x more LLC/memory bandwidth requirements)and turbo on

>although obviously the SSE path also benefits from Sandy Bridge's extra execution units).

my understanding is that the 2nd load port and the decoded icache are very effective to maximize lgacy SSE code throughput, but AFAIK the only extra execution unit for SSE is the 2nd blend unit, which execution units do you have in mind ?

bronxzv · ‎01-19-2011

>Optimizing is happening right now for this SandyBridge.

sureand this good Sandy hasno gather instruction, it's an hard fact of life, hey man not even the SDE (whichhas FMA3 andhalf-floats already btw)has a gather instruction, that's why I use an inlined gather() function that generates optimal AVX code (or pretty optimal) and also optimized fallback paths like SSE-2 from the same source code

now let's imaginesome alternate reality whereSandyhas a gather instruction, it's even there in the specs since day 1 and wasn't removed unlike FMA4 for ex., it's all cool and dandy and we are all happy with our "complete SIMD ISA", still our customers have mostly SSEn enabled machines, a lot have still XP as OS that willprobablynevereven support AVX, or have Seven and will not install the SP 1 very fast (yes there will be 100'000 s of Sandy machine sold with no AVX support due to the lack of support in Seven ATM), what can we do about it ? we will *need* to generate at least 2 code paths, a sensible way will be using a singleintrinsic allowing the compiler to generate optimized code for all targets

Is that concept so difficult to grasp or what?

levicki · ‎01-19-2011

In that "alternate reality" customers would have an incenitve to upgrade hardware because speedup would be immediate and automatic. Gather instruction would map to an intrinsic so no difference there for you and you could still have two code paths -- with gather instruction and with SSE emulation.

TimP · ‎01-19-2011

Quoting bronxzv

- VBLENDVPS is clearly a killer instruction, I get 1.6x speedups in loops with heavy VBLENDVPS, it's in all case way faster than masked moves, masked moves allow for cleaner and shorter ASM, it's just way slower

my understanding is that the 2nd load port and the decoded icache are very effective to maximize lgacy SSE code throughput

The compilers should choose vblend instructions for vectorization whenever it is possible (likely requiring VECTOR ALIGNED pragma). masked move is useful as a last resort when it enables auto-vectorization. I see 2.0x speedups comparing AVX-256 vectorization with masked move against non-vector SSE code.

The 2nd load port doubles the speed of existing single thread code which gathers operands into a packed operand. In my tests, old SSE code does as well as AVX-256 in that case. In the multi-threaded case, there might be an advantage for hardware which could perform gather operations across cores without replicating all cached data. A first micro-coded implementation of a gather instruction would likely not deal with the cache line replication.

Decoded icache is supposed to avoid performance obstacles encountered when missing alignment and unroll optimization for loop stream detection.

Your quotations for L1D and L2 cache blocking are interesting.

bronxzv · ‎01-19-2011

Tim, FYI here is an ASM dump ofour kernel with the best speedup so far, the AVX version is 1.82 x faster than the equivalent SSE path (2600K at 3.4 GHz, turbo off, single thread, L1D$ blocked, 100% 32-B alignment)

it allows the computation of a 3Dboundingbox from XYZ data inSoA form so it's arguably useful code with a great speedup

; LOE eax edx ecx ebx esi edi ymm0 ymm1 ymm2 ymm3 ymm4 ymm5

.B16.14: ; Preds .B16.14 .B16.13

vmovups ymm6, YMMWORD PTR [esi+edi*4] ;325.22

vmovups ymm7, YMMWORD PTR [ebx+edi*4] ;326.22

vminps ymm4, ymm4, ymm6 ;325.5

vmaxps ymm1, ymm1, ymm6 ;325.5

vminps ymm5, ymm5, ymm7 ;326.5

vmaxps ymm0, ymm0, ymm7 ;326.5

vmovups ymm6, YMMWORD PTR [edx+edi*4] ;327.22

add edi, 8 ;323.29

vminps ymm2, ymm2, ymm6 ;327.5

cmp edi, eax ;323.21

vmaxps ymm3, ymm3, ymm6 ;327.5

jb .B16.14 ; Prob 82% ;323.21

Thomas_W_Intel · ‎01-19-2011

Bronxzv, Igor,

Following your discussion, I think that both of you made your point very clear. However, you are coming to different conclusions as your willingness to write and supportdifferent code paths and software version differ. Also your estimation howlikely customers are to buy new hardware or software differs a lot. As both questions depend on industry and company, there is probably no clear answer. Furthermore, this is actually not a technical but a business question. I therefore suggest that you leave this argument as it is. Your high quality contributions are highly appreciated and it would be a pity to spoil this otherwise interesting thread with a flame war.

Kind regards
Thomas

levicki · ‎01-19-2011

Thomas,

If you were following my contributions so far, you would have known that I was always ready to support different code paths. That has not changed. I just don't think that intrinsics are the solution to every problem -- I always prefer hardware implementation over software.

Regarding customers, I built many systems for many people including myself so far. I have also upgraded many systems. I cannot speak about USA market, but there are other markets I am familiar with such as Eastern and Central Europe, Russia, China, India, etc, which prefer simple upgrades that bring noticeable performance improvements. The simplest upgrade is to replace just the CPU -- you don't even have to reinstall the operating system for that. My estimation is also based on the fact that with this economic crisis majority of people cannot afford to pony up for a full hardware and software upgrade every year or even sooner.

Finally, if implementation of gather and scatter is indeed reduced to some business decision, instead of driven by the need for innovation and enabling, then I am truly disappointed in Intel.

I will never again mention those, and I will also refrain from other suggestions as well from now on -- I don't want them ending up in someone's drawer waiting for the marketing team to figure out how to pitch them to the illiterate masses.

bronxzv · ‎01-20-2011

>you are coming to different conclusions

I don't think so, we basically want the same thing: a gather instruction in AVX much like VGATHER in LRBni, it will map to one (or several) intrinsic(s) as all other AVX instructions and I'm sure Igor isn't against that (see his last post on the subject)

Now my request is merely a new intrinsic, much like_mm256_set_ps(which is already a variant of gather btw) but with a base address and 8 packed indices in a __m256i argument instead of the 8 floats with set_ps