Sandy Bridge: SSE performance and AVX gather/scatter - Page 2

capens__nicolas · ‎01-04-2011

Hi all,

I'm curious how the two symmetric 128-bit vector units on Sandy Bridge affect SSE performance. What's its peak throughput, and sustainable throughput for legacy SSE instructions?

I also wonder when parallelgather/scatter instructions will finally be supported. AVX is great in theory, but in practice parallelizing a loop requires the ability to load/store elements from (slightly) divergent memory locations. Serially inserting and extracting elements was still somewhat acceptable for SSE, but with 256-bit AVXitbecomes a serious bottleneck, which partially cancels its theoretical benefits.

Sandy Bridge's CPU cores are actually more powerful than its GPU, but the lack of gather/scatter will limit the use of all this computing power.

Cheers,

Nicolas

levicki · ‎01-14-2011

Tim,

What I wanted to know when I asked about the socket is whether some of those architectural limitations you are mentioning will be removed because after all it will be a different die with different litography masks due to socket change. Will Intel use the chance to further improve AVX performance with the new socket in Q3 2011, or will we have to wait for the next CPU generation to realize its full potential?

Also, do you have any data on how much of an improvement can three-operand syntax bring without other code changes?

Furthermore, when you say DIVPS is pipelined, I presume that it is preferable to MULPS with 1/X instead of DIVPS with X?

Finally, are those restrictions on shuffling caused by the lack of 256-bit ALU? Is that the reason why Intel did not even attempt to implement integer part of the AVX, even pipelined?

Sorry for so many questions, but we developers really need to know this and so far SandyBridge reviews only offer marketing hype.

TimP · ‎01-14-2011

Three operand syntax often gives more than 10% reduction in number of instructions required. Of course, the net effect on performance is far less. I doubt this will be sufficient incentive to change many scalar applications over to AVX.
In my personal view, the invert and multiply scheme as default implementation for single precision vectorization seemed a relic of the weak divide performance of early SSE CPUs, made unnecessary by the strong divide performance of recent production CPUs. As you suggest, the AVX implementation, with little improvement in divide and sqrt, may push us back in that direction.
Restricting most new instruction introduction to floating point operations seems to be a consequence of priority on increasing number of cores and improving multi-core scaling as a more effective way to gain both integer and floating point performance, postponing additional integer instructions until after another process shrink.

levicki · ‎01-15-2011

Tim,

What I don't understand from your answer is how Intel engineers expect to gain both integer and floating performance by multi-core scaling if they do not scale integer units by the same amount by which they scale floating point units?

Wouldn't doubling integer performance (and thus performance of many multimedia application -- image processing, video coding/decoding, etc) by supporting wider integer vectors in AVX be a low hanging fruit?

Since new integer instructions will be added to AVX anyway, I presume that the cost of their decoding is not a factor which made Intel decide not to support them yet.

With that in mind, is the cost of 256-bit ALU in terms of silicon real estate really so prohibitive that Intel had to give up on instruction set orthogonality and full-fledged shuffle, not to mention the chance to improve performance of integer applications?

I also have to wonder why has Intel wasted resources on QuickSync instead of investing those resources into improving integer performance (by making 256-bit ALU and integer AVX) thus allowing wide variety of software (including for example x264 encoder) to get 2x performance gain? What good is such specialized transcoding engine when the output image quality has been sacrificed for speed?

TimP · ‎01-15-2011

I suppose a number of applications which were considered important in the decision didn't support integer vectorization, but were expected to scale well with threading. This is speculation on my part, with no inside knowledge. SPEC 2006 is still considered important for marketing, and has practically no integer vectorization. SPECrate, at least, benefits from multiple cores. Evolution toward a wider group of applications for which platforms will be designed is slow.
Hardware designers recognize the need to prepare for the applications which will be important 5 years in the future, in time to make many of these decisions, but there's a lack of useful data.

levicki · ‎01-15-2011

I must say that I disagree on the "lack of useful data" -- data is out there, but it seems that the wrong people are looking at it.

Intel is trumpeting multimedia performance, but by not having integer AVX instructions it is not doing any favor to multimedia application developers or the users alike.

What I have in mind is the following:

1. Image processing (Photoshop, Gimp, etc)
2. Video processing (Premiere, VirtualDub, AVISynth, etc)
3. Video codecs (x264, VP8, XviD, etc)
4. Sound processing (SoundForge, Logic Audio, Cubase, Sonar, etc)
5. String processing (all text processing applications/libraries that are using SSE4.2 instructions)

All of those applications would benefit from wider integer vectors and better integer performance and very quickly because they are being actively developed with short update cycles. Please do not tell me that Intel engineers are unable to notice that those applications will not go away -- their performance can only become more important in the future as the amount of data we need to process and store keeps growing at a staggering rate.

Moreover, it is time to stop chasing the benchmarks because they are useless in real life, and are misleading your customers. For example, in Sandra Dhrystone benchmark SandyBridge 2600K has 30% advantage over Core i7-920 clocked at the same speed, but once you compare integer performance in real applications this advantage melts down to miserable 6%.

Finally, I was under impression that Intel has asked us for an opinion and based on what we said decided to give us early access to new features so we can pave the way in our code for real performance boost which will come later with new CPU revisions. By releasing another incomplete instruction set extension we are again unable to do that for you. How do you expect integer vectorization to take off if hardware support is not there?

capens__nicolas · ‎01-17-2011

Quoting bronxzv

I will also welcome full support in AVX for 256-bit packed integers, though for my ownapplicationfield (realtime 3D rendering)the most important 256-bit integer related instructions are already here : 8 x 32-bit int<->floats conversions

It will be also nice to have vgather/vscatter LRBni like, though for 3D applications it's generally better (i.e. faster since it minimize L1 DCache misses and replace a series of 32-bit loads by 128-bit or 256-bit loads) to keep data in AoS form for the cases requiring gathers (and that's the only casewhere SoA/hybrid SoAare to be avoided I'll say), and thena series of gather is replaced by a swizzle operation

That's interesting because my main application field is also real-time 3D rendering (I'm the lead developer of SwiftShader). I can't think of a singleoperation that would help 3D rendering more than gather/scatter though. In particular it would help speed up texture sampling, which requires fetching many 32-bit texels at various memory locations. Even though a 2x2 footprint of texels can wrap/clamp/mirror, in the general case they are close to each other so it's perfect for a gather instruction as implemented by Larrabee. It would also substantiallyspeed up vertex attribute fetch and transcendental functions.

Note that a 240 mm 6-core CPU with FMA would actually have the same computing power as a 240 mm GeForce GTS 450. So CPUs and GPUs appear to be on a collision course but the lack of gather/scatter still makes the CPU hopelessly inefficient at some workloads.

Another example is ray-tracing. There hasn't been a real breakthrough yet because GPUs are not good at recursion (too much stack space per thread, and GPUs need thousands of threads to achieve good utilization). But while CPUs are good at recursion they're not good at ray-tracing because the rays may slightly diverge and need the ability to access multiple memory locations in parallel.

Giving the CPU gather/scatter capabilities would allow to drop the IGP and replace it with generic CPU cores instead. This unification results in an architecture which is overall more powerful and allows developers to create something entirely new and achieve great performance.

bronxzv · ‎01-18-2011

As I said I agree with you and I haveno better ideafo an useful instruction than vgather (vscatter is less important for my stuff), though the most important step before that is to have a true high performance 256-bit implementation, at the moment (on Sandy) we can load two XMM registers per clock with SSE and only one YMM register per clock it's a strong limit for my renderer where on a lot of kernels load:store ratio is well above 2:1 and thus sustained load bandwidth from the L1 DCache is a key limiter

>In particular it would help speed up texture sampling, which requires fetching many 32-bit texels at various memory locations.

with bilinear interpolation that will be 64-bit loads (LDRI) or 128-bit loads (HDRI) since you always access two adjacent texels in a row, with gradient data packed together (for bump mapping for ex.) there is even more data to access in // thus my Swizzle instead of Gather comment in a previous post

>a gather instruction as implemented by Larrabee.
the only thing I know about the implementation in Larrabee is Abrash's comment that it will be not at top performance (I read it as microcoded and slow) in the first batch of chips, maybe you know more that that ?

>It would also substantiallyspeed up vertex attribute fetch >

if you have for ex. a mesh topology, each vertex will be shared by many faces so you are better of to keep vertex data in AoS form (XYZUVW) and then use Swizzle instead of Gather

>transcendental functions.

if you use a LUT based imlementation you acces multiple items inparallel (the coefficient of the spline), the Swizzle argument also apply here

levicki · ‎01-18-2011

The point is that we already have swizzle in form of shuffle instructions (SHUFPS/PSHUFD). What we don't have is parallel load from different locations indexed by elements of another vector.

What we are trying to say is that:

1. Parallel load is a prerequisite for swizzle, not vice versa.
2. Swizzle can be implemented as a part of a parallel load instruction if necessary, not vice versa

bronxzv · ‎01-18-2011

be assured I understand the behavior of vgather and vscatter, whatever the implementation of these (in code, microcode, or with monster xbars) maximizing data locality will be always a good idea, a bad data layout with a true hardware vgather will give you less performance than plain AVX code doing256-bit loads (masked if required) then swizzle with the *existing* instructions

inmost the cases I'm aware of for3D rendering you not only access N elements in parallel (N=8 with AVX packed floats) to use all the SIMD computation slots but you generally need to gather M distinctvalues with the sameinteger vector (indices)

for example for vertex data (X,Y,Z,U,V,W) M = 6, for FP32 color data (R,G,B,A) M = 4, for 3rd order polynomials approximations M=4, and so on

with a SoA layout you will have to use M gatheroperations with different base addresses and the same indices vector (M*N distinct memory locations)

with an AoS layout you will have to use N swizzle operations from N distinct memory locations

bronxzv · ‎01-18-2011

Tim you talk about a "128-bit path to L2 in the current implementation" though in the optimization manual http://www.intel.com/Assets/PDF/manual/248966.pdf, page 2-16 the bandwidth per core for the L2 Cache is documented as "1 x 32 bytes per cycle", who is right?

capens__nicolas · ‎01-18-2011

Quoting bronxzv

>In particular it would help speed up texture sampling, which requires fetching many 32-bit texels at various memory locations.

with bilinear interpolation that will be 64-bit loads (LDRI) or 128-bit loads (HDRI) since you always access two adjacent texels in a row, with gradient data packed together (for bump mapping for ex.) there is even more data to access in // thus my Swizzle instead of Gather comment in a previous post

>a gather instruction as implemented by Larrabee.
the only thing I know about the implementation in Larrabee is Abrash's comment that it will be not at top performance (I read it as microcoded and slow) in the first batch of chips, maybe you know more that that ?

>It would also substantiallyspeed up vertex attribute fetch >

if you have for ex. a mesh topology, each vertex will be shared by many faces so you are better of to keep vertex data in AoS form (XYZUVW) and then use Swizzle instead of Gather

>transcendental functions.

if you use a LUT based imlementation you acces multiple items inparallel (the coefficient of the spline), the Swizzle argument also apply here

You can't just load two adjecent texels. Due to texture addressing modes like wrap/mirror/clamp (and different possibilities in multiple dimensions), the texels you need can be in quite different locations than the typical 2x2 footprint.

Furthermore, if you want to improve the cache hit ratio by using texture swizzling (i.e. fitting 2D texel blocks into cache lines by swapping addressing bits around), the address for each texel can vary even more (while still improving overall locality). With a gather instruction, this would be no problem at all.

Tom Forsyth's presentation (software.intel.com/file/15545) claims that for Larrabee "offsets referring to the same cache line can happen on the same clock". So bascially the throughput would be the same as the number of unique cache lines that need to be accessed. Could you point me to the documentwhere Abrash said it would not be at top performance for the first chips? I wonder if he was talking about early prototypes which load elements sequentially, or whether he was actually talking about the ability to improve the performance further by splitting the gather operation over multiple load units.

As for vertex data, if you read 8 xyzuvw structures and want to transpose this into 6 AVX registers you need a ton of swizzle instructions. Furthermore, the attributes could be stored in separate streams. A gather instruction would solve both problems at once.

Tiny lookup tables can be implemented with swizzle instructions, but they are too small to get good accuracy. NVIDIA's presentation on G80's SFU (http://arith.polito.it/foils/11_2.pdf) can give you an indication of the size of tables that are required for high accuracy: between 6.5 and 13 cache lines. Since high correlation can be expected, the gather instruction would only need to access a few cache lines though. With a pair of 128-bit gather units AVX would be able to do these parallel lookups in very few cycles.

Anyway, while I'm passionate about graphics with no limitations, I believe gather/scatter would be of great value beyond that as well. As Igor noted there are other multimedia applications which would immediately benefit from it. But I also believe it would enable new applications to emerge. Things which some people are currently trying to fit onto the GPU architecture (GPGPU applications) but aren't very succesful at either due to the GPUs limited programming model. A CPU with gather/scatter support would revolutionize throughput computing.

levicki · ‎01-18-2011

What I really don't understand is why you are arguing against gather instruction when you already have all you need for your purpose? Why don't you just keep doing your 256-bit masked loads and swizzling to your heart's content with your well organized data, and leave others here to discuss further improvements to what you are already satisfied with? It is not as if you are going to lose anything if we get what we want.

There are certain cases where you simply cannot rearrange the data, or where data rearranging is prohibitively expensive (read: large 3D datasets).

Finally, just take a look at the optimization manual linked (Page 505, Example 11-8) assembler code for gather emulation, and then tell me again that real hardware wouldn't be faster.

bronxzv · ‎01-18-2011

>you are arguing against gather instruction
huh? where ?

>It is not as if you are going to lose anything if we get what we want.
well it will depends of the area it will take on the chip, I'll prefer 256-bit datapaths to the L1 DCache if I have to choose, now having both will be cool

>just take a look at the optimization manual linked (Page 505, Example 11-8) assembler code for gather emulation, and then tell me again that real hardware wouldn't be faster.

yup I remember reading it the other day, the indices are read from a buffer IIRC, in a more useful implementation you'll have the indices in a YMM register and moveeach individualindex to a GPR with PEXTRD (after a VEXTRACTF128 for the high part) then INSERTPS from memory, that's 18 instructionsincluding the final VINSERTF128, the throughput is better than "single instructions" like VSQRTPS orVDIVPS forworkloadsfitting in the L1 DCache, I have no idea ofthe speedup an hardware implementation will provide, though as I stated several times I will welcome the instruction even if the speedup is modest

bronxzv · ‎01-18-2011

> NVIDIA's presentation on G80's SFU (http://aith.polito.it/foils/11_2.pdf) can give you an indication of the size of tables that are required for high accuracy: betw

they use 2nd order polynomials so what I was calling M is = 3 in this case, with3rd order polynomialsM = 4 and is a better fit for 128-bit loads,ifthe table is big it's even more important to use an AoS layouti.e. (c0,c1,c2) packed together in your example

I'll see if I find a pointer to Abrash's comment and I will post it here

bronxzv · ‎01-18-2011

>So bascially the throughput would be the same as the number of unique cache lines that need to be accessed.

I'm afraid you are way too optimistic

Forsyth also says (slide 47)

"
Gather/scatter limited by cache speed

L1$ can only handle a few accesses per
clock, not 16 different ones
Address generation and virtual->physical are expensive
Speed may change between different processors
"

still searching the Abrash's reference...

bronxzv · ‎01-18-2011

I just found the Abrash's reference :

http://www.drdobbs.com/high-performance-computing/216402188;jsessionid=UXUQLDWA2IBWJQE1GHRSKH4ATMY32JVN?pgno=5

"
Finally, note that in the initial version of the hardware, a few aspects of the Larrabee architecture -- in particular vcompress, vexpand, vgather, vscatter, and transcendentals and other higher math functions -- are implemented as pseudo-instructions, using hardware-assisted instruction sequences, although this will change in the future.

"

levicki · ‎01-19-2011

Not having 256-bit datapaths and adding gather is out of the question. Gather would not be too usefull without them.

So, in the best case you will use 18 instructions to do a single parallel load instead of one?

1. What if register pressure is high, and you don't have GPRs to spare?

You will be spilling registers to memory and reloading them generating additional cache/bus traffic for already memory intenisve operation with poor data locality. How will that help?

2. How will compiler "learn" to perform such a parallel load in order to be able to vectorize loops where such load is needed?

Best you can do is write it on your own each time you need it using intrinsics or inline assembler. Instead of writing one intrinsic / instruction, or letting compiler take care of it, you will have to write 18 or more.

As I said multiple times, initial implementation of gather does not have to be faster than the current alternative as far as I am concerned -- at least it will make code more clear, enable compiler to auto-vectorize more loops, and pave a way for future hardware implementations which will be considerably faster.

capens__nicolas · ‎01-19-2011

Quoting Igor Levicki

As I said multiple times, initial implementation of gather does not have to be faster than the current alternative as far as I am concerned -- at least it will make code more clear, enable compiler to auto-vectorize more loops, and pave a way for future hardware implementations which will be considerably faster.

Amen.

bronxzv · ‎01-19-2011

>1. What if register pressure is high, and you don't have GPRs to spare?

you re-use 8x a *single* architected GPR that's a non issue

>. How will compiler "learn" to perform such a parallel load in order to be able to vectorize loops where such load is needed?

much like if there was a vscather instruction butby simply instentiating the optimized code, it will be free to optimize accross multiple scathers (unlike if it was a hardware instruction), in most the cases I'm aware of you use a series of scather with the same packed indices, the access pattern can be optimized when considering all the scathers together, and it's of paramount importance since the bottleneck is clearly the L1 DCache (nr of ports #1 limitation) and will always be

bronxzv · ‎01-19-2011

Amen to what ?

if you use high level constructs likeinlined vscather(), vcompress(), etc.functionsI don't see whygenerating a single instruction instead of several ones will make the source code more clear

maybe you are talking about theASM

capens__nicolas · ‎01-19-2011

Quoting bronxzv

>So bascially the throughput would be the same as the number of unique cache lines that need to be accessed.

I'm afraid you are way too optimistic

Forsyth also says (slide 47)

"
Gather/scatter limited by cache speed

L1$ can only handle a few accesses per
clock, not 16 different ones
Address generation and virtual->physical are expensive
Speed may change between different processors
"

He'smerelysaying that there should be some coherence to achieve a goodspeedup with gather/scatter. The worst case is when each of the elements is on a different cache line, which will take 16 L1 cache accesses. But for all the applications already mentioned here there will be high coherence between the address of each of the elements being loaded/stored, so typically it will be much faster.

It's interesting that Forsyth mentions that L1 can handle "a few" accesses. With two read ports the worst case for gather would be just 8 cycles. In the case of AVX it's a mere 4 cycles (versus 18 to emulate it). And again, that's the worst case, the typical case is probably between 1-2 cycles!

Another option is to organize the L1 cache into banks. Basically each of the load/store units would have its own L1 cache bank (or a pair of load/store units could share a multi-port cache bank so there can be four load/store units in total with merely two banks). Since there can be duplicate data in each of the banks it can be necessary to increase the total L1 cache size to ensure good temporal coherency though. But note that this is actually already how a multi-core architecture works. Obviously it's a lot cheaper to double the L1 size than to double the number of cores. But anyway, cache banking might onlybe worth it when expecting very high gather/scatter performance even with incoherent data locations. Given that Larrabee has wider vectors and is definitely running lots of SIMD code, my guess is that Abrash and Forsyth are talking about the possibility of further improving gather/scatter performance, closer to how GPUs perform. Quoting the rest of Forsyth's slide number 47:

"Offsets referring to the same cache line canhappen on the same clock

A gather where all offsets point to the same cache line willbe much faster than one where they point to 16 differentcache lines

Gather/scatter allows SOA/AOS mixing, but data layout design is still important for top speed"

So it's all just about clarifying that a compromise has been used to keep the logic size reasonable. With a sensible data layout this compromise doesn't stand in the way of achieving high performance.

There are clearly many options, ranging from the trivial microcoded implementation, tomulti-banking + multi-porting + multi-LSU. But the point is that it's a risk-free investment. I think it's perfectlyfine if the first implementation takes 4 cycles by using the two load units. And if after several years Intel benchmarks the applications which use these instructions, and finds that it's not worth the transistors to attempt to reduce that to 1 cycle, that's fine too. If they find that gather/scatter is widely used and it makes a significant difference to have a faster implementation, great!