Sandy Bridge: SSE performance and AVX gather/scatter - Page 4

capens__nicolas · ‎01-04-2011

Hi all,

I'm curious how the two symmetric 128-bit vector units on Sandy Bridge affect SSE performance. What's its peak throughput, and sustainable throughput for legacy SSE instructions?

I also wonder when parallelgather/scatter instructions will finally be supported. AVX is great in theory, but in practice parallelizing a loop requires the ability to load/store elements from (slightly) divergent memory locations. Serially inserting and extracting elements was still somewhat acceptable for SSE, but with 256-bit AVXitbecomes a serious bottleneck, which partially cancels its theoretical benefits.

Sandy Bridge's CPU cores are actually more powerful than its GPU, but the lack of gather/scatter will limit the use of all this computing power.

Cheers,

Nicolas

capens__nicolas · ‎01-20-2011

Quoting bronxzv

my understanding is that the 2nd load port and the decoded icache are very effective to maximize lgacy SSE code throughput, but AFAIK the only extra execution unit for SSE is the 2nd blend unit, which execution units do you have in mind ?

I must have incorrectly interpreted these 'Execution Cluster' images: http://www.anandtech.com/show/3922/intels-sandy-bridge-architecture-exposed/3

I was under the impression that two ports were now each able to take 128-bit ADD and MUL instructions. Apparently they got extended in the other direction though, overlapping logic from the integer pipelines.

Frankly I have to say this makes AMD's Bulldozer architecture look really interesting, at least for AVX in its current form.

On the other hand, it means that whatever is responsible for the 30% performance increase for SwiftShader is really impressive! I've only been able to play with a Sandy Bridge demo system for 15 minutes, so I don't have a detailed analysis, but likely the dual load port is a big help for the sequential load hotspots (texture sampling, vertex fetch, lookup tables, etc.).

Still, given that a gather instruction would ideally be able to perform8 load operations and a matrix transposition in a single cycle, I think that would help even more than 30%. With integer AVX support the throughput could theoretically double, so it's important not to make it data starved with sequential load/store. There's plenty of cache-line coherence, which can be exploited with gather/scatter.

So my only hope is that Intel doesn't leave things half-done. As I've mentioned before, in the future developers will need architectures with abalanced mix of ILP, DLP and TLP. It looks like Sandy Bridge's dual load ports is a step forward in ILP, and AVX is an attempt at increasing DLP, but the potential of more thandoubling the performance is held back by lack of integer operations and gather/scatter. I fully realize these things take time, but it would have been incredibly useful to already have access to these instructions even if not implemented optimally.

As for TLP, I believe the number of cores should keep increasing, but not at the expense of completing the AVX instruction set. Here's why: It will still take many years before the compilers and tools will assist or automate multi-threaded development, with good scaling behavior.So for now it's easier to achieve higher performance in a wide range of software by parallelizingperformance-critical loops, rather than attempting to split it into threads.

But that's just my take. I'm curious what Intel's visionof the long-term future is like and how they plan to obtain a synergy with software development.

Thomas_W_Intel · ‎01-20-2011

Igor,

I am aware that you already support different code paths, but my impression was that you would prefer fewer code pathsthan bronxyz does.In any case,you certainly have a valid point thatrequiring customers to upgrade hardware and software at the same time puts an extra burden on them and I don'tthink it's important if you label this as "business" or "technical".

I would like to stressthat I cannot to speak on behalf of Intel.Personally however, I highly appreciate your technicalinsights and see a tremendous value in your suggestions and feed-back. In fact, I had alreadypointed out this thread to Intel architects who are working on future instruction sets and who read it with high interest.

Kind regards
Thomas

levicki · ‎01-20-2011

Thomas,

So it was a misunderstanding. I am not against multiple code paths.

Regarding customers having to upgrade both hardware and software, please bear in mind that if something is done only with intrinsics, it also forces developers to upgrade software (compiler) to be able to pass the benefit down to the end user.

That is not only costly for developers in terms of money, but it requires a lot of work on our side just to be sure that upgrading the compiler doesn't break build or backward compatibility, or that it does not introduce some regressions elsewhere. On larger projects, the cost of such an effort often blocks the compiler upgrade initiative.

When it comes to hardware, apart from adding some missing instructions such as HMINPS/HMAXPS/HMINPOSPS/HMAXPOSPS/etc, what I would love to see implemented in hardware as soon as possible is:

- GATHER
- SCATTER
- FMA
- LERP (linear interpolation)

First two I already explained.

FMA is usefull for DSP tasks (and has many other uses as well). First implementation does not need to be faster than MUL+ADD -- it just has to provide extra accuracy that comes from the lack of intermediate rounding.

LERP (linear interpolation) is also usefull for all sorts of DSP, scientific, and medical imaging tasks, and with CPUs and GPUs converging adding it would be a logical step forward -- with three and four operand non-destructive syntax available now I don't see any issues with adding it.

If you had GATHER and SCATTER instructions, it would also be usefull to have chunky (packed) to planar and planar to chunky conversion instructions. You could fetch 16 8-bit pixels, and write them out to 8 different memory locations (planes) with SCATTER as 16 consecutive bits belonging to each plane.

An example:

C2PBW XMM0, XMM1

Would reorder:

XMM1 (src) (numbers mean bit.byte)
15.07 15.06 15.05 15.04 15.03 15.02 15.01 15.00 14.07 ... 00.07 ... 00.00

To:

XMM0 (dst)
15.07 14.07 ... 00.07 15.06 14.06 ... 00.06 ... 15.00 ... 00.00

Hypothetical P2CWB would do the reverse transformation.

Of course, this "bit shuffle" could be generalized with additional parameter(s) to become usefull beyond packed to planar conversion which I had in mind.

I have a SIMD code written for this purpose, but it takes a lot of instructions to do this seemingly simple transformation. This could be usefull for image and video manipulation software.

capens__nicolas · ‎01-20-2011

By the way, even for floating-point applications integer vector operations are important. For example see my 2^x implementation here: http://www.devmaster.net/forums/showpost.php?p=43569&postcount=10

capens__nicolas · ‎01-20-2011

Quoting Igor Levicki

LERP (linear interpolation) is also usefull for all sorts of DSP, scientific, and medical imaging tasks, and with CPUs and GPUs converging adding it would be a logical step forward -- with three and four operand non-destructive syntax available now I don't see any issues with adding it.

I believe NVIDIA implements LERP using only FMA execution units (i.e. using multiple instructions).

I don't think it makes sense to add hardware support forLERP. First of all, would it have to be implemented as a*x+b*(1-x), or as a+(b-a)*x? There can be significant differences in the result due to rounding, denormals,orNaNs. Either way you're looking at adding another multiplier or adder, which increases latency.And it won't be used very often (except maybe for very select applications) so there's not a good return in performance for the transistors investment.

I believe two FMA units per core would offer a much better tradeoff between area and performance.

bronxzv · ‎01-20-2011

note that thecommon E(x) idiom

cvtps2dq xmm2, xmm1
cvtdq2ps xmm1, xmm2

can be directly extended to AVX-256 since there is a 256-bit variant of these two goodies

only your paddd and pslld will require 2 instructions instead of 1, it doesn't matter much since they aren't in the critical path

I'm sure you know that but if you want to use this functionin a loop you are better off to keep as much as possible your constants (C0,C1,..) in registers, it isn't that important for 128-bit code buttoo many loads are a big limiter for SSE to AVX-256 scalability

all in all this is a wonderful example to put in good light AVX-256

levicki · ‎01-20-2011

a + (b - a) * x is what I believe to be the most commonly used variant. I don't think it would be too hard to implement. I don't mind if it gets implemented using more than one uop internally as long as it performs same or better than the mix of instructions needed to perform the same operation.

I have also been suggesting an instruction that returns just the fractional part of float to int conversion (hypothetical FRACPS).

For example I have 1.55f. To get 0.55f I have to:

CVTTPS2DQ(1.55f) = 1
CVTDQ2PS(1) = 1.0f
SUBPS(1.55f - 1.0f) = 0.55f

That is also pretty common operation. Even better if the instruction could return both fractional and integer parts.

capens__nicolas · ‎01-21-2011

Quoting bronxzv

only your paddd and pslld will require 2 instructions instead of 1, it doesn't matter much since they aren't in the critical path

I'm sure you know that but if you want to use this functionin a loop you are better off to keep as much as possible your constants (C0,C1,..) in registers, it isn't that important for 128-bit code buttoo many loads are a big limiter for SSE to AVX-256 scalability

all in all this is a wonderful example to put in good light AVX-256

Indeed there will be a nice performance improvement, but my point was that even for floating-point applications you need integer operations. So a new code path will be required once the 256-bit instructions appear.

Anyway, I respect your opinion that for you this doesn't matter much. For other projects it's a big deal to go through another development cycle though. And my only point with this example was that it will likely also affect applications which are highly floating-point oriented.

Off course nobody can change anything about that now, but if for example it's possible to reasonably easily add the 256-bit packed integer instructions to Ivy Bridge (by executing them as two 128-bit operations), instead of waiting for Haswell, that would be beneficital for everyone. Or if Haswell is only planned to add FMA and 256-bit packed integer support was planned even later (when full 256-bit execution units are feasible), I think it would still be better to expose the integer instructions in Haswell.

It's clear to me that the execution core has to be extended in stages. It doesn't make sense for Intel to invest a lot of transistors into something that won't be widely used for several years, and for which the usage pattern isn't clear yet. But that doesn't mean the instructions aren't useful yet, even if not executed at full rate. In fact it might make sense to leave it that way...

Just look at NVIDIA's Fermi architecture. Some implementations have 4 FMA units for every SFU unit, while other implementations have 6 FMA units for every SFU unit. Also for some implementations every pair of FMA units can execute a double-precision floating-point operation, while other implementations appear to use the SFU for that. The instructions haven't changed, they just have different latency. This allows them to adjust the hardware to the 'mix' of instructions applications use.

For AVX this means it might really make sense to only have 128-bit packed integer execution units for many years to come. But the 256-bit instructions are needed so developers can use them and Intel can in turn analyze their usage and evolve the hardware accordingly. It allows them to very accurately determine when it makes sense to extend the integer executution units to 256-bit, if ever. Analyzing the SSE packed integer instructions is not entirely the same, because developers may decide to use a different implementation due to register pressure and additional instructions to get the data to and from the upper half of the YMM registers.

The 128-bit execution of 256-bit DIV and SQRT instructions is a prime example how to do this the right way. And it shows that 256-bit packed integer instructions were within reach for Sandy Bridge, which makes me hopeful that they'll be added to Ivy Bridge or Haswell at the latest.

capens__nicolas · ‎01-21-2011

Quoting Igor Levicki

a + (b - a) * x is what I believe to be the most commonly used variant. I don't think it would be too hard to implement. I don't mind if it gets implemented using more than one uop internally as long as it performs same or better than the mix of instructions needed to perform the same operation.

I have also been suggesting an instruction that returns just the fractional part of float to int conversion (hypothetical FRACPS).

For example I have 1.55f. To get 0.55f I have to:

CVTTPS2DQ(1.55f) = 1
CVTDQ2PS(1) = 1.0f
SUBPS(1.55f - 1.0f) = 0.55f

That is also pretty common operation. Even better if the instruction could return both fractional and integer parts.

a + (b - a) * x is susceptible to precision issues. If a = 1.0f and b = 1.0e-24f, then (b - a) = -1.0f and when x = 1.0f the result is 0.0f. This may lead to a division by zero. So generally a * (1 - x) + b * x is preferred, which doesn't suffer from this issue. However, it's clearly even more expensive to add hardware support for it.

Note that neither DirectX 10or OpenCL defines a LERP instruction or macro. It's just too tricky to know what the developer expects. And I haven't even touched the issue of what should happen with the rounding bits. Clearly it's better to just let the developer write the lerp version he wants explicitly. I don't think there's any realistic potential for optimizing it in hardware, except for making use of FMA.

As for a fraction operation, SSE4 features the ROUNDPS instruction, which can perform four different rounding operations (FLOOR, CEIL, ROUND and TRUNC), depending on the immediate operand. So a fraction operation only takes two instructions, which I think is already quite good. And it works outside of the range of 32-bit integers as well. Anyway, the upper five bits of the immediate operand are reserved, so they might extend it to include a FRAC modetoo (the integer part obviously corresponds to your choice of FLOOR, CEIL or TUNC). My most important use of ROUNDPS is to compute the fraction part so it would be welcome indeed.

levicki · ‎01-21-2011

Regarding lerp(), I just checked, GPU hardware uses this form:

dest = src0 * src1 + (1 - src0) * src2

Furthermore, various shader compilers often replace LRP instruction with FMA when possible nowadays, probably because hardware implementations have changed considerably since LRP was first introduced (i.e. I don't believe GPUs had FMA back then).

Note that I did list LERP after FMA because I was aware that they partially overlap. What I was trying to say is that if you have one, you can have the other as well without breaking the budget.

Regarding ROUNDPS, yes I know about that instruction, but other than being able to pick any rounding mode for the integer (as float) result, there is no advantage in my case -- I need both integer (as int) result, and fraction (as float) result. So no, unfortunately I cannot reduce number of instructions with ROUNDPS.

bronxzv · ‎01-21-2011

>big deal to go through another development cycle though. And my only point with this

it's not a new "development cycle" if you work at a high level, just a recompilation, the point I was trying to make all along is that development cycles are longer than ISA refresh cycles, we have already to plan for FMA3 anyway (and then 512-bit vectors, etc.),btw imagine what we will have todo if Larrabee was here as yet another "x86" target ?so it's no morepratical to work inassembly or even directlywith the intrinsics for any multi-man*year projectotherwise youspend all your human resourcesoptimizing zillions of different code paths instead of adding features and improving your algorithms

>And it shows that 256-bit packed integer instructions were within reach for Sandy Bridge, which makes me hopeful that they'll be added to Ivy Bridge or Haswell at the latest.

provided that "post 32-nm" instructions like the FP16 conversions and even FMA3 are already supported by the Intel C++ compiler and the SDE, but not the 256-bit integer instructions, I'll not count on it ATM for any project

levicki · ‎01-21-2011

>it's not a new "development cycle" if you work at a high level, just a recompilation...

Well, that depends heavily on the project size, company size and organisational structure.

If it is a "one man band" (i.e. if you are working alone), then yes, it is possible just to recompile whenever you want, although I sincerely hope that you at least perform functional and performance regression testing.

I had a situation where new version of Intel C++ compiler together with recompilation reduced overall performance by 10% -- I had to change compilation options and restructure parts of the code just to stay at the previous performance level, not to mention that intrinsic code output (and thus its performance) can also differ between two compiler versions.

And what about the situation where you want to have multiple code paths and new path is introduced while old one you still need is removed at the same time?

For example, how will you keep supporting Pentium 3 which no longer has a code path in IPP 7.0 and everything else up to and including SandyBridge? You have to keep using IPP 6.1 for that Pentium 3, and IPP 7.0 for SandyBridge.

Another example, compiler options change, with next compiler and another recompile you can no longer generate specialized code paths for specific architectures even though in some cases you had considerable benefits from doing so.

So no, it is not possible to do it all "at a high level", and there is a reason why you implement things in hardware and write low-level code. Software implementations are way too fluid and introduce too many variables, not to mention additional cost.

Regarding 256-bit packed integer support versus FP16 and FMA3 support -- for me it is clear that Intel is playing catch up with GPUs.

They will fail at that simply because GPUs have much shorter release cycle, and are advancing performance and functionality at a much faster rate than CPUs.

GPUs have an industry that drives the change (gaming/multimedia), while CPUs still benchmark their base performance on synthetic workloads from five years ago, which is IMO rather pathetic goal given how much R&D money is being burned every year.

bronxzv · ‎01-21-2011

>sincerely hope that you at least perform functional

sure we do

>and performance regression testing.

yepwe do, we explain it here

http://www.inartis.com/Products/Kribi%203D%20Engine/Default.aspx

"

The essential task in the final phase of the development of a 3D rendering engine, much like for a racing car engine, is tuning for the best possible performances on actual machines. After each small change in the program code, very precise timings show us the amount of speedup achieved (if any). For this purpose, we time with a stopwatch the rendering of a sequence of images, the laps of our racecourse.

"

>Pentium 3 which no longer has a code path in IPP 7.0

IIRC they have reverted some new limitations(i.e. removed paths)afterusershave complained in the XE 2011 release (IPP 7 update 1), not sure about the Katmai path though

capens__nicolas · ‎01-21-2011

Quoting Igor Levicki

Furthermore, various shader compilers often replace LRP instruction with FMA when possible nowadays, probably because hardware implementations have changed considerably since LRP was first introduced (i.e. I don't believe GPUs had FMA back then).

Note that I did list LERP after FMA because I was aware that they partially overlap. What I was trying to say is that if you have one, you can have the other as well without breaking the budget.

GPUs had FMA support from the very beginning, and LERP has always been a macro which expanded to multiple instructions. Nothing has changed there as far as I'm aware. In fact like I said LERP is gone from DirectX 10 and OpenCL.

SoI don't see the point in adding a LERP instruction for the CPU either. It's just additions and multiplies so you need adders and multipliers. If you want LERP to be faster, you just need FMA execution units, and if you want it to be faster still, you need wider vectors. Both FMA and wider vectors help all other arithmetic too. But there's no way you can gain anything from a LERP instruction (unless you want to waste die space).

I think it's very helpful to have early access to new instructions if there's the potential they will become faster in later generations, but for the case of LERP I just don't see that potential.

levicki · ‎01-22-2011

Well, LERP uses a constant (1.0) which you need to load from memory or keep in a register. Hardcoded instruction could use internal constant like it was possible with FPU. That would either remove one cache/memory access, or free one SIMD register when using hardcoded instruction instead of sequence of instructions that do the same thing.

jan_v_ · ‎06-15-2013

Anyone already experimented with AVX2 gather ?

I've been using it for texture interpolation. Unfortunately I've seen only small speeups of my ported code from SSE2 to AVX2. Less than 10%. The gather has a rather annoying property that it resets the mask. From a software point of view this is totally useless and requires more code. A gather without mask also would have been nice to save a register.

From what I read, the gather is executed by microcode, producing uops, explaining the poor performance.... At least it's there and hopefully will get better later.

TimP · ‎06-15-2013

Guessing as to what you refer to, gather with mask is used to avoid potential access failures when using uninitialized pointers in the masked off elements. Did you check whether -opt-assume-safe-padding is applicable to your compiler choices?

I don't know whether there would be a 32-bit signed vs. unsigned index vs. 64-bit pointer performance issue, in part since I can't see your code from here and anyway am not an expert in your subject.

As you indicate, the initial implementation of gather was not advertised as a performance improvement over equivalent compiler simulated gather. But you didn't say whether you tried sse4 with simulated gather, for example.

jan_v_ · ‎06-15-2013

Having the mask is fine. The mask indicates what elements to load, the annoying thing it that afther the gather has finished the mask is set to 0. With the vs2012 compiler, using the gather intrinsic, the compiler doesn't even take into account the mask is set to 0, leading to wrong code. I already filed a bug for that with Microsoft.

The reason I see no performance improvement, is imho, simply because the cache can not be loaded in paralell from up to 8 addresses, and everything is serialized there...

TimP · ‎06-15-2013

OK; as you say, I haven't seen VS2012 generate simulated (and certainly not AVX2) gather, and AVX is the only /arch: there which could be used.

I agree, the 8 separate addresses could each touch a different cache line, and the accesses may be limited to 2 vector elements per clock cycle, which you could probably achieve with SSE2.

bronxzv · ‎06-17-2013

jan v. wrote:
Anyone already experimented with AVX2 gather ?

I have ported a big project to AVX2 well before to be able to test it on actual hardware (simply validated with the SDE). Just after I can start tuning my code on a retail Core i7 4770K setup, my first change was to comment out all my specialized AVX2 gather code paths. Software synthetized gather (as used for my AVX code path) is faster, thus the gather "optimization" was in fact leading to a performance regression!

As you can see in the Optimization Manual https://www-ssl.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf page C-5, gather instructions have high reciprocal throughput (~ 10 clocks for 256-bit gather instructions). Hint: there are a few advices at pages 11-59/11-60 for gather optimizations.

Also of note :
- VTune Amplifier reports a lot of performance warning events when using the gather instructions.
- There is a bunch of errata related to gather in the Haswell specification update http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/4th-gen-core-family-desktop-specification-update.pdf.

My impression so far is that if someone get significant speedups from gather instructions it tells more about how poorly optimized is his/her software synthetized gather path (required for legacy targets) than how fast the gather instructions are. btw there is a common myth that gather instructions are somehow required for proper vectorization, that's clearly not true since these instructions are slower (or slightly faster in some corner cases, maybe) than their emulation with other basic instructions, so there is no more, no less vectorization opportunities.

All in all, I'll say that AVX2 gather instructions aren't ready for prime time at the moment.

Bernard · ‎06-17-2013

>>>gather instructions have high reciprocal throughput (~ 10 clocks for 256-bit gather instructions). Hint: there is a few advices at pages 11-59/11-60 for gather optimizations.>>>

Optimization manual also states that cpi for gather instructions is measured when memory references are cached in L1.If it is not the situation cpi will rise due to latency of memory access.