Converging AVX and LRBni

capens__nicolas · ‎05-10-2011

Hi all,

With Larrabee being canned as a discrete GPU, I was wondering whether it makes sense to actually let the CPU take the role of GPU and high-throughput computing device.

Obviously power consumption is a big issue, but since AVX is specified to process up to 1024-bit registers, it could execute such wide operations using SNB's existing 256-bit execution units in four cycles (throughput). Since it's one instruction this takes a lot less power than four 256-bit instructions. Basically you get the benefit of in-order execution within an out-of-order architecture.

The only other thing that would be missing to be able to get rid of the IGP (and replacing it with generic cores) is support for gather/scatter instructions. Since SNB already has two 128-bit load units it seems possible to me to achieve a throughput of one 256-bit gather every cycle, or 1024-bit every four cycles. In my experience (as lead SwiftShader developer) this makes software texture sampling perfectly feasible, while also offering massive benefits in all other high-throuhput tasks.

Basically you'd get Larrabee in a CPU socket, without compromising any single-threaded or scalar performance!

Thoughts?

Nicolas

Matthias_Kretz · ‎05-11-2011

Have you missed the MIC announcement? I agree that it would be nice to have LRBni features in the CPU. But at least Intel hasn't completely given up on the Larrabee developments yet.

capens__nicolas · ‎05-11-2011

Quoting Matthias Kretz

Have you missed the MIC announcement? I agree that it would be nice to have LRBni features in the CPU. But at least Intel hasn't completely given up on the Larrabee developments yet.

No I haven't missed the MIC announcement. It looks like Knight's Corner will be quite impressive and could be very successful in the HPC market (NVIDIA's Tesla chips are selling like mad for supercomputers, despite the lack of a fully generic/flexible programming model and relatively low effective performance at complex tasks).

That said, the MIC product just seems to have to make up for the investment into Larrabee as a discrete GPU. Frankly it's too soon for high-end graphics to become fully programmable. There's potential to do things more efficiently but the software market isn't going to radically change course overnight, and Larrabee can't match the performance of the competition's cards which have been fully dedicated to Direct3D/OpenGL rendering for ages. Larrabee could have been great as a game console chip, but it looks like Intel lost that opportunity as well.

Low-end graphics doesn't have that disadvantage though. The competition isn't nearly as fierce, and a CPU with high throughput performance would be welcome for a very wide variety of markets. It can deliver adequate low-end graphics with limitless features, but it would also be awesome at other tasks. By increasing the number of cores they can even conquer the mid-end gaphics market and eventually the high-end market. This bottom-up approach seems a lot more realistic and low-risk to me than Larrabee's unsuccesful top-down attack of the market. Imagine hardcore gamers buying 256-core CPUs by the end of this decade, and other markets having other core counts to satisfy the computing needs. I don't think heterogeneous architectures are the future; developers detest them (i.e. are limited by their features) and a heterogenous architecture which combines the best of both worlds is within reach.

Intel could leverage what it learned from Larrabee to obtain graphics market dominance through its CPU product line. Executing 1024-bit instructions on 256-bit execution units can help keep the power consumption in check (and help hide memory latency), while gather/scatter is essential for the variety of memory access patterns and data ordering operations needed by graphics and many other computing tasks. Both of these features seem perfectly feasible, without hurting peformance for legacy workloads.

capens__nicolas · ‎05-15-2011

Could any Intel engineer/scientist tell me what the expected power efficiency would be for executing AVX-1024 on 256-bit execution units?

Since out-of-order execution takes a lot of power it seems to me that replacing four AVX-256 instructions with one AVX-1024 instruction, without widening the execution unit, could be a fairly significant power saving, close to that of in-order execution (like Larrabee or other throughput oriented architectures). It seems to me it could combine the advantages of both the GPU and CPU.

Or am I overlooking something that makes heterogeneous achitectures more attractive? Then why does Sandy Bridge have more GFLOPS in its CPU than its IGP? All that's lacking to get good efficiency is gather/scatter and some technology to lower the power consumption...

capens__nicolas · ‎05-20-2011

I just realized I might be able to partially estimate the power consumption impact myself, using existing instructions which operate on registers wider than the actual execution units.

One of these is the vdivps instruction. It's executed on an 128-bit execution unit, so vdivps essentially replaces two divps instructions.I wonder though, if vdivps is split into multiple uops which are scheduled separately, or whether it's one uop and the execution unit takes care of sequencing? Unfortunately it looks like it's the former since Agner Fog's documents say it takes 3 uops (essentially two divps and a vinsertf128)?

Can anyone think of other instructions or perhaps an entirely different approach to estimate the power consumption impact of executing 1024-bit AVX instructions on (sequencing) 256-bit execution units? Is this even a viable idea at all?

bronxzv · ‎05-20-2011

this makes software texture sampling perfectly feasible

pure software texture sampling is already clearly feasible with today's AVX, look at this example

http://www.inartis.com/Products/Kribi%203D%20Player/FeaturesLab/page_example/Materials_bump_map.aspx

the dummies are with a diffuse map, a bump map, a reflection map and it's renderered (with per sample lighting and sample exact shadows casting) at 40+ fps @ 1920 x 1080 on a 2600K at stock frequency

IDZ_A_Intel · ‎05-21-2011

That's not good enough. Don't get me wrong, you've got an impressive software renderer. But let's face it, that scene is pretty simple. There's only two low-poly objects, there's some simple lighting by today's standards (no long shaders), it appears to be using per-polygon mipmapping instead of per-quad mipmapping, I can't spot any trilinear or anisotropic filtering, and based on previous conversations we had you're packing multiple textures together. The latter is a clever trick to compensate somewhat for the lack of gather/scatter support, but unfortunately it's not generally applicable.

To really converge the GPU and CPU into a superior homogeneous architecture, higher effective performance is required, with lower power consumption. With FMA support already on the way, I think gather/scatter will be absolutely critical to be able to make efficient use of all this computing power. And executing 1024-bit instructions on 256-bit execution units should help keep the power consumption in check.

bronxzv · ‎05-21-2011

There's only two low-poly objects

sure indeed, that's why it's an interesting example when talking about texture sampling since a significant CPU budget is for texture operations not for scene graph traversal, geometry, etc. btw the 3 textures (diffuse map, bump gradient map and reflection map) are fully independent (3 distinct mipmap pyramids)

with high poly count model such as :

http://www.inartis.com/Company/Lab/KribiBenchmark/KB_Robots.aspx
or
http://www.inartis.com/Company/Lab/KribiBenchmark/KB_Skyline_City.aspx

less than 5% of the time is spent for textures

capens__nicolas · ‎05-22-2011

Quoting bronxzv

sure indeed, that's why it's an interesting example when talking about texture sampling since a significant CPU budget is for texture operations not for scene graph traversal, geometry, etc.

Which is exactly why it's not good enough. Real 3D applications are far more complex so you need the extra efficiency gather/scatter would bring (not just for texturing but other tasks as well), and you also need the lower power consumption of executing 1024-bit instructions on 256-bit execution units (for integer operations even 128-bit would probably work well).

I can't see why you'd argue against that. I hate to tell you but WebGL and Flash Molehill (featuring SwiftShader support) could quickly make Kribi obsolete. The only way the software renderer can still prove to be useful, is if it's actually consitently faster and more flexible than an IGP. You absolutely need gather/scatter to achieve that.

capens__nicolas · ‎05-22-2011

By the way, the value of converging AVX and LRBni goes way beyond graphics. So it's likely for gather/scatter support to be added to increase SIMD efficiency anyhow, regardless of whether you see much need for it in your software renderer. The other big thing that separates the CPU and Larrabee would be performance/Watt due to the out-of-order architecture, but that might be fixable by reducing the instruction rate...

Unless anyone sees any reasons why such features are not likely to be feasible?

bronxzv · ‎05-23-2011

Unless anyone sees any reasons why such features are not likely to be feasible?

as a matter of fact neithergather/scatter nor 1024-bit vectors are announced yet for AVX so we can't plan for these

in both cases I think they are not matching well with SMT, 4-way SMT is a more likely future IMHO

capens__nicolas · ‎05-23-2011

It's not about what's announced yet or not. It's about what makes sense. FMA will be added sooner or later, and at that point the bottleneck from irregular data access patterns will be unbearable. To make all those transistors spent on SIMD really count, they should add the one thing to make it complete: gather/scatter. It's only a matter of time before Intel or AMD realizes how big of a win that would be (making every CPU capable of efficient and flexible high-throughput computing like Larrabee).

1024-bit vectors are not announced yet either, but they are already part of the AVX spec! So it seems relatively simple to implement it by executing the instructions on 256-bit execution units. If it significantly improves performance/Watt, it makes sense and will be added at some point.

I also can't help but think Intel already has long-term plans with AVX. It must have costed a lot of transistors and a lot of designing hours, to implement it into Sandy Bridge. FMA and AVX-1024 clearly indicate they intend on investing even more time and resources into it, and indicates they don't think the IGP is suitable for anything beyond legacy graphics. Perhaps the plans beyond FMA haven't solidified yet, so I don't think it hurts to discuss what I think makes most sense, on an Intel forum.

Besides, your product will become obsolete unless it can outperform the IGP and offer superior features. So you'd better join the discussion on what makes sense, and evaluate the cost/gain of all the options. It would give you very valuable insights once some of these things do get announced. In fact you could lead the custom software rendering revolution by offering frameworks with unique features.

I don't think 4-way SMT is worthwhile. First of all, 2-way SMT offers about 30% speedup at best, so 4-way SMT will likely offer much less. Also, I believe the thread count should be kept as low as possible. It's not unlikely that the gains from 4-way SMT are nullified by the synchronization overhead of managing twice the number of threads. Core counts will keep increasing anyway, so there's no need to make things even harder with more threads per core.

Correct me if I'm wrong, but I think what you're really after is 'strands' not threads; synchronous execution like on a GPU. I believe this is exactly what can be achieved by executing 1024-bit instructions on 256-bit execution units. It offers similar latency hiding advantages, and doesn't suffer from thread synchronization overhead. What's more, unlike 4-way SMT it reduces the instruction rate, offering power savings not unlike a GPU. Larrabee's high power consumption must have been partly due to 4-way SMT, and made it miss its goals. We don't want Intel to make that same mistake twice. For the CPU I can't really think of a reason not to implement AVX-1024 on the existing 256-bit and 128-bit execution units...

bronxzv · ‎05-23-2011

Correct me if I'm wrong, but I think what you're really after is 'strands' not threads; synchronous execution like on a GPU.

no, no, I mean 4 thread contexts as in Power 7 orLarrabee

>I believe this is exactly what can be achieved by executing 1024-bit instructions on 256-bit execution units. It offers similar latency hiding advantages,

not at all, in case of a cache miss the microcoded execution of the 4 sub-parts will be stalled unlike with SMT where at least one threadamong four will do useful work most of the time

in my experience (i.e. the graphics workloads I'm accustomed to, it may not apply to others) it's easier to achieve a good scalability with more cores than with wider vectors,a reasonably well optimized renderer will avoid synchronization between theads within the computation of a frame. Also stacked DRAM is around the cornerso the bandwitdh/latency gap will be widened shortly, maybe more than 2x and having more thread contexts will be welcome just to keep this 30% speedup

capens__nicolas · ‎05-23-2011

I know you meant 4 thread contexts, but I don't think it offers the best cost/gain balance.

With my AVX-1024 proposal execution won't easily stall because of a cache miss. Independent instructions take four times longer to execute (but also perform four times more work), so by the time there are only dependent instructions left, the cache miss highly likely has been resolved. So it really does help hide miss latency. And in the case of RAM accesses, there's still 2-way SMT to cover for that. In total this solution is two times more effective at hiding memory access latency than 4-way SMT alone, and then there's the thread synchronization and power consumption advantage.

Indeed it's currently hard to achieve good scalability with wider vectors, but that's exactly because of the lack of gather/scatter!

Thomas_W_Intel · ‎05-23-2011

The latest Intel microarchitecture code-named "Sandy Bridge" comes with a decoded instruction cache. If the decoded instructions are executed from cache, in particular the power consumption of the decoding is avoided altogether.

capens__nicolas · ‎05-24-2011

Quoting Thomas Willhalm (Intel)

The latest Intel microarchitecture code-named "Sandy Bridge" comes with a decoded instruction cache. If the decoded instructions are executed from cache, in particular the power consumption of the decoding is avoided altogether.

I know. And that's obviously a nice power saving feature, but the CPU's peak performance/Watt is still lower than that of the IGP. However, the IGP's absolute performance is lower, and it's not useful for anything other than legacy graphics (e.g. I can't imagine efficiently performing physics calculations on it while also rendering graphics, while all of the CPU's computing power is left unused). The CPU will even pull ahead further when support for FMA instructions is added! To make the IGP catch up with that and make it more flexible for GPGPU tasks, it would have to become a lot bigger and a lot more complex. But that seems pointless to me given that we've already got all this flexible processing power in the CPU part.

So it seems more worthwhile to me to think of how to (further) lower the CPU's power consumption instead. As far as I'm aware out-of-order instruction scheduling is still responsible for much of the power consumption. Since getting rid of out-of-order execution itself is obviously not an option, the amount of arithmetic work per instruction should be increased instead. Executing 1024-bit AVX operations on the existing execution units in multiple cycles seems to me like the perfect way to achieve that.

Gather/scatter support would also massively improve performance/Watt. Currently it takes 18 extract/insert instructions to perform a 256-bit gather operation of 32-bit elements. Each of them occupies ALU pipelines and moves around 128-bit of data. In total that's a lot of data movement and thus lots of heat. Not to mention each of these 18 instructions is scheduled separately and there's a long (false) dependency chain. Performing gather/scatter in the the load/store units instead would free up the ALU pipelines, drastically reduce the data movement, and turn it into a single instruction. Even a relatively straightforward implementation where a gather operation is executed as two sets of four load operations, would increase the peak throughput to one gather instruction every four cycles. The combination of both lower power consumption and higher performance should make it pretty irresistible.

levicki · ‎05-24-2011

Well, gather and scatter instructions are something I was arguing for ever since Pentium IV days.

The CPU still has a long road to reach some of the performance points which are commodity for today's GPU:

1. Memory bandwidth (today's high-end video card has >1GB RAM with 192GB/sec bandwidth)
2. High thread count
3. Dedicated tesselation, texturing and video decoding/processing hardware

Etc, etc. On the other hand, look how fast are GPU's getting CPU related functionality:
http://www.nvidia.com/docs/IO/100940/GeForce_GTX_580_Datasheet.pdf

Did you ever think that C++ on a GPU will be possible?

Last year CPU's main selling point for HPC was double precision and 64-bit memory addressing. How about now when GPU has both and when GPU release cycle is every 6 months while new CPUs take 2 years to market?

Not to mention that to get better GPU you just have to get another video card, while with CPU you need to change everything because CPU designers don't care about hardware compatibility and yet they bring smallest increases in performance and functionality while almost every new GPU is a revolution in itself.

bronxzv · ‎05-24-2011

>Each of them occupies ALU pipelines and moves around 128-bit of data

Nope, they aremostly 32-bit moves from/to the 256-bit PRF

>each of these 18 instructions is scheduled separately

which is great since you can interleave them withinstructions from the other thread, it's faster overallin case ofcache misses than a fat serializing monolitic instruction

capens__nicolas · ‎05-24-2011

Quoting bronxzv

>Each of them occupies ALU pipelines and moves around 128-bit of data

Nope, they aremostly 32-bit moves from/to the 256-bit PRF

>each of these 18 instructions is scheduled separately

which is great since you can interleave them withinstructions from the other thread, it's faster overallin case ofcache misses than a fat serializing monolitic instruction

extractps is a port 5 uop, not coincidentally the same port for all types of shuffle operations. 128-bit of data goes in, and 32-bit gets extracted.insertps is also a port 5 uop, and if the inserted data comes out of memory (like for gather) there's an extra port 2/3 uop for the load. 128-bit + 32-bit goes in, and 128-bit comes out.

The reason it's hauling around all this data is because the result has to be carried back to the top of the ALU by the forwarding network in case the next instruction needs it as input. The PRF is hardly involved as long as you're executing dependent instructions. Since the CPU doesn't know you're planning on overwriting each element, it's needlessly pushing around lots of data.

There's absolutely nothing great about scheduling 18 instructions which aren't even intended to be dependent.Gather could simply be a pair of port 2/3 uops instead. That doesn't mean it's a fat monolithic instruction. Each of the four loads for every 128-bit half can finish in any order, it doesn't occupy port 5, your thread can advance faster, and there's less of a chance that the second thread also runs out of work and the whole core stalls.

So really, there's nothing but advantages to having hardware support for gather/scatter.

sirrida · ‎05-24-2011

It would be fine if at least the gather command were there; in my experience the scatter command is not used as often.

Programming such endless chains of insertxx commands is ugly, stupid, slow, space-inefficient and really should be unnecessary. All these commands must be decoded and executed, and needlessly fill the memory and op caches. Gather commands could (for the first implementation) at least generate the flood of ops by itself and the out-of-order mechanism will care for the proper interlacing with other commands which can be executed in parallel (superscalar). I know that such a command will have a *long* latency.

...and things get even worse if one wants to gather small entities such as bytes. I have often missed commands such as "rep xlat" ("jecxz end; redo: lodsb; xlat; stosb; loop redo; end:" with al preserved). If gather was there, it could be simulated with some code around "gather.byte xmm1,[ea]" where the bytes of xmm1 hold unsigned offsets to ea (effective memory address).

By the way: I've never understood why e.g. the handy commands "rep stosb", "rep movsb", "jecxz", "loop", "enter" (level 0) and "leave" can be outperformed by any replacement stuff. Modern CPUs ought to be smart enough to execute such dedicated commands much faster. It's a shame that they don't!

capens__nicolas · ‎05-24-2011

Igor, while I share your love for gather/scatter, I don't think your other expectations are very realistic.

1. High memory bandwidth at low latency is very expensive. GPUs thank their high bandwidth to GDDR memory with high latency, but that would be unacceptable for a CPU since it doesn't tolerate high latency. Lots of software relies on low RAM latency so you don't want to compromise their performance. Besides, the CPU greatly compensates for lower RAM bandwidth by having large caches. Future denser cache technology like T-RAM could make RAM bandwidth even less of an issue.

2. Having lots of threads is not a good thing. It comes with extra synchronization overhead, and some software just doesn't scale well to a high thread count. If you can achieve the same peak throughput, with less threads, that's always a good thing in practice. Besides, GPUs do not have a high thread count. The terminology is confusing, but truely independent instruction streams are called kernels on the GPU, and only very recently they managed to get a few kernels to execute concurrently. What GPU manufacturers like to call threads are really dependent strands within the same kernel which run on tighly coupled SIMD lanes. In this sense a quad-core Hyper-Threaded CPU with two 256-bit AVX execution units processing 1024-bit vectors could be considered to have 512 of these "threads" in flight, while only really running 8 independent "kernels". Anyhow, again, the CPU isn't as far behind as it might appear.

3. You don't need dedicated tesselation or texturing units to outperform the IGP. A homogeneous architecture has more area available, and using dynamic code generation you only pay for the features that are actually in use. Also, tesselation and even texturing is very likely to become fully programmable at some point in the future anyhow. Furthermore, dedicated hardware is only useful for legacy graphics. For anything else it would be idle silicon (read: a waste). Gather/scatter and AVX-1024 would also benefit lots of other high-throughput applications.

C++ on the GPU is still a bit of gimmick really. Each thread (strand) only has access to a 1 kB stack. So you can forget about deep function calls and long recursion. It even starts spilling data to slow memory long before this limit is reached. The performance per strand is also really low, so you'd better have thousands of them to compensate for the latency. And last but not least you can't launch kernels from within kernels. So basically you shouldn't expect a quick port of your existing C++ code to run efficiently (or even run at all).

That said, the clock is indeed ticking since GPU manufacturers also see the benefits of evolving their architecture into something more CPU-like. Fortunately Intel merely has to add FMA, gather/scatter and power saving features like executing AVX-1024 on execution units of lesser width to regain dominance in the HPC market (and far beyond).