No I haven't missed the MIC announcement. It looks like Knight's Corner will be quite impressive and could be very successful in the HPC market (NVIDIA's Tesla chips are selling like mad for supercomputers, despite the lack of a fully generic/flexible programming model and relatively low effective performance at complex tasks).
That said, the MIC product just seems to have to make up for the investment into Larrabee as a discrete GPU. Frankly it's too soon for high-end graphics to become fully programmable. There's potential to do things more efficiently but the software market isn't going to radically change course overnight, and Larrabee can't match the performance of the competition's cards which have been fully dedicated to Direct3D/OpenGL rendering for ages. Larrabee could have been great as a game console chip, but it looks like Intel lost that opportunity as well.
Low-end graphics doesn't have that disadvantage though. The competition isn't nearly as fierce, and a CPU with high throughput performance would be welcome for a very wide variety of markets. It can deliver adequate low-end graphics with limitless features, but it would also be awesome at other tasks. By increasing the number of cores they can even conquer the mid-end gaphics market and eventually the high-end market. This bottom-up approach seems a lot more realistic and low-risk to me than Larrabee's unsuccesful top-down attack of the market. Imagine hardcore gamers buying 256-core CPUs by the end of this decade, and other markets having other core counts to satisfy the computing needs. I don't think heterogeneous architectures are the future; developers detest them (i.e. are limited by their features) and a heterogenous architecture which combines the best of both worlds is within reach.
Intel could leverage what it learned from Larrabee to obtain graphics market dominance through its CPU product line. Executing 1024-bit instructions on 256-bit execution units can help keep the power consumption in check (and help hide memory latency), while gather/scatter is essential for the variety of memory access patterns and data ordering operations needed by graphics and many other computing tasks. Both of these features seem perfectly feasible, without hurting peformance for legacy workloads.
pure software texture sampling is already clearly feasible with today's AVX, look at this example
this makes software texture sampling perfectly feasible
the dummies are with a diffuse map, a bump map, a reflection map and it's renderered (with per sample lighting and sample exact shadows casting) at 40+ fps @ 1920 x 1080 on a 2600K at stock frequency
There's only two low-poly objects
sure indeed, that's why it's an interesting example when talking about texture sampling since a significant CPU budget is for texture operations not for scene graph traversal, geometry, etc. btw the 3 textures (diffuse map, bump gradient map and reflection map) are fully independent (3 distinct mipmap pyramids)
with high poly count model such as :
less than 5% of the time is spent for textures
Which is exactly why it's not good enough. Real 3D applications are far more complex so you need the extra efficiency gather/scatter would bring (not just for texturing but other tasks as well), and you also need the lower power consumption of executing 1024-bit instructions on 256-bit execution units (for integer operations even 128-bit would probably work well).
I can't see why you'd argue against that. I hate to tell you but WebGL and Flash Molehill (featuring SwiftShader support) could quickly make Kribi obsolete. The only way the software renderer can still prove to be useful, is if it's actually consitently faster and more flexible than an IGP. You absolutely need gather/scatter to achieve that.
Unless anyone sees any reasons why such features are not likely to be feasible?
Unless anyone sees any reasons why such features are not likely to be feasible?
as a matter of fact neithergather/scatter nor 1024-bit vectors are announced yet for AVX so we can't plan for these
in both cases I think they are not matching well with SMT, 4-way SMT is a more likely future IMHO
1024-bit vectors are not announced yet either, but they are already part of the AVX spec! So it seems relatively simple to implement it by executing the instructions on 256-bit execution units. If it significantly improves performance/Watt, it makes sense and will be added at some point.
I also can't help but think Intel already has long-term plans with AVX. It must have costed a lot of transistors and a lot of designing hours, to implement it into Sandy Bridge. FMA and AVX-1024 clearly indicate they intend on investing even more time and resources into it, and indicates they don't think the IGP is suitable for anything beyond legacy graphics. Perhaps the plans beyond FMA haven't solidified yet, so I don't think it hurts to discuss what I think makes most sense, on an Intel forum.
Besides, your product will become obsolete unless it can outperform the IGP and offer superior features. So you'd better join the discussion on what makes sense, and evaluate the cost/gain of all the options. It would give you very valuable insights once some of these things do get announced. In fact you could lead the custom software rendering revolution by offering frameworks with unique features.
I don't think 4-way SMT is worthwhile. First of all, 2-way SMT offers about 30% speedup at best, so 4-way SMT will likely offer much less. Also, I believe the thread count should be kept as low as possible. It's not unlikely that the gains from 4-way SMT are nullified by the synchronization overhead of managing twice the number of threads. Core counts will keep increasing anyway, so there's no need to make things even harder with more threads per core.
Correct me if I'm wrong, but I think what you're really after is 'strands' not threads; synchronous execution like on a GPU.
no, no, I mean 4 thread contexts as in Power 7 orLarrabee
>I believe this is exactly what can be achieved by executing 1024-bit instructions on 256-bit execution units. It offers similar latency hiding advantages,
not at all, in case of a cache miss the microcoded execution of the 4 sub-parts will be stalled unlike with SMT where at least one threadamong four will do useful work most of the time
in my experience (i.e. the graphics workloads I'm accustomed to, it may not apply to others) it's easier to achieve a good scalability with more cores than with wider vectors,a reasonably well optimized renderer will avoid synchronization between theads within the computation of a frame. Also stacked DRAM is around the cornerso the bandwitdh/latency gap will be widened shortly, maybe more than 2x and having more thread contexts will be welcome just to keep this 30% speedup
With my AVX-1024 proposal execution won't easily stall because of a cache miss. Independent instructions take four times longer to execute (but also perform four times more work), so by the time there are only dependent instructions left, the cache miss highly likely has been resolved. So it really does help hide miss latency. And in the case of RAM accesses, there's still 2-way SMT to cover for that. In total this solution is two times more effective at hiding memory access latency than 4-way SMT alone, and then there's the thread synchronization and power consumption advantage.
Indeed it's currently hard to achieve good scalability with wider vectors, but that's exactly because of the lack of gather/scatter!
The latest Intel microarchitecture code-named "Sandy Bridge" comes with a decoded instruction cache. If the decoded instructions are executed from cache, in particular the power consumption of the decoding is avoided altogether.
I know. And that's obviously a nice power saving feature, but the CPU's peak performance/Watt is still lower than that of the IGP. However, the IGP's absolute performance is lower, and it's not useful for anything other than legacy graphics (e.g. I can't imagine efficiently performing physics calculations on it while also rendering graphics, while all of the CPU's computing power is left unused). The CPU will even pull ahead further when support for FMA instructions is added! To make the IGP catch up with that and make it more flexible for GPGPU tasks, it would have to become a lot bigger and a lot more complex. But that seems pointless to me given that we've already got all this flexible processing power in the CPU part.
So it seems more worthwhile to me to think of how to (further) lower the CPU's power consumption instead. As far as I'm aware out-of-order instruction scheduling is still responsible for much of the power consumption. Since getting rid of out-of-order execution itself is obviously not an option, the amount of arithmetic work per instruction should be increased instead. Executing 1024-bit AVX operations on the existing execution units in multiple cycles seems to me like the perfect way to achieve that.
Gather/scatter support would also massively improve performance/Watt. Currently it takes 18 extract/insert instructions to perform a 256-bit gather operation of 32-bit elements. Each of them occupies ALU pipelines and moves around 128-bit of data. In total that's a lot of data movement and thus lots of heat. Not to mention each of these 18 instructions is scheduled separately and there's a long (false) dependency chain. Performing gather/scatter in the the load/store units instead would free up the ALU pipelines, drastically reduce the data movement, and turn it into a single instruction. Even a relatively straightforward implementation where a gather operation is executed as two sets of four load operations, would increase the peak throughput to one gather instruction every four cycles. The combination of both lower power consumption and higher performance should make it pretty irresistible.
The CPU still has a long road to reach some of the performance points which are commodity for today's GPU:
1. Memory bandwidth (today's high-end video card has >1GB RAM with 192GB/sec bandwidth)
2. High thread count
3. Dedicated tesselation, texturing and video decoding/processing hardware
Etc, etc. On the other hand, look how fast are GPU's getting CPU related functionality:
Did you ever think that C++ on a GPU will be possible?
Last year CPU's main selling point for HPC was double precision and 64-bit memory addressing. How about now when GPU has both and when GPU release cycle is every 6 months while new CPUs take 2 years to market?
Not to mention that to get better GPU you just have to get another video card, while with CPU you need to change everything because CPU designers don't care about hardware compatibility and yet they bring smallest increases in performance and functionality while almost every new GPU is a revolution in itself.
Nope, they aremostly 32-bit moves from/to the 256-bit PRF
>each of these 18 instructions is scheduled separately
which is great since you can interleave them withinstructions from the other thread, it's faster overallin case ofcache misses than a fat serializing monolitic instruction
extractps is a port 5 uop, not coincidentally the same port for all types of shuffle operations. 128-bit of data goes in, and 32-bit gets extracted.insertps is also a port 5 uop, and if the inserted data comes out of memory (like for gather) there's an extra port 2/3 uop for the load. 128-bit + 32-bit goes in, and 128-bit comes out.
The reason it's hauling around all this data is because the result has to be carried back to the top of the ALU by the forwarding network in case the next instruction needs it as input. The PRF is hardly involved as long as you're executing dependent instructions. Since the CPU doesn't know you're planning on overwriting each element, it's needlessly pushing around lots of data.
There's absolutely nothing great about scheduling 18 instructions which aren't even intended to be dependent.Gather could simply be a pair of port 2/3 uops instead. That doesn't mean it's a fat monolithic instruction. Each of the four loads for every 128-bit half can finish in any order, it doesn't occupy port 5, your thread can advance faster, and there's less of a chance that the second thread also runs out of work and the whole core stalls.
So really, there's nothing but advantages to having hardware support for gather/scatter.
Programming such endless chains of insertxx commands is ugly, stupid, slow, space-inefficient and really should be unnecessary. All these commands must be decoded and executed, and needlessly fill the memory and op caches. Gather commands could (for the first implementation) at least generate the flood of ops by itself and the out-of-order mechanism will care for the proper interlacing with other commands which can be executed in parallel (superscalar). I know that such a command will have a *long* latency.
...and things get even worse if one wants to gather small entities such as bytes. I have often missed commands such as "rep xlat" ("jecxz end; redo: lodsb; xlat; stosb; loop redo; end:" with al preserved). If gather was there, it could be simulated with some code around "gather.byte xmm1,[ea]" where the bytes of xmm1 hold unsigned offsets to ea (effective memory address).
By the way: I've never understood why e.g. the handy commands "rep stosb", "rep movsb", "jecxz", "loop", "enter" (level 0) and "leave" can be outperformed by any replacement stuff. Modern CPUs ought to be smart enough to execute such dedicated commands much faster. It's a shame that they don't!
C++ on the GPU is still a bit of gimmick really. Each thread (strand) only has access to a 1 kB stack. So you can forget about deep function calls and long recursion. It even starts spilling data to slow memory long before this limit is reached. The performance per strand is also really low, so you'd better have thousands of them to compensate for the latency. And last but not least you can't launch kernels from within kernels. So basically you shouldn't expect a quick port of your existing C++ code to run efficiently (or even run at all).
That said, the clock is indeed ticking since GPU manufacturers also see the benefits of evolving their architecture into something more CPU-like. Fortunately Intel merely has to add FMA, gather/scatter and power saving features like executing AVX-1024 on execution units of lesser width to regain dominance in the HPC market (and far beyond).