No I haven't missed the MIC announcement. It looks like Knight's Corner will be quite impressive and could be very successful in the HPC market (NVIDIA's Tesla chips are selling like mad for supercomputers, despite the lack of a fully generic/flexible programming model and relatively low effective performance at complex tasks).
That said, the MIC product just seems to have to make up for the investment into Larrabee as a discrete GPU. Frankly it's too soon for high-end graphics to become fully programmable. There's potential to do things more efficiently but the software market isn't going to radically change course overnight, and Larrabee can't match the performance of the competition's cards which have been fully dedicated to Direct3D/OpenGL rendering for ages. Larrabee could have been great as a game console chip, but it looks like Intel lost that opportunity as well.
Low-end graphics doesn't have that disadvantage though. The competition isn't nearly as fierce, and a CPU with high throughput performance would be welcome for a very wide variety of markets. It can deliver adequate low-end graphics with limitless features, but it would also be awesome at other tasks. By increasing the number of cores they can even conquer the mid-end gaphics market and eventually the high-end market. This bottom-up approach seems a lot more realistic and low-risk to me than Larrabee's unsuccesful top-down attack of the market. Imagine hardcore gamers buying 256-core CPUs by the end of this decade, and other markets having other core counts to satisfy the computing needs. I don't think heterogeneous architectures are the future; developers detest them (i.e. are limited by their features) and a heterogenous architecture which combines the best of both worlds is within reach.
Intel could leverage what it learned from Larrabee to obtain graphics market dominance through its CPU product line. Executing 1024-bit instructions on 256-bit execution units can help keep the power consumption in check (and help hide memory latency), while gather/scatter is essential for the variety of memory access patterns and data ordering operations needed by graphics and many other computing tasks. Both of these features seem perfectly feasible, without hurting peformance for legacy workloads.
pure software texture sampling is already clearly feasible with today's AVX, look at this example
this makes software texture sampling perfectly feasible
There's only two low-poly objects
sure indeed, that's why it's an interesting example when talking about texture sampling since a significant CPU budget is for texture operations not for scene graph traversal, geometry, etc. btw the 3 textures (diffuse map, bump gradient map and reflection map) are fully independent (3 distinct mipmap pyramids)
with high poly count model such as :
less than 5% of the time is spent for textures
Which is exactly why it's not good enough. Real 3D applications are far more complex so you need the extra efficiency gather/scatter would bring (not just for texturing but other tasks as well), and you also need the lower power consumption of executing 1024-bit instructions on 256-bit execution units (for integer operations even 128-bit would probably work well).
I can't see why you'd argue against that. I hate to tell you but WebGL and Flash Molehill (featuring SwiftShader support) could quickly make Kribi obsolete. The only way the software renderer can still prove to be useful, is if it's actually consitently faster and more flexible than an IGP. You absolutely need gather/scatter to achieve that.
Unless anyone sees any reasons why such features are not likely to be feasible?
Correct me if I'm wrong, but I think what you're really after is 'strands' not threads; synchronous execution like on a GPU.
no, no, I mean 4 thread contexts as in Power 7 orLarrabee
>I believe this is exactly what can be achieved by executing 1024-bit instructions on 256-bit execution units. It offers similar latency hiding advantages,
not at all, in case of a cache miss the microcoded execution of the 4 sub-parts will be stalled unlike with SMT where at least one threadamong four will do useful work most of the time
in my experience (i.e. the graphics workloads I'm accustomed to, it may not apply to others) it's easier to achieve a good scalability with more cores than with wider vectors,a reasonably well optimized renderer will avoid synchronization between theads within the computation of a frame. Also stacked DRAM is around the cornerso the bandwitdh/latency gap will be widened shortly, maybe more than 2x and having more thread contexts will be welcome just to keep this 30% speedup
The latest Intel microarchitecture code-named "Sandy Bridge" comes with a decoded instruction cache. If the decoded instructions are executed from cache, in particular the power consumption of the decoding is avoided altogether.
I know. And that's obviously a nice power saving feature, but the CPU's peak performance/Watt is still lower than that of the IGP. However, the IGP's absolute performance is lower, and it's not useful for anything other than legacy graphics (e.g. I can't imagine efficiently performing physics calculations on it while also rendering graphics, while all of the CPU's computing power is left unused). The CPU will even pull ahead further when support for FMA instructions is added! To make the IGP catch up with that and make it more flexible for GPGPU tasks, it would have to become a lot bigger and a lot more complex. But that seems pointless to me given that we've already got all this flexible processing power in the CPU part.
So it seems more worthwhile to me to think of how to (further) lower the CPU's power consumption instead. As far as I'm aware out-of-order instruction scheduling is still responsible for much of the power consumption. Since getting rid of out-of-order execution itself is obviously not an option, the amount of arithmetic work per instruction should be increased instead. Executing 1024-bit AVX operations on the existing execution units in multiple cycles seems to me like the perfect way to achieve that.
Gather/scatter support would also massively improve performance/Watt. Currently it takes 18 extract/insert instructions to perform a 256-bit gather operation of 32-bit elements. Each of them occupies ALU pipelines and moves around 128-bit of data. In total that's a lot of data movement and thus lots of heat. Not to mention each of these 18 instructions is scheduled separately and there's a long (false) dependency chain. Performing gather/scatter in the the load/store units instead would free up the ALU pipelines, drastically reduce the data movement, and turn it into a single instruction. Even a relatively straightforward implementation where a gather operation is executed as two sets of four load operations, would increase the peak throughput to one gather instruction every four cycles. The combination of both lower power consumption and higher performance should make it pretty irresistible.
extractps is a port 5 uop, not coincidentally the same port for all types of shuffle operations. 128-bit of data goes in, and 32-bit gets extracted.insertps is also a port 5 uop, and if the inserted data comes out of memory (like for gather) there's an extra port 2/3 uop for the load. 128-bit + 32-bit goes in, and 128-bit comes out.
The reason it's hauling around all this data is because the result has to be carried back to the top of the ALU by the forwarding network in case the next instruction needs it as input. The PRF is hardly involved as long as you're executing dependent instructions. Since the CPU doesn't know you're planning on overwriting each element, it's needlessly pushing around lots of data.
There's absolutely nothing great about scheduling 18 instructions which aren't even intended to be dependent.Gather could simply be a pair of port 2/3 uops instead. That doesn't mean it's a fat monolithic instruction. Each of the four loads for every 128-bit half can finish in any order, it doesn't occupy port 5, your thread can advance faster, and there's less of a chance that the second thread also runs out of work and the whole core stalls.
So really, there's nothing but advantages to having hardware support for gather/scatter.