Intel® ISA Extensions
Use hardware-based isolation and memory encryption to provide more code protection in your solutions.

Converging AVX and LRBni

capens__nicolas
New Contributor I
2,085 Views
Hi all,

With Larrabee being canned as a discrete GPU, I was wondering whether it makes sense to actually let the CPU take the role of GPU and high-throughput computing device.

Obviously power consumption is a big issue, but since AVX is specified to process up to 1024-bit registers, it could execute such wide operations using SNB's existing 256-bit execution units in four cycles (throughput). Since it's one instruction this takes a lot less power than four 256-bit instructions. Basically you get the benefit of in-order execution within an out-of-order architecture.

The only other thing that would be missing to be able to get rid of the IGP (and replacing it with generic cores) is support for gather/scatter instructions. Since SNB already has two 128-bit load units it seems possible to me to achieve a throughput of one 256-bit gather every cycle, or 1024-bit every four cycles. In my experience (as lead SwiftShader developer) this makes software texture sampling perfectly feasible, while also offering massive benefits in all other high-throuhput tasks.

Basically you'd get Larrabee in a CPU socket, without compromising any single-threaded or scalar performance!

Thoughts?

Nicolas
0 Kudos
62 Replies
bronxzv
New Contributor II
626 Views
1>A gather/scatter based solution would perform parallel table lookups for each of the coefficients.

yes, for a 3rd order polynomial and FP32 coefficients it will be 4 dictinct gathers of unrelated 32-bit values


2>could also look up all coefficients at once and transpose the elements

yesthis is my favorite solution, for a 3rd order polynomial you will move only aligned 128-bit values from the LUT


IMHO (2) is faster in all casessinceahardware implementation will typically miss the 128-bit movesoptimization when processing the series of gathers in (1). Also for (1) each cache linewill be visited4xmore times than with (2).

more generally, the access pattern to the cache/memory hierarchy has more impact onperformance and power consumption thanthe particular implementation of gather, software synthetized gather can *optimize across multiple gathers* , an oportunity that a hardware implementaton will typically lack
0 Kudos
capens__nicolas
New Contributor I
626 Views
Sandy Bridge has HT. Run some performance test code that uses two threads, those of HT siblings. Have one thread be the master thread and the other a slave thread. Have both threads setup MONITOR to monitor a list of addresses used for mailbox. The master thread, when it knows it will require a gather some time in the future (fairly long time) writes the pointer to the gather list into the mailbox, the slave thread MWAITing on the mail box gets the gather list, gathers the data and writes as 256-bit item into mailbox gather results. With sufficient advanced notice this could complete prior to the master thread needing the data. The master thread should be able to read the gathered data in one gulp if ready (or issue MWAIT till ready, or go get the data itself in the event the slave thread got preempted). The slave thread could serve as a data pipeline prefetcher and post-storer. This will "waste" one thread one that you do not want using the AVX anyway.

Not all Sandy Bridge processors have Hyper-Threading.

Anyway, assuming all future processors will feature HT, I still don't think your suggestion would work well in practice. First of all passing data between the threads has a considerable bandwidth and synchronization overhead. Also, very few algorithms can deal with such high latency, or would require restructuring which counteracts any potential advantage. But most importantly, I don't think performing data gathering in another thread on the same core really helps increase total instruction rate. You might as well do it in the same thread: Sandy Bridge has a substatial out-of-order instruction window. So just place the extract/insert instructions early enough and let the hardware take care of the rest. Even if a miss causes a stall, you still have the second thread to issue more work.

Also in my experience you can gain about 30% performance from Hyper-Threading, even for SIMD heavy code. I seriously doubt that for practical algorithms you can gain more from splitting the work unevenly.

What we really need is higher throughput, lower latency, fewer uops, lower register pressure, lower power consumption, etc. All of these things can be achieved with a minor amount of logic. It would enable revolutionary new software, which might then make it worth to invest into an advanced implementation with even more advantages.

0 Kudos
sirrida
Beginner
626 Views
> A gather/scatter ... but overall it doesn't seem worthwhile to me.
> Assembly cleanliness certainly isn't much of a convincing argument.

Would you think the same for a future gather command like "gather.byte zmm0,[eax]" reading 64 bytes, which will expand into at least 128 commands costing about 800 bytes of code if fully expanded (emulated)?
In my opinion even the effort of 18 commands to emulate 1 is far too much.
And as I must debug my hand-written code I appreciate any cleanliness as long as it does not cost too much performance.

We process e.g. a lot of ISO A0 sized images at 600 dpi, true color.
Most operations are at byte or word level. I miss the byte/integer operations for AVX very much!
A gather command for bytes and/or words for SSE/AVX is sourly missing for ages.
See also my comment for the suggested "rep xlat".
0 Kudos
bronxzv
New Contributor II
626 Views
>Would you think the same for a future gather command like "gather.byte zmm0,[eax]" reading 64 bytes, which will expand into at least 128 commands costing about 800 bytes of code if fully expanded (emulated)?

I don't think it's really possible to "expand" it as you think, i.e. to emit more than 100 uops, it will take ages each time the x86 instruction is decoded, for example the Sandy Bridecomplex decoder can emit at most 4 uops

http://www.realworldtech.com/page.cfm?ArticleID=RWT091810191937&p=4

So it will be most probably run from microcode, like the x87 transcendentals instructions, the latency/thoughput figures will bemonstruous,beating FCOS, FPTAN and such, in case ofmany cache misses (there can be32 misseswith byte indices and a 8x scaling such asthe maximum scaling forLRBni VGATHERD) this single instruction will completely stall a core (L2 misses)or choke the whole CPU (LLC misses/TLB misses)
0 Kudos
capens__nicolas
New Contributor I
626 Views
Quoting sirrida
> A gather/scatter ... but overall it doesn't seem worthwhile to me.
> Assembly cleanliness certainly isn't much of a convincing argument.

Would you think the same for a future gather command like "gather.byte zmm0,[eax]" reading 64 bytes, which will expand into at least 128 commands costing about 800 bytes of code if fully expanded (emulated)?
In my opinion even the effort of 18 commands to emulate 1 is far too much.
And as I must debug my hand-written code I appreciate any cleanliness as long as it does not cost too much performance.

Yes it can be annoying to debug hand-tuned code, but you have to weigh that against the benefits, and if necessary develop tools to assist you (ranging from a simple macro, to a full-blown compiler).Try to look at this from Intel's perspective, and its customers. What's the value in adding such a microcoded instruction, to the average person buying a computer? Also, where does this stop? Intel won't just add any instruction an assembly programmer might find convenient. So seriously, it absolutely has to add sizable benefit before it will be taken into consideration.

An actual gather instruction which is faster than the sequential code sequence is the only thing I can realistically imagine Intel to find valuable.

By the way, note the comment for Figure 12 in this article: A First look at the Larrabee New Instructions (LRBni). It appears to imply that Intel thinks it's feasible to have a single non-microcoded gather instruction for collecting 16 dwords. So I'm fairly confident that getting Sandy Bridge's pair of load units to each manage 4 loads is well within reach.

0 Kudos
capens__nicolas
New Contributor I
626 Views
Quoting bronxzv
>Would you think the same for a future gather command like "gather.byte zmm0,[eax]" reading 64 bytes, which will expand into at least 128 commands costing about 800 bytes of code if fully expanded (emulated)?

I don't think it's really possible to "expand" it as you think, i.e. to emit more than 100 uops, it will take ages each time the x86 instruction is decoded, for example the Sandy Bridecomplex decoder can emit at most 4 uops

http://www.realworldtech.com/page.cfm?ArticleID=RWT091810191937&p=4

So it will be most probably run from microcode, like the x87 transcendentals instructions, the latency/thoughput figures will bemonstruous,beating FCOS, FPTAN and such, in case ofmany cache misses (there can be 64 misses...) this single instruction will completely stall a core (L2 misses)or choke the whole CPU (LLC misses/TLB misses)

There can't be that many misses for byte translate. The offsets can only access 4 cache lines. The bigger problem, again, is extracting each of the indices and inserting each of the loaded values. The only way to prevent that from taking extra ALU cycles and constantly moving the whole vector, is with an actual gather instruction executed by the LSU.

In my opinion byte gather is less valuable than dword gather though, and it could impede an advanced implementation since gathering up to 16 bytes from a single cache line is much harder than gathering 4 dwords. A mixed implementation might work, but personally I'd already beecstatic if the first version just got us a dword gather. Byte (and word) gather have some value too but can also be achieved with a few dword gathers. Especially with the advanced implementation it would still be blazing fast (since cache misses should be rare).
0 Kudos
bronxzv
New Contributor II
626 Views
sorry but it looks like you answered an older version of my post (see my remark about the 8x scaling that I have added after posting the 1st version)


>There can't be that many misses for byte translate. The offsets can only access 4 cache lines.

nope, with a 8x scaling (the max for VGATHERD)the address range willbe 256*8 = 2048 bytes, or 32 cache lines (assuming 64B lines)

0 Kudos
bronxzv
New Contributor II
626 Views
>note the comment for Figure 12 in this article:

"
Figure 12:
vgatherd v1 {k1}, [rbx + v2*4]. This is a simplified representation of what is currently a hardware-assisted multi-instruction sequence, but will become a single instruction in the future.
"

good catch, it's another evidence, besides the Abrash's comment at the end of the paper (the one we discussed the other day *1) that Knights Ferry hasn't full hardware support for VGATHERD, the way it's phrased sounds like if the compiler expand an intrinsic to a series of instructions,much like an inlined function

hopefully the 1st MIC product will have true hardware support for the instruction, it will be fun to test
btw is there any vector computer ever produced with a generic gather, LRBni style?AFAIK the classical vector supercomputers (see also the Tarantula EV8 proposal) provide support for non-unit stride gather but not a vector indirect addressing mode


*1 : "Finally, note that in the initial version of the hardware, a few aspects of the Larrabee architecture -- in particular vcompress, vexpand, vgather, vscatter, and transcendentals and other higher math functions -- are implemented as pseudo-instructions, using hardware-assisted instruction sequences, although this will change in the future."
0 Kudos
sirrida
Beginner
626 Views
Most of us cry for a gather operation. I definitely need it for bytes and words.

> So I'm fairly confident that getting Sandy Bridge's pair of load units
> to each manage 4 loads is well within reach.

This would also mean that my sketched worst case for ZMM (64 bytes) would cost 16 cycles (16*4 byte lookup) plus memory latency.
I think 16 cycles (+ memory latency) for such a mighty command is not too much and will outperform the replacement coding for sure. Several other commands have a higher latency. For XMM it will cost only 4 cycles plus memory latency which is quite fast even in absolute terms.
For byte lookups never more than a block of 256 adjacent bytes will be read since all the indexes/offsets are in the range 0..255.
For words there will be only 8*4 word lookups and 32 cache misses for ZMM in the index range 0..65535 (offset range 0..128 Ki). This will probably be the real worst case.

I know how to use macros and compilers, but hand-tuned code is often much faster because I can manually schedule the commands at the cost of maintainability.
0 Kudos
jimdempseyatthecove
Honored Contributor III
626 Views
>>First of all passing data between the threads has a considerable bandwidth and synchronization overhead.
You do not pass data element by element from thread to thread. Depending on the design of the HT core, HT siblings share L1 and/or L2 cache. One of the threads (slave) is taking the hit for memory latency in performing the read gathers and write to RAM (which additionally writes to L1 and L2 assuming not non-temporal store). This would be (should be) beneficial when the pre-gathered data to be process tends to not reside in L1/L2 cache .AND. when the code combines this with SSE/AVX instructions with other data that is currently in L1/L2. The code path length, independent of the gather would have to take more clocks than the gather (4, 8 or more clocks per 4/8 way gather). While this will add 2 cycles to the throughput (per cache line write by slave, read by master) should the code path after the gather consume more than 6/10 cycles you should see a net gain. However, I would guess that you need more than~ 30% additional clock cycles (8/11). An additional concern is then overall effect of preemption by the O/S on one of the pairs of threads (as this will defeat the goal of priming the cache).

Note that the use of two (or more) HT siblings in this manner effectively produce a pipeline stream IOW coded as a parallel pipeline.

Jim Dempsey
0 Kudos
levicki
Valued Contributor I
626 Views
Jim, I am just speculating but I am afraid that if the master thread is doing some usefull work, then prefetching by the slave thread at the same time would evict usefull data from L1 and L2 and thus actually decrease performance.
0 Kudos
capens__nicolas
New Contributor I
626 Views
This would be (should be) beneficial when the pre-gathered data to be process tends to not reside in L1/L2 cache...

Gather/scatter is intended to improve throughput when the data does reside in L1/L2. If it's higher up then I expect the CPU will pile up several misses and then the second thread (executing similar code) should simply take over and cover (most of) the latency.

Having lots of misses means there are plenty of cycles to do the extract/insert sequentially, so gather/scatter wouldn't help in that case. But since the L1 cache typically has a hit rate of around 95-99%, hardware gather/scatter can often speed things up considerably.Besides, if you do expect lots of misses you should use software prefetching to increase the likelyhood that the data will be close when you perform the actual gather (or scatter) operation. Note that LRBni includes gather/scatter prefetch instructions. The same could be added to AVX, so you don't need 18 instructions for the prefetch and 18 instrutions for the actual gather. There should even be plenty of elements located within the same cache line, so that an advanced gather implementation can collect them all at once and the throughput is further increased.

Here's a document which shows how badly CPUs need gather/scatter to maximize SIMD performance:
Debunkingthe 100XGPUvs.CPUmyth.

With AVX and soon FMA sequentially loading elements which reside in L1/L2 cache is a massive bottleneck. Storing the data in AoS format requires lots of shuffle instructions to allow efficient SoA style processing. In other words there's a growing discrepancy between arithmetic processing power and the ability to get data into the pipelines even if it's stored in only mildly irregular locations. So I don't think doing sequential extract/insert in a second thread solves anything. We need dedicated gather/scatter support.

0 Kudos
bronxzv
New Contributor II
626 Views

>if the master thread is doing some usefull work, then prefetching by the slave thread at the same time would evict usefull data from L1 and L2 and thus actually decrease performance.

good point, though there is probably some buffer sizes where it will make sense (1 KB - 1MB once packed), after all people get speedups with software speculative precomputation in some cases

even ifthe packed datais not fitting entirely in L1D and that you are "gathering" very sparse data (it doesn't apply to the use cases discussed with c0d1f1ed, though) you may increase the spatial coherence by 10x or more thus decreasing10x ormore(*1) the bandwidth required to move it from L2 to L1D the 2nd time, you'll also evict 10x less L1D lines the 2nd time, the latency will be more than 10x bettersince the streamer will quickin (loads with regular strides the 2nd time, random addresses the 1st time)

*1:if a single float per cache line in the source data, it will be 64/sizeof(float) =16x less bandwidth hungry the 2nd time, this is acommon case with linear algebrafor big matrices for example

0 Kudos
jimdempseyatthecove
Honored Contributor III
626 Views
>>Gather/scatter is intended to improve throughput when the data does reside in L1/L2. If it's higher up then I expect the CPU will pile up several misses and then the second thread (executing similar code) should simply take over and cover (most of) the latency

Yes, but since we are talking about HT sibling threads, under many circumstances the second thread would cause unnecessary eviction from the smaller L1/L2 (core shared) caches. As to if cooperative HT programming helps or hinders, this would depend on what is being done. IOW use the technique when appropriate.

Jim Dempsey
0 Kudos
Mark_B_Intel1
Employee
626 Views
I think gather

We just posted information on the gather instructions we plan to support in Intel microarchitecture Haswell. See http://software.intel.com/file/36945. Blog coming soon.

Regards,
Mark
0 Kudos
levicki
Valued Contributor I
626 Views
Heh... finally some serious flexibility -- after gather I especially like shifts with independent counts per element.
How long we will have to wait for Haswell and how many kidneys we need to sell for the platform upgrade this time? :D
0 Kudos
bronxzv
New Contributor II
626 Views
>See http://software.intel.com/file/36945

Thanks for a veryexciting reading, the 1st word that come to mind is "WOW!"
0 Kudos
TimP
Honored Contributor III
626 Views
Quoting bronxzv

as a matter of fact neithergather/scatter nor 1024-bit vectors are announced yet for AVX so we can't plan for these

in both cases I think they are not matching well with SMT, 4-way SMT is a more likely future IMHO

Reading the above, I would think you're not aware of the future AVX or current MIC gather instructions.
How does this square with http://software.intel.com/en-us/forums/showthread.php?t=83399&o=a&s=lr

Even if the initial version will be only micro-coded without performance advantage over current AVX code, it opens the door for you to show the extent of your intended use, and provide examples for optimization of future hardware.

Likewise, you might provide more detail on how you expect to show an advantage for 4-way SMT. I understand that it has shown a measurable advantage over 2 or 3 threads per core for linpack benchmarks, but not in the context of gather/scatter.
0 Kudos
sirrida
Beginner
596 Views
We just posted information on the gather instructions we plan to support in Intel microarchitecture Haswell.
See http://software.intel.com/file/36945.
Blog coming soon.
I'm very happy to see that most of the integer commands have been promoted to YMM.
Unfortunately the gather command only works on entities of 32 bit or greater since as described in my reply #53 we really need it also for bytes and words. I understand that there are some obstacles implementing it if there are page faults during execution of such command.
0 Kudos
bronxzv
New Contributor II
596 Views
>Reading the above, I would think you're not aware of the future AVX

Hey! consider the timing, I'mneither an insider nor under a CNDA with Intel, so I wasn't aware of them May 23,2011 (the date I posted this), though I'mwell aware of them now as you can see herehttp://software.intel.com/en-us/forums/showpost.php?p=152492

>or current MIC gather instruction

I was talking about *AVX*, since the whole thread subject is "Converging AVX and LRBni" they are two distinct x86 ISAs in the context of this thread.Anyway, be assured I'm well aware of the VGATHERD instruction inLRBni since the day the 1st Abrash's paper was published (more than 2 years ago), I havezero fresh news about the MIC products, though.

>Likewise, you might provide more detail on how you expect to show an advantage for 4-way SMT.

well exactly like I enjoy nice speedups (up to 1.3x on Sandy Bridge) with 2 threads: simply allocate a pool of threads with 1 thread per hardware thread context, minimize synchronization as much as possible, etc.. Since both the canceled EV8, Power 7 andLarrabee 1 feature 4hardware threadsthere is obviously some good reasons (i.e. simulations have validated the idea)to have more than2 hw threads, particularly when consideringthe wider and wider RAM bandwidth/latency gap


>but not in the context of gather/scatter.

exactly, thus my comment "they are not matching well with SMT", I told that because I suspect that at least the first implementation will be not able to track more than one gather instruction at the same time, if it's high (huge in case of cache miss(es)) latency and serializing like VSQRTPS it can choke code using intensively VGATHERDPS in critical loops due to threads fighting for the gather resource, now I'll love to be surprised and thatthe executions of several gathers will be allowed to overlap

0 Kudos
capens__nicolas
New Contributor I
596 Views
Quoting c0d1f1ed
Could any Intel engineer/scientist tell me what the expected power efficiency would be for executing AVX-1024 on 256-bit execution units?

I've done some research and I can probably answer my own question now...

Each execution port has its own scheduler (reservation station), so when an AVX-1024 uop is issued on a 256-bit execution unit, the corresponding scheduler could be clock gated for 3 out of 4 cycles. While the scheduler is asleep a small and power efficient piece of sequencing logic can keep feeding in the 256-bit registers. NVIDIA uses similar register stringing (and Intel has a patent agreement with them) so I'm confident it's feasible to implement AVX-1024 as a single uop executed in 4 cycles.

Higher up in the pipeline, dispatch, register rename and decoding stages can also be clock gated when the buffers are full. Can anyone tell me to which degree these things have already been implemented?

With AVX2 already offering gather and FMA support, AVX-1024 would be the only thing remaining to turn the CPU into a power-efficient high-throughput device, without sacrificing sequential performance. AVX-1024 implemented this way would not only have the potential to drastically reduce power consumption during DLP-limited workloads, but in practice also allow to better cover long cache or RAM access latencies. Out-of-order execution enables higher ILP than Larrabee and GPUs offer, which reduced cache contention (2-way Hyper-Threading still seems like a valuable compromise though).

So does anyone know of any obstacles which would cause you not to expect AVX-1024 support for the Skylake timeframe?

0 Kudos
Reply