Converging AVX and LRBni - Page 2

capens__nicolas · ‎05-10-2011

Hi all,

With Larrabee being canned as a discrete GPU, I was wondering whether it makes sense to actually let the CPU take the role of GPU and high-throughput computing device.

Obviously power consumption is a big issue, but since AVX is specified to process up to 1024-bit registers, it could execute such wide operations using SNB's existing 256-bit execution units in four cycles (throughput). Since it's one instruction this takes a lot less power than four 256-bit instructions. Basically you get the benefit of in-order execution within an out-of-order architecture.

The only other thing that would be missing to be able to get rid of the IGP (and replacing it with generic cores) is support for gather/scatter instructions. Since SNB already has two 128-bit load units it seems possible to me to achieve a throughput of one 256-bit gather every cycle, or 1024-bit every four cycles. In my experience (as lead SwiftShader developer) this makes software texture sampling perfectly feasible, while also offering massive benefits in all other high-throuhput tasks.

Basically you'd get Larrabee in a CPU socket, without compromising any single-threaded or scalar performance!

Thoughts?

Nicolas

bronxzv · ‎05-24-2011

you may be right for the vinsertps (6/8 loads) though I'lllove a confirmation by someone in the know
for the low elements you can use 32-bit vmovss (2/8 loads) and it looks like an easy optimization to simply clear the 96 MSBs instead of "moving" 128-bit, here again I'll welcomeanexplanation of the actual working beyond our guesswork

for extracting the 32-bit indices from the very same XMM
vmovd edi, xmm2
...
vpextrd edi, xmm2, 1
...
vpextrd edi, xmm2, 2
...
vpextrd edi, xmm2, 3
...
it looks rather odd to move 128-bit each time, at least nothing in the ISA ask for it so it can be optimized in forthcoming chips if it's really as bad as you said

bronxzv · ‎05-24-2011

>Programming such endless chains of insertxx commands is ugly, stupid, slow

Programming gather is no more effort than calling a Gather() inlined function (*1) or using the array notation for gather A[B[:]] if you useIntel C++

>Gather commands could at least generate the flood of ops

I can't see how it will be a real improvement since the uop cache pressure will be the same than with a software synthetized gather and execution willbe on par with the current situation, only the x86 code densitywill be better but it's really notan importantpoint for high performance code with a lot of inlining/unrolling

*1: Examples of generic gather functions (4 & 8 FP32 elements)

#define INLINE _forceinline

INLINE __m128 Gather(const float *base, const __m128i &indices)

{

__m128 res = _mm_load_ss(base+_mm_cvtsi128_si32(indices));

res = _mm_insert_ps(res,_mm_load_ss(base+_mm_extract_epi32(indices,1)),_MM_MK_INSERTPS_NDX(0,1,0));

res = _mm_insert_ps(res,_mm_load_ss(base+_mm_extract_epi32(indices,2)),_MM_MK_INSERTPS_NDX(0,2,0));

res = _mm_insert_ps(res,_mm_load_ss(base+_mm_extract_epi32(indices,3)),_MM_MK_INSERTPS_NDX(0,3,0));

return res;

}

INLINE __m256 Gather(const float *base, const __m256i &indices)

{

const __m128 low = Gather(base,_mm256_extractf128_si256(indices,0)),

high = Gather(base,_mm256_extractf128_si256(indices,1));

return _mm256_insertf128_ps(_mm256_castps128_ps256(low),high,1);

}

compiling the 256-bit variant generates the 18 instructions discussed with c0d1f1ed:

vmovd edi, xmm2

vextractf128 xmm6, ymm2, 1

vmovss xmm0, DWORD PTR [ecx+edi*4]

vpextrd edi, xmm2, 1

vinsertps xmm1, xmm0, DWORD PTR [ecx+edi*4], 16

vpextrd edi, xmm2, 2

vinsertps xmm3, xmm1, DWORD PTR [ecx+edi*4], 32

vpextrd edi, xmm2, 3

vinsertps xmm0, xmm3, DWORD PTR [ecx+edi*4], 48

vmovd edi, xmm6

vmovss xmm4, DWORD PTR [ecx+edi*4]

vpextrd edi, xmm6, 1

vinsertps xmm5, xmm4, DWORD PTR [ecx+edi*4], 16

vpextrd edi, xmm6, 2

vinsertps xmm7, xmm5, DWORD PTR [ecx+edi*4], 32

vpextrd edi, xmm6, 3

vinsertps xmm1, xmm7, DWORD PTR [ecx+edi*4], 48

vinsertf128 ymm2, ymm0, xmm1, 1

sirrida · ‎05-25-2011

All these commands cost code space and much more ops than necessary for a naive implementation of a gather command.
Also, all these commands must be decoded and they trash my precious registers (in your example edi and ymm0..7).
Would you create a whopping bunch of 32 or more commands for my example of gathering bytes in xmm registers?
For the assumed avx extension for integer ops on ymm things get worse...

bronxzv · ‎05-25-2011

>much more ops than necessary

it's pretty difficult to tell since the ISA of the uops isn't disclosed, I don't see where it can be significantly simplified, you need to extractthe integer indices from vector registers to GPRs,compute the base + index addresses,load thevaluesand insert them in vector registers
the hot spot is clearly for loads anyway, limited by the cache hierarchy, not the ISA

to improvecode density, and maybe also execution speed, an interestingmiddle ground (between the current situation and a fat indivisible vgather instruction that may triggermultiple cache misses) will be to introduce a new addressing mode where we can usea singleelementof an ymm register as offset

ASM code may look like :
vinsertps xmm1, xmm1, DWORD PTR [ecx+ymm0[DWORD 1]*4], 16
vinsertps xmm1,xmm1, DWORD PTR [ecx+ymm0[DWORD 2]*4], 32

>they trash my precious registers (in your example edi and ymm0..7)

yes good point, though it's compiler generated code, the code is different if register pressure is higher, the insertions may be like this for example vinsertps xmm0, xmm0, DWORD PTR [ecx+edi*4], 16, wasting less registers

capens__nicolas · ‎05-25-2011

Quoting bronxzv

you may be right for the vinsertps (6/8 loads) though I'lllove a confirmation by someone in the know
for the low elements you can use 32-bit vmovss (2/8 loads) and it looks like an easy optimization to simply clear the 96 MSBs instead of "moving" 128-bit, here again I'll welcomeanexplanation of the actual working beyond our guesswork

for extracting the 32-bit indices from the very same XMM
vmovd edi, xmm2
...
vpextrd edi, xmm2, 1
...
vpextrd edi, xmm2, 2
...
vpextrd edi, xmm2, 3
...
it looks rather odd to move 128-bit each time, at least nothing in the ISA ask for it so it can be optimized in forthcoming chips if it's really as bad as you said

It's not guesswork. Every modern pipelined processor uses result forwarding to eliminate read-after-write hazards. Trust me, I'm in the know. I have a masters degree in computer science and engineering (and a minor in embedded systems). You can also read about the added latency for bypassing results between execution domains in Intel's Optimization Reference Manual.

Forwarding also affects extract instructions. Take the following code sequence:

paddd xmm0, xmm1
pextrd eax, xmm0, 3
sub eax, 123

You might think the sub could directly use the fourth element of the result of the paddd right after it finishes executing (eliminating the pextrd), but this would complicate the forwarding network in multiple places, adding gate delay. That's some delay and complication right where you don't want it. So making extract instructions more efficient would compromise everything else. Instead they just forward all 128-bit as-is, and use the next cycle to execute the pextrd, after which the result is forwarded to the sub instruction.

But while there's nothing that can be done in the above case, a gather operation really doesn't need any of this forwarding; it shouldn't even involve the ALU pipelines at all! It's also fine if a gather instruction takes extra latency (it will still be much faster than 18 instructions, and throughput is far more critical for SIMD code anyway). It can also use a weaker memory consistency model. These things should make it feasible to prevent it from affecting the performance of regular load operations.

sirrida · ‎05-25-2011

...and as we have seen in Copy and modify forwarding does not work as good as it should, at least on i7 and Atom...

capens__nicolas · ‎05-25-2011

Quoting bronxzv

I can't see how it will be a real improvement since the uop cache pressure will be the same than with a software synthetized gather and execution willbe on par with the current situation, only the x86 code densitywill be better

No, a misaligned load instruction is still one uop, even if it has to access two cache lines. So likewise a 128-bit gather instruction can be just a single uop even if it has to access four cache lines. There's no benefit at all in having multiple uops and scheduling them individually. Dependent instructions can't commence anyway till all data has been loaded. So whether it's an aligned load, a misaligned load, or a gather, it can treat it as one uop which has either finished or not. It's the load unit's responsability to fetch each portion of the data. Even a vmovaps load instruction is a single uop, but it issues on both port 2 and 3.

This doesn't just free up lots of uop cache space (from 18 fused uops down to 1), but also avoids all of the power consumption related to scheduling and register renaming and such.

bronxzv · ‎05-25-2011

>It's not guesswork.

sure it is, I'll be interested to have the input from an insider, though

bronxzv · ‎05-25-2011

>So likewise a 128-bit gather instruction can be just a single uop

Sorry but I was answering this sirrida'scomment "Gather commands could at least generate the flood of ops", I'm suresirrida was meaning it as an instruction decoded as multi uops likeSSE on the P!!!/P4 or vdivps/pd vsqrtps/pd on SNB, itlooks interesting since it will match well with SMT, unlike a fat vgather

now if it's really possible to implement a single uop genericvgather andthat a thread canissue a vgather before the other thread(s) vgather(s) retire, potentially thousands of cycles later, (i.e. ifvgatheris not serializing like vdivps for example), be assuredI'll use it from day one, otherwiseit will be clearly the#1 source of stalls and the #1 instruction to avoid, well IMHO

sirrida · ‎05-25-2011

I meant "Gather commands could at least generate the flood of ops" as a means to easily get a first implementation. Later CPU generations will surely have a better implementation.

bronxzv · ‎05-25-2011

>I meant "Gather commands could at least generate the flood of ops" as a means to easily get a first
>implementation. Later CPU generations will surely have a better implementation.

yes, you were clear about it, and I suppose it will be quite simple to implement themulti-uop solution though I'm not sure it will really provide concrete speedups (since the bottlenecks are elsewhere: nr of load ports, cache and memory hierarchies)the incentive to use it will be low, not many ISVs want one more code path without significant speedup

sirrida · ‎05-25-2011

At least the 108 bytes of your solution will become about 5 bytes and writing, reading and debugging will be much easier at assembly level.
OK, there's one more code path, but this is the price to pay.
The anticipated speedup might come later...

capens__nicolas · ‎05-25-2011

Quoting bronxzv

to improvecode density, and maybe also execution speed, an interestingmiddle ground (between the current situation and a fat indivisible vgather instruction that may triggermultiple cache misses) will be to introduce a new addressing mode where we can usea singleelementof an ymm register as offset

I'm sorry but it's pretty pointless and even wasteful to add an instruction which will be supersceded by gather/scatter at some point in the near or far future. It's also not obvious how to encode your instruction in the first place. And you're sending a large vector to the load unit for each of these instructions, while only using a minor portion (the same issue as the forwarding that takes place with the extract instrutions). And last but not least each of your insert instructions are still dependent and needlessly carry lots of data around.

It could shave off a few cycles but remains flawed, so in my opinion it would be a lot more worthwhile to investe the transistors into an actual gather implementation.

bronxzv · ‎05-25-2011

>And you're sending a large vector to the load unit for each of these instructions, while only using a minor portion

here again I don't see why it will be not possible to move only 32-bit between the PRFand the AGU using amultiplexer,it looks farsimpler than a fully functional vgatherand a concrete step that will fit nicely with SMT, one more time it will be nice to have the input from someone nearer to the action than we are

bronxzv · ‎05-25-2011

>writing, reading and debugging will be much easier at assembly level

so true, though in my experience readable ASM and fast code are orthogonal issues at best, most of the time faster code is less readable in the ASM dump like when we geta significant speedup using a "#pragma unroll" directive, or when using two 128-bit vmovups is way faster than a single 256-bit vmovups with unaligned arrays

capens__nicolas · ‎05-25-2011

Quoting bronxzv

>It's not guesswork.

sure it is, I'll be interested to have the input from an insider, though

Did you read the documents I linked? Forwarding is the only way a fully pipelined architecture can execute dependent instructions back-to-back (which has been possible for every Intel chip since the 486). Reading and writing the register file simply takes too many cycles, so the results are directly looped back in case the next instructions requires the new value instead of the old value it read a couple cycles earlier. Here's a good explanation: modern microprocessors(figure 4).

bronxzv · ‎05-25-2011

hey Nicolas, Idon't see anythingat your linkgoing against the fact that it's possible to move only 32-bit (or only 64-bit / 128-bit *not always 256-bit*)toan inputthat don't need more than 32-bit

with a P6 like design (Nehalem) the move can be froma result buffer (in case of forwarding ) or from the RRF when the result is no more available in a result buffer

my understanding is thatwith Sandy Bridgeit's simply always from the unified PRF, it can forward as soon as the result is available, before the RAT update is completed

http://www.realworldtech.com/page.cfm?ArticleID=RWT091810191937&p=5

"
Allocation, renaming, scheduling and retiring are all different for Sandy Bridge, and minimize the movement and replication of data within the processor
"

""
PRF-based renaming and scheduling is substantially more power efficient because it eliminates movement of 32, 64, 128 or 256-bit data values
"

jimdempseyatthecove · ‎05-25-2011

Bronxzv,

I do not have my Sandy Bridge system yet so I cannot try this out.

Sandy Bridge has HT. Run some performance test code that uses two threads, those of HT siblings. Have one thread be the master thread and the other a slave thread. Have both threads setup MONITOR to monitor a list of addresses used for mailbox. The master thread, when it knows it will require a gather some time in the future (fairly long time) writes the pointer to the gather list into the mailbox, the slave thread MWAITing on the mail box gets the gather list, gathers the data and writes as 256-bit item into mailbox gather results. With sufficient advanced notice this could complete prior to the master thread needing the data. The master thread should be able to read the gathered data in one gulp if ready (or issue MWAIT till ready, or go get the data itself in the event the slave thread got preempted). The slave thread could serve as a data pipeline prefetcher and post-storer. This will "waste" one thread one that you do not want using the AVX anyway.

Jim Dempsey

bronxzv · ‎05-26-2011

Jim,

It's an interesting idea, along the line of software speculative precomputation.It will be particularly effective when "gathering" a big chunckof data(i.e. you'll pass it an array of indices and it will set an array of values) with a high cache miss rate and if the master threadis able todo other useful work in the meantime.

>one thread one that you do not want using the AVX anyway.
In my use cases I typically work with a pool of threads (1 thread per logical processor) so each hardware thread need AVX support, btw the speedups from hyperthreading are slightly better for AVX code

capens__nicolas · ‎05-26-2011

Quoting bronxzv

hey Nicolas, Idon't see anythingat your linkgoing against the fact that it's possible to move only 32-bit (or only 64-bit / 128-bit *not always 256-bit*)toan inputthat don't need more than 32-bit

with a P6 like design (Nehalem) the move can be froma result buffer (in case of forwarding ) or from the RRF when the result is no more available in a result buffer

my understanding is thatwith Sandy Bridgeit's simply always from the unified PRF, it can forward as soon as the result is available, before the RAT update is completed

http://www.realworldtech.com/page.cfm?ArticleID=RWT091810191937&p=5

"
Allocation, renaming, scheduling and retiring are all different for Sandy Bridge, and minimize the movement and replication of data within the processor
"

""
PRF-based renaming and scheduling is substantially more power efficient because it eliminates movement of 32, 64, 128 or 256-bit data values
"

Although Sandy Bridge eliminates storing data in the ROB, the forwarding network and register file accesses are performance critical. The use of a PRF doesn't change that. Extracting 32-bit data from a vector takes several gate delays, and you can't squeeze that in there without affecting timings. So the only sensible solution to insert/extract parts of a register is to send the entire vector to the execution unit and take a full clock cycle to perform the operation as a separate instruction.

Gather/scatter on the other hand wouldn't affect those critical parts of arithmetic instruction execution at all. They are parallel load/store operations, so they only affect the load/store units. And I think it can be implemented without impacting regular load/store performance. First of all it can take advantage of the fact that the index values can be extracted sequentially; the first element can be used directly, and for subsequent indices it has a full cycle to shift the vector to the right. This merely requires sending the index vector once, and doesn't require any control bits to select which element to extract. Furthermore, in case of gather the elements can be inserted in arbitrary order. This can be handled by a separate piece of logic which collects the elements into a dedicated gather register the next cycle, sending it to the PRF when complete. This adds a cycle of latency to gather, but should leave regular load latency unaffected.

An advanced implementation could check which indices point to the same cache line, and gather up to four elements in parallel. Although it's more challenging to prevent this from affecting regular load/store performance, it seems well worth it to me to further improve performance/Watt of throughput computing (and beyond). Tons of algorithms contain loops which could be successfully parallelized with gather/scatter support. Of course it makes sense to only invest in this advanced implementation once software already makes use of gather/scatter. So I think the above cheap implementation with a maximum throughput of one gather every four cycles makes most sense for the near future.

capens__nicolas · ‎05-26-2011

Quoting sirrida

At least the 108 bytes of your solution will become about 5 bytes and writing, reading and debugging will be much easier at assembly level.
OK, there's one more code path, but this is the price to pay.
The anticipated speedup might come later...

A gather/scatter instruction which expands into ~18 (fused) uops could indeed save code bytes and reduce register pressure, but overall it doesn't seem worthwhile to me. Assembly cleanliness certainly isn't much of a convincing argument.

The real issue with it though is that a custom code sequence might be faster. Take for instance the example of approximating a function with piecewise polynomials. A gather/scatter based solution would perform parallel table lookups for each of the coefficients. But you could also look up all coefficients at once and transpose the elements (from AoS to SoA). Despite requiring many shuffle operations, the latter would be faster if the gather/scatter implementation is a straight expansion into extract/insert operations.

So in my humble opinion we really need a gather/scatter implementation which offers a sizable advantage from the get-go. A 256-bit gather operation with a throughput of 4 cycles can't be beaten by custom code and offers many other advantages thanks to only taking a single uop, and seems quite feasible to me. I don't see any point in asking for anything less.