Solved: Re: AVX in Sandy Bridge

bronxzv · ‎09-23-2009

A few new public information about the 1st AVX incarnation in Sandy Bridge in the SF09_ARCS002_FIN.pdf document available here :

http://www.intel.com/idf/technology-tracks/ -> "32 nm Implementation of... " -> "See sessions within this track" -> ARCS002 PDF Icon

1) slide 6, one "AVX HIGH" unit on port 0, and "AVX LOW" on port 1 => it looks like 256-bit AVX will have the same throughput than 128-bit SSE on current cores: 2 clocks forone 256-bit vmulps/pd +one 256-bit vaddps/pd instead of 1 clock for one 128-bit mulps/pd + one 128-bit addps/pd (i.e. same peaksp/dp flops per clock with balanced add/mul), so if Sandy Bridge can't issue in the same clock a mul and an add unlike Conroe, Penryn andNehalem it will be actually less efficient than these previous cores witha lot of legacy 128-bit SSE code ?, it looks rather odd => can someone in the know confirm this ?

2) slide 6, only 128-bit paths from the L1D cache to execution units (I was hoping full featured 256-bit paths), a few consequences :
- the extra load port will help as much legacy 128-bit SSE or 128-bit AVX than 256-bit AVX, same 48 B / clock maximum L1 bandwidth
- loop fission will be probably no more a good optimization if intermediate results are stored in L1D, probably better to overflow the LSD than the L1D, particularly with multiple threads fighting for L1D access
- more incentive to use 64-bit code to have 16 ymm registers instead of 8 to minimize L1D access

3) slide 54, 64 B cache lines (unchanged), so :
- align memory still important (more important than on Nehalem), 1/2 access will incur a cache line split otherwise

4) slide 58, masked moves considered harmful, replace vmaskmovps by vblendvps + vmovaps just like in legacy SSE4 code ?

Max_L · ‎09-23-2009

well, I can see how picture might be a bit confusing ....

but in fact, as was said previsouly, Sandy Bridge will double peak FP throughput compared to few past Intel Core implementations - it will be able to start 1 256-bit FP ADD and 1 256-bit FP MUL every cycle.

-Max

View solution in original post

TimP · ‎09-23-2009

Quoting - bronxzv

A few new public information about the 1st AVX incarnation in Sandy Bridge in the SF09_ARCS002_FIN.pdf document available here :

http://www.intel.com/idf/technology-tracks/ -> "32 nm Implementation of... " -> "See sessions within this track" -> ARCS002 PDF Icon

1) slide 6, one "AVX HIGH" unit on port 0, and "AVX LOW" on port 1 => it looks like 256-bit AVX will have the same throughput than 128-bit SSE on current cores: 2 clocks forone 256-bit vmulps/pd +one 256-bit vaddps/pd instead of 1 clock for one 128-bit mulps/pd + one 128-bit addps/pd (i.e. same peaksp/dp flops per clock with balanced add/mul), so if Sandy Bridge can't issue in the same clock a mul and an add unlike Conroe, Penryn andNehalem it will be actually less efficient than these previous cores witha lot of legacy 128-bit SSE code ?, it looks rather odd => can someone in the know confirm this ?

2) slide 6, only 128-bit paths from the L1D cache to execution units (I was hoping full featured 256-bit paths), a few consequences :
- the extra load port will help as much legacy 128-bit SSE or 128-bit AVX than 256-bit AVX, same 48 B / clock maximum L1 bandwidth
- loop fission will be probably no more a good optimization if intermediate results are stored in L1D, probably better to overflow the LSD than the L1D, particularly with multiple threads fighting for L1D access
- more incentive to use 64-bit code to have 16 ymm registers instead of 8 to minimize L1D access

3) slide 54, 64 B cache lines (unchanged), so :
- align memory still important (more important than on Nehalem), 1/2 access will incur a cache line split otherwise

4) slide 58, masked moves considered harmful, replace vmaskmovps by vblendvps + vmovaps just like in legacy SSE4 code ?

I'm technically challenged as to how to actually see the slides, so I'm relying on your description.
"efficiency" for 128-bit parallel code should improve over current CPUs in the case where it can benefit from 2 128-bit cache read accesses per cycle, otherwise should be unchanged, thus there is a strong emphasis on recompilation and finding/fixing obstacles to 256-bit operations. As you hint, many applications which could benefit from AVX could see a measurable improvement without recompilation.
Loop fission (distribution) already is a problem where it breaks cache and register locality. No doubt, you're right, the importance of this will increase. This is recognized in compiler development to avoid some of the problems with distribution, including support of mixed 32- and 64-bit data types by employing (in effect) 128- and 256-bit parallel instructions in the same loop.
The strong support in Nehalem for misaligned 128-bit operands continues, but, as you say, doesn't extend to 256-bit operands, even though they appear to be composed of 128-bit pairs. So the emphasis on alignment returns, with greatly increased importance of 32-byte alignments. Compilers will split loads explicitly so as to minimize this problem.
As far as I know, there hasn't been any advantage in masked moves beyond the superficial one of removing one line of asm code, so I doubt many people will get worked up over one or tiny moves back in the RISC direction.
As you should expect, Amdahl's law comes into play to a much greater extent than the publicity slides tend to acknowledge. Only a handful of current major applications spend as much as 60% of the time executing parallel instructions. Even when those instructions are addressed by AVX (much of them memory speed limited, perhaps seeing the benefit of increasing from 3 to 4 channels), it's not possible to expect a 50% gain overall.

bronxzv · ‎09-23-2009

Quoting - tim18

"efficiency" for 128-bit parallel code should improve over current CPUs in the case where it can benefit from 2 128-bit cache read accesses per cycle, otherwise should be unchanged,

AFAIK on Nehalem (reference : http://www.realworldtech.com/page.cfm?ArticleID=RWT040208182719&p=6) 128-bit SSE FP instructionscan be issued to Port 0 (4 x FP32 or 2 x FP64 multipliers, shuffle), Port 1 (4 x FP32 or 2 x FP64 adders) and Port 5 (misc, shuffle), so it canissue a packed mul on the same clock than a packed add for a peak 8 FP32 / 4 FP64 flops per cycle

Now from the Sandy Bridge picture above (the slide I was refering to) it looks like SSE is dispatched only by a single port on any given cycle, though it's probably just something that isoversimplified (like a "SSE" label missing in the leftmost execution units block)

The AVX "HIGH" and "LOW" labels make me think there is a 128-bit FMUL unit *and* a 128-bit FADD unit for bothblocks at the left (attached orthogonaly to ports 0,1,5 if I understand it well),so hopefullyit can issuea 128-bit fadd/fmul pair in the same clock like Nehalem, though it cangenerate only one 256-bit result per clock with 256-bit fadd/fmul, in other words peak flops with well balanced fadd/fmul as is pretty common in vector algebra (I definitely have 3D rendering in mind) will be the same with 128-bit code and 256-bit code

SHIH_K_Intel · ‎09-23-2009

Quoting - bronxzv

A few new public information about the 1st AVX incarnation in Sandy Bridge in the SF09_ARCS002_FIN.pdf document available here :

http://www.intel.com/idf/technology-tracks/ -> "32 nm Implementation of... " -> "See sessions within this track" -> ARCS002 PDF Icon

1) slide 6, one "AVX HIGH" unit on port 0, and "AVX LOW" on port 1 => it looks like 256-bit AVX will have the same throughput than 128-bit SSE on current cores: 2 clocks forone 256-bit vmulps/pd +one 256-bit vaddps/pd instead of 1 clock for one 128-bit mulps/pd + one 128-bit addps/pd (i.e. same peaksp/dp flops per clock with balanced add/mul), so if Sandy Bridge can't issue in the same clock a mul and an add unlike Conroe, Penryn andNehalem it will be actually less efficient than these previous cores witha lot of legacy 128-bit SSE code ?, it looks rather odd => can someone in the know confirm this ?

2) slide 6, only 128-bit paths from the L1D cache to execution units (I was hoping full featured 256-bit paths), a few consequences :
- the extra load port will help as much legacy 128-bit SSE or 128-bit AVX than 256-bit AVX, same 48 B / clock maximum L1 bandwidth
- loop fission will be probably no more a good optimization if intermediate results are stored in L1D, probably better to overflow the LSD than the L1D, particularly with multiple threads fighting for L1D access
- more incentive to use 64-bit code to have 16 ymm registers instead of 8 to minimize L1D access

3) slide 54, 64 B cache lines (unchanged), so :
- align memory still important (more important than on Nehalem), 1/2 access will incur a cache line split otherwise

4) slide 58, masked moves considered harmful, replace vmaskmovps by vblendvps + vmovaps just like in legacy SSE4 code ?

It seems point 1) may have assumed it requires monolithic 256-bit hardware to achieve 1 cycle throughput for 256-bit AVX instructions. That's not true.
From what I know of the 256-bit AVX instructions, common operation such as add has 1 cycle throughput.

bronxzv · ‎09-23-2009

Quoting - Shih Kuo (Intel)

It seems point 1) may have assumed it requires monolithic 256-bit hardware to achieve 1 cycle throughput for 256-bit AVX instructions. That's not true.
From what I know of the 256-bit AVX instructions, common operation such as add has 1 cycle throughput.

my point is that with Nehalem we have 1 cycle thoughput for (128-bit) ADDPS/PDor MULPS/PD and both can be issued in parallelsothe *throughput of a packed 128-bit add/mul pair is1 clock*

from the diagram above it looks like we have indeed 1 cycle throughput for (256-bit) VADDPS/PD or VMULPS/PD considered separately but both can't be issued in parallel (and yesI understand it's possible to issue the MSBs and the LSBs in two different cycles as Willamette and followerswere doing, up to Conroe not included) so the *throughput of a packed 256-bit add/mul pair is 2 clocks*, in other words peak flopsare the same (not 2x higher as previously said)

Max_L · ‎09-23-2009

well, I can see how picture might be a bit confusing ....

but in fact, as was said previsouly, Sandy Bridge will double peak FP throughput compared to few past Intel Core implementations - it will be able to start 1 256-bit FP ADD and 1 256-bit FP MUL every cycle.

-Max

bronxzv · ‎09-23-2009

Quoting - Max Locktyukhin (Intel)

well, I can see how picture might be a bit confusing ....

but in fact, as was said previsouly, Sandy Bridgy will double peak FP throughput compared to few past Intel Core implementations - it will be able to start 1 256-bit FP ADD and 1 256-bit FP MUL every cycle.

-Max

thanks a lot for the information, it's very good news since my code is already using more than 90% 256-bit instructions in all its hot spots (i.e. 95 % of flops are with the 256-bit instructions)

so indeed the IDF diagram and its labeling arequite confusing IMO

Mark_B_Intel1 · ‎09-23-2009

Quoting - bronxzv

A few new public information about the 1st AVX incarnation in Sandy Bridge in the SF09_ARCS002_FIN.pdf document available here :

http://www.intel.com/idf/technology-tracks/ -> "32 nm Implementation of... " -> "See sessions within this track" -> ARCS002 PDF Icon

1) slide 6, one "AVX HIGH" unit on port 0, and "AVX LOW" on port 1 => it looks like 256-bit AVX will have the same throughput than 128-bit SSE on current cores: 2 clocks forone 256-bit vmulps/pd +one 256-bit vaddps/pd instead of 1 clock for one 128-bit mulps/pd + one 128-bit addps/pd (i.e. same peaksp/dp flops per clock with balanced add/mul), so if Sandy Bridge can't issue in the same clock a mul and an add unlike Conroe, Penryn andNehalem it will be actually less efficient than these previous cores witha lot of legacy 128-bit SSE code ?, it looks rather odd => can someone in the know confirm this ?

2) slide 6, only 128-bit paths from the L1D cache to execution units (I was hoping full featured 256-bit paths), a few consequences :
- the extra load port will help as much legacy 128-bit SSE or 128-bit AVX than 256-bit AVX, same 48 B / clock maximum L1 bandwidth
- loop fission will be probably no more a good optimization if intermediate results are stored in L1D, probably better to overflow the LSD than the L1D, particularly with multiple threads fighting for L1D access
- more incentive to use 64-bit code to have 16 ymm registers instead of 8 to minimize L1D access

3) slide 54, 64 B cache lines (unchanged), so :
- align memory still important (more important than on Nehalem), 1/2 access will incur a cache line split otherwise

4) slide 58, masked moves considered harmful, replace vmaskmovps by vblendvps + vmovaps just like in legacy SSE4 code ?

Great questions some more details to the response Max gave.

1) The chart is wrong, we will fix it. Sandy Bridge has true 256-bit FP execution units (mul, add, shuffle). They are on exactly the same execution ports as the 128-bit versions. You can get a 256-bit multiply (on port 0) and a 256-bit add (on port 1) and a 256-bit shuffle (port 5) every cycle. 256-bit FP add and multiply bandwidth is therefore 2X higher flops than 128. See IACA for the ports on an instruction-by-instruction basis.
2) The chart doesnt mention 16-byte paths. We have true 32-byte loads (i.e. each load only uses one AGU resource and we have 2 AGUs) but only a 48-byte/cycle total is supported to the L1 each cycle. You cant get 48 bytes per cycle to the DCU using 128-bit operations (only 2 agus). This is why a simple memory-limited kernel like matrix add (load, load, add, store) measures 1.42X speedup (would have predicted 1.5X with the current architecture in the limit; vs. 1.0X if we had double pumped).
3) Alignment for 128-bit loads/stores is similar to Nehalem. The alignment penalty for 256-bit loads/stores is somewhat worse thats due to line splits and page splits. You are much more likely to split with wider loads, so alignment is much more important. Thats why, especially if you can guarantee 16 byte alignment but not 32-byte alignment, it often pays off to do load128/insertf128 instead of load256. Previous guidance to favor aligning stores (when you get a choice to align either a load or a store stream) still holds store page splits are worse than load page splits.
4) Masked moves are not harmful, they are proving extremely useful. But they are designed for a specific problem when the exception safety of nonmasked loads/stores cant be guaranteed. They burn a blend resource, and they arent going to disambiguate as well as normal loads and stores, so I dont use them when I dont need them. If you are a vectorizing compiler, theyre great for peeling and remainder operations, vectorizing code with if protecting a possible exception, etc. If you are a human coder, I doubt youll need them: A bit of data overrun padding (often coupled with alignment) pays dividends in speed. You mention doing the blend yourself. Note that a variable blend requires 2 port-5 shuffles so in shuffle-limited code this doesnt always win.

Mark Buxton

bronxzv · ‎09-23-2009

Mark, thanks a lot for all the details

>2) The chart doesn't mention 16-byte paths.

sure, I was deeply confused by the "AVX HIGH" (made me think to 128 MSBs) and "AVX LOW" (128 LSBs) labels

>3) Alignment for 128-bit loads/stores is similar to Nehalem. The alignment penalty for 256-bit loads/stores is
>somewhat worse - that's due to line splits and page splits. You are much more likely to split with wider loads, so
>alignment is much more important. That's why, especially if you can guarantee 16 byte alignment but not 32-byte
>alignment, it often pays off to do load128/insertf128 instead of load256. Previous guidance to favor aligning stores
>(when you get a choice to align either a load or a store stream) still holds - store page splits are worse than load
>page splits.

In my case I'll say more than 95 % of moves are aligned to 32 B, I use VMOVAPS wherever possible and SDE nicely crash (I really mean it) if the address isn't aligned, btw LRBni requires strict 64B alignment so it's an important practice for multi-paths code anyway

>4) Masked moves are not harmful,

Sure but most of my kernels have 10-40 iterations, slide 58 states that "it may be beneficial to not use masked storesfor very small loops (< 30 iterations)"

here is an excerpt of my code, it will be just a matter of recompile to select the best option :

INLINE OctoFloat Select (const OctoMask &mask, const OctoFloat &a, const OctoFloat &b)
{

return _mm256_blendv_ps(b.m,a.m,mask.m);

}

INLINE void CondStore (float *v, const OctoFloat &a, const OctoMask &m)

{

Store(v,Select(m,a,OctoFloat(v)));

}

/*

INLINE void CondStore (float *v, const OctoFloat &a, const OctoMask &m) // REVAVX test if faster than Select variant on real AVX HW

{

_mm256_maskstore_ps(v,_mm256_castps_si256(m.m),a.m);

}

*/