>2) The chart doesn't mention 16-byte paths.
sure, I was deeply confused by the "AVX HIGH" (made me think to 128 MSBs) and "AVX LOW" (128 LSBs) labels
>3) Alignment for 128-bit loads/stores is similar to Nehalem. The alignment penalty for 256-bit loads/stores is
>somewhat worse - that's due to line splits and page splits. You are much more likely to split with wider loads, so
>alignment is much more important. That's why, especially if you can guarantee 16 byte alignment but not 32-byte
>alignment, it often pays off to do load128/insertf128 instead of load256. Previous guidance to favor aligning stores
>(when you get a choice to align either a load or a store stream) still holds - store page splits are worse than load
In my case I'll say more than 95 % of moves are aligned to 32 B, I use VMOVAPS wherever possible and SDE nicely crash (I really mean it) if the address isn't aligned, btw LRBni requires strict 64B alignment so it's an important practice for multi-paths code anyway
>4) Masked moves are not harmful,
Sure but most of my kernels have 10-40 iterations, slide 58 states that "it may be beneficial to not use masked storesfor very small loops (< 30 iterations)"
here is an excerpt of my code, it will be just a matter of recompile to select the best option :
INLINE OctoFloat Select (const OctoMask &mask, const OctoFloat &a, const OctoFloat &b)
INLINE void CondStore (float *v, const OctoFloat &a, const OctoMask &m)
INLINE void CondStore (float *v, const OctoFloat &a, const OctoMask &m) // REVAVX test if faster than Select variant on real AVX HW