What are the alignment restrictions on the new Haswell AVX VGATHER instructions ?
I'm looking at the AVX programming reference. The new Haswell instructions include some eagerly awaited "gather" loads. However, I can't figure out what the alignment restrictions are on the indexed data items. Section 2.5 "Memory alignment" of the reference seems like it ought to list the various VGATHER* instructions in one of tables 2.4 or 2.5... but it doesn't.
Background: while gather instructions' supported data sizes are 4 and 8 bytes, my application could benefit from gather-loading adjacent 16-bit data value pairs to DWORDS. Odd indices with a 2-byte scale will produce 2-byte aligned 4-byte loads and it's not clear to me from the manual whether this will fault or otherwise fail to work as intended (I do rather suspect I'm out of luck given all the instructions supporting unaligned accesses seem to have a 'U' in them).
I don't expect there are any alignment restrictions.
A gather operation is essentially a set of load operations. So a relatively straightforward implementation would just use the existing load unit(s), which fetch cache lines and shift the data to extract the (aligned or unaligned) word. Words which straddle a cache line boundardy require two fetches, and the logic to merge the two pieces is already in place as well. So supporting unaligned gather doesn't require any essential changes.
A more advanced implementation would have multiple shifters in the load unit, so that if two words are located within the same cache line, they can be loaded simultaneously. I believe eight byte-granularity log shifters are not that expensive (in terms of area), but they have somewhat higher latency versus barrel shifters. That doesn't have to be much of an issue, if only the second load unit supports gather and the first load unit still has a minimal latency. Another approach is perhaps to have one fast barrel shifter and seven slower log shifters. That way both load units could support gather without sacrificing latency for regular loads. And then there's also the option to equip the multi-bancked caches with more ports so more cache lines can be fetched simultaneously...
Anyway, there are many options and I'm very curious which one was chosen for Haswell. In any case unaligned accesses seem fully supported.