Segmentation fault caused by unalignment

yuyinyang · ‎10-10-2014

I'm trying to write some assemble programs on mic manually, but I'm really stuck by the forced alignment.

For example, assume there's a double array whose elements start from the address addr, and it's not 64 bytes aligned (may be 8 bytes aligned). I try to load the elements in the array to a zmm register with the instruction:

vmovapd %zmm0, addr{k1}; (value in k1 is 0xFFFF, because I want to load 8 doubles per instruction)

As a result, the instruction above will cause a segmentation fault.

Is there any instruction that can move date between unaligned memory address and zmm registers? Or some other ways to figure out my problem (provided that the address could not be aligned)?

TimP · ‎10-10-2014

I'd start by copying the way a compiler does it, either by splitting the load or by vgather. I don't believe there's a way to move a split cache line in a single instruction.

Handling misalignment at a low level seems counter-productive to me.

jimdempseyatthecove · ‎10-10-2014

With the foreknowledge that the address is misaligned, it also confirms that the addresses immediately preceding the misaligned data, to the beginning of the cache line of the data is assured to be mapped to a page within the virtual address space. IOW, a load of addr&~0x3F will never segfault. Once loaded, you can shuffle the elements.

Jim Dempsey

yuyinyang · ‎10-10-2014

So it means in xeonphi there isn't any date movement instruction for unaligned memory address?

Like vpackstorelpd, it can pack and store unaligned low from float64 vector, is there any insight?

McCalpinJohn · ‎10-11-2014

For 64-bit floating-point data the compiler generates a VLOADUNPACKLD and a VLOADUNPACKHD for each cache line. Similar instructions exist for the other supported data types. These are the "load" counterparts to the VPACKSTORELPD instruction you mentioned.

I recommend that you spend time studying the compiler output for simple kernels to see how these are managed -- the wording in the documentation can be confusing.

The impact of using these pairs of unaligned load instructions depends primarily on where the data is located:

If all the data is in the L1 cache, then these instructions have a relatively high overhead --- for example a single vector arithmetic instruction with an aligned memory argument has to be expanded to at least three instructions (2 loads and a register-based vector arithmetic instruction), taking three times as many vector issue slots.
For data in the L2 cache the impact is smaller and sometimes negligible. With optimal code, the hardware can sustain approximately 8 loads in 25 cycles, so there are enough issue slots for 3*8=24 vector instructions as long as you are using at least 2 threads.
For data beyond the L2 the overhead of using the unaligned load instructions is generally unmeasurable, since there are so many idle cycles in which to issue the additional required instructions.