Confusion in behavior of _mm256_loadu_ps and _mm256_loadu_ps instrinsics

Aketh_T_1 · ‎06-21-2018

Hi all,

I performed a quick test to understand the behaviors of _mm256_load_ps and _mm256_loadu_ps SIMD intrinsic respectively, and the behavior is quite unexpected.

I am wondering if this is a bug by any chance?

when i try to load a register with unaligned access with _mm256_load_ps, I am expected to encounter an general-protection exception. But this isn't the case with _mm256_loadu_ps.
However, I see no such thing happen when using the aligned load access intrinsic?. For instance in the code below clearly I must expect an exception thrown on the second iteration.

for(i = 0; i < size ; i+=1)
        {
                t0 = _mm256_load_ps(&a);
                t1 = _mm256_load_ps(&b);
                t2 = _mm256_add_ps(t0, t1);
                _mm256_store_ps(&c, t2);
        }

This seems to be the case irrespective of weather a,b,c arrays were aligned or unaligned?

Is there any documentation I could refer to which explains this behavior and the performance implication of such unaligned access?

Attached below is the full code

Thanks,

Aketh

McCalpinJohn · ‎06-21-2018

You need to check the assembly code to be sure, but I have noticed that the intrinsics for aligned accesses actually generate the version of the instruction that does not require alignment. When the accesses are actually aligned, there is no difference in performance between VMOVAPS and VMOVUPS, so the only reason to use the VMOVAPS instruction is if you want to generate an exception.

Another issue that comes up with intrinsics is that the compiler treats them as suggestions, not as inline assembly. In this case, you have specific two _mm256_load_ps operations, but the compiler probably only generates one, with the other memory reference as an input argument to the VADDPS instruction. According to my reading of the description of the VADDPS instruction in Volume 2 of the Intel Architectures Software Developer's Manual, the 256-bit version of this instruction never requires alignment on the memory operand.