Solved: Aligned loads + shift vs. unaliagned loads vs. vgather

Mark_D_9 · ‎05-18-2016

What do you recommend would be the best approach for this stencil on a Xeon Phi? The idea is:

Given a 1-D array A: [0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18]

And three scalars: probUp, probMid, probDown

You need to compute a 1-D array B where:

B[0] = probDown*A[0] + probMid*A[1] + probUp*A[2]

B[1] = probDown*A[1] + probMid*A[2] + probUp*A[3]

B[2] = probDown*A[2] + probMid*A[3] + probUp*A[4]

etc…

Everything is double precision.

The approach I’m going with is to load three vector registers, a0, a1, and a2 (where a0 is aligned to the loop iterator, a1 is shifted by 1 element, and a2 is shifted by 2 elements), do muls and fmadds to smoosh them together with the scalars, and then store.

The only question I guess I have is what would they say the most efficient way is to get the a* vector registers loaded from memory. I’m pretty sure it’s best to do two loads and some shifts (as I’m doing below) as opposed to doing unaligned loads or using gather intrinsics, but I’d be curious if Intel has a different opinion.

Loc_N_Intel · ‎05-23-2016

Hi Mark,

I forwarded your question to the intrinsics experts, here are their opinions:

1. They said you are correct that “on Xeon Phi it’s better to do loads with subsequent shuffles than doing unaligned loads or using gather intrinsics.” and “It is better to use aligned loads + valign for stencil accesses”.

2. They believed your intrinsics code is the best implementation of the desired functionality (from the performance point of view).

3. They also noted that the upcoming Intel compiler release later this year, in Beta testing now (as Intel Parallel Studio 2017 Beta) , is able to make such transformations as well, but it currently cannot pipeline the loads from one iteration to another (this is what they saw in your code).

I hope this help. Thank you.

View solution in original post

Loc_N_Intel · ‎05-23-2016