Software Archive
Read-only legacy content

Aligned loads + shift vs. unaliagned loads vs. vgather

Mark_D_9
New Contributor I
301 Views
What do you recommend would be the best approach for this stencil on a Xeon Phi?  The idea is:
 
Given a 1-D array A: [0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18] 
 
And three scalars: probUp, probMid, probDown
 
You need to compute a 1-D array B where:
 
B[0] = probDown*A[0] + probMid*A[1] + probUp*A[2]
B[1] = probDown*A[1] + probMid*A[2] + probUp*A[3]
B[2] = probDown*A[2] + probMid*A[3] + probUp*A[4]
etc…
 
Everything is double precision.
 
The approach I’m going with is to load three vector registers, a0, a1, and a2 (where a0 is aligned to the loop iterator, a1 is shifted by 1 element, and a2 is shifted by 2 elements), do muls and fmadds to smoosh them together with the scalars, and then store.
 
The only question I guess I have is what would they say the most efficient way is to get the a* vector registers loaded from memory.  I’m pretty sure it’s best to do two loads and some shifts (as I’m doing below) as opposed to doing unaligned loads or using gather intrinsics, but I’d be curious if Intel has a different opinion. 
 
0 Kudos
1 Solution
Loc_N_Intel
Employee
301 Views

Hi Mark,

I forwarded your question to the intrinsics experts, here are their opinions:

1. They said you are correct that “on Xeon Phi it’s better to do loads with subsequent shuffles than doing unaligned loads or using gather intrinsics.” and “It is better to use aligned loads + valign for stencil accesses”. 

2. They believed your intrinsics code is the best implementation of the desired functionality (from the performance point of view).

3. They also noted that the upcoming Intel compiler release later this year, in Beta testing now (as Intel Parallel Studio 2017 Beta) , is able to make such transformations as well, but it currently cannot pipeline the loads from one iteration to another (this is what they saw in your code).

I hope this help. Thank you.

View solution in original post

0 Kudos
1 Reply
Loc_N_Intel
Employee
302 Views

Hi Mark,

I forwarded your question to the intrinsics experts, here are their opinions:

1. They said you are correct that “on Xeon Phi it’s better to do loads with subsequent shuffles than doing unaligned loads or using gather intrinsics.” and “It is better to use aligned loads + valign for stencil accesses”. 

2. They believed your intrinsics code is the best implementation of the desired functionality (from the performance point of view).

3. They also noted that the upcoming Intel compiler release later this year, in Beta testing now (as Intel Parallel Studio 2017 Beta) , is able to make such transformations as well, but it currently cannot pipeline the loads from one iteration to another (this is what they saw in your code).

I hope this help. Thank you.

0 Kudos
Reply