- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Mark,
I forwarded your question to the intrinsics experts, here are their opinions:
1. They said you are correct that “on Xeon Phi it’s better to do loads with subsequent shuffles than doing unaligned loads or using gather intrinsics.” and “It is better to use aligned loads + valign for stencil accesses”.
2. They believed your intrinsics code is the best implementation of the desired functionality (from the performance point of view).
3. They also noted that the upcoming Intel compiler release later this year, in Beta testing now (as Intel Parallel Studio 2017 Beta) , is able to make such transformations as well, but it currently cannot pipeline the loads from one iteration to another (this is what they saw in your code).
I hope this help. Thank you.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Mark,
I forwarded your question to the intrinsics experts, here are their opinions:
1. They said you are correct that “on Xeon Phi it’s better to do loads with subsequent shuffles than doing unaligned loads or using gather intrinsics.” and “It is better to use aligned loads + valign for stencil accesses”.
2. They believed your intrinsics code is the best implementation of the desired functionality (from the performance point of view).
3. They also noted that the upcoming Intel compiler release later this year, in Beta testing now (as Intel Parallel Studio 2017 Beta) , is able to make such transformations as well, but it currently cannot pipeline the loads from one iteration to another (this is what they saw in your code).
I hope this help. Thank you.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page