Intel® ISA Extensions
Use hardware-based isolation and memory encryption to provide more code protection in your solutions.
Announcements
FPGA community forums and blogs on community.intel.com are migrating to the new Altera Community and are read-only. For urgent support needs during this transition, please visit the FPGA Design Resources page or contact an Altera Authorized Distributor.

SSE4 Register-Handling

adrian_s_3
Beginner
2,818 Views

I'm working on a stereo-algorithm to compute a disparity map. Therefore I need to calculate a lot of SAD-values.

To improve the performance I want to use SSE4, especially the "_mm_mpsadbw_epu8" instruction.

I stumbled over this Intel document. In Section F "Intel® SSE4 – Optimized Function for 16x16 Blocks" is a SAD calculation example of a 16x16 Block. I used this snippet in my code and the preformance improved a lot. But it is not enough. Is it possible to boost the performance by using all 16 SSE registers instead of 8, or is there any kind of constraint?

Best Regards

Jambalaja

0 Kudos
21 Replies
SergeyKostrov
Valued Contributor II
190 Views
By the way, [ UnRolled - 8-in-1 ] is ~75% faster than [ Rolled - 1-in-1 ].
0 Kudos
Reply