Intel® ISA Extensions
Use hardware-based isolation and memory encryption to provide more code protection in your solutions.

SSE4 Register-Handling

adrian_s_3
Beginner
2,776 Views

I'm working on a stereo-algorithm to compute a disparity map. Therefore I need to calculate a lot of SAD-values.

To improve the performance I want to use SSE4, especially the "_mm_mpsadbw_epu8" instruction.

I stumbled over this Intel document. In Section F "Intel® SSE4 – Optimized Function for 16x16 Blocks" is a SAD calculation example of a 16x16 Block. I used this snippet in my code and the preformance improved a lot. But it is not enough. Is it possible to boost the performance by using all 16 SSE registers instead of 8, or is there any kind of constraint?

Best Regards

Jambalaja

0 Kudos
21 Replies
SergeyKostrov
Valued Contributor II
188 Views
By the way, [ UnRolled - 8-in-1 ] is ~75% faster than [ Rolled - 1-in-1 ].
0 Kudos
Reply