By the way, [ UnRolled - 8-in - Page 2

adrian_s_3 · ‎07-10-2013

I'm working on a stereo-algorithm to compute a disparity map. Therefore I need to calculate a lot of SAD-values.

To improve the performance I want to use SSE4, especially the "_mm_mpsadbw_epu8" instruction.

I stumbled over this Intel document. In Section F "Intel® SSE4 – Optimized Function for 16x16 Blocks" is a SAD calculation example of a 16x16 Block. I used this snippet in my code and the preformance improved a lot. But it is not enough. Is it possible to boost the performance by using all 16 SSE registers instead of 8, or is there any kind of constraint?

Best Regards

Jambalaja

SergeyKostrov · ‎07-18-2013

By the way, [ UnRolled - 8-in-1 ] is ~75% faster than [ Rolled - 1-in-1 ].

SSE4 Register-Handling