- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm working on a stereo-algorithm to compute a disparity map. Therefore I need to calculate a lot of SAD-values.
To improve the performance I want to use SSE4, especially the "_mm_mpsadbw_epu8" instruction.
I stumbled over this Intel document. In Section F "Intel® SSE4 – Optimized Function for 16x16 Blocks" is a SAD calculation example of a 16x16 Block. I used this snippet in my code and the preformance improved a lot. But it is not enough. Is it possible to boost the performance by using all 16 SSE registers instead of 8, or is there any kind of constraint?
Best Regards
Jambalaja
Link Copied
- « Previous
-
- 1
- 2
- Next »
21 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
By the way, [ UnRolled - 8-in-1 ] is ~75% faster than [ Rolled - 1-in-1 ].

Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- « Previous
-
- 1
- 2
- Next »