Community
cancel
Showing results for 
Search instead for 
Did you mean: 
missing__zlw
Beginner
79 Views

Performance of SSE4 instructions is not faster?

I am experiencing some SSE instructions, such as __mm_dp_ps and __mm_blend_ps, performance issues.

The problem is when I move from my old combination of _mm_add_ps, _mm_mult_ps and _mm_shuffle_ps to __mm_dp_ps and __mm_blend_ps, the overall performance is about the same, or slightly worse.

Although the total instruction number has reduced in my example. I had 18 multiply, 13 add, 1 movehl, 1 shuffle before. Now I have 14 dot product. 3 add, 6 blend and 2 shuffles.

This is done in a highly repeated loop. The instruction count is for each iteration. My data accumulation is for 6, so I need to use two registries.

The tests are conducted on Linux, with SNB, NHLM and Westmere machines. They all share the same behavior.
Any help? Thanks.
0 Kudos
2 Replies
jimdempseyatthecove
Black Belt
79 Views

Examine your code to see how the ports are utilized. Intel has a tool that attemptsto do this.Although your new code "uses two registers" it may work faster using more registers as you can often do additional workduring latencies.

Jim Dempsey
missing__zlw
Beginner
79 Views

Thanks. Could you let me know the name of this Intel tool?

Reply