Intel® C++ Compiler
Support and discussions for creating C++ code that runs on platforms based on Intel® processors.
Announcements
This community is designed for sharing of public information. Please do not share Intel or third-party confidential information here.
7680 Discussions

Performance of SSE4 instructions is not faster?

missing__zlw
Beginner
157 Views
I am experiencing some SSE instructions, such as __mm_dp_ps and __mm_blend_ps, performance issues.

The problem is when I move from my old combination of _mm_add_ps, _mm_mult_ps and _mm_shuffle_ps to __mm_dp_ps and __mm_blend_ps, the overall performance is about the same, or slightly worse.

Although the total instruction number has reduced in my example. I had 18 multiply, 13 add, 1 movehl, 1 shuffle before. Now I have 14 dot product. 3 add, 6 blend and 2 shuffles.

This is done in a highly repeated loop. The instruction count is for each iteration. My data accumulation is for 6, so I need to use two registries.

The tests are conducted on Linux, with SNB, NHLM and Westmere machines. They all share the same behavior.
Any help? Thanks.
0 Kudos
2 Replies
jimdempseyatthecove
Black Belt
157 Views
Examine your code to see how the ports are utilized. Intel has a tool that attemptsto do this.Although your new code "uses two registers" it may work faster using more registers as you can often do additional workduring latencies.

Jim Dempsey
missing__zlw
Beginner
157 Views
Thanks. Could you let me know the name of this Intel tool?

Reply