Performance of SSE4 instructions is not faster?

missing__zlw · ‎11-08-2011

I am experiencing some SSE instructions, such as __mm_dp_ps and __mm_blend_ps, performance issues.

The problem is when I move from my old combination of _mm_add_ps, _mm_mult_ps and _mm_shuffle_ps to __mm_dp_ps and __mm_blend_ps, the overall performance is about the same, or slightly worse.

Although the total instruction number has reduced in my example. I had 18 multiply, 13 add, 1 movehl, 1 shuffle before. Now I have 14 dot product. 3 add, 6 blend and 2 shuffles.

This is done in a highly repeated loop. The instruction count is for each iteration. My data accumulation is for 6, so I need to use two registries.

Any help? Thanks.

TimP · ‎11-08-2011

This may depend on several factors, including, but not limited to, which CPU it is, your data locality, ....
__mm_dp_ps is very much a specialized instruction, not usually advertised as a competitive solution where the older parallel instructions may be used effectively, e.g. by a compiler.
The most recent CPUs appeared to have improved the performance of shuffle to such an extent that it matches performance of more recent alternative instructions.

missing__zlw · ‎11-08-2011

I have tested on NHLM, Westmere and SNB. They all give similar results. My data resides in several normal float arrays.

I am not sure I understand your last statement. Could you elaborate? "The most recent CPUs appeared to have improved the performance of shuffle to such an extent that it matches performance of more recent alternative instructions."

Thanks.

TimP · ‎11-08-2011

Well, yes, I've had code with shuffles running as fast as SSE4 or AVX alternatives on SNB. I doubt that it's relevant to your comment, except to point out that my code with shuffles may not resemble yours, given how wide a variety of situations might reasonably be coded with either shuffles or SSE4 or newer alternatives.