This may depend on several factors, including, but not limited to, which CPU it is, your data locality, .... __mm_dp_ps is very much a specialized instruction, not usually advertised as a competitive solution where the older parallel instructions may be used effectively, e.g. by a compiler. The most recent CPUs appeared to have improved the performance of shuffle to such an extent that it matches performance of more recent alternative instructions.
I have tested on NHLM, Westmere and SNB. They all give similar results. My data resides in several normal float arrays.
I am not sure I understand your last statement. Could you elaborate? "The most recent CPUs appeared to have improved the performance of
shuffle to such an extent that it matches performance of more recent
Well, yes, I've had code with shuffles running as fast as SSE4 or AVX alternatives on SNB. I doubt that it's relevant to your comment, except to point out that my code with shuffles may not resemble yours, given how wide a variety of situations might reasonably be coded with either shuffles or SSE4 or newer alternatives.