- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am experiencing some SSE instructions, such as __mm_dp_ps and __mm_blend_ps, performance issues.
The problem is when I move from my old combination of _mm_add_ps, _mm_mult_ps and _mm_shuffle_ps to __mm_dp_ps and __mm_blend_ps, the overall performance is about the same, or slightly worse.
Although the total instruction number has reduced in my example. I had 18 multiply, 13 add, 1 movehl, 1 shuffle before. Now I have 14 dot product. 3 add, 6 blend and 2 shuffles.
This is done in a highly repeated loop. The instruction count is for each iteration. My data accumulation is for 6, so I need to use two registries.
Any help? Thanks.
The problem is when I move from my old combination of _mm_add_ps, _mm_mult_ps and _mm_shuffle_ps to __mm_dp_ps and __mm_blend_ps, the overall performance is about the same, or slightly worse.
Although the total instruction number has reduced in my example. I had 18 multiply, 13 add, 1 movehl, 1 shuffle before. Now I have 14 dot product. 3 add, 6 blend and 2 shuffles.
This is done in a highly repeated loop. The instruction count is for each iteration. My data accumulation is for 6, so I need to use two registries.
Any help? Thanks.
Link Copied
3 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This may depend on several factors, including, but not limited to, which CPU it is, your data locality, ....
__mm_dp_ps is very much a specialized instruction, not usually advertised as a competitive solution where the older parallel instructions may be used effectively, e.g. by a compiler.
The most recent CPUs appeared to have improved the performance of shuffle to such an extent that it matches performance of more recent alternative instructions.
__mm_dp_ps is very much a specialized instruction, not usually advertised as a competitive solution where the older parallel instructions may be used effectively, e.g. by a compiler.
The most recent CPUs appeared to have improved the performance of shuffle to such an extent that it matches performance of more recent alternative instructions.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have tested on NHLM, Westmere and SNB. They all give similar results. My data resides in several normal float arrays.
I am not sure I understand your last statement. Could you elaborate? "The most recent CPUs appeared to have improved the performance of shuffle to such an extent that it matches performance of more recent alternative instructions."
Thanks.
I am not sure I understand your last statement. Could you elaborate? "The most recent CPUs appeared to have improved the performance of shuffle to such an extent that it matches performance of more recent alternative instructions."
Thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Well, yes, I've had code with shuffles running as fast as SSE4 or AVX alternatives on SNB. I doubt that it's relevant to your comment, except to point out that my code with shuffles may not resemble yours, given how wide a variety of situations might reasonably be coded with either shuffles or SSE4 or newer alternatives.
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page