- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The following example:
if( a < b )
{
a = b;
}
else
{
a = c;
}
mask = _mm_cmplt_pd (a, b);
a = _mm_blend_pd (c, b, mask);
In this case:
- The comparison is mapped to the first intrinsic operation
- The
conditional assignment is mapped to the second intrinsic operation
if( a < b )
{
a = b;
}If the condition isnt met, the regular C code skips the assignment operation,
But with the following optimized code:
mask = _mm_cmplt_pd (a, b);
a = _mm_blend_pd (a, b, mask);
Is there a way to save the assignment if not needed (i.e. if the conditioned isnt met)?
Thanks
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
You can only avoid the blend instruction if the condition is false for both comparisons that _mm_cmplt_pd does. Furthermore, you will need (at least) 1 cycle for a conditional jump and _mm_blend_pd needs only1 cycle. This looks already optimal to me.
Kind regards
Thomas
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
MAD willhal:You can only avoid the blend instruction if the condition is false for both comparisons that _mm_cmplt_pd does.
Is the avoidance done automatically if the condition is false for both comparisons?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
tim18:The blend with unconditional store is probably optimum, for cases where branch prediction isn't effective.
Can you explain what you mean by blend with unconditional store? are you referring to a case where the condition is true for one of the masks? Also, what do you mean by is probably optimum?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
MAD willhal:Furthermore, you will need (at least) 1 cycle for a conditional jump and _mm_blend_pd needs only1 cycle.
Is there a reference that specifies the number of cycles that each of the _mm operations require? (or do they all require one cycle?)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
a = _mm_blend_pd (a, b, mask);
stores a, regardless of whether the value changes. This avoids any dependence on branch prediction, but could increase latency, compared to compilation with a predictable branch and frequent skipping of the store.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Is the avoidance done automatically if the condition is false for both comparisons?
No, it is not avoided. However, avoiding it does not make sense. The blend instruction needs only 1 single clock cycle--regardless of how many values are copied (0, 1, or 2). You can hardly do better than that.
If you introduced some additional code to skip the blend, it will need at least oneinstruction for the jump. Plus, the branch prediction might be wrong, which will result in additional wasted cycles.
With an out-of-order engine, it is very difficult to predict if some piece of code is optimal---but in this case it is very likely :)
The latency and throughput of instructions can be found in Appendix C of the "Intel and IA-32 Architectures Optimization Reference Manual" (http://www.intel.com/design/processor/manuals/248966.pdf)
Kind regards
Thomas
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page