- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
consider SSE/AVX Code which operates on XMM/YMM registers 99% of the time but requires to keep track of masks (control-flow to data-flow conversion using BLENDVPS operations).
Example in C (intrinsics for better understanding):
__m256 x = ...
__m256 y = ...
__m256 mask0 = ...
__m256 mask1 = _mm256_cmplt_ps(x, y);
__m256 mask2 = _mm256_and_ps(mask0, mask1);
__m256 res = _mm256_blendv_ps(x, y, mask2);
Now due to disjoint control-flow paths both being executed, the number of live variables required for the blending operations increases, and so does register pressure.
The idea is now to store masks in GPR instead of vector registers in order to free some of the registers (operations like and/or/xor can just as well be executed in the scalar unit).
This would result in code like this:
__m256 x = ...
__m256 y = ...
unsigned mask0 = ...
__m256 mask1v = _mm256_cmplt_ps(x, y);
unsigned mask1 = _mm256_movemask_ps(mask1v);
unsigned mask2 = mask0 & mask1;
__m256 mask2v = ?
__m256 res =_mm256_blendv_ps(x, y, mask2v);
Now the question is: can anybody help me out on the question mark? :)
However, I could imagine people that are more experienced with such code to give advice not to attempt this because of other performance issues - is that the case?
Kind regards,
Ralf
P.S. I saw postings in a different thread (http://software.intel.com/en-us/forums/showthread.php?t=80452&o=a&s=lr ) that went into a similar direction, but I felt my question was a little bit of-topic.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[...]
const __m256 mask2v = LUT[mask2];
now, I don't thnik it will be beneficial in your example, the extra latency from the vmovmskps + LUT access will be worse than the one for the spills/fills due to your lack of registers, also it will be in the critical path, unlike the spills/fills most probably.NB: I use it personnaly toexpand masks stored in packed form (8-bit) for multi-passes algorithms: theeasy 32 to 1compression obviously minimize cache misses and I have measured actual speedups vs. storing the 256-bit masks (for datasets bigger than the L2$ capacity)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi bronxzv,
thanks for your answer.
That solution obviously works, and I will give it a try. However, I was hoping that there is a way to compute the 256-bit mask on the fly with some clever instructions. So far, I did not come up with anything useful.
Best,
Ralf
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
only a long instruction sequence will fit the bill, clearly slower than a single vmovaps(with very low cache miss ratio) with the LUT solution
anyway, even if there was an instruction it will not speed up much your solution since the 8-bit to 256-bit conversion occurs only once but the vmovmskps is required after each packed compare, all this added latency in the critical path, just for freeing a single YMM, is most probably a bad tradeoff
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page