- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm working on a stereo-algorithm to compute a disparity map. Therefore I need to calculate a lot of SAD-values.
To improve the performance I want to use SSE4, especially the "_mm_mpsadbw_epu8" instruction.
I stumbled over this Intel document. In Section F "Intel® SSE4 – Optimized Function for 16x16 Blocks" is a SAD calculation example of a 16x16 Block. I used this snippet in my code and the preformance improved a lot. But it is not enough. Is it possible to boost the performance by using all 16 SSE registers instead of 8, or is there any kind of constraint?
Best Regards
Jambalaja
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Sergey,
thanks for your reply. So if i use AVX instructions (which are the operating on the same registers than SSE, just with 128bit more) i can probably reach twice the performance. But can i also reach twice the performance if i use all 16 registers instead of 8? Or are there any dependencies, like "you cannot compute register 1 and register 9 parallel" or sth. like that?
best regards
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
> But can i also reach twice the performance if i use all 16 registers instead of 8?
Operating on more registers doesn't boost performance by itself. But it does help if your data processing context does not fit in 8 registers, and when your data processing algorithm is friendly to instruction-level parallelism (i.e. when multiple instructions within a single thread can be issued in parallel by the CPU), given that you're not saturating execution units already. From my experience, most often algorithms are memory or computation bound.
> Or are there any dependencies, like "you cannot compute register 1 and register 9 parallel" or sth. like that?
No, at least to my knowledge. All xmm/ymm registers are independent and interchangeable. However, there are dependency stalls which basically happen when CPU cannot execute subsequent instructions before the (supposed) input data for these instructions is ready (this is what counters instruction-level parallelism).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Adrian,
I think it would be productive for you to post a complete function indicating where the choke point is locatate. Trying to give advice with generalities is often counter productive. What is usually effective is to interleave the memory load and stores with computations as opposed to packing registers all together with loads, then perform all register to register calculations, then dump the results registers of interest back to memory.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
my code is posted below.
as you can see im not using the registers in a smart way. i dont even have to use s8-s14.
maybe i should load all registers first, then compute, load, compute, ... like Jim said. ill try it.
left_ptr = left->ptr(x);
right_ptr = right->ptr(y);
right_shift_ptr = right->ptr(z);
// row1
s0 = _mm_loadu_si128((__m128i*)left_ptr);
s1 = _mm_loadu_si128((__m128i*)right_ptr);
s2 = _mm_loadu_si128((__m128i*)right_shift_ptr);
s3 = _mm_adds_epu16(s3, _mm_mpsadbw_epu8(s1, s0, 0));
s4 = _mm_adds_epu16(s4, _mm_mpsadbw_epu8(s1, s0, 5));
s5 = _mm_adds_epu16(s5, _mm_mpsadbw_epu8(s2, s0, 2));
s6 = _mm_adds_epu16(s6, _mm_mpsadbw_epu8(s2, s0, 7));
left_ptr+=const;
right_ptr+=const;
right_shift_ptr+=const;
// row2
s8 = _mm_loadu_si128((__m128i*)left_ptr);
s9 = _mm_loadu_si128((__m128i*)right_ptr);
s10 = _mm_loadu_si128((__m128i*)right_shift_ptr);
s11 = _mm_adds_epu16(s11, _mm_mpsadbw_epu8(s9, s8, 0));
s12 = _mm_adds_epu16(s12, _mm_mpsadbw_epu8(s9, s8, 5));
s13 = _mm_adds_epu16(s13, _mm_mpsadbw_epu8(s10, s8, 2));
s14 = _mm_adds_epu16(s14, _mm_mpsadbw_epu8(s10, s8, 7));
left_ptr+=const;
right_ptr+=const;
right_shift_ptr+=const;
// row3
s0 = _mm_loadu_si128((__m128i*)left_ptr);
s1 = _mm_loadu_si128((__m128i*)right_ptr);
s2 = _mm_loadu_si128((__m128i*)right_shift_ptr);
s3 = _mm_adds_epu16(s3, _mm_mpsadbw_epu8(s1, s0, 0));
s4 = _mm_adds_epu16(s4, _mm_mpsadbw_epu8(s1, s0, 5));
s5 = _mm_adds_epu16(s5, _mm_mpsadbw_epu8(s2, s0, 2));
s6 = _mm_adds_epu16(s6, _mm_mpsadbw_epu8(s2, s0, 7));
left_ptr+=const;
right_ptr+=const;
right_shift_ptr+=const;
//row4
s8 = _mm_loadu_si128((__m128i*)left_ptr);
s9 = _mm_loadu_si128((__m128i*)right_ptr);
s10 = _mm_loadu_si128((__m128i*)right_shift_ptr);
s11 = _mm_adds_epu16(s11, _mm_mpsadbw_epu8(s9, s8, 0));
s12 = _mm_adds_epu16(s12, _mm_mpsadbw_epu8(s9, s8, 5));
s13 = _mm_adds_epu16(s13, _mm_mpsadbw_epu8(s10, s8, 2));
s14 = _mm_adds_epu16(s14, _mm_mpsadbw_epu8(s10, s8, 7));
left_ptr+=const;
right_ptr+=const;
right_shift_ptr+=const;
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Adrian,
You mis-read my statement, read it again. You have it backwards. You want to interleave the memory fetches and stores to some extent with register-to-register instructions, and you want to defer using destination register of memory fetch for some time after the fetch instruction. Using Sergey's uploaded code as an example:
Original code (edited slightly for reformatting for this forum message:
[cpp]
for(iX = 0; iX <= (iRight - iLeft); iX++){
sum = _mm_xor_si128(sum, sum); // Clear accumulator
sum2 = _mm_xor_si128(sum2, sum2); // Clear accumulator
row2 = _mm_loadu_si128((__m128i *)pucR);
row4 = _mm_loadu_si128((__m128i *)(pucR + iWidth));
row6 = _mm_loadu_si128((__m128i *)(pucR + 2*iWidth));
row8 = _mm_loadu_si128((__m128i *)(pucR + 3*iWidth));
row1 = _mm_load_si128((__m128i *) pucC);
row3 = _mm_load_si128((__m128i *) (pucC + iWidth));
row5 = _mm_load_si128((__m128i *) (pucC + 2*iWidth));
row7 = _mm_load_si128((__m128i *) (pucC + 3*iWidth));
row1 = _mm_sad_epu8(row1, row2);
row3 = _mm_sad_epu8(row3, row4);
sum = _mm_add_epi16(sum, row1);
sum2 = _mm_add_epi16(sum2, row3);
row5 = _mm_sad_epu8(row5, row6);
row8 = _mm_sad_epu8(row7, row8);
sum = _mm_add_epi16(sum, row5);
sum2 = _mm_add_epi16(sum2, row7);
row2 = _mm_loadu_si128((__m128i *)(pucR + 4*iWidth));
row4 = _mm_loadu_si128((__m128i *)(pucR + 5*iWidth));
row6 = _mm_loadu_si128((__m128i *)(pucR + 6*iWidth));
row8 = _mm_loadu_si128((__m128i *)(pucR + 7*iWidth));
row1 = _mm_load_si128((__m128i *) (pucC + 4*iWidth));
row3 = _mm_load_si128((__m128i *) (pucC + 5*iWidth));
row5 = _mm_load_si128((__m128i *) (pucC + 6*iWidth));
row7 = _mm_load_si128((__m128i *) (pucC + 7*iWidth));
row1 = _mm_sad_epu8(row1, row2);
row3 = _mm_sad_epu8(row3, row4);
sum = _mm_add_epi16(sum, row1);
sum2 = _mm_add_epi16(sum2, row3);
row5 = _mm_sad_epu8(row5, row6);
row7 = _mm_sad_epu8(row7, row8);
sum = _mm_add_epi16(sum, row5);
sum2 = _mm_add_epi16(sum2, row7);
[/cpp]
Notice how all the loads are congregated together. A re-arrangement of the statements will produce more efficient code.
[cpp]
for(iX = 0; iX <= (iRight - iLeft); iX++){
// Get SAD for block pair
row1 = _mm_load_si128((__m128i *) pucC); // first memory fetch
row2 = _mm_loadu_si128((__m128i *)pucR); // second memory fetch (first fetch in-flight)
sum = _mm_xor_si128(sum, sum); // Clear accumulator (first and second fetch in-flight)
sum2 = _mm_xor_si128(sum2, sum2); // Clear accumulator (first and second fetch in-flight)
row3 = _mm_load_si128((__m128i *) (pucC + iWidth)); // third memory fetch (first and second fetch in-flight)
row4 = _mm_loadu_si128((__m128i *)(pucR + iWidth)); // forth memory fetch (first, second, and third fetch in-flight)
row1 = _mm_sad_epu8(row1, row2); // operaton dependent on first and second in-flight memory fewtches (hopefully complete by now)
row5 = _mm_load_si128((__m128i *) (pucC + 2*iWidth));
row6 = _mm_loadu_si128((__m128i *)(pucR + 2*iWidth));
row3 = _mm_sad_epu8(row3, row4); // operaton dependent on third and forth in-flight memory fewtches (hopefully complete by now)
row7 = _mm_load_si128((__m128i *) (pucC + 3*iWidth));
row8 = _mm_loadu_si128((__m128i *)(pucR + 3*iWidth));
sum = _mm_add_epi16(sum, row1);
sum2 = _mm_add_epi16(sum2, row3);
row5 = _mm_sad_epu8(row5, row6);
row8 = _mm_sad_epu8(row7, row8);
row1 = _mm_load_si128((__m128i *) (pucC + 4*iWidth));
sum = _mm_add_epi16(sum, row5);
row2 = _mm_loadu_si128((__m128i *)(pucR + 4*iWidth));
sum2 = _mm_add_epi16(sum2, row7);
row3 = _mm_load_si128((__m128i *) (pucC + 5*iWidth));
row4 = _mm_loadu_si128((__m128i *)(pucR + 5*iWidth));
row1 = _mm_sad_epu8(row1, row2);
row5 = _mm_load_si128((__m128i *) (pucC + 6*iWidth));
row6 = _mm_loadu_si128((__m128i *)(pucR + 6*iWidth));
row3 = _mm_sad_epu8(row3, row4);
row7 = _mm_load_si128((__m128i *) (pucC + 7*iWidth));
row8 = _mm_loadu_si128((__m128i *)(pucR + 7*iWidth));
sum = _mm_add_epi16(sum, row1);
sum2 = _mm_add_epi16(sum2, row3);
row5 = _mm_sad_epu8(row5, row6);
row7 = _mm_sad_epu8(row7, row8);
sum = _mm_add_epi16(sum, row5);
sum2 = _mm_add_epi16(sum2, row7);
...
[/cpp]
The above is a (partial) re-write from Sergey's sample program. The important point to know is the processor is capable of having multiple memory fetches in-flight at a time, and as long as you do not use the result of the fetch (while fetch in-flight) the processor won't stall. IOW you want to fill in the fetch latencies with other useful (non-fetch dependent) instructions.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Jim and Sergey,
thank you very much for you code examples. I will think about it, write it down and check the performance.
best regards
adrian
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Changed my code, looks like i got a slight performance boost. im trying to avoid calculations that have to "wait" for data and filling the "holes" with sth else (like other calc or pointer increments)
// load row1
s0 = _mm_loadu_si128((__m128i*)left_ptr);
s1 = _mm_loadu_si128((__m128i*)right_ptr);
s2 = _mm_loadu_si128((__m128i*)right_shift_ptr);
s3 = _mm_setzero_si128();
s4 = _mm_setzero_si128();
s5 = _mm_setzero_si128();
s6 = _mm_setzero_si128();
// 1-2
left_ptr+=const;
right_ptr+=const;
right_shift_ptr+=const;
// load row2
s8 = _mm_loadu_si128((__m128i*)left_ptr);
s9 = _mm_loadu_si128((__m128i*)right_ptr);
s10 = _mm_loadu_si128((__m128i*)right_shift_ptr);
s11 = _mm_setzero_si128();
s12 = _mm_setzero_si128();
s13 = _mm_setzero_si128();
s14 = _mm_setzero_si128();
//calc row1
s3 = _mm_adds_epu16(s3, _mm_mpsadbw_epu8(s1, s0, 0));
s4 = _mm_adds_epu16(s4, _mm_mpsadbw_epu8(s1, s0, 5));
s5 = _mm_adds_epu16(s5, _mm_mpsadbw_epu8(s2, s0, 2));
s6 = _mm_adds_epu16(s6, _mm_mpsadbw_epu8(s2, s0, 7));
//2-3
left_ptr+=const;
right_ptr+=const;
right_shift_ptr+=const;
// load row3
s0 = _mm_loadu_si128((__m128i*)left_ptr);
s1 = _mm_loadu_si128((__m128i*)right_ptr);
s2 = _mm_loadu_si128((__m128i*)right_shift_ptr);
//calc row2
s11 = _mm_adds_epu16(s11, _mm_mpsadbw_epu8(s9, s8, 0));
s12 = _mm_adds_epu16(s12, _mm_mpsadbw_epu8(s9, s8, 5));
s13 = _mm_adds_epu16(s13, _mm_mpsadbw_epu8(s10, s8, 2));
s14 = _mm_adds_epu16(s14, _mm_mpsadbw_epu8(s10, s8, 7));
//3-4
left_ptr+=const;
right_ptr+=const;
right_shift_ptr+=const;
// load row4
s8 = _mm_loadu_si128((__m128i*)left_ptr);
s9 = _mm_loadu_si128((__m128i*)right_ptr);
s10 = _mm_loadu_si128((__m128i*)right_shift_ptr);
//calc row3
s3 = _mm_adds_epu16(s3, _mm_mpsadbw_epu8(s1, s0, 0));
s4 = _mm_adds_epu16(s4, _mm_mpsadbw_epu8(s1, s0, 5));
s5 = _mm_adds_epu16(s5, _mm_mpsadbw_epu8(s2, s0, 2));
s6 = _mm_adds_epu16(s6, _mm_mpsadbw_epu8(s2, s0, 7));
//4-5
left_ptr+=const;
right_ptr+=const;
right_shift_ptr+=const;
//load row5
s0 = _mm_loadu_si128((__m128i*)left_ptr);
s1 = _mm_loadu_si128((__m128i*)right_ptr);
s2 = _mm_loadu_si128((__m128i*)right_shift_ptr);
// calc row4
s11 = _mm_adds_epu16(s11, _mm_mpsadbw_epu8(s9, s8, 0));
s12 = _mm_adds_epu16(s12, _mm_mpsadbw_epu8(s9, s8, 5));
s13 = _mm_adds_epu16(s13, _mm_mpsadbw_epu8(s10, s8, 2));
s14 = _mm_adds_epu16(s14, _mm_mpsadbw_epu8(s10, s8, 7));
//5-6
left_ptr+=const;
right_ptr+=const;
right_shift_ptr+=const;
//load row6
s8 = _mm_loadu_si128((__m128i*)left_ptr);
s9 = _mm_loadu_si128((__m128i*)right_ptr);
s10 = _mm_loadu_si128((__m128i*)right_shift_ptr);
//calc row5
s3 = _mm_adds_epu16(s3, _mm_mpsadbw_epu8(s1, s0, 0));
s4 = _mm_adds_epu16(s4, _mm_mpsadbw_epu8(s1, s0, 5));
s5 = _mm_adds_epu16(s5, _mm_mpsadbw_epu8(s2, s0, 2));
s6 = _mm_adds_epu16(s6, _mm_mpsadbw_epu8(s2, s0, 7));
//6-7
left_ptr+=const;
right_ptr+=const;
right_shift_ptr+=const;
// load row7
s0 = _mm_loadu_si128((__m128i*)left_ptr);
s1 = _mm_loadu_si128((__m128i*)right_ptr);
s2 = _mm_loadu_si128((__m128i*)right_shift_ptr);
//calc row 6
s11 = _mm_adds_epu16(s11, _mm_mpsadbw_epu8(s9, s8, 0));
s12 = _mm_adds_epu16(s12, _mm_mpsadbw_epu8(s9, s8, 5));
s13 = _mm_adds_epu16(s13, _mm_mpsadbw_epu8(s10, s8, 2));
s14 = _mm_adds_epu16(s14, _mm_mpsadbw_epu8(s10, s8, 7));
//7-8
left_ptr+=const;
right_ptr+=const;
right_shift_ptr+=const;
// load row8
s8 = _mm_loadu_si128((__m128i*)left_ptr);
s9 = _mm_loadu_si128((__m128i*)right_ptr);
s10 = _mm_loadu_si128((__m128i*)right_shift_ptr);
// calc row7
s3 = _mm_adds_epu16(s3, _mm_mpsadbw_epu8(s1, s0, 0));
s4 = _mm_adds_epu16(s4, _mm_mpsadbw_epu8(s1, s0, 5));
s5 = _mm_adds_epu16(s5, _mm_mpsadbw_epu8(s2, s0, 2));
s6 = _mm_adds_epu16(s6, _mm_mpsadbw_epu8(s2, s0, 7));
//calc row8
s11 = _mm_adds_epu16(s11, _mm_mpsadbw_epu8(s9, s8, 0));
s12 = _mm_adds_epu16(s12, _mm_mpsadbw_epu8(s9, s8, 5));
s13 = _mm_adds_epu16(s13, _mm_mpsadbw_epu8(s10, s8, 2));
s14 = _mm_adds_epu16(s14, _mm_mpsadbw_epu8(s10, s8, 7));
thanks again and best regards
adrian
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Adrian,
If you are running on 32-bit build reduce the number of "registerized" variables (s0 to s14)
If you are running on 64-bit build, examine the dissassembly code of Release Build,by inserting break point then opening up dissassembly window. ICC does not permit you to make a .ASM listing file in Release Build.
Examine the code to assure that all your s0-s14 variables are indeed in registers.
Note, although the x64 CPU has 16 registers, your emm intrinsics generally will require some scratch (unused) registers.
s14 = _mm_adds_epu16(s14, _mm_mpsadbw_epu8(s10, s8, 7));
The mpsadbw will require a temporary register.
Even in the event that xmm15 is available
s13 = _mm_adds_epu16(s13, _mm_mpsadbw_epu8(s10, s8, 2));
s14 = _mm_adds_epu16(s14, _mm_mpsadbw_epu8(s10, s8, 7));
Would likely stall for the adds to finish.
You may need to free up so you have more unused SSE registers (so the compiler can do its magic).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You should take into account that the compiler usually does a great job in reordering instructions and allocating registers in the most efficient way. I mean, it doesn't harm if you arrange your code that way, but don't expect that if you don't the resulting asm will carry out expressions the way you've written in the high level language. For example, the compiler is free to map s0-s2 and s8-10 to the same 3 xmm registers since their values lifetimes don't intersect. It can also reorder _mm_adds_epu16 and _mm_mpsadbw_epu8 to reduce dependency stalls. OTOH, the compiler may need to insert some other instructions, e.g. to perform operations in non-destructive manner.
My point is that you should generally inspect the code generated by your compiler to see the actual code and find bottlenecks. Then, if you find ones, you should try helping the compiler by adjusting your code. You may fail to do so, if the compiler is stubborn enough to ignore your attempts. In these cases good old asm is the answer.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
adrian s. wrote:
Changed my code, looks like i got a slight performance boost. im trying to avoid calculations that have to "wait" for data and filling the "holes" with sth else (like other calc or pointer increments)
This is generally not an effective optimization (a lot of effort for very low or no speedup) method since the OoO execution engine in modern cores do already a very good job at "filling the holes" as you say (not to mention that with 2 running threads per core this kind of reasoning don't stand anymore). Also the compiler may decide to do a completely different scheduling than the one in your source code, you can often see that the ASM dumps don't change even if you re-order the source code since the compiler still use the same (optimal as per its heuristics) register allocation and scheduling order.
adrian s. wrote:
// load row1
s0 = _mm_loadu_si128((__m128i*)left_ptr);
s1 = _mm_loadu_si128((__m128i*)right_ptr);
s2 = _mm_loadu_si128((__m128i*)right_shift_ptr);
s3 = _mm_setzero_si128();
s4 = _mm_setzero_si128();
s5 = _mm_setzero_si128();
s6 = _mm_setzero_si128();// 1-2
left_ptr+=const;
right_ptr+=const;
right_shift_ptr+=const;
// load row2
s8 = _mm_loadu_si128((__m128i*)left_ptr);
s9 = _mm_loadu_si128((__m128i*)right_ptr);
s10 = _mm_loadu_si128((__m128i*)right_shift_ptr);
s11 = _mm_setzero_si128();
s12 = _mm_setzero_si128();
s13 = _mm_setzero_si128();
s14 = _mm_setzero_si128();//calc row1
s3 = _mm_adds_epu16(s3, _mm_mpsadbw_epu8(s1, s0, 0));
s4 = _mm_adds_epu16(s4, _mm_mpsadbw_epu8(s1, s0, 5));
s5 = _mm_adds_epu16(s5, _mm_mpsadbw_epu8(s2, s0, 2));
s6 = _mm_adds_epu16(s6, _mm_mpsadbw_epu8(s2, s0, 7));//2-3
left_ptr+=const;
right_ptr+=const;
right_shift_ptr+=const;// load row3
s0 = _mm_loadu_si128((__m128i*)left_ptr);
s1 = _mm_loadu_si128((__m128i*)right_ptr);
s2 = _mm_loadu_si128((__m128i*)right_shift_ptr);
//calc row2
s11 = _mm_adds_epu16(s11, _mm_mpsadbw_epu8(s9, s8, 0));
s12 = _mm_adds_epu16(s12, _mm_mpsadbw_epu8(s9, s8, 5));
s13 = _mm_adds_epu16(s13, _mm_mpsadbw_epu8(s10, s8, 2));
s14 = _mm_adds_epu16(s14, _mm_mpsadbw_epu8(s10, s8, 7));
//3-4
left_ptr+=const;
right_ptr+=const;
right_shift_ptr+=const;// load row4
s8 = _mm_loadu_si128((__m128i*)left_ptr);
s9 = _mm_loadu_si128((__m128i*)right_ptr);
s10 = _mm_loadu_si128((__m128i*)right_shift_ptr);
//calc row3
s3 = _mm_adds_epu16(s3, _mm_mpsadbw_epu8(s1, s0, 0));
s4 = _mm_adds_epu16(s4, _mm_mpsadbw_epu8(s1, s0, 5));
s5 = _mm_adds_epu16(s5, _mm_mpsadbw_epu8(s2, s0, 2));
s6 = _mm_adds_epu16(s6, _mm_mpsadbw_epu8(s2, s0, 7));
//4-5
left_ptr+=const;
right_ptr+=const;
right_shift_ptr+=const;//load row5
s0 = _mm_loadu_si128((__m128i*)left_ptr);
s1 = _mm_loadu_si128((__m128i*)right_ptr);
s2 = _mm_loadu_si128((__m128i*)right_shift_ptr);
// calc row4
s11 = _mm_adds_epu16(s11, _mm_mpsadbw_epu8(s9, s8, 0));
s12 = _mm_adds_epu16(s12, _mm_mpsadbw_epu8(s9, s8, 5));
s13 = _mm_adds_epu16(s13, _mm_mpsadbw_epu8(s10, s8, 2));
s14 = _mm_adds_epu16(s14, _mm_mpsadbw_epu8(s10, s8, 7));
//5-6
left_ptr+=const;
right_ptr+=const;
right_shift_ptr+=const;//load row6
s8 = _mm_loadu_si128((__m128i*)left_ptr);
s9 = _mm_loadu_si128((__m128i*)right_ptr);
s10 = _mm_loadu_si128((__m128i*)right_shift_ptr);
//calc row5
s3 = _mm_adds_epu16(s3, _mm_mpsadbw_epu8(s1, s0, 0));
s4 = _mm_adds_epu16(s4, _mm_mpsadbw_epu8(s1, s0, 5));
s5 = _mm_adds_epu16(s5, _mm_mpsadbw_epu8(s2, s0, 2));
s6 = _mm_adds_epu16(s6, _mm_mpsadbw_epu8(s2, s0, 7));
//6-7
left_ptr+=const;
right_ptr+=const;
right_shift_ptr+=const;// load row7
s0 = _mm_loadu_si128((__m128i*)left_ptr);
s1 = _mm_loadu_si128((__m128i*)right_ptr);
s2 = _mm_loadu_si128((__m128i*)right_shift_ptr);
//calc row 6
s11 = _mm_adds_epu16(s11, _mm_mpsadbw_epu8(s9, s8, 0));
s12 = _mm_adds_epu16(s12, _mm_mpsadbw_epu8(s9, s8, 5));
s13 = _mm_adds_epu16(s13, _mm_mpsadbw_epu8(s10, s8, 2));
s14 = _mm_adds_epu16(s14, _mm_mpsadbw_epu8(s10, s8, 7));
//7-8
left_ptr+=const;
right_ptr+=const;
right_shift_ptr+=const;// load row8
s8 = _mm_loadu_si128((__m128i*)left_ptr);
s9 = _mm_loadu_si128((__m128i*)right_ptr);
s10 = _mm_loadu_si128((__m128i*)right_shift_ptr);
// calc row7
s3 = _mm_adds_epu16(s3, _mm_mpsadbw_epu8(s1, s0, 0));
s4 = _mm_adds_epu16(s4, _mm_mpsadbw_epu8(s1, s0, 5));
s5 = _mm_adds_epu16(s5, _mm_mpsadbw_epu8(s2, s0, 2));
s6 = _mm_adds_epu16(s6, _mm_mpsadbw_epu8(s2, s0, 7));
//calc row8
s11 = _mm_adds_epu16(s11, _mm_mpsadbw_epu8(s9, s8, 0));
s12 = _mm_adds_epu16(s12, _mm_mpsadbw_epu8(s9, s8, 5));
s13 = _mm_adds_epu16(s13, _mm_mpsadbw_epu8(s10, s8, 2));
s14 = _mm_adds_epu16(s14, _mm_mpsadbw_epu8(s10, s8, 7));thanks again and best regards
adrian
If you use the Intel compiler I encourage you to try the #pragma unroll(N) notation, it will allow you to rewrite it as a simple loop (8x simpler) and I'm quite sure you will get the same level of performance without all this duplicated code, it will be faster to write and a lot easier to maintain, maintenance may include new optimizations that you can't afford to try with the copy&paste version, the end result is a win-win situation, faster code overall and more code produced.
I will also suggest to replace the 3 incremented pointers by a single induction variable as in :
[cpp]
s8 = _mm_loadu_si128((__m128i*)left_ptr+i); s9 = _mm_loadu_si128((__m128i*)right_ptr+i); s10 = _mm_loadu_si128((__m128i*)right_shift_ptr+i);
i += increment; // I wonder how can you use "const" as a variable name ?
[/cpp]
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
As bronxv indicated, compilers often generate better unrolled code if you use their options for that purpose. This should make it easier to find an optimum. Besides the icc pragma unroll() you have options such as gcc -funroll-loops --param max-unroll-times=4 (since unrolling by 4 may be enough, unless you have a very early core i2 CPU which may not have fully effective Loop Stream Detection). Micro-op cache on newer CPUs also deals with smoothing out performance artifacts associated with degree of unrolling. If you are benchmarking on an out-of-production CPU, you should weigh the benefit of small performance increments which won't apply to a current one. This forum is about optimization on more recent CPUs.
The unroll factor as used by both Intel and gnu compilers is on top of the parallelism implied by the simd register width.
In 32-bit mode, it's particularly important to minimize the number of loop-carried pointers, and it couild help with compiler optimization if you explicitly show that all memory accesses can be done with invariant offsets from a single pointer.
As to using more than the minimum number of registers, the hardware will take care of it by register renaming, possibly more efficiently than you can do in source code, as long as you don't have implicit partial register dependence.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sergey Kostrov wrote:
Just a couple of days ago I had some issues with unrolling and I strongly recommend to do as better as possible testing. In my case there was a conflict between a piece of code with manual unrolling ( N-in-1 ) and #pragma unroll( N ) directive and a command line option /Qunroll:N ( all three were used! ). So, use just one method and don't mix all of them since a negative performance impact could be greater than ~5%.
Good point, that's why I advise to ressort as less as possible to manual unrolling, once you have a manually unrolled loop it's way more difficult to experiment new optimizations, including extra compiler unrolling. Manual unrolling is pretty much like freezing the source code, in case of refactoring you must first get rid of the manual unrolling, rewrite it as a vanilla loop, then, if you have the time to do it, redo the manual unrolling, it's not manageable for a big project with hundreds of loops.
As you say, testing is key for top performance. The value of N in #pragma unroll(N) may have a big impact (from a 10% slowdown to more than 20% speedup in some cases I have experimented with), particularly on short loops, typical optimal values in my cases are N=4 and N=8 but sometimes it is N=3, it's easy to test with the assistance of the compiler but will be a nightmare to experiment with manual unrolling, so manual unrolling in practice is not only much more painful but will generally not use the optimal unrolling factor, a loss-loss situation.
Note that AFAIK a local #pragma unroll(N) always take precedence over the global /Qunroll setting.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page