- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
The application I'm trying to optimise makes intensive use of shift cl operations to pack bits together. Such as:
packed_bits = (packed_bits << n) | new_bits;
(where n is a variable specifying the size of new_bits)
The target platform is Sandy Bridge. Vtune reports a high number of Flags Merge Stalls. All is consistent with the description made in this page http://software.intel.com/sites/products/documentation/doclib/iss/2013/amplifier/lin/ug_docs/GUID-02B56687-59FA-4C7B-8697-5106FB705ECD.htm
Not use any shift cl operation sounds a quite big limitation. I was wondering if anyone could suggest a way to workaround this issue, considering that such flags are not really used for packing bits (i.e. the bits that slide out are not relevant).
Thanks.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Another example where the Flags Merge Stalls seem to cause an unjustified performance decrease is shown in the screenshot attached. As mentioned before, in the expression " (code >> i) & 1 " there is no need of any flag management. The only purpose is to extract a single bit in a specific order.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
SAR instruction moves LSB to CF and compiler decided to use sar cl followed by test instr. which also updates flag register hence I suppose so called Flag Merge Stalls .By checking Agner Fog instruction tables it can be seen that sar instruction latency is 2 cycles and test latency is 1 cycle.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
According to optimization user manual:
Flags Merge micro-op ratio:
%FLAGS.MERGE.UOP =
100 * PARTIAL_RAT_STALLS.FLAGS_MERGE_UOP_CYCLES /
CPU_CLK_UNHALTED.THREAD;
Get event count in GE report (set "Show data as" to number), know how it impacts on performance data overall.
Try Intel C/C++ Composer XE 13.0 to build, with advanced option "-xHost -O3". Use VTune to test it again. It's my opinion.
Another consideration to split your statement :
packed_bits = (packed_bits << n) | new_bits; To two statements avoid RAT stall, I hope.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Peter
If I understood it correctly Flags Merge issue can be related to issueing two flags modifying instructions consecutively inside the loop.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes iliyapolak. That was why I hope that Intel C++ compiler can help, and we may change code to:
packed_bit = (packed_bits << n);
do_others;
packed_bit |= new_bits;
In loop to reduce RAT resource pressure, improve performance.
I remember that Intel C++ composer has "intrinsic" function to implement shift function, you may ask to Composer forum.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It seems that GCC was used to compile that code.It could be interesting to run that code(responsible for generating Flags Merge) not in the loop to see if Flags Merge stalls are generated due to loop Sar cl and Test beign executed inside the loop.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page