Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Valued Contributor II
46 Views

Throughput of binary codes generated by Intel C++ compiler is 42% slower

*** Throughput of binary codes generated by Intel C++ compiler is 42% slower ***
0 Kudos
28 Replies
Highlighted
Valued Contributor II
35 Views

[ Abstract ] Throughput of codes generated by Intel C++ compiler is 42% slower than throughput of codes generated by MinGW, Microsoft and legacy Watcom C++ compilers. This is because Intel C++ compiler re-ordered a sequence of instructions wrongly assuming that it will improve performance due to pipelining of a processing unit. An intensive testing of some C codes showed that re-ordering of instructions did not improve performance. Modern C++ compiler do lots of optimizations "behind the scenes" but C/C++ software engineers should have a greater control of all actions related to re-ordering of generated instructions, that is binary codes. I would consider three options: - A Warning Message at /W5 level ( Not at another levels ) needs to be displayed when there is a re-ordering - Introduction a '#pragma no-reordering' directive for a piece of critical codes to prevent any re-orderings - A command line compiler option similar to Watcom C++ compiler option '-or' ( Re-order instructions to avoid stalls )
0 Kudos
Highlighted
Valued Contributor II
35 Views

[ Test Case - CrtClflush ] ... RTint piAddress[10][16] = { { 0x11, 0x11, 0x11, 0x11, 0x11, 0x11, 0x11, 0x11, 0x11, 0x11, 0x11, 0x11, 0x11, 0x11, 0x11, 0x11 }, // 0 { 0x22, 0x22, 0x22, 0x22, 0x22, 0x22, 0x22, 0x22, 0x22, 0x22, 0x22, 0x22, 0x22, 0x22, 0x22, 0x22 }, // 1 { 0x33, 0x33, 0x33, 0x33, 0x33, 0x33, 0x33, 0x33, 0x33, 0x33, 0x33, 0x33, 0x33, 0x33, 0x33, 0x33 }, // 2 { 0x44, 0x44, 0x44, 0x44, 0x44, 0x44, 0x44, 0x44, 0x44, 0x44, 0x44, 0x44, 0x44, 0x44, 0x44, 0x44 }, // 3 { 0x77, 0x77, 0x77, 0x77, 0x77, 0x77, 0x77, 0x77, 0x77, 0x77, 0x77, 0x77, 0x77, 0x77, 0x77, 0x77 }, // 4 { 0x88, 0x88, 0x88, 0x88, 0x88, 0x88, 0x88, 0x88, 0x88, 0x88, 0x88, 0x88, 0x88, 0x88, 0x88, 0x88 }, // 5 { 0x44, 0x44, 0x44, 0x44, 0x44, 0x44, 0x44, 0x44, 0x44, 0x44, 0x44, 0x44, 0x44, 0x44, 0x44, 0x44 }, // 6 { 0x33, 0x33, 0x33, 0x33, 0x33, 0x33, 0x33, 0x33, 0x33, 0x33, 0x33, 0x33, 0x33, 0x33, 0x33, 0x33 }, // 7 { 0x22, 0x22, 0x22, 0x22, 0x22, 0x22, 0x22, 0x22, 0x22, 0x22, 0x22, 0x22, 0x22, 0x22, 0x22, 0x22 }, // 8 { 0x11, 0x11, 0x11, 0x11, 0x11, 0x11, 0x11, 0x11, 0x11, 0x11, 0x11, 0x11, 0x11, 0x11, 0x11, 0x11 }, // 9 }; IrtClflush( &piAddress[0][0] ); CrtClflush( &piAddress[1][0] ); CrtSetThreadPriority( THREADPRIORITY_REALTIME ); CrtPrefetchData( ( RTchar * )&piAddress[0][0] ); // All prefetches are T0-type CrtPrefetchData( ( RTchar * )&piAddress[1][0] ); CrtPrefetchData( ( RTchar * )&piAddress[2][0] ); CrtPrefetchData( ( RTchar * )&piAddress[3][0] ); CrtPrefetchData( ( RTchar * )&piAddress[4][0] ); CrtPrefetchData( ( RTchar * )&piAddress[5][0] ); CrtPrefetchData( ( RTchar * )&piAddress[6][0] ); CrtPrefetchData( ( RTchar * )&piAddress[7][0] ); CrtPrefetchData( ( RTchar * )&piAddress[8][0] ); CrtPrefetchData( ( RTchar * )&piAddress[9][0] ); RTuint64 uiClock1 = CrtRdtsc(); CrtClflush( &piAddress[0][0] ); CrtClflush( &piAddress[1][0] ); CrtClflush( &piAddress[2][0] ); CrtClflush( &piAddress[3][0] ); CrtClflush( &piAddress[4][0] ); CrtClflush( &piAddress[5][0] ); CrtClflush( &piAddress[6][0] ); CrtClflush( &piAddress[7][0] ); CrtClflush( &piAddress[8][0] ); CrtClflush( &piAddress[9][0] ); RTuint64 uiClock2 = CrtRdtsc(); CrtPrintf( RTU("[ CrtClflush ] - Executed in %u clock cycles\n"), ( RTuint )( uiClock2 - uiClock1 ) / 10 ); CrtSetThreadPriority( THREADPRIORITY_NORMAL ); CrtPrintf( RTU("IrtClflush & CrtClflush\n") ); ...
0 Kudos
Highlighted
Valued Contributor II
35 Views

[ Run-Time testing - Extended Tracing - No ] [ Intel C++ compiler ] ... [ CrtClflush ] - Executed in 20 clock cycles [ CrtClflush ] - Executed in 23 clock cycles [ CrtClflush ] - Executed in 24 clock cycles [ CrtClflush ] - Executed in 24 clock cycles [ CrtClflush ] - Executed in 20 clock cycles [ CrtClflush ] - Executed in 19 clock cycles [ CrtClflush ] - Executed in 19 clock cycles [ CrtClflush ] - Executed in 22 clock cycles [ CrtClflush ] - Executed in 19 clock cycles [ CrtClflush ] - Executed in 18 clock cycles ... A question is why does it slower than Microsoft or Watcom C++ compilers? Here are generated binary codes: ... 0040365C rdtsc 0040365E clflush [ebp-8B8h] 00403665 mov ecx, eax 00403667 clflush [ebp-878h] 0040366E clflush [ebp-838h] 00403675 clflush [ebp-7F8h] 0040367C clflush [ebp-7B8h] 00403683 clflush [ebp-778h] 0040368A clflush [ebp-738h] 00403691 clflush [ebp-6F8h] 00403698 clflush [ebp-6B8h] 0040369F clflush [ebp-678h] 004036A6 rdtsc ... 1. Intel C++ compiler re-ordered a sequence of instructions. 2. 'mov ecx, eax' is placed after the 1st 'clflush [ebp-8B8h]' in order to save a value returned from 'RDTSC' in 'eax' general purpose register. 3. It is possible that pipelining is affected ( Very Likely! ). 4. Take a look at a perfectly generated binary codes by Watcom C++ compiler ( see below ). 5. Almost the same re-ordering is done by Microsoft C++ compiler.
0 Kudos
Highlighted
Valued Contributor II
35 Views

[ Run-Time testing - Extended Tracing - No ] [ MinGW C++ compiler ] ... [ CrtClflush ] - Executed in 12 clock cycles [ CrtClflush ] - Executed in 12 clock cycles [ CrtClflush ] - Executed in 12 clock cycles [ CrtClflush ] - Executed in 12 clock cycles [ CrtClflush ] - Executed in 12 clock cycles [ CrtClflush ] - Executed in 12 clock cycles [ CrtClflush ] - Executed in 12 clock cycles [ CrtClflush ] - Executed in 12 clock cycles [ CrtClflush ] - Executed in 12 clock cycles [ CrtClflush ] - Executed in 12 clock cycles ... Here are generated binary codes: ... 0040265B rdtsc 0040265D mov esi, eax 0040265F clflush [ebp-2B8h] 00402666 clflush [ebp-278h] 0040266D clflush [ebp-238h] 00402674 clflush [ebp-1F8h] 0040267B clflush [ebp-1B8h] 00402682 clflush [ebp-178h] 00402689 clflush [ebp-138h] 00402690 clflush [ebp-0F8h] 00402697 clflush [ebp-0B8h] 0040269E clflush [ebp-78h] 004026A2 rdtsc ... Perfect binary codes generation.
0 Kudos
Highlighted
Valued Contributor II
35 Views

[ Run-Time testing - Extended Tracing - No ] [ Watcom C++ compiler v2.0 ( aka Open Watcom ) ] ... [ CrtClflush ] - Executed in 12 clock cycles [ CrtClflush ] - Executed in 12 clock cycles [ CrtClflush ] - Executed in 12 clock cycles [ CrtClflush ] - Executed in 12 clock cycles [ CrtClflush ] - Executed in 12 clock cycles [ CrtClflush ] - Executed in 12 clock cycles [ CrtClflush ] - Executed in 12 clock cycles [ CrtClflush ] - Executed in 12 clock cycles [ CrtClflush ] - Executed in 12 clock cycles [ CrtClflush ] - Executed in 12 clock cycles ... Here are generated binary codes: ... 00403791 rdtsc 00403793 mov ecx, eax 00403795 lea eax, [ebp-8AEh] 0040379B clflush [eax] 0040379E lea eax, [ebp-86Eh] 004037A4 clflush [eax] 004037A7 lea eax, [ebp-82Eh] 004037AD clflush [eax] 004037B0 lea eax, [ebp-7EEh] 004037B6 clflush [eax] 004037B9 lea eax, [ebp-7AEh] 004037BF clflush [eax] 004037C2 lea eax, [ebp-76Eh] 004037C8 clflush [eax] 004037CB lea eax, [ebp-72Eh] 004037D1 clflush [eax] 004037D4 lea eax, [ebp-6EEh] 004037DA clflush [eax] 004037DD lea eax, [ebp-6AEh] 004037E3 clflush [eax] 004037E6 lea eax, [ebp-66Eh] 004037EC clflush [eax] 004037EF rdtsc ... Perfect binary codes generation.
0 Kudos
Highlighted
Valued Contributor II
35 Views

[ Run-Time testing - Extended Tracing - No ] [ Microsoft C++ compiler ] ... [ CrtClflush ] - Executed in 12 clock cycles [ CrtClflush ] - Executed in 12 clock cycles [ CrtClflush ] - Executed in 12 clock cycles [ CrtClflush ] - Executed in 12 clock cycles [ CrtClflush ] - Executed in 12 clock cycles [ CrtClflush ] - Executed in 12 clock cycles [ CrtClflush ] - Executed in 12 clock cycles [ CrtClflush ] - Executed in 12 clock cycles [ CrtClflush ] - Executed in 12 clock cycles [ CrtClflush ] - Executed in 12 clock cycles ... Here are generated binary codes: ... 00244486 rdtsc 00244488 clflush [ebp-300h] 0024448F clflush [ebp-240h] 00244496 clflush [ebp-180h] 0024449D mov dword ptr [ebp-48h], eax 002444A0 clflush [ebp-340h] 002444A7 clflush [ebp-280h] 002444AE clflush [ebp-1C0h] 002444B5 clflush [ebp-100h] 002444BC mov dword ptr [ebp-44h], edx 002444BF clflush [ebp-2C0h] 002444C6 clflush [ebp-200h] 002444CD clflush [ebp-140h] 002444D4 rdtsc ... 1. Even if Microsoft C++ compiler also re-ordered binary codes it does Not affect performance and throughput of a block of codes with a set of CLFLUSH instructions. Performance and throughput was as expected, that is, 12 clock cycles per instruction executed. 2. Also, Microsoft C++ compiler stores values from the General Purpose registers EDX and EAX, number of clock cycles returned from RDTSC instruction, to a variable created on the stack. For comparison, Intel C++ compiler stores a value in EAX register to ECX register. Theoretically it should be faster than storing the value to the RAM, that is, to a variable created on the stack. In reality, results of testing are opposite.
0 Kudos
Highlighted
Employee
35 Views

Hi, Sergey

What's your targeting processor? And the optimization option you used? Is it default -mSSE2?

Besides the clflush cycles, have you counted the rdtsc latency?

Compiler optimization may re-order instructions based on instruction latency/throughput targeting different micro-architecture.

Thanks.

0 Kudos
Highlighted
35 Views

Sergey,

Comparing #4 and #5. Regardless of the instruction reorder, I notice that subtracting the address of the first rdtsc from the second produces an instruction byte count (hex) of 0x4A for the Intel, and 0x47 for the MinGW. IOW there are 3 extra bytes not accounted for.

There is an option to display the instruction byte codes, can you enable that?

Also, there may be a minor flaw in your test program. Prior to your first rdtsc, you are issuing a series of prefetches. From you code, it is not clear as to:

a) if an alignment issue causes the array to spill over an extra cache line in one scenario and not the other(s).
b) (possibly more important) if the prefetches are still in flight when the clflush is issued in one scenario and not the other.

As for b) I suggest you manipulate the prefetched data in a manner that assures the data has reached L1 before you start your timed run of clflushes.

Jim Dempsey

0 Kudos
Highlighted
Valued Contributor II
35 Views

>>...there are 3 extra bytes not accounted for... There are no any problems here because clflush [ebp-offset] is a 7-byte instruction. [ Intel CLFLUSH instruction Opcodes ] 0F AE 38................clflush [eax] 0F AE 3B................clflush [ebx] 0F AE 39................clflush [ecx] 0F AE 3A................clflush [edx] 0F AE BD [offset].....clflush [ebp-offset] where [offset] is 'XX XX XX XX' ( 4 more bytes ).
0 Kudos
Highlighted
Valued Contributor II
35 Views

[ A workaround ] The problem has two parts, that is: - RDTSC instruction was Not aligned on a 16-byte boundary for Intel C++ compiler - Pipelining of a series of CLFLUSH instructions is affected when a MOV instruction is inserted after the 1st CLFLUSH instruction By the way, Watcom C++ compiler's binary of codes are Not aligned on a 16-byte boundary and it doesn't have any problems! So, I decided to use a workaround by forcing an alignment on a 16-byte boundary ( _DEFAULT_CODEALIGN16 is a macro based on _asm ALIGN 16 assembler directive ). ... [ Codes Skipped ] ... _DEFAULT_CODEALIGN16; RTuint64 uiClock1 = CrtRdtsc(); CrtClflush( &piAddress[0][0] ); CrtClflush( &piAddress[1][0] ); CrtClflush( &piAddress[2][0] ); CrtClflush( &piAddress[3][0] ); CrtClflush( &piAddress[4][0] ); CrtClflush( &piAddress[5][0] ); CrtClflush( &piAddress[6][0] ); CrtClflush( &piAddress[7][0] ); CrtClflush( &piAddress[8][0] ); CrtClflush( &piAddress[9][0] ); RTuint64 uiClock2 = CrtRdtsc(); ... Here is statistics for a memory address of 1st RDTSC instruction: MSC - 00244490 % 0x10 = 0 - Aligned on 16-byte boundary? - Yes ( forced by _DEFAULT_CODEALIGN16 ) ICC - 00403660 % 0x10 = 0 - Aligned on 16-byte boundary? - Yes ( forced by _DEFAULT_CODEALIGN16 ) MGW - 00402490 % 0x10 = 0 - Aligned on 16-byte boundary? - Yes ( forced by _DEFAULT_CODEALIGN16 ) BCC - 0040417A % 0x10 = 10 - Aligned on 16-byte boundary? - No ( workaround is N/A ) WCC - 00403791 % 0x10 = 11 - Aligned on 16-byte boundary? - No ( workaround is N/A ) [ Run-Time testing - Extended Tracing - No ] [ Intel C++ compiler - AFTER Alignment was applied! ] ... [ CrtClflush ] - Executed in 12 clock cycles [ CrtClflush ] - Executed in 12 clock cycles [ CrtClflush ] - Executed in 12 clock cycles [ CrtClflush ] - Executed in 12 clock cycles [ CrtClflush ] - Executed in 12 clock cycles [ CrtClflush ] - Executed in 12 clock cycles [ CrtClflush ] - Executed in 12 clock cycles [ CrtClflush ] - Executed in 12 clock cycles [ CrtClflush ] - Executed in 12 clock cycles [ CrtClflush ] - Executed in 12 clock cycles ... This is what I wanted to see before I move on another task.
0 Kudos
Highlighted
Valued Contributor II
35 Views

[ Features Request - Re-ordering statistics and control ] - A Warning Message at /W5 level ( Not at another levels ) needs to be displayed when there is a Re-ordering - Introduction a '#pragma no-reordering' directive for a piece of critical codes to prevent any Re-orderings - A command line compiler option to control Re-ordering of instructions ( similar to Watcom C++ compiler option '-or' Re-order instructions to avoid stalls )
0 Kudos
Highlighted
Valued Contributor II
35 Views

[ MinGW C++ compiler command line options ] For example, these are command line options for different types of Re-orders supported by MinGW C++ compiler: ... -Wreorder - Warn when the compiler reorders code. -freorder-blocks - Reorder basic blocks to improve code placement. -freorder-blocks-algorithm= - -freorder-blocks-algorithm=[simple|stc] Set the used basic block reordering algorithm. -freorder-blocks-and-partition - Reorder basic blocks and partition into hot and cold sections. -freorder-functions - Reorder functions to improve code placement. -fprofile-reorder-functions - Enable function reordering that improves code placement. -ftoplevel-reorder - Reorder top level functions, variables, and asms. ...
0 Kudos
Highlighted
Valued Contributor II
35 Views

Hi Yuan, >>What's your targeting processor? And the optimization option you used? Is it default -mSSE2? I consider that problem with Re-orderings as a fundamental and we shouldn't have any constraints related to CPUs, ISAs, etc. >>Besides the clflush cycles, have you counted the rdtsc latency? No because it is not needed in that case. In overall, there are three possible cases of using RDTSC ( the same applies to any timing APIs ): Case 1: - One RDTSC instruction was called after some Block of Codes and returns a number of clock cycles When The Block has Completed Processing - RDTSC's latency could be taken into account by subtracting its overhead ( ~80 clock cycles ) Case 2: - One RDTSC instruction will be called before some Block of Codes and returns a number of clock cycles When The Block is about to start Processing - RDTSC's latency could be taken into account by adding its overhead ( ~80 clock cycles ) Case 3: - Two RDTSC instructions will be called before ( it will be T1 ) and after ( it will be T2 ) some Block of Codes and difference of ( T2 - T1 ) is a number of clock cycles it took to Complete Processing - No any corrections need to be taken into account ( Note: 3rd Case is the most common as you know ) >>Compiler optimization may re-order instructions based on instruction latency/throughput targeting different micro-architecture. I understand it but my point is: Intel C++ compiler should give us a greater control in similar to my cases. If a Software Engineer has some specs, knows how some processing needs to be done ( its order, number of instructions, estimated number of clock cycles to complete the processing, etc ), then Intel C++ compiler should Not interfere with the Software Engineer's codes. Of course, implementation with assembler solves all these problems but it is more time consuming to implement and it breaks portability of C/C++ source codes.
0 Kudos
Highlighted
Valued Contributor II
35 Views

>>...And the optimization option you used? Yuan, for additional technical details please take a look at a thread Latency and Throughput of Intel CPUs 'clflush' instruction in the Watercooler-catchall IDZ forum. >>Is it default -mSSE2? Yes it was SSE2 for Microsoft, Intel and MinGW C++ compilers, and No for Borland and Watcom C++ compiler ( they both do Not support SSE2 and even SSE / There is a limited support of MMX by Watcom ). PS: Personally, I'm very impressed how a legacy Watcom C++ compiler generated binary codes.
0 Kudos
Highlighted
Valued Contributor II
35 Views

>>...a) if an alignment issue causes the array to spill over an extra cache line in one scenario and not the other(s). Jim, as you can see the Workaround _asm ALIGN 16 before the 1st RDTSC instruction solved that problem for Intel C++ compiler. But, as I've investigated, there are No 16-byte alignment before the 1st RDTSC instruction in case of Watcom C++ compiler and it was able to "execute" every CLFLUSH instruction in 12 clock cycles! Anyway, thanks to everybody who responded.
0 Kudos
Highlighted
35 Views

Sergey,

if you make the first rdtsc located at the end of a cache line (and clflushes begin in next line), .AND. if you place your performance test code in a loop, what is the timing excluding the first trip through the test code? And what is the timing of say the 10'th iteration. IOW after you are assured the code sequence is in the L1 Instruction Cache. Note, code preceding and following the timed interval must not evict the instructions from the L1 Instruction Cache.

Jim Dempsey

0 Kudos
Highlighted
Black Belt
35 Views

I am pretty sure that the overheads of multiple calls to RDTSC can't "cancel out" as suggested in message 14 above (https://software.intel.com/en-us/forums/intel-c-compiler/topic/697062#comment-1885846).   This would require that the first call to RDTSC return the cycle count at the end of its execution, while the second call to RDTSC would have to return the cycle count at the beginning of its execution.  This does not make sense.

I looked at the overlap of RDTSC and RDTSCP instructions with user code in some detail in a new post at https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring...

0 Kudos
Highlighted
Valued Contributor II
35 Views

>>...if you place your performance test code in a loop, what is the timing excluding the first trip through the test code? Jim, Here are performance results when the Serial-Test-Case was converted to a 10-interations For-Loop-Test-Case. Performance results from the best to the worst: [ MinGW C++ compiler ] ... [ Sub-Test002.21.B - Processing of 10 calls ] - Executed in 120 clock cycles [ Sub-Test002.21.B - For-Loop Overhead ] - Executed in 84 clock cycles [ Sub-Test002.21.B - CrtClflush ] - Executed in 3 clock cycles ... [ Intel C++ compiler ] ... [ Sub-Test002.21.B - Processing of 10 calls ] - Executed in 196 clock cycles [ Sub-Test002.21.B - For-Loop Overhead ] - Executed in 152 clock cycles [ Sub-Test002.21.B - CrtClflush ] - Executed in 4 clock cycles ... [ Watcom C++ compiler ] ... [ Sub-Test002.21.B - Processing of 10 calls ] - Executed in 212 clock cycles [ Sub-Test002.21.B - For-Loop Overhead ] - Executed in 128 clock cycles [ Sub-Test002.21.B - CrtClflush ] - Executed in 8 clock cycles ... [ Microsoft C++ compiler ] ... [ Sub-Test002.21.B - Processing of 10 calls ] - Executed in 192 clock cycles [ Sub-Test002.21.B - For-Loop Overhead ] - Executed in 88 clock cycles [ Sub-Test002.21.B - CrtClflush ] - Executed in 10 clock cycles ... [ Borland C++ compiler ] ... [ Sub-Test002.21.B - Processing of 10 calls ] - Executed in 964 clock cycles [ Sub-Test002.21.B - For-Loop Overhead ] - Executed in 264 clock cycles [ Sub-Test002.21.B - CrtClflush ] - Executed in 70 clock cycles ... Results are very reproducible and I see that in case of MinGW and Intel C++ compilers I was able to achieve low-bound numbers for CLFLUSH instruction stated by Intel in: Intel 64 and IA-32 Architectures Optimization Reference Manual Order Number: 248966-033 June 2016 ... Chapter: INSTRUCTION LATENCY AND THROUGHPUT Table C-17. General Purpose Instructions ( Page C-17 ) ... CLFLUSH throughputs for different CPUs are ~2 to 50, ~3 to 50, ~3 to 50 and ~5 to 50 clock cycles. ...
0 Kudos
Highlighted
Valued Contributor II
35 Views

John, Taking into account that RDTSC is a Not serializable instruction and some instructions, scheduled for execution after RDTSC, are pipelined and Could Be executed in Out-Of-Order way, it Does Not Make Sense to take into account a CONSTANT overhead of RDTSC instruction reproducible with accuracy +/- 1 clock cycle for many hardware platforms I've tested.
0 Kudos