Integration of Open Watcom C++ compiler - details, performance evaluation, etc - Page 4

SergeyKostrov · ‎01-10-2016

*** Integration of Open Watcom C++ compiler - details, performance evaluation, etc *** Welcome Back, Open Watcom C++ compiler! At the end of 2015 a decision was made to integrate Open Watcom C++ compiler v1.9 with a project I've been working on since 2009. I used Watcom C++ compiler in the middle of 90th ( last century! ) and I know how superior it is when it comes to optimization of C and C++ codes. Honestly, I was concerned about timing of the integration, that is end of the year, Christmas almost "knocks" to the door ( just two weeks before December 24th ), however a significant portion of the integration was completed in about 6 hours and I managed to compile C/C++ sources and executed some test-cases. Even if the work is still in progress on stabilizing codes and solving some little technical problems I could say that The Legendary Watcom C++ compiler is Not at the top of a list of the Modern optimizing C/C++ compilers. First of all, because version 1.9 is 32-bit only and does Not fully support, or does Not support At All, some Hot-Modern technologies. There is No support of SSE 2.x, SSE 4.x, AVX, AVX2, FMA instructions, OpenMP, Intel intrinsic functions, etc. But, don't be too frustrated because Open Watcom C++ compiler team is working, this is an Open Source Project now, and I hope that a new version of Open Watcom C++ compiler will be released in the future. I will follow up with more technical details and performance evaluation numbers on a set of scientific algorithms later. I will demonstrate how good Open Watcom C++ compiler is compared to Borland, MinGW, Microsoft, Intel and Turbo C++ compilers.

SergeyKostrov · ‎01-16-2016

... - OW is ported to 64-bit hosts ( WIN64, Linux X64 ) ... Linux X64? There is No an installer for it. Take a look at: sourceforge.net/projects/openwatcom/files/current-build There is only installer for Linux 32-bit x86 platform.

SergeyKostrov · ‎01-16-2016

Some time in 2015 I've dowloaded 'open-watcom-2_0-f77-win-x86.exe' file and I don't remember any 64-bit versions on http://sourceforge.net. Just take a look at this: http://iweb.dl.sourceforge.net/project/openwatcom/current-build/readme.txt sums.md5............................md5 sums for each file ow-snapshot.7z......................full binary image, it contains all files for full Fortran and C/C++ installation open-watcom-2_0-c-win-x86.exe.......C/C++ installer for 32-bit Windows host open-watcom-2_0-c-win-x64.exe.......C/C++ installer for 64-bit Windows host open-watcom-2_0-c-os2.exe...........C/C++ installer for OS/2 host open-watcom-2_0-c-linux-x86.........C/C++ installer for Linux 32-bit x86 host open-watcom-2_0-c-dos.exe...........C/C++ installer for DOS/16-bit Windows host open-watcom-2_0-f77-win-x86.exe.....Fortran installer for 32-bit Windows host open-watcom-2_0-f77-win-x64.exe.....Fortran installer for 64-bit Windows host open-watcom-2_0-f77-os2.exe.........Fortran installer for OS/2 host open-watcom-2_0-f77-linux-x86.......Fortran installer for Linux 32-bit x86 host open-watcom-2_0-f77-dos.exe.........Fortran installer for DOS/16-bit Windows host

SergeyKostrov · ‎01-16-2016

>>>>See also a final statement of the previous post. Application of Vectorization is Not: >>>> >>>>'Panacea-Of-All-Performance-Problems'. >> >>That is understood as far as data set or specific domain is not easily vectorisable. That is right because Vectorization of Codes and Data Intensive Computing are very different things. It looks like Intel locked itself on that Advertizing Hype about Vectorization and can't move on something else.

SergeyKostrov · ‎01-16-2016

>>Seems that VS Compiler inserted unconditional jump to CRuntimeSet::RunTest+1C0h (243470h) , I suppose that this branch >>(not present in Watcom) generated machine code can be the reason for the slower performance of MS Compiler. The assembler codes I've posted based on a simple for-loop ( also posted ) with about 16M iterations. There is No way that any C/C++ compiler would exit the for-loop without some kind of Unconditional-Jump instruction for so many iterations. Have we incremented the counter that it equals to an "exit-value"? If Yes, then exit the for-loop. If number of iterations is very small, let's say only 12, then, a very good optimizing C/C++ compiler would automatically Unroll the for-loop to a 12-in-1 Loop-Unrolling-Schema. I don't have more time for a really detailed analysis why Watcom C++ compiler easily outperformed all the rest modern C/C++ compilers but I clearly see that two multiplication operations ( FMUL ) are done with FPU registers instead of memory variables created on the stack. Just take a look again at all the assembler codes and you will see it. By the way, Borland does even "worse" code generation because it adds one extra 'mov' instruction and it looks like it is Not needed at all. MinGW and Intel absolutely failed with optimization based on usage of instructions with XMM registers. Haven't you noticed it?

SergeyKostrov · ‎01-16-2016

>>...why Watcom C++ compiler easily outperformed all the rest modern C/C++ compilers but I clearly see that two multiplication >>operations ( FMUL ) are done with FPU registers instead of memory variables created on the stack [ Watcom C++ compiler ] ... 00402AED fst st(2) 00402AEF fmul st, st(2) 00402AF1 fmul st, st(2) 00402AF3 fstp st(1) ... The core part of the calculations is done with FPU registers!

SergeyKostrov · ‎01-17-2016

Question is asked why Watcom C++ compiler completed the test-case by 20% faster then modern C++ compilers. So, this is a very simple test-case in C language: ... for( int t = 0; t < 16777216; t += 1 ) { volatile float x = ( float )t; volatile float y = x * x * x; } ... Description of steps is as follows: 1. Create a for-loop of 16777216 iterations 2. Convert iterator 't' of type 'int' to a variable 'x' of type 'float' 3. Calculate cube of 'x' 4. Store the result in a variable 'y' of type 'float' 5. Increment 't' by 1 and check for exit condition 6. Repeat steps 2, 3, 4, and 5 while 't' is less then 16777216

SergeyKostrov · ‎01-17-2016

Every C++ compiler generated a different set of assembler instructions but, take a look at how Watcom generated the most compact ( only 20 bytes ) and the most efficient codes ( it does all multiplications using FPU registers and doesn't use stack ) of the test-case. Below are assembler codes generated by five C++ compilers: [ Watcom C++ compiler ] ... 00402AE7 mov dword ptr [ebp+72h], eax 00402AEA fild dword ptr [ebp+72h] 00402AED fst st(2) 00402AEF fmul st, st(2) 00402AF1 fmul st, st(2) 00402AF3 fstp st(1) 00402AF5 inc eax 00402AF6 cmp eax, 1000000h 00402AFB jl 00402AE7 ... 20 bytes between 0x00402AFB and 0x00402AE7 Note: FPU-based concept of computations proved its efficiency ( ~20% faster ). [ Microsoft C++ compiler ] ... 00243470 fild dword ptr [ebp-4] 00243473 add eax, 1 00243476 cmp eax, 1000000h 0024347B fstp dword ptr [ebp-4] 0024347E fld dword ptr [ebp-4] 00243481 fmul dword ptr [ebp-4] 00243484 fmul dword ptr [ebp-4] 00243487 fstp dword ptr [ebp-4] 0024348A mov dword ptr [ebp-4], eax 0024348D jl CRuntimeSet::RunTest+1C0h (243470h) ... 29 bytes between 0x0024348D and 0x00243470 [ Borland C++ compiler ] ... 00403044 mov dword ptr [ebp-320h], eax 0040304A fild dword ptr [ebp-320h] 00403050 fstp dword ptr [ebp-0A8h] 00403056 fld dword ptr [ebp-0A8h] 0040305C fmul dword ptr [ebp-0A8h] 00403062 fmul dword ptr [ebp-0A8h] 00403068 fstp dword ptr [ebp-0ACh] 0040306E inc eax 0040306F cmp eax, 1000000h 00403074 jl 00403044 ... 48 bytes between 0x00403074 and 0x00403044 [ MinGW C++ compiler ] ... 00401910 pxor xmm2, xmm2 00401914 cvtsi2ss xmm2, esi 00401918 add esi, 1 0040191B cmp esi, 1000000h 00401921 movss dword ptr [ebp-8Ch], xmm2 00401929 movss xmm6, dword ptr [ebp-8Ch] 00401931 movss xmm7, dword ptr [ebp-8Ch] 00401939 mulss xmm6, xmm7 0040193D movss xmm0, dword ptr [ebp-8Ch] 00401945 mulss xmm6, xmm0 00401949 movss dword ptr [ebp-88h], xmm6 00401951 jne _ZN11CRuntimeSet7RunTestEv+300h (401910h) ... 65 bytes between 0x00401951 and 0x00401910 Note: XMM-based concept of computations failed ( ~20% slower ). [ Intel C++ compiler ] ... 0040204A cvtsi2ss xmm0, eax 0040204E movss dword ptr [ebp-30h], xmm0 00402053 inc eax 00402054 movss xmm3, dword ptr [ebp-30h] 00402059 cmp eax, 1000000h 0040205E movss xmm1, dword ptr [ebp-30h] 00402063 mulss xmm3, xmm1 00402067 movss xmm2, dword ptr [ebp-30h] 0040206C mulss xmm3, xmm2 00402070 movss dword ptr [ebp-2Ch], xmm3 00402075 jl CRuntimeSet::RunTest+1BAh (40204Ah) ... 43 bytes between 0x00402075 and 0x0040204A Note: XMM-based concept of computations failed ( ~20% slower ).

SergeyKostrov · ‎01-17-2016

Test Results: >>... >>1. Watcom C++ compiler - Test Executed in: 120,554,024 clock cycles >>2. Intel C++ compiler - Test Executed in: 150,922,384 clock cycles >>3. MinGW C++ compiler - Test Executed in: 158,981,392 clock cycles >>4. Microsoft C++ compiler - Test Executed in: 186,046,772 clock cycles >>5. Borland C++ compiler - Test Executed in: 188,474,452 clock cycles >>... Summary: Winner: Watcom - uses FPU-registers only for FMUL without using variables on the stack - length of the codes is 20 bytes - smallest size of generated codes Place #2: Intel - uses XMM-registers and MULSS with access to variables on the stack - two MOVSS instructions after MULSS instructions - length of the codes is 43 bytes Place #3: MinGW - uses XMM-registers and MULSS with access to variables on the stack - two MOVSS instructions after MULSS instructions - length of the codes is 65 bytes - largest size of generated codes Place #4: Microsoft - uses FPU-registers for FMUL with access to variables on the stack - length of the codes is 29 bytes - generated codes are only 9 bytes longer than Watcom codes Place #5: Borland - uses FPU-registers for FMUL with access to variables on the stack - length of the codes is 48 bytes

SergeyKostrov · ‎01-17-2016

[ To Alexander ] >>... >>...You sure that won't be a waste of time and other resources? Alexander, Q1: I wonder if you can use C codes of the test I've posted in order to create a test-case for LLVM C++ compiler? Q2: Could you upload assembler instructions generated by LLVM C++ compiler? Let me know if you need any help.

SergeyKostrov · ‎01-17-2016

I will also provide details about versions of all C++ compilers, used to complete that test-case, and their command line options ( compiler and linker ).

Bernard · ‎01-18-2016

Answering post #20:

Yes now I see what do you mean. I was under assumption of the skewed results and I pay less attention to code analysis(had not time for detailed analysis). It seems that Watcom generated instruction (fild?) which converted int to float in the microcode (probably part of x87 FPU stack) thus speeding somehow execution. I think that CPU scheduler dispatched loop control statement for the computation on the Port6 (forward branch) it could has been also fused with the jl instruction. As you mentioned in your comment Watcom compiler did not use stack for accessing multiplier value and whole computation was performed entirely "inside" FPU x87 unit.

ICL produced probably unoptimized code with interleaved multiplication operation with constant stack access needed to reload running product multiplier. If that multiplier remained the same value I wonder if it was cached because of high temporal frequency of usage. Constant usage of stack resident multiplier at ebp-30h probably slowed down the computation. ILP exploitation of loop control statement probably had zero impact on computation speed up when compared to Watcom code.

Bernard · ‎01-18-2016

Sergey Kostrov wrote:

>>...why Watcom C++ compiler easily outperformed all the rest modern C/C++ compilers but I clearly see that two multiplication
>>operations ( FMUL ) are done with FPU registers instead of memory variables created on the stack

[ Watcom C++ compiler ]

...
00402AED fst st(2)
00402AEF fmul st, st(2)
00402AF1 fmul st, st(2)
00402AF3 fstp st(1)
...

The core part of the calculations is done with FPU registers!

Beside the core calculation I think that also an int->float conversion is managed by the x87 instruction internally at uop level. I think that FILD is the is responsible for that, but this can have impact on the loop computation speed.

Bernard · ‎01-18-2016

>>>Even if STL is a good library C++ overheads affect performance significantly and it is clearly seen when it comes to processing with boosted thread priorities, like Above Normal or Time Critical, and measurements of time intervals with microseconds, or less ( hundred of nanoseconds ) accuracy.>>>

Do you mean in highly portable DSP-like applications running on Windows desktop OS?

I ask this because in my own projects I tend to rely so much on STL std::valarray and std::vector for scientific computation. I prefer to lean on resources deallocation (destructor calls) of those aforementioned containers.

SergeyKostrov · ‎02-07-2016

>>Beside the core calculation I think that also an int->float conversion is managed by the x87 instruction internally >>at uop level. I think that FILD is the is responsible for that, but this can have >>impact on the loop computation speed. Exactly! This is what I tell from time to time: Do not forget that CPU and FPU are working in parallel and this is still true since Intel 80386 CPU times.

SergeyKostrov · ‎02-07-2016

>>>>Even if STL is a good library C++ overheads affect performance significantly and it is clearly seen when it comes to >>>>processing with boosted thread priorities, like Above Normal or Time Critical, and measurements of time intervals >>>>with microseconds, or less ( hundred of nanoseconds ) accuracy... >> >>Do you mean in highly portable DSP-like applications running on Windows desktop OS? To some degree No. I wanted to say that even if a priority of some STL-based processing ( a function, or a set of functions, etc ) is boosted to Above Normal or Time Critical it can not outperform a processing ( a function, or a set of functions, etc ) implemented in pure C language. This is because C codes more easily optimizable when compared to standarrd, or advanced, C++ codes. Don't even speak about codes with heavy usage of C++11, or C++14, etc, features of C++ language.

SergeyKostrov · ‎02-07-2016

>>I will also provide details about versions of all C++ compilers, used to complete that test-case, and their command line >>options ( compiler and linker ). Here are they are: [ Watcom C++ compiler and linker options ] -5r -fp5 -fpi87 -wx -d0 -s -oabil+mprt -xd -D_WIN32_WCC -DNDEBUG -feWccTestApp.exe -k33554432 -i"C:\WorkLib\NKL\Include" -"libpath C:\WorkLib\MKL\Lib\Ia32Wcc" -wcd=007 -wcd=008 -wcd=013 -wcd=014 -wcd=086 -wcd=188 -wcd=367 -wcd=368 -wcd=369 -wcd=387 -wcd=389 -wcd=549 -wcd=628 -wcd=689 -wcd=716 -wcd=725 -wcd=726 -wcd=735

SergeyKostrov · ‎02-07-2016

[ Intel C++ compiler and linker options ] [ Compiler options ] /c /O3 /Ob1 /Oi /Ot /Oy /Qipo /I "..\..\Include" /D "WIN32" /D "_CONSOLE" /D "NDEBUG" /D "_WIN32_ICC" /D "INTEL_SUITE_VERSION=PE121_300" /D "_VC80_UPGRADE=0x0710" /D "_UNICODE" /D "UNICODE" /GF /MT /GS- /fp:fast=2 /GR- /Yu"Stdphf.h" /Fp"Release\IccTestApp.pch" /Fo"Release/" /W5 /nologo /Wp64 /Zi /Gd /TP /Qdiag-disable:2012 /Qdiag-disable:2013 /Qdiag-disable:2014 /Qdiag-disable:2015 /Qdiag-disable:2017 /Qdiag-disable:2021 /Qdiag-disable:2022 /Qdiag-disable:2304 /U "_WIN32_MSC" /U "_WINCE_MSC" /U "WIN32_PLATFORM_PSPC" /U "WIN32_PLATFORM_WFSP" /U "WIN32_PLATFORM_WM50" /U "_WIN32_MGW" /U "_WIN32_BCC" /U "_COS16_TCC" /U "_WIN32_WCC" /Qopenmp /Qfp-speculation:fast /Qopt-matmul /Qparallel /Qstd=c++0x /Qrestrict /Qdiag-disable:111,673,10121 /Wport /Qeffc++ /QxSSE2 /Qansi-alias /Qvec-report=0 /Qfma /Qunroll:8 /Qunroll-aggressive /Qopt-streaming-stores:always /Qopt-block-factor:128 /Qopt-mem-layout-trans:2 [ Linker options ] kernel32.lib user32.lib gdi32.lib winspool.lib comdlg32.lib advapi32.lib shell32.lib ole32.lib oleaut32.lib uuid.lib odbc32.lib odbccp32.lib /OUT:"Release/IccTestApp.exe" /INCREMENTAL:NO /nologo /MANIFEST /MANIFESTFILE:"Release\IccTestApp.exe.intermediate.manifest" /NODEFAULTLIB:"../../Bin/Release/ScaLib.lib" /TLBID:1 /SUBSYSTEM:CONSOLE /STACK:268435456 /LARGEADDRESSAWARE /MACHINE:X86 /qdiag-disable:111,673,10121

SergeyKostrov · ‎02-07-2016

[ MinGW C++ compiler and linker options ] -O3 -m32 -msse2 -fpeel-loops -ffast-math -ftree-vectorizer-verbose=0 -ftree-vectorize -fvect-cost-model -mprfchw -flto -fwhole-program -w -fomit-frame-pointer -I "C:/WorkLib/ICC2011/Composer XE/Mkl/Include" -B "../../AppsSca" -DNDEBUG -o Release/MgwTestApp%3.exe MgwTestApp.cpp "C:/WorkLib/ICC2011/Composer XE/Mkl/Lib/Ia32/mkl_rt.lib" -Xlinker --stack=67108864 -fopenmp

SergeyKostrov · ‎02-07-2016

[ Microsoft C++ compiler and linker options ] [ Compiler options ] /O2 /Ob1 /Oi /Ot /Oy /GL /I "..\..\Include" /D "WIN32" /D "_CONSOLE" /D "NDEBUG" /D "_WIN32_MSC" /D "_VC80_UPGRADE=0x0710" /D "_UNICODE" /D "UNICODE" /GF /Gm /MT /GS- /fp:fast /GR- /openmp /Yu"Stdphf.h" /Fp"Release\MscTestApp.pch" /Fo"Release/" /Fd"Release/" /W4 /nologo /c /Wp64 /Zi /Gd /TP /wd4005 /U "_WINCE_MSC" /U "WIN32_PLATFORM_PSPC" /U "WIN32_PLATFORM_WFSP" /U "WIN32_PLATFORM_WM50" /U "_WIN32_MGW" /U "_WIN32_BCC" /U "_COS16_TCC" /U "_WIN32_ICC" /U "_WIN32_WCC" /errorReport:prompt /arch:SSE2 [ Linker options ] /OUT:"Release/MscTestApp.exe" /INCREMENTAL:NO /NOLOGO /MANIFEST /MANIFESTFILE:"Release\MscTestApp.exe.intermediate.manifest" /NODEFAULTLIB:"../../Bin/Release/ScaLib.lib" /SUBSYSTEM:CONSOLE /STACK:268435456 /LARGEADDRESSAWARE /LTCG /MACHINE:X86 /ERRORREPORT:PROMPT kernel32.lib user32.lib gdi32.lib winspool.lib comdlg32.lib advapi32.lib shell32.lib ole32.lib oleaut32.lib uuid.lib odbc32.lib odbccp32.lib "..\..\bin\release\scalib.lib"

SergeyKostrov · ‎02-07-2016

[ Borland C++ compiler and linker options ] -d -O2 -w -D_WIN32_BCC -DNDEBUG -5 -nRelease -eBccTestApp.exe -I"C:\WorkLib\ICC2011\Composer XE\Mkl\Include" -L"C:\WorkLib\ICC2011\Composer XE\Mkl\Lib\Ia32Bcc" -lS:268435456 BccTestApp.cpp HrtALLib.asm

SergeyKostrov · ‎02-07-2016

To LLVM C++ compiler developers Q1: I wonder if you can use C codes of the test I've posted in order to create a test-case for LLVM C++ compiler? Q2: Could you upload assembler instructions generated by LLVM C++ compiler?