I wouldn't consider it

christophe_b_ · ‎06-10-2017

I have a dual-boot on my computer since I prefer to code on linux. On windows, I use Intel Composer XE 2013, and to compare I installed the same version on linux (at first I used the intel compiler 17 on linux, but I went back to 2013 in order to compare properly with windows).

I have a program involving mostly computation (a fixed point algorithm). I compile it on windows with similar flags (-openmp -fast and link to boost library)... but the windows executable ends up being about 1.6 times faster.

More detail about the program: at the core of it I m using the boost library, with this kind of computations (that i m calling millions of time):

boost::math::lognormal distrib(mu, sigma);
value = exp(boost::math::lgamma(N))*pow(cdf(distrib, x), 5);

I already noticed on linux (and I don t know why) that icpc tends to be faster for any call to boost function (like lgamma) than g++. However it's even faster overall on windows than on linux and I don't know why (and this is highly probably the reasons why windows is faster than linux).
(I also use openmp to parallelize loops but even without it the difference between OS stays).

What is the explanation? maybe some pre-build differences in compilation flags between the two platforms? Or is windows simply faster with ICC ? This would be weird because I never heard of such a thing (and for gcc for example, it tends to always be faster on linux with the same computer...).

Not sure it s relevant but on windows when I compile with the following .bat:

@call "C:\Program Files (x86)\Intel\Composer XE 2013\bin\iclvars.bat" ia32 
icl /openmp /I "C:\boost_1_61_0" /fast program.cpp

and when I launch it, it says:

Intel(R) composer XE 2013 (package 089)
Setting environment for using Microsoft Visual Studio 2010 x86 tools.
Intel(R) C++ Compiler XE for applications running on IA-32. Version 13.0.0.089 Build 20120731
...
Microsoft (R) Incremental Linker Version 10.00.30319.01
-

Notice that I use ia32 while I'm being on an intel64 system (x86_64). But as a consequence on Linux I'm also using ia32 with:

source /media/usr/intel2013/bin/iccvars.sh intel64
icpc -o program program.cpp -std=c++11 -openmp -fast

Notice that on linux I have to impose -std=c++11 for it not to bug (while on windows I don't think it uses c++11 in fact when compiling.

I would like to definitely switch to linux but I would need my program to run at least as fast as on windows. It's been about 5 days that I'm looking for a solution to this, tweaking with flags and all, but I'm pretty new to compiler performance optimization and I was unable to find the reason why I have differences.
Any help would be greatly appreciated.

Edit: as for my hardware: Intel(R) Core(TM) i7-4710HQ CPU @ 2.50GHz, 8 CPU, 4 cores, 2 threads per core, architecture x86_64. Windows is 8.1 and Linux 4.8.0-54-generic #57~16.04.1-Ubuntu

jimdempseyatthecove · ‎06-15-2017

Your Linux source command is specifying 64-bit build (intel64). Replace intel64 with ia32.

Also, specify the appropriate code generation option such that the same instruction set is used on both system. One system may be using SSE/AVX the other may be using the FPU.

And one system may be using fast-transcendentals and the other not.

Jim Dempsey

SergeyKostrov · ‎06-16-2017

>>...What is the explanation? There is a term already for such differences: Performance Portability. So, it is Not always possible to get the same numbers when comparing performance of some processing on different OSs. >>...maybe some pre-build differences in compilation flags between the two platforms? I don't think so. >>Or is windows simply faster with ICC ? See my 1st comment. In overall, try to investigate if there are differences in Boost codes.

christophe_b_ · ‎06-19-2017

jimdempseyatthecove wrote:

Your Linux source command is specifying 64-bit build (intel64). Replace intel64 with ia32.

Also, specify the appropriate code generation option such that the same instruction set is used on both system. One system may be using SSE/AVX the other may be using the FPU.

And one system may be using fast-transcendentals and the other not.

Jim Dempsey

Maybe I forgot to mention it, but I tried compiling with Intel Composer XE2013 and option ia32 on linux too (as on windows) without any better result.
The -fast flag should include -xHost, I feel like it's using xAVX or xAVX2 on linux but anyway it should use the same on both OS right? How do you activate FPU to test? (isn't it slower anyway?).
I tried fast-transcentals too on both system without any real change.

>>...What is the explanation?

There is a term already for such differences: Performance Portability. So, it is Not always possible to get the same numbers when comparing performance of some processing on different OSs.

So you're basically saying that it might just be "normal"? Well I'm ok with that, just a bit surprised since I thought linux was generally better.

TimP · ‎06-19-2017

I wouldn't consider it "normal" to compare compilers and libraries without taking into account target ISA (default x87 for gcc 32-bit, default SSE2 for icpc) and the ISA for which your libraries are built. Linux doesn't provide any math library for g++ other than the x87 one, and calls into that from ISA other than x87 will be slow. As Jim said, you would want to use consistent code generation as much as possible.

-fast-transcendentals enables the use of svml vector math calls even if you set options like icpc -fp-model source. It would not be surprising if this can't happen in the context you quote with boost operands (but you should tell us about it).

If you must make x87 math calls with boost operands the way you show, it may even help to build boost with corresponding options, although that isn't pretty. You certainly can't make a generalization from such context to the idea that linux is generally slow.

SergeyKostrov · ‎06-19-2017

>>...So you're basically saying that it might just be "normal"? Well I'm ok with that, just a bit surprised since I thought linux was >>generally better... A 60% difference ( 1.6x slower ) can't be considered as normal and you need to reduce codes of processing to a small test case and after that analysis with Intel VTune needs to be done. When it comes to Performance Portability only differences from 5% to 10% could be considered as normal. All cases above that threshold are not normal.

Bastian_B_ · ‎05-02-2018

Are you sure that most of the time is really spent in your code (the code you compiled with the Intel Compiler)? Or is it possible that a large part of the CPU time is spent in code from another library (boost?)? Because then the question is how the person who compiled that library for your OS compiled it, and so on.

In general I would recommend to use the Intel VTune Amplifier XE on your program and to analyze the results of a basic 'hotspot' analysis. On Linux it is as simple as:

amplxe-cl -c hotspots ./foobar
amplxe-gui

You can also use the 'perf' tool on Linux. This should show more precisely where the differences are between Linux and Windows and in which part of the code and in which libraries the time is spent. For representative results use at least -O2 and add -g, in order to generate debugging information. I also usually add -fno-omit-frame-pointer for performance analysis as that makes the stack unwinding easier, but that may have a small performance impact in certain cases.

ICC faster on Windows than on Linux