I'd like to share results of Performance Evaluation of Matrix Transpose ( MT ) algorithms with different C++ compilers.
- Optimization used: 'Optimize For Speed'
- Single-Precision data type ( float ) for a matrix
- Intel C++ compiler version 188.8.131.521 ( IA-32 / Update 7 / Optimization: /O2 )
- Microsoft C++ compiler version 14.00.50727.762 ( IA-32 / Visual Studio 2005 / Optimization: /O2 )
- MinGW C++ compiler version 3.4.2 ( IA-32 / Optimization: -O2 )
- Borland C++ compiler version 5.5.1 ( IA-32 / Optimization: -O2 )
- Nothing is done to create an advantage or disadvantage of some C++ compiler over another one and all the rest settings are default
- Operating system: Windows XP 32-bit Professional SP3
- Computer with Pentium 4 CPU single core
- Results on a computer with Intel Core i7-3840QM ( Ivy Bridge ) CPU could be provided ( if somebody is interested )
Hi Sergey, I can do some test with an i7 2600K ( Ivy Bridge) but do not have the same compilers you have. I can try with ICL 184.108.40.206, MSVC 2008 and MinGW (but it is hard for me to use). I also have the same ICL in an Ubuntu virtual machine (just installed, only out of curiosity, it is not our production platform).
If you only use /O2 for Intel, I think that your trial ignores several advantages of that compiler that makes a big differences like : auto-parallelization, O3 with additional optimizations and /fp:fast=2, to mention only the most important in my personal experience.
I have made several tests with ICL and other compilers. I always try the maximum level of optimization of each compiler. In most of the cases Intel finished as the winner when floating point is the main task. The compiler with best results, excluding Intel, was MSVC 2010 so far.
MinGW gcc 4.8 compilers have become fairly widespread, with current support. In any case, g++ has option -ffast-math which is comparable to icpc -fast=2 in case you are looking for something which implies -fcx-limited-range, consistent with icpc -complex-limited-range. I don't see that these options should impact matrix transposition. Obviously, if you are comparing an old compiler without auto-vectorization against a new one with vectorization, that's hardly a meaningful competition, but I don't know why you wouldn't want these tests vectorized.
For several years now, gcc invoked auto-vectorization either as part of -O3, or (often better) explicitly by -O2 -ftree-vectorize, best used with -march=native or the like. Neither of those turn on additional unrolling, for which gcc may need detailed options for Intel Nehalem and newer, e.g. -funroll-loops --param max-unroll-times=4
In my tests now, comparing latest gnu and Intel compilers on the classic "LCD" benchmark, gcc out-performs icc -ansi-alias by at most 50% on cases which aren't vectorizable. With the aid of pragmas, icc can at least match gcc on vectorizable cases.
One of the peculiarities of the Intel pragmas for recent releases is that the new #pragma omp simd safelen(32) applies for -mmic and -mavx, but the older version #pragma simd vectorlength(64) is needed for the older architectures. gcc can vectorize most cases quite well with __restrict pointers without pragmas, and Microsoft VS2012 may do a fair job for /arch:AVX
So it is quite easy to set ground rules which will change the balance when comparing optimization of various compilers. At one time, people would compare defaults (-O0 for gcc), but marketing based on whose defaults are better thankfully has declined. It used to be that adherence to standards ruled out #pragma optimization, but new pragmas gradually introduced in icc are covered by the future OpenMP 4.0. All #pragma simd over-rule any standards compliance settings such as /fp:source, but not /ftz- settings which can't be controlled locally without large time penalty.