Performance Evaluation of Matrix Transpose ( MT ) algorithms with different C++ compilers

SergeyKostrov · ‎05-15-2013

I'd like to share results of Performance Evaluation of Matrix Transpose ( MT ) algorithms with different C++ compilers.

Notes:

- Optimization used: 'Optimize For Speed'
- Single-Precision data type ( float ) for a matrix

- Intel C++ compiler version 12.1.7.371 ( IA-32 / Update 7 / Optimization: /O2 )
- Microsoft C++ compiler version 14.00.50727.762 ( IA-32 / Visual Studio 2005 / Optimization: /O2 )
- MinGW C++ compiler version 3.4.2 ( IA-32 / Optimization: -O2 )
- Borland C++ compiler version 5.5.1 ( IA-32 / Optimization: -O2 )

- Nothing is done to create an advantage or disadvantage of some C++ compiler over another one and all the rest settings are default

- Operating system: Windows XP 32-bit Professional SP3
- Computer with Pentium 4 CPU single core
- Results on a computer with Intel Core i7-3840QM ( Ivy Bridge ) CPU could be provided ( if somebody is interested )

SergeyKostrov · ‎05-15-2013

Matrix Size: 1024 x 1024 ( number of tests is 64 ) [ Classic MT algorithm ] MinGW C++ compiler - Completed in 9953 ticks - best time Intel C++ compiler - Completed in 9969 ticks - 0.16% slower Microsoft C++ compiler - Completed in 9984 ticks - 0.32% slower Borland C++ compiler - Completed in 10031 ticks - 0.78% slower [ Diagonal MT algorithm ] MinGW C++ compiler - Completed in 5484 ticks - best time Microsoft C++ compiler - Completed in 5500 ticks - 0.29% slower Intel C++ compiler - Completed in 5547 ticks - 1.15% slower Borland C++ compiler - Completed in 5594 ticks - 2.01% slower [ Eklundh MT algorithm ] MinGW C++ compiler - Completed in 3657 ticks - best time Microsoft C++ compiler - Completed in 3766 ticks - 2.98% slower Intel C++ compiler - Completed in 3843 ticks - 5.09% slower Borland C++ compiler - Completed in 4547 ticks - 24.34% slower

SergeyKostrov · ‎05-15-2013

Matrix Size: 2048 x 2048 ( number of tests is 64 ) [ Classic MT algorithm ] MinGW C++ compiler - Completed in 39766 ticks - best time Intel C++ compiler - Completed in 39937 ticks - 0.43% slower Borland C++ compiler - Completed in 40000 ticks - 0.59% slower Microsoft C++ compiler - Completed in 40359 ticks - 1.49% slower [ Diagonal MT algorithm ] MinGW C++ compiler - Completed in 22156 ticks - best time Borland C++ compiler - Completed in 22187 ticks - 0.14% slower Intel C++ compiler - Completed in 22219 ticks - 0.28% slower Microsoft C++ compiler - Completed in 22344 ticks - 0.85% slower [ Eklundh MT algorithm ] MinGW C++ compiler - Completed in 15422 ticks - best time Microsoft C++ compiler - Completed in 15953 ticks - 3.44% slower Intel C++ compiler - Completed in 16172 ticks - 4.86% slower Borland C++ compiler - Completed in 20093 ticks - 30.29% slower

SergeyKostrov · ‎05-15-2013

Matrix Size: 4096 x 4096 ( number of tests is 64 ) [ Classic MT algorithm ] MinGW C++ compiler - Completed in 162140 ticks - best time Microsoft C++ compiler - Completed in 162469 ticks - 0.20% slower Borland C++ compiler - Completed in 163234 ticks - 0.67% slower Intel C++ compiler - Completed in 165172 ticks - 1.87% slower [ Diagonal MT algorithm ] Borland C++ compiler - Completed in 90406 ticks - best time MinGW C++ compiler - Completed in 99203 ticks - 9.73% slower Intel C++ compiler - Completed in 101282 ticks - 12.03% slower Microsoft C++ compiler - Completed in 102640 ticks - 13.53% slower [ Eklundh MT algorithm ] MinGW C++ compiler - Completed in 75391 ticks - best time Microsoft C++ compiler - Completed in 78219 ticks - 3.75% slower Intel C++ compiler - Completed in 78390 ticks - 3.98% slower Borland C++ compiler - Completed in 91188 ticks - 20.95% slower

Armando_Lazaro_Alami · ‎05-18-2013

Hi Sergey, I can do some test with an i7 2600K ( Ivy Bridge) but do not have the same compilers you have. I can try with ICL 13.1.0.149, MSVC 2008 and MinGW (but it is hard for me to use). I also have the same ICL in an Ubuntu virtual machine (just installed, only out of curiosity, it is not our production platform).

If you only use /O2 for Intel, I think that your trial ignores several advantages of that compiler that makes a big differences like : auto-parallelization, O3 with additional optimizations and /fp:fast=2, to mention only the most important in my personal experience.

I have made several tests with ICL and other compilers. I always try the maximum level of optimization of each compiler. In most of the cases Intel finished as the winner when floating point is the main task. The compiler with best results, excluding Intel, was MSVC 2010 so far.

SergeyKostrov · ‎05-18-2013

Hi Armando, >>...If you only use /O2 for Intel, I think that your trial ignores several advantages of that compiler that makes a big differences >>like : auto-parallelization, O3 with additional optimizations and /fp:fast=2, to mention only the most important in my personal >>experience. You're right that /O3 and /fp:fast=2 options are very powerful. However, if I would use these options for Intel C++ compiler it would create "unfair competition environment" with regard to MinGW ( v3.4.2 is ~9-year-old ) and Borland C++ ( v5.5.1 is ~15+-year-old ) C++ compilers. So, the purposes of these tests are as follows: - Check-ups of some classic and proprietary algorithms in order to identify implementation issues, or problems - Demonstration of capabilities and efficiency of code generation (!) of legacy C++ compilers when compared to modern C++ compilers, like Intel and Microsoft If you have some test results please post and I really would like to bring attention of Intel software engineers to the thread.

TimP · ‎05-19-2013

MinGW gcc 4.8 compilers have become fairly widespread, with current support. In any case, g++ has option -ffast-math which is comparable to icpc -fast=2 in case you are looking for something which implies -fcx-limited-range, consistent with icpc -complex-limited-range. I don't see that these options should impact matrix transposition. Obviously, if you are comparing an old compiler without auto-vectorization against a new one with vectorization, that's hardly a meaningful competition, but I don't know why you wouldn't want these tests vectorized.

SergeyKostrov · ‎05-19-2013

>>...g++ has option -ffast-math which is comparable to icpc -fast=2... Thanks Tim for the note and I'll try to evaluate that option. This is interesting note from documentation:: ... -ffast-math - This switch lacks documentation ...

TimP · ‎05-19-2013

For several years now, gcc invoked auto-vectorization either as part of -O3, or (often better) explicitly by -O2 -ftree-vectorize, best used with -march=native or the like. Neither of those turn on additional unrolling, for which gcc may need detailed options for Intel Nehalem and newer, e.g. -funroll-loops --param max-unroll-times=4

In my tests now, comparing latest gnu and Intel compilers on the classic "LCD" benchmark, gcc out-performs icc -ansi-alias by at most 50% on cases which aren't vectorizable. With the aid of pragmas, icc can at least match gcc on vectorizable cases.

One of the peculiarities of the Intel pragmas for recent releases is that the new #pragma omp simd safelen(32) applies for -mmic and -mavx, but the older version #pragma simd vectorlength(64) is needed for the older architectures. gcc can vectorize most cases quite well with __restrict pointers without pragmas, and Microsoft VS2012 may do a fair job for /arch:AVX

So it is quite easy to set ground rules which will change the balance when comparing optimization of various compilers. At one time, people would compare defaults (-O0 for gcc), but marketing based on whose defaults are better thankfully has declined. It used to be that adherence to standards ruled out #pragma optimization, but new pragmas gradually introduced in icc are covered by the future OpenMP 4.0. All #pragma simd over-rule any standards compliance settings such as /fp:source, but not /ftz- settings which can't be controlled locally without large time penalty.

SergeyKostrov · ‎05-21-2013

Thanks for the tip. >>... In any case, g++ has option -ffast-math which is comparable to icpc -fast=2... I verified the option for an algorithm with lots of multiplications and performance improvement was ~1% but it doesn't mean that the option is useless. I will need to check it on a couple systems with different of CPUs including Ivy Bridge.

SergeyKostrov · ‎05-22-2013

It is interesting that option -ffast-math allows to detect divide-by-zero cases (!): ... In file included from MgwTestApp.cpp:77: ../../Common/PrtTests.cpp: In function `RTdouble FastSinV1(RTdouble)': ../../Common/PrtTests.cpp:21363: warning: division by zero in `0.0 / 0.' ../../Common/PrtTests.cpp:21368: warning: division by zero in `0.0 / 0.' ... MgwTestApp - 0 error(s), 7 warning(s) ...