Simulation times MUCH higher with v. 12 than with v.11

cathleenmcguiness · ‎11-18-2011

Hi,

I've been doing some benchmarking, comparing the Linux 64bit version of the Intel Fortran compiler v. 11.1.037 and all releases of v. 12 up to 12.1.6.233.

In some of my cases I allocate a double precision matrix of size x*y*z and in some I don't. For the cases where this matrix remains un-allocated, CPU times are roughly the same for v.11 and v.12. However, when this matrix is used, CPU times get higher and higher in all iterations of v. 12 whereas in v. 11 they stay constant. There's a non-linear increas in CPU time with v. 12 the larger the size of x, y and z gets.

If the matrix exists, I access it y times per simulation, storing x*z values.

Compiler flags are the same in all tests and are as follows:

-heap-arrays
-override-limits
-O3
-fp-model precise
-ip
-ipo
-ftz
-axAVX,SSE4.2,SSE4.1,SSSE3,SSE3,SSE2
-fno-alias

Any explanation as to why the CPU time skyrockets in v.12 and how to get around it would be much appreciated. For x = 600, y = 100 and z = 3, CPU times are already 10% higher in v.12 than in v.11 and for the simulations I need to run, this is a very small allocation size for the matrix in question.

Whether this is an issue on Windows as well, I have yet to check.

TimP · ‎11-18-2011

Did you test also with a more reasonable number of architecture selections, e.g. just one of -xAVX or -xSSE4.1 ?
I suppose the 12.x compilers may have increased the limit on number of architecture dependent paths (once said to have been 3 for 11.1) or changed the criteria for which paths are chosen when too many have been specified.
I suppose also the treatment of unaligned memory access may have changed for certain of those architecture options. In my tests, the reduction in performance going from SSE4.1 to SSE4.2 on NHM and WSM is much more than the compiler docs would lead you to expect. If you have very few unaligned accesses, SSE4.2 would likely be superior, but SSE4.1 may be faster when you have 50% unaligned accesses.
If small variations of performance are important, you may wish to test individual architecture selections on the CPU variations which are important to you, so as to avoid putting useless selections in your list.
Remember, if you want the compiler to choose a single architecture, the option is -xHost. If you always run on the same CPU architecture, that should always prove superior to a list of architectures including the same one.
I have seen the compiler fail to generate a vectorized code path for certain loops when multiple architectures are specified (presumably in an effort to control code size growth).

cathleenmcguiness · ‎11-21-2011

xHost is not an option. I need the code to perform well on any machine. Cutting down and using just xSSE2 takes care of the problem though. Thanks!