ifort optimization level on big cluster

savinovsv · ‎05-06-2012

Dear all!

I have a problem.

I have ifort installed on a big cluster.

Linux access1 2.6.32-131.0.15.el6.x86_64 #1 SMP Sat Nov 12 15:11:58 CST 2011 x86_64 x86_64 x86_64 GNU/Linux

Intel Fortran Intel 64 Compiler XE for applications running on Intel 64, Version 12.0.3.174 Build 20110309

Here is cpuinfo

Intel Xeon CPU X5670
===== Processor composition =====
Processors(CPUs) : 24
Packages(sockets) : 2
Cores per package : 6
Threads per core : 2
===== Processor identification =====
Processor Thread Id. Core Id. Package Id.
0 0 0 0
1 0 0 1
2 0 1 0
3 0 1 1
4 0 2 0
5 0 2 1
6 0 8 0
7 0 8 1
8 0 9 0
9 0 9 1
10 0 10 0
11 0 10 1
12 1 0 0
13 1 0 1
14 1 1 0
15 1 1 1
16 1 2 0
17 1 2 1
18 1 8 0
19 1 8 1
20 1 9 0
21 1 9 1
22 1 10 0
23 1 10 1
===== Placement on packages =====
Package Id. Core Id. Processors
0 0,1,2,8,9,10 (0,12)(2,14)(4,16)(6,18)(8,20)(10,22)
1 0,1,2,8,9,10 (1,13)(3,15)(5,17)(7,19)(9,21)(11,23)
===== Cache sharing =====
Cache Size Processors
L1 32 KB (0,12)(1,13)(2,14)(3,15)(4,16)(5,17)(6,18)(7,19)(8,20)(9,21)(10,22)(11,23)
L2 256 KB (0,12)(1,13)(2,14)(3,15)(4,16)(5,17)(6,18)(7,19)(8,20)(9,21)(10,22)(11,23)
L3 12 MB (0,2,4,6,8,10,12,14,16,18,20,22)(1,3,5,7,9,11,13,15,17,19,21,23)

What is really bothering me is that I can not go to optimization level higher than O1. Otherwise (O2, O3) my program crashes with segfault.

The program itself is perfectly reliable. It can be optimized to any level on another cluster

Linux t60-2.parallel.ru 2.6.18-skif-rhel-alt13.M41.3 #1 SMP Tue Feb 2 12:09:59 MSK 2010 x86_64 GNU/Linux

Intel Fortran Intel 64 Compiler Professional for applications running on Intel 64, Version 11.1 Build 20091012 Package ID: l_cprof_p_11.1.059

Intel Xeon Processor (Intel64 Harpertown)
===== Processor composition =====
Processors(CPUs) : 8
Packages(sockets) : 2
Cores per package : 4
Threads per core : 1
===== Processor identification =====
Processor Thread Id. Core Id. Package Id.
0 0 0 0
1 0 0 1
2 0 1 0
3 0 1 1
4 0 2 0
5 0 2 1
6 0 3 0
7 0 3 1
===== Placement on packages =====
Package Id. Core Id. Processors
0 0,1,2,3 0,2,4,6
1 0,1,2,3 1,3,5,7
===== Cache sharing =====
Cache Size Processors
L1 32 KB no sharing
L2 6 MB (0,2)(1,3)(4,6)(5,7)

I also can use any optimixzation level on i7 processor with the latest Composer 2011.

So, my quetsion to experts is:
What could be a problem?

Is it another uncertainty in the compiler or is it something else?

Thanks in advance

S.Savinov

P.S.
I must admit that the program is runnig 2.5 times faster on X5670 with O1.

Ron_Green · ‎05-07-2012

First, read this article, which is listed at the top of this forum: http://software.intel.com/en-us/articles/determining-root-cause-of-sigsegv-or-sigbus-errors/

Next, for clusters keep in mind that some batch systems will not propagate env settings like 'ulimit -s unlimited'. You may want to make that the default for the cluster in the system wide /etc/profile.

ron

savinovsv · ‎05-08-2012

Quoting Ronald W. Green (Intel)

First, read this article, which is listed at the top of this forum: http://software.intel.com/en-us/articles/determining-root-cause-of-sigsegv-or-sigbus-errors/

Next, for clusters keep in mind that some batch systems will not propagate env settings like 'ulimit -s unlimited'. You may want to make that the default for the cluster in the system wide /etc/profile.

ron

Dear Ron!

Thank you for reply. And I am sorry, I did not notice the article you have mentioned.

I am aware of stack problems. I am sure I have enought stack and that it is not corrupted.
My program is DFT package SIESTA which is known to have some problems with ifort > 10.

My flags are:
FFLAGS=-O1 -xHost -fp-model precise -heap-arrays 100000 -mcmodel=medium (pls notice the amount of stack)

My problem is slightly different from described. I can not precisely localize the place of segfault, thus I can not say for sure what kind of problem I do have. BUT I have troubles ONLY on one system AND with high optimization level. It is not very probable that the changes in optimization level are leading to the stack misuse. I have a feeling (very subjective thing) that the real problem resides a lot deeper.

Sergei

jimdempseyatthecove · ‎05-08-2012

With higher levels of optimization there is more in-lining and higher unrolling of loops. When the in-lining and unrolling includes code that generates(uses) stack based temporaries then stack consumption goes up (possibly way up in the case where array temporaries are created).

I suggest you experiment with using options to control the extent of IPO, then loop unrolling, then in-lining. Keep O3 on. Do this globally, if this works, then start removing the restrictions. Once you find the group (IPO, loop unroll, in-lining) then start working on groups of files within your projects. IOW search for offending source files.

Elimination of the creation of array temporaries may be in order. There is a diagnostic option that triggers a once only printout of where an array temporary is created in your code.

Jim Dempsey

Lorri_M_Intel · ‎05-10-2012

May I clarify something please?

You said:

My flags are:

FFLAGS=-O1 -xHost -fp-model precise -heap-arrays 100000 -mcmodel=medium (pls notice the amount of stack

Are you saying that "-heap-arrays 100000" is setting your stack?

If so, I'm afraid there is a bit of confusion. The "-heap-arrays " switch tells the compiler to put temporary arrays bigger than "" into heap, rather than using the stack. Setting that number to be so large only puts arrays bigger than 100K on the heap. You could still have several temps created on the stack, each of which are smaller than 100K.

You should follow the suggestion to look at how many temporaries you are creating, and see if it's possible to limit those.

--Lorri

TimP · ‎05-10-2012

In fact, -heap-arrays 100000 will put only fixed size allocations beyond that size on heap. Chances are that all your big allocations are variable size and go on stack.
Global stack size is set by your shell command. It's entirely normal to set "unlimited" (as Ron mentioned), which actually is a reasonable system-dependent limit, which you would set up to happen each time you open a shell on each node of the cluster, in case it's not reasonable to ask your sysadmin to make it happen for everyone.

savinovsv · ‎05-19-2012

Quoting TimP (Intel)

In fact, -heap-arrays 100000 will put only fixed size allocations beyond that size on heap. Chances are that all your big allocations are variable size and go on stack.
Global stack size is set by your shell command. It's entirely normal to set "unlimited" (as Ron mentioned), which actually is a reasonable system-dependent limit, which you would set up to happen each time you open a shell on each node of the cluster, in case it's not reasonable to ask your sysadmin to make it happen for everyone.

Thank you all for your comments!

Just to make things a bit more clear.

To my experience explicit "-ipo" does not work with DFT codes I use. ifort gets completely confused. I have tried this option many times in different combinations with no success. Very sad, as it should speed-up the process.

Huge (100M) program stack (ulimit -s 120000 in any platform) was set on purpose. DFT codes are not polished. There is some relatively SMALL temp garbage inside. So, this amount of stack is for garbage. Main allocations typically go to 500MB per node. They are obviously residing on the heap.

I agree that increasing loop-unrolling and inlining level can stress the stack. But -O3 IS working everywhere except for X5670 based cluster. You see I am using processor dependent compilation and I would like to keep this.
As far as I can see Intel compiler (+ IMPI) on Intel based cluster is really much faster compared to GNU or PGI. The same concerns MVAPICH <-> IMPI competition. I mean, I know ifort is very clever with optimisation and I suspect it cat try to silently put something else on the stack for X5670, say some local computational procedures. In such a case I have no lever to pull, I can only hope there is no bug.

At the moment I have no idea where to go.

I really appresiate your help. Thank you for sharing time with me.

Sergei

TimP · ‎05-20-2012

I am not surprised that -ipo would have limitations in a project large enough to require -mcmodel. You are probably better off relying on same-file in-lining for the most frequent calls.
Using processor-dependent option such as -xHost doesn't have a clear connection to any problems you mention here. On Westmere X56xx I have never found -xHost running as well as -xSSE4.1 or possibly -xSSE2, but you haven't indicated any reasons why you might see a problem associated with these choices. Differences between -O2 and -O1 should have much more impact on performance than the architecture option. There is nothing wrong with choosing a single architecture option which works for all your clusters.
Intel MPI has borrowed a number of features from MVAPICH, so it is not so much a competition as an effort to include as many useful facilities as possible.

savinovsv · ‎10-14-2012

Dear experts! I have a few new comments to the problem. Now my flags are: FFLAGS=-O2 -xSSE4.2 -fp-model precise -heap-arrays 100000 I am using ifort13.0 1. I can go to -O3 level with -xSSE4.2 option. This means that there is some uncertainty in ifort. Besides this, -xSSE4.2 runs faster. 2. I did not find a way to get -ipo working with my code. 3. The above mentioned flags set is the optimal choice. No -ip, no explicit loop unrolling. The difference is not too big, about 2-3%, but absolutely reproducible 4. -O3 makes things worse. The program is slowing down. 5. Decreasing -heap-arrays size notably decreasing performance. This is some kind of conclusion. I have found the answer to my question, and I just want to share my experience. Sergei