Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.
28704 Discussions

HOMME run-time failure on Altix, compiled with Intel 10.1

notahoo
Beginner
493 Views

Hi,

I have compiled HOMME on an Sgi Altix with Intel 10.1. Compilation is successful but fails in run time. I posted the relevant part of the batch output and gdb trace below. I appreciate your insight resolving the error.

HOMME is written in Fortran, MPI, some C, and NetCDF for I/O.

MPI: --------stack traceback-------
Internal Error: Can't read/write file "/dev/mmtimer", (errno = 22)
Internal Error: Can't read/write file "/dev/sgi_fetchop", (errno = 22)
MPI: Intel Debugger for applications running on IA-64, Version 10.1-32 , Build 20070829
MPI: Attaching to program: /homme/benchmark/preqx, process 23903
MPI: [New Thread 2305843009318652880 (LWP 23903)]
MPI:
MPI: #0 0xa000000000010621
MPI: #1 0x20000000062076f0 in __waitpid () in /lib/libc-2.4.so
MPI: #2 0x20000000001c6570 in MPI_SGI_stacktraceback () in /usr/lib/libmpi.so
MPI: #3 0x20000000001c7c20 in slave_sig_handler () in /usr/lib/libmpi.so
MPI: #4 0xa0000000000107e0
MPI: #5 0x4000000000183e60 in SCHEDULE_MOD::setcycle (schedule=, cycle= {...}, edge=) at schedule_mod.F90:1017
MPI: #6 0x4000000000182c60 in SCHEDULE_MOD::genedgesched (partnumber=, lschedule= {...}, metavertex=) at schedule_mod.F90:225
MPI: #7 0x40000000001e07e0 in PREQ_INIT_MOD::preq_init (edge2dv=, edge1= {...}, edge2=, edge3= {...}, edge3p1= {...}, edge4=, red= {...}, par= {...}, timer= {...}) at ../src/preq_init_mod.F90:256
MPI: #8 0x4000000000004ab0 in prim_main () at ../src/prim_main.F90:67

MPI: -----stack traceback ends-----
MPI: /homme/benchmark/preqx, Rank 2, Process 23903: Dumping core on signal SIGSEGV(11) into directory /homme/benchmark/little-endian
MPI: MPI_COMM_WORLD rank 2 has terminated without calling MPI_Finalize()
MPI: aborting job
MPI: Received signal 11

GNU gdb 6.6
Copyright (C) 2006 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
This GDB was configured as "ia64-suse-linux"...
Using host libthread_db library "/lib/libthread_db.so.1".

warning: .dynamic section for "/lib/libc.so.6.1" is not at the expected address (wrong library or version mismatch?)
Reading symbols from /usr/lib/libmpi.so...done.
Loaded symbols for /usr/lib/libmpi.so
Reading symbols from /opt/intel/mkl/10.0.3.020/lib/64/libmkl_gf_ilp64.so...done.
Loaded symbols for /opt/intel/mkl/10.0.3.020/lib/64/libmkl_gf_ilp64.so
Reading symbols from /opt/intel/mkl/10.0.3.020/lib/64/libmkl_core.so...done.
Loaded symbols for /opt/intel/mkl/10.0.3.020/lib/64/libmkl_core.so
Reading symbols from /opt/intel/mkl/10.0.3.020/lib/64/libmkl_sequential.so...don e.
Loaded symbols for /opt/intel/mkl/10.0.3.020/lib/64/libmkl_sequential.so
Reading symbols from /opt/intel/fc/10.1.008/lib/libimf.so.6...done.
Loaded symbols for /opt/intel/fc/10.1.008/lib/libimf.so.6
Reading symbols from /lib/libm.so.6.1...done.
Loaded symbols for /lib/libm.so.6.1
Reading symbols from /lib/libc.so.6.1...done.
Loaded symbols for /lib/libc.so.6.1
Reading symbols from /lib/libgcc_s.so.1...done.
Loaded symbols for /lib/libgcc_s.so.1
Reading symbols from /lib/libdl.so.2...done.
Loaded symbols for /lib/libdl.so.2
Reading symbols from /lib/libunwind.so.7...done.
Loaded symbols for /lib/libunwind.so.7
Reading symbols from /lib/librt.so.1...done.
Loaded symbols for /lib/librt.so.1
Reading symbols from /lib/ld-linux-ia64.so.2...done.
Loaded symbols for /lib/ld-linux-ia64.so.2
Reading symbols from /lib/libpthread.so.0...done.
Loaded symbols for /lib/libpthread.so.0
Reading symbols from /usr/lib/libbitmask.so...done.
Loaded symbols for /usr/lib/libbitmask.so
Reading symbols from /usr/lib/libcpuset.so...done.
Loaded symbols for /usr/lib/libcpuset.so
Reading symbols from /usr/lib/libxpmem.so...done.
Loaded symbols for /usr/lib/libxpmem.so
Core was generated by `../preqx'.
Program terminated with signal 11, Segmentation fault.
#0 0x4000000000183e60 in schedule_mod_mp_setcycle_ ()
(gdb) bt
#0 0x4000000000183e60 in schedule_mod_mp_setcycle_ ()
#1 0x4000000000182c60 in schedule_mod_mp_genedgesched_ ()
#2 0x40000000001e07e0 in preq_init_mod_mp_preq_init_ ()
#3 0x4000000000004ab0 in prim_main () at ../src/prim_main.F90:67
#4 0x4000000000004790 in main ()
(gdb)

0 Kudos
5 Replies
Martyn_C_Intel
Employee
493 Views

Hi,

First, I'm quoting the replies that were already made in the Whatif forum:

______________________________________________-

Dear notahoo,

First of all, it seems you used a wrong forum - you should send this request to the Fortran compiler forum.

From what you wrote, it seems you have problems with the Intel Fortran compiler. According to your trace, there is a SIGSEGV signal in the following part of the HOMME code:SEGV is here !!It looks likea compiler issue. Please send your this request to the Intel Fortran Compiler for Linux and Mac OS* X forum (http://software.intel.com/en-us/forums/, section "Intel Software Development Products")

1015#ifndef _PREDICT

1016 if(il .gt. 0) then

1017 elem(il)%desc%getmapV(face) = Edge%edgeptrV(i) + Cycle%ptrV 1 <-

1018 elem(il)%desc%getmapP(face) = Edge%edgeptrP(i) + Cycle%ptrP - 1

1019 endif

1020#endif

You can also try the following as possible workarounds: 1) remove -fno-alias if you are using it; 2) if you use -O3 optimization level, decrease it to -O2.

Thanks,
Alexander Semenov

_____________________________________


Thank you for your reply. I did not have -fno -alias or -O3 in my compilation. I sent theerror to Intel compiler forum as you mentioned.

___________________________________________________________________

continuing:

Please can you clarify the platform - is this an Itanium-based Altix? What OS? Which subversion of the 10.1 compiler? (You can obtain this with ifort -V). Is this the first time you tried to build HOMME with the Intel compiler?

Please could you also supply the full compiler command line thatwas used?

To help establish whether this might be a compiler issue, please could you try compiling the source file containing the failing routine at -O0, & rerunning? There is probably a source file for the whole module schedule_mod. Does it take long to reach the point of failure?

If this works, it is suggestive of a compiler problem in optimization, that we can try to narrow down. If it still fails, then we can try turning on more debugging options at -O0.

Regards,

Martyn

0 Kudos
notahoo
Beginner
493 Views

Hi,

First, I'm quoting the replies that were already made in the Whatif forum:

______________________________________________-

Dear notahoo,

First of all, it seems you used a wrong forum - you should send this request to the Fortran compiler forum.

From what you wrote, it seems you have problems with the Intel Fortran compiler. According to your trace, there is a SIGSEGV signal in the following part of the HOMME code:SEGV is here !!It looks likea compiler issue. Please send your this request to the Intel Fortran Compiler for Linux and Mac OS* X forum (http://software.intel.com/en-us/forums/, section "Intel Software Development Products")

1015#ifndef _PREDICT

1016 if(il .gt. 0) then

1017 elem(il)%desc%getmapV(face) = Edge%edgeptrV(i) + Cycle%ptrV 1 <-

1018 elem(il)%desc%getmapP(face) = Edge%edgeptrP(i) + Cycle%ptrP - 1

1019 endif

1020#endif

You can also try the following as possible workarounds: 1) remove -fno-alias if you are using it; 2) if you use -O3 optimization level, decrease it to -O2.

Thanks,
Alexander Semenov

_____________________________________


Thank you for your reply. I did not have -fno -alias or -O3 in my compilation. I sent theerror to Intel compiler forum as you mentioned.

___________________________________________________________________

continuing:

Please can you clarify the platform - is this an Itanium-based Altix? What OS? Which subversion of the 10.1 compiler? (You can obtain this with ifort -V). Is this the first time you tried to build HOMME with the Intel compiler?

Please could you also supply the full compiler command line thatwas used?

To help establish whether this might be a compiler issue, please could you try compiling the source file containing the failing routine at -O0, & rerunning? There is probably a source file for the whole module schedule_mod. Does it take long to reach the point of failure?

If this works, it is suggestive of a compiler problem in optimization, that we can try to narrow down. If it still fails, then we can try turning on more debugging options at -O0.

Regards,

Martyn


Compiler info:

%ifort -V
Intel Fortran IA-64 Compiler for applications running on IA-64, Version 10.1 Build 20070913 Package ID: l_fc_p_10.1.008
Copyright (C) 1985-2007 Intel Corporation. All rights reserved.

% uname -a
Linux ..... 2.6.16.60-0.29-default-netboot-lustre-1.6.5.1 #20 SMP Thu Oct 16 12:36:28 EDT 2008 ia64 ia64 ia64 GNU/Linux

% arch

ia64

- system is Altix Itanium2 motvale processor

I had tried HOMME with Intel 10.1 on Intel Xeon before, and I did not encounter this same problem.

Compilation flag was used:

FFLAGS=-I../src -O2 -nothreads -w90 -Vaxlib (in Makefile.Linux)

I applied your suggestion and compiled schedule_mod.F90 with -O0, and it resolved the segfault reported earlier.

Othersuccessful flags are:

a. FFLAGS=-I../src -nothreads -O2 -w90 -Vaxlib (with -O0 schedule_mod.F90)

b. FFLAGS=-I../src -nothreads -O2 -w90 -Vaxlib -ftz -align (no need for -O0 schedule_mod.F90

Failed flags:

c. I../src -nothreads -O2 -w90 -Vaxlib -ftz

d. I../src -nothreads -O3 -w90 -Vaxlib -ftz (even fails with -O0 schedule_mod.F90)

Segfault was caused by Intel Fortran optimization flag (-O2) on schedule_mod.F90. The error was resolved by compiling schedule_mod.F90 with -O0 and the rest of the routines with -O2.

Compiling all the routines with -O0 is successful but generates extremly slow binary.

0 Kudos
notahoo
Beginner
493 Views
Quoting - notahoo


Compiler info:

%ifort -V
Intel Fortran IA-64 Compiler for applications running on IA-64, Version 10.1 Build 20070913 Package ID: l_fc_p_10.1.008
Copyright (C) 1985-2007 Intel Corporation. All rights reserved.

% uname -a
Linux ..... 2.6.16.60-0.29-default-netboot-lustre-1.6.5.1 #20 SMP Thu Oct 16 12:36:28 EDT 2008 ia64 ia64 ia64 GNU/Linux

% arch

ia64

- system is Altix Itanium2 motvale processor

I had tried HOMME with Intel 10.1 on Intel Xeon before, and I did not encounter this same problem.

Compilation flag was used:

FFLAGS=-I../src -O2 -nothreads -w90 -Vaxlib (in Makefile.Linux)

I applied your suggestion and compiled schedule_mod.F90 with -O0, and it resolved the segfault reported earlier.

Othersuccessful flags are:

a. FFLAGS=-I../src -nothreads -O2 -w90 -Vaxlib (with -O0 schedule_mod.F90)

b. FFLAGS=-I../src -nothreads -O2 -w90 -Vaxlib -ftz -align (no need for -O0 schedule_mod.F90

Failed flags:

c. I../src -nothreads -O2 -w90 -Vaxlib -ftz

d. I../src -nothreads -O3 -w90 -Vaxlib -ftz (even fails with -O0 schedule_mod.F90)

Segfault was caused by Intel Fortran optimization flag (-O2) on schedule_mod.F90. The error was resolved by compiling schedule_mod.F90 with -O0 and the rest of the routines with -O2.

Compiling all the routines with -O0 is successful but generates extremly slow binary.


Hi,

I am resuming this topic because I experience run-time failure compiling HOMME in 'OpenMP mode'. I am able to successfully run the binary on one thread. But core dumps when I run on more than one threads. Compilation and the error are posted below FYI:

System information:system is SGI Altix 4700 shared-memory with Itanium2 Montvale 9130M dual-core processors.

Compiler: same as above

Compier flags:
CFLAGS = -O2 -openmp -g
FFLAGS= -O2 -openmp

GDB trace:

gdb ../preqx core*
GNU gdb 6.6
Copyright (C) 2006 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
This GDB was configured as "ia64-suse-linux"...
Using host libthread_db library "/lib/libthread_db.so.1".

warning: Can't read pathname for load map: Input/output error.

warning: .dynamic section for "/lib/libc.so.6.1" is not at the expected address (wrong library or version mismatch?)
Reading symbols from /opt/intel/mkl/10.0.3.020/lib/64/libmkl_gf_ilp64.so...done.
Loaded symbols for /opt/intel/mkl/10.0.3.020/lib/64/libmkl_gf_ilp64.so
Reading symbols from /opt/intel/mkl/10.0.3.020/lib/64/libmkl_core.so...done.
Loaded symbols for /opt/intel/mkl/10.0.3.020/lib/64/libmkl_core.so
Reading symbols from /opt/intel/mkl/10.0.3.020/lib/64/libmkl_sequential.so...done.
Loaded symbols for /opt/intel/mkl/10.0.3.020/lib/64/libmkl_sequential.so
Reading symbols from /opt/intel/fc/10.1.008/lib/libimf.so.6...done.
Loaded symbols for /opt/intel/fc/10.1.008/lib/libimf.so.6
Reading symbols from /lib/libm.so.6.1...done.
Loaded symbols for /lib/libm.so.6.1
Reading symbols from /opt/intel/fc/10.1.008/lib/libguide.so...done.
Loaded symbols for /opt/intel/fc/10.1.008/lib/libguide.so
Reading symbols from /lib/libpthread.so.0...done.
Loaded symbols for /lib/libpthread.so.0
Reading symbols from /lib/libc.so.6.1...done.
Loaded symbols for /lib/libc.so.6.1
Reading symbols from /lib/libgcc_s.so.1...done.
Loaded symbols for /lib/libgcc_s.so.1
Reading symbols from /lib/libdl.so.2...done.
Loaded symbols for /lib/libdl.so.2
Reading symbols from /lib/libunwind.so.7...done.
Loaded symbols for /lib/libunwind.so.7
Reading symbols from /lib/ld-linux-ia64.so.2...done.
Loaded symbols for /lib/ld-linux-ia64.so.2
Core was generated by `../preqx'.
Program terminated with signal 6, Aborted.
#0 0xa000000000010620 in __kernel_syscall_via_break ()
(gdb) bt
#0 0xa000000000010620 in __kernel_syscall_via_break ()
#1 0x2000000000b5c1c0 in raise () from /lib/libc.so.6.1
#2 0x2000000000b5eb60 in abort () from /lib/libc.so.6.1
#3 0x2000000000a8d900 in __kmp_do_abort ()
from /opt/intel/fc/10.1.008/lib/libguide.so
#4 0x2000000000a7df10 in __kmp_wait_sleep ()
from /opt/intel/fc/10.1.008/lib/libguide.so
#5 0x2000000000a7d610 in __kmp_barrier ()
from /opt/intel/fc/10.1.008/lib/libguide.so
#6 0x2000000000a61970 in __kmpc_barrier ()
from /opt/intel/fc/10.1.008/lib/libguide.so
#7 0x400000000016bc60 in bndry_mod_mp_bndry_exchangev_thsave_time_ ()
#8 0x40000000001bbca0 in prim_advance_mod_mp_preq_advance_exp_ ()
#9 0x40000000001eb0c0 in prim_seam_mod_mp_preq_ ()
#10 0x4000000000005ad0 in L_MAIN___111__par_region0_2$0 ()
#11 0x2000000000aa1800 in __kmp_invoke_microtask ()
from /opt/intel/fc/10.1.008/lib/libguide.so
#12 0x2000000000a7c300 in __kmpc_invoke_task_func ()
from /opt/intel/fc/10.1.008/lib/libguide.so
#13 0x2000000000a79330 in __kmp_launch_thread ()
from /opt/intel/fc/10.1.008/lib/libguide.so
#14 0x2000000000aa1a20 in __kmp_hardware_timestamp ()
from /opt/intel/fc/10.1.008/lib/libguide.so
0 Kudos
TimP
Honored Contributor III
493 Views
Intel Thread Checker might be valuable, if it can be made to run. Typically, you must be able to build and link at -O0 -g, possibly using open source code rather than MKL.
It looks like you are using much newer hardware and software than what public docs indicate have been tested with this application. 10.1.008 was not as reliable as current 10.1 versions. If the application runs "correctly" at -O0 and shows no problems with thread checker, it's time to try a current compiler.
0 Kudos
Martyn_C_Intel
Employee
493 Views
Hi Notahoo,
The stacktrace shows the failure happening during a call to the OpenMP runtime library.
Catastrophic runtime failures with OpenMP can be caused my exceeding the stack size limits for the application or for the daughter threads.

To increase the stack allocation for daughter threads, export or set the environment variable KMP_STACKSIZE to a value in bytes (or MBytes, with the suffix m). Some experimentation may be needed; memory is actually allocated, so KMP_STACKSIZE should not be made vastly larger than needed. KMP_STACKSIZE only needs to be large enough for data for which a private copy is actually created for each thread. Defaultsvalues are typically a few MB.

On an Altix, the shell stack size limit often defaults to unlimited. If it does not,then to increaseuse:
ulimit -s unlimited (Bash) or limit stacksize unlimited (C shell). Sometimes, it may be better to give an explicit size (in KBytes) instead of unlimited. Some Linux distributions impose additional conditions on changing the stack limit. Simply recompiling an application with -openmp is likely to increase its stack usage, as data may be placed on the stack to simplify the creation of threadprivate copies if necessary. whether or not this is eventually needed. ulimit only sets a limit; additional memory is not allocated on the stack unless it is needed, so the limit can be generous. The limit needs to be large enough to accommodate all non-global data, whether or not private copies eventually get created in parallel regions.


Sorry not to have suggested this earlier. Please let us know if you continue to have runtime problems
0 Kudos
Reply