- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear All,
There appears to be an obscure bug in version 15 of the Fortran compiler involving MPI, OpenMP and string passing all together.
We're running on Intel hardware under Linux. First the compiler:
mpif90 --version ifort (IFORT) 15.0.3 20150407 Copyright (C) 1985-2015 Intel Corporation. All rights reserved.
... now the smallest code I could devise which still has the bug:
program test implicit none integer i !$OMP PARALLEL DEFAULT(SHARED) !$OMP DO do i=1,10 call foo end do !$OMP END DO !$OMP END PARALLEL end program subroutine foo implicit none character(256) str call bar(str) return end subroutine subroutine bar(str) implicit none character(*), intent(out) :: str write(str,'(I1)') 1 return end subroutine
... compiling with:
mpif90 -openmp test.f90
... results in code which crashes randomly with OMP_NUM_THREADS greater than one. This does not occur with version 14 of the compiler.
Regards,
John.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I built and ran the test case multiple times on RHEL 6.4 with mpif90 15.0.3 (my underlying MPI version is Intel MPI 5.0.1.035), including varying the number of threads, and I cannot produce any failure with > 1 thread so I have a missing ingredient.
Can you post the output of the command: which mpif90 and your version of Linux?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Kevin,
Here is some more information which may be helpful:
which mpif90 /cluster/openmpi/1.8.6/intel_15.0.3/bin/mpif90
uname -r 2.6.32-504.3.3.el6.x86_64
cat /etc/issue CentOS release 6.6 (Final) Kernel \r on an \m
cat /etc/*release CentOS release 6.6 (Final) LSB_VERSION=base-4.0-amd64:base-4.0-noarch:core-4.0-amd64:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-noarch CentOS release 6.6 (Final) CentOS release 6.6 (Final)
lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 24 On-line CPU(s) list: 0-23 Thread(s) per core: 2 Core(s) per socket: 6 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 44 Stepping: 2 CPU MHz: 1600.000 BogoMIPS: 5333.20 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 12288K NUMA node0 CPU(s): 0-5,12-17 NUMA node1 CPU(s): 6-11,18-23
This is just the login node of an 8000 core cluster, but I've tried it on various nodes and they all crash. Here is an example of the output from a crash:
./a.out *** glibc detected *** ./a.out: double free or corruption (!prev): 0x00007fc6d0000c80 *** ======= Backtrace: ========= /lib64/libc.so.6[0x3411e75e66] /lib64/libc.so.6[0x3411e789b3] /cluster/openmpi/1.8.6/intel_15.0.3/lib/libmpi_usempif08.so.0(for__free_vm+0x2a)[0x7fc6ef636a0a] /cluster/openmpi/1.8.6/intel_15.0.3/lib/libmpi_usempif08.so.0(for__release_lun+0x2b4)[0x7fc6ef62ef74] ./a.out[0x40687d] ./a.out[0x403b7a] ./a.out[0x4030f5] /cluster/intel_2015/composer_xe_2015.3.187/compiler/lib/intel64/libiomp5.so(__kmp_invoke_microtask+0x93)[0x7fc6eeb5bab3] /cluster/intel_2015/composer_xe_2015.3.187/compiler/lib/intel64/libiomp5.so(+0x759d7)[0x7fc6eeb309d7] /cluster/intel_2015/composer_xe_2015.3.187/compiler/lib/intel64/libiomp5.so(+0x750da)[0x7fc6eeb300da] /cluster/intel_2015/composer_xe_2015.3.187/compiler/lib/intel64/libiomp5.so(+0xa0dad)[0x7fc6eeb5bdad] /lib64/libpthread.so.0[0x34126079d1] /lib64/libc.so.6(clone+0x6d)[0x3411ee88fd] ...
The non-MPI version of the compiler (ifort version 15) works fine.
Regards,
John.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you for the added information. Unfortunately the CentOS and openmpi are two ingredients that I do not have, although I have RHEL 6.6 so I will look into obtaining openmpi.
Out of curiosity, at your convenience, could you try compiling/running with the additional options: -g -traceback
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have not moved to RHEL 6.6 yet, but I do have openmpi 1.8.6 built and running under RHEL 6.4 and believe I reproduced the issue. The failure signature I see is a segmentation fault but other characteristics seem to match. This only occurs periodically with OMP_NUM_THREADS >=4 and my trace (I'm compiling with -openmp -g -traceback -O0) below shows similarities to yours.
I will provide updates as I learn more.
$ ./a.out forrtl: severe (174): SIGSEGV, segmentation fault occurred Image PC Routine Line Source a.out 0000000000409DA1 Unknown Unknown Unknown a.out 00000000004084F7 Unknown Unknown Unknown libmpi_usempif08. 00007FC219C8A772 Unknown Unknown Unknown libmpi_usempif08. 00007FC219C8A5C6 Unknown Unknown Unknown libmpi_usempif08. 00007FC219C78E1C Unknown Unknown Unknown libmpi_usempif08. 00007FC219C59FB8 Unknown Unknown Unknown libpthread.so.0 0000003F75A0F500 Unknown Unknown Unknown libopen-pal.so.6 00007FC218BF14BD Unknown Unknown Unknown libopen-pal.so.6 00007FC218BF3E58 Unknown Unknown Unknown libmpi_usempif08. 00007FC219C657E2 Unknown Unknown Unknown libmpi_usempif08. 00007FC219C5E023 Unknown Unknown Unknown libmpi_usempif08. 00007FC219C5DB49 Unknown Unknown Unknown a.out 00000000004032A3 Unknown Unknown Unknown a.out 000000000040320E bar_ 23 u564266.f90 a.out 0000000000403161 foo_ 16 u564266.f90 a.out 000000000040311C MAIN__ 7 u564266.f90 libiomp5.so 00007FC219177AB3 Unknown Unknown Unknown libiomp5.so 00007FC21914C9D7 Unknown Unknown Unknown libiomp5.so 00007FC21914E032 Unknown Unknown Unknown libiomp5.so 00007FC219121FD5 Unknown Unknown Unknown a.out 0000000000402FA3 MAIN__ 4 u564266.f90 a.out 0000000000402EAE Unknown Unknown Unknown libc.so.6 0000003F7521ECDD Unknown Unknown Unknown a.out 0000000000402DB9 Unknown Unknown Unknown
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Kevin,
Thanks for the update.
I think a memory violation is occurring at the line
write
(str,
'(I1)'
) 1
Fortran passes a hidden argument containing the length of a string. It may messing that up.
Regards,
John.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Just another update. From variants on the test case I created I could not conclude there was a problem with the string passing. The failure does occur inside the call chain of Fortran RTL routines handing the internal write as you noted but what causes the fault is unclear at the moment. I forwarded the testcase and my openmpi 1.8.6 build to our RTL Development team (see internal tracking id below) for further investigation and will keep you updated on our findings.
(Internal tracking id: DPD200374632)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
One more thought. If you know, or could find out, it would be helpful if we knew how your openmpi 1.8.6 was built. I built my copy based on the Performance Tools for Software Developers - Building Open MPI* with the Intel® compilers article.
Thank you
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
From their examination our Development team concluded the underlying problem with openmpi 1.8.6 resulted from mixing out-of-date/incompatible Fortran RTLs. In short, there were older static Fortran RTL bodies incorporated in the openmpi library that when mixed with newer Fortran RTL led to the failure. They found the issue is resolved in the newer openmpi-1.10.1rc2 and recommend resolving requires using a newer openmpi release with our 15.0 (or newer) release.
It turns out this issue was related to an earlier report here, https://software.intel.com/en-us/forums/intel-fortran-compiler-for-linux-and-mac-os-x/topic/540673
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Kevin,
Thanks for the update: good to know that the problem has been resolved.
I'll forward this to our system administrator and ask him to update the software on our cluster.
Regards,
John.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page