Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.
29253 Discussions

Bug in Intel MPI Fortran compiler version 15

John_D_12
Beginner
1,605 Views

Dear All,

There appears to be an obscure bug in version 15 of the Fortran compiler involving MPI, OpenMP and string passing all together.

We're running on Intel hardware under Linux. First the compiler:

mpif90 --version

ifort (IFORT) 15.0.3 20150407
Copyright (C) 1985-2015 Intel Corporation.  All rights reserved.

... now the smallest code I could devise which still has the bug:

program test
implicit none
integer i
!$OMP PARALLEL DEFAULT(SHARED)
!$OMP DO
do i=1,10
  call foo
end do
!$OMP END DO
!$OMP END PARALLEL
end program

subroutine foo
implicit none
character(256) str
call bar(str)
return
end subroutine

subroutine bar(str)
implicit none
character(*), intent(out) :: str
write(str,'(I1)') 1
return
end subroutine

... compiling with:

mpif90 -openmp test.f90

... results in code which crashes randomly with OMP_NUM_THREADS greater than one. This does not occur with version 14 of the compiler.

Regards,

John.

 

0 Kudos
9 Replies
Kevin_D_Intel
Employee
1,605 Views

I built and ran the test case multiple times on RHEL 6.4 with mpif90 15.0.3 (my underlying MPI version is Intel MPI 5.0.1.035), including varying the number of threads, and I cannot produce any failure with > 1 thread so I have a missing ingredient.

Can you post the output of the command: which mpif90 and your version of Linux?

0 Kudos
John_D_12
Beginner
1,605 Views

Hi Kevin,

Here is some more information which may be helpful:

which mpif90

/cluster/openmpi/1.8.6/intel_15.0.3/bin/mpif90

 

uname -r

2.6.32-504.3.3.el6.x86_64

 

cat /etc/issue

CentOS release 6.6 (Final)
Kernel \r on an \m

 

cat /etc/*release

CentOS release 6.6 (Final)
LSB_VERSION=base-4.0-amd64:base-4.0-noarch:core-4.0-amd64:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-noarch
CentOS release 6.6 (Final)
CentOS release 6.6 (Final)

 

lscpu

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                24
On-line CPU(s) list:   0-23
Thread(s) per core:    2
Core(s) per socket:    6
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 44
Stepping:              2
CPU MHz:               1600.000
BogoMIPS:              5333.20
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              12288K
NUMA node0 CPU(s):     0-5,12-17
NUMA node1 CPU(s):     6-11,18-23

 

This is just the login node of an 8000 core cluster, but I've tried it on various nodes and they all crash. Here is an example of the output from a crash:

./a.out 

*** glibc detected *** ./a.out: double free or corruption (!prev): 0x00007fc6d0000c80 ***
======= Backtrace: =========
/lib64/libc.so.6[0x3411e75e66]
/lib64/libc.so.6[0x3411e789b3]
/cluster/openmpi/1.8.6/intel_15.0.3/lib/libmpi_usempif08.so.0(for__free_vm+0x2a)[0x7fc6ef636a0a]
/cluster/openmpi/1.8.6/intel_15.0.3/lib/libmpi_usempif08.so.0(for__release_lun+0x2b4)[0x7fc6ef62ef74]
./a.out[0x40687d]
./a.out[0x403b7a]
./a.out[0x4030f5]
/cluster/intel_2015/composer_xe_2015.3.187/compiler/lib/intel64/libiomp5.so(__kmp_invoke_microtask+0x93)[0x7fc6eeb5bab3]
/cluster/intel_2015/composer_xe_2015.3.187/compiler/lib/intel64/libiomp5.so(+0x759d7)[0x7fc6eeb309d7]
/cluster/intel_2015/composer_xe_2015.3.187/compiler/lib/intel64/libiomp5.so(+0x750da)[0x7fc6eeb300da]
/cluster/intel_2015/composer_xe_2015.3.187/compiler/lib/intel64/libiomp5.so(+0xa0dad)[0x7fc6eeb5bdad]
/lib64/libpthread.so.0[0x34126079d1]
/lib64/libc.so.6(clone+0x6d)[0x3411ee88fd]

...

 

The non-MPI version of the compiler (ifort version 15) works fine.

Regards,

John.

0 Kudos
Kevin_D_Intel
Employee
1,605 Views

Thank you for the added information. Unfortunately the CentOS and openmpi are two ingredients that I do not have, although I have RHEL 6.6 so I will look into obtaining openmpi.

Out of curiosity, at your convenience, could you try compiling/running with the additional options: -g -traceback

0 Kudos
Kevin_D_Intel
Employee
1,605 Views

I have not moved to RHEL 6.6 yet, but I do have openmpi 1.8.6 built and running under RHEL 6.4 and believe I reproduced the issue. The failure signature I see is a segmentation fault but other characteristics seem to match. This only occurs periodically with OMP_NUM_THREADS >=4 and my trace (I'm compiling with -openmp -g -traceback -O0) below shows similarities to yours.

I will provide updates as I learn more.

$ ./a.out
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source      
a.out              0000000000409DA1  Unknown               Unknown  Unknown
a.out              00000000004084F7  Unknown               Unknown  Unknown
libmpi_usempif08.  00007FC219C8A772  Unknown               Unknown  Unknown
libmpi_usempif08.  00007FC219C8A5C6  Unknown               Unknown  Unknown
libmpi_usempif08.  00007FC219C78E1C  Unknown               Unknown  Unknown
libmpi_usempif08.  00007FC219C59FB8  Unknown               Unknown  Unknown
libpthread.so.0    0000003F75A0F500  Unknown               Unknown  Unknown
libopen-pal.so.6   00007FC218BF14BD  Unknown               Unknown  Unknown
libopen-pal.so.6   00007FC218BF3E58  Unknown               Unknown  Unknown
libmpi_usempif08.  00007FC219C657E2  Unknown               Unknown  Unknown
libmpi_usempif08.  00007FC219C5E023  Unknown               Unknown  Unknown
libmpi_usempif08.  00007FC219C5DB49  Unknown               Unknown  Unknown
a.out              00000000004032A3  Unknown               Unknown  Unknown
a.out              000000000040320E  bar_                       23  u564266.f90
a.out              0000000000403161  foo_                       16  u564266.f90
a.out              000000000040311C  MAIN__                      7  u564266.f90
libiomp5.so        00007FC219177AB3  Unknown               Unknown  Unknown
libiomp5.so        00007FC21914C9D7  Unknown               Unknown  Unknown
libiomp5.so        00007FC21914E032  Unknown               Unknown  Unknown
libiomp5.so        00007FC219121FD5  Unknown               Unknown  Unknown
a.out              0000000000402FA3  MAIN__                      4  u564266.f90
a.out              0000000000402EAE  Unknown               Unknown  Unknown
libc.so.6          0000003F7521ECDD  Unknown               Unknown  Unknown
a.out              0000000000402DB9  Unknown               Unknown  Unknown

 

0 Kudos
John_D_12
Beginner
1,605 Views

Hi Kevin,

Thanks for the update.

I think a memory violation is occurring at the line

write(str,'(I1)') 1

Fortran passes a hidden argument containing the length of a string. It may messing that up.

Regards,

John.

 

0 Kudos
Kevin_D_Intel
Employee
1,605 Views

Just another update. From variants on the test case I created I could not conclude there was a problem with the string passing. The failure does occur inside the call chain of Fortran RTL routines handing the internal write as you noted but what causes the fault is unclear at the moment. I forwarded the testcase and my openmpi 1.8.6 build to our RTL Development team (see internal tracking id below) for further investigation and will keep you updated on our findings.

(Internal tracking id: DPD200374632)

0 Kudos
Kevin_D_Intel
Employee
1,605 Views

One more thought. If you know, or could find out, it would be helpful if we knew how your openmpi 1.8.6 was built. I built my copy based on the Performance Tools for Software Developers - Building Open MPI* with the Intel® compilers article.

Thank you

0 Kudos
Kevin_D_Intel
Employee
1,605 Views

From their examination our Development team concluded the underlying problem with openmpi 1.8.6 resulted from mixing out-of-date/incompatible Fortran RTLs. In short, there were older static Fortran RTL bodies incorporated in the openmpi library that when mixed with newer Fortran RTL led to the failure. They found the issue is resolved in the newer openmpi-1.10.1rc2 and recommend resolving requires using a newer openmpi release with our 15.0 (or newer) release.

It turns out this issue was related to an earlier report here, https://software.intel.com/en-us/forums/intel-fortran-compiler-for-linux-and-mac-os-x/topic/540673

0 Kudos
John_D_12
Beginner
1,605 Views

Hi Kevin,

 

Thanks for the update: good to know that the problem has been resolved.

 

I'll forward this to our system administrator and ask him to update the software on our cluster.

 

Regards,

John.

0 Kudos
Reply