Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.

Module load in oneapi for coarray run

sverdrup
Beginner
980 Views

Hi all,

I'm triing to compile and run a fortran code with coarray. The compilation of the code seems ok, but when I run it I obtaine the seguent error:

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 0 PID 721 RUNNING AT 41ca5757f459
= KILLED BY SIGNAL: 11 (Segmentation fault)
===================================================================================

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 1 PID 722 RUNNING AT 41ca5757f459
= KILLED BY SIGNAL: 11 (Segmentation fault)
===================================================================================

I've tried also hello world code with coarray but the resault is the same, not seem's to be a memory error! It is possible I have to load other modules?
This is my loaded modules:

Currently Loaded Modulefiles:
 1) mpi/2021.6.0   2) tbb/latest   3) compiler-rt/latest   4) oclfpga/latest   5) compiler/2022.1.0   6) init_opencl/2022.1.0   7) vtune/2022.2.0

and this is the list of the avalibles modules from oneapi modulefiles:

--------------------------------------------------------------- /opt/intel/oneapi/modulefiles ---------------------------------------------------------------
advisor/2022.1.0 compiler-rt32/latest dev-utilities/2021.6.0 dnnl/latest inspector/2022.1.0 mkl/latest vpl/2022.1.0
advisor/latest compiler/2022.1.0 dev-utilities/latest dpl/2021.7.0 inspector/latest mkl32/2022.1.0 vpl/latest
ccl/2021.6.0 compiler/latest dnnl-cpu-gomp/2022.1.0 dpl/latest intel_ipp_intel64/2021.6.0 mkl32/latest vtune/2022.2.0
ccl/latest compiler32/2022.1.0 dnnl-cpu-gomp/latest icc/2022.1.0 intel_ipp_intel64/latest mpi/2021.6.0 vtune/latest
clck/2021.6.0 compiler32/latest dnnl-cpu-iomp/2022.1.0 icc/latest intel_ippcp_intel64/2021.6.0 mpi/latest
clck/latest dal/2021.6.0 dnnl-cpu-iomp/latest icc32/2022.1.0 intel_ippcp_intel64/latest oclfpga/2022.1.0
compiler-rt/2022.1.0 dal/latest dnnl-cpu-tbb/2022.1.0 icc32/latest itac/2021.6.0 oclfpga/latest
compiler-rt/latest debugger/2021.6.0 dnnl-cpu-tbb/latest init_opencl/2022.1.0 itac/latest tbb/2021.6.0
compiler-rt32/2022.1.0 debugger/latest dnnl/2022.1.0 init_opencl/latest mkl/2022.1.0 tbb/latest

Thanks in advance,

best regards

0 Kudos
26 Replies
jimdempseyatthecove
Black Belt
831 Views

Suggestions/Questions:

 

1) Does your program (and "Hello") run standalone (without MPI) on the development system?

2) Does your program run via MPI with 1 rank on the development system?

3) Does your program run via MPI with 2 ranks on the development system?

4) Does your program (and "Hello") run standalone (without MPI) on a remote system via an MPI launch from the development system?

 

The above will help isolate where the problem occurs.

 

Also, what is your command line launch method? (mpirun, mpiexec, hydra...., etc...)?

 

Jim Dempsey

 

sverdrup
Beginner
825 Views

Hello Jim, and thanks for your help.

My command line launch method for the code is:

ifort -O3 -coarray=shared -coarray-num-images=6 mod_bathy_field.f90 bathy.f90 `nf-config --fflags --flibs` -o pippo

The standalone without coarray not compile, in the code I've the parallelization via coarray, if I try to compile without -coarray flag, I've the following error:

mod_bathy_field.f90(21): error #8516: Coarray declarations are not allowed when the -coarray compiler option is not specified.   [BFRAME]
real, allocatable,dimension(:,:)::aframe,bframe[:]
-----------------------------------------^
mod_bathy_field.f90(59): error #8516: Coarray declarations are not allowed when the -coarray compiler option is not specified.   [TEDGE]
real,allocatable, dimension(:,:),codimension[:]::tedge,sedge
-------------------------------------------------^
mod_bathy_field.f90(59): error #8516: Coarray declarations are not allowed when the -coarray compiler option is not specified.   [SEDGE]
real,allocatable, dimension(:,:),codimension[:]::tedge,sedge
-------------------------------------------------------^
mod_bathy_field.f90(191): error #8516: Coarray declarations are not allowed when the -coarray compiler option is not specified.   [BT]
real, allocatable,dimension(:,:),codimension[:] :: bt
---------------------------------------------------^
mod_bathy_field.f90(68): error #8510: If an allocate-coarray-spec is specified then the allocate-object must be an allocatable coarray.   [TEDGE]
allocate(tedge(nlons,2)[*],sedge(nlats,2)[*])
---------^
mod_bathy_field.f90(68): error #8510: If an allocate-coarray-spec is specified then the allocate-object must be an allocatable coarray.   [SEDGE]
allocate(tedge(nlons,2)[*],sedge(nlats,2)[*])
---------------------------^
mod_bathy_field.f90(75): error #8363: An entity is not a coarray.   [SEDGE]
self%gemo%z(0,1:nlats)=sedge(1:nlats,1)[jj]
-----------------------^
mod_bathy_field.f90(83): error #8363: An entity is not a coarray.   [SEDGE]
self%gemo%z(nlons+1,1:nlats)=sedge(1:nlats,1)[jj]
-----------------------------^
mod_bathy_field.f90(91): error #8363: An entity is not a coarray.   [TEDGE]
self%gemo%z(1:nlons,0)=tedge(1:nlons,1)[jj]
-----------------------^
mod_bathy_field.f90(99): error #8363: An entity is not a coarray.   [TEDGE]
self%gemo%z(1:nlons,nlats+1)=tedge(1:nlons,2)[jj]
-----------------------------^
mod_bathy_field.f90(204): error #8510: If an allocate-coarray-spec is specified then the allocate-object must be an allocatable coarray.   [BT]
allocate(bt(nlon,nlat)[*])
---------^
mod_bathy_field.f90(213): error #8363: An entity is not a coarray.   [BT]
         bt(i,j)[1]=bt(i,j)
---------^
compilation aborted for mod_bathy_field.f90 (code 1)

For the "hello", the compilation of the code in standalone mode goes ok, but if I try to compile with the -coarray flag I've the same error.

I hope I have been close enough,

thank you

Alessandro

jimdempseyatthecove
Black Belt
816 Views

What happens when you compile with -coarray=shared -coarray-num-images=1?

 

Segmentation fault indicates an access (either code or data) to a virtual memory page (either resident or in page file) that is .NOT. part of your virtual address page. It may also occur when writing to a read-only page .OR. data read from an execute-only page.

 

Try inserting diagnostic

 

program
   ...
   print *,"debug 1" ! 1st statement of program
   ...
   print *,"debug 2" ! 1st statement following first coarray-esk statement.
   ...
  

Jim Dempsey

 

sverdrup
Beginner
807 Views

I've inserted diagnostic in two different point of the program, the first one immediately after the beginning of the code.

If I run with -coarray=shared -coarray-num-images=1 I've the same erorr.

It is possible the problem is related to the environment configuration?

I have this module loaded at the moment:

Currently Loaded Modulefiles:
 1) intel_ipp_intel64/2021.6.0   3) compiler-rt/2022.1.0   5) compiler/2022.1.0   7) mkl/2022.1.0    9) dev-utilities/2021.6.0
 2) tbb/2021.6.0                 4) oclfpga/2022.1.0       6) mpi/2021.6.0         itac/2021.6.0

and the avalible module are:

root@9d9bc7f4fe75:/home/data# module avail
--------------------------------------------------------------- /opt/intel/oneapi/modulefiles ---------------------------------------------------------------
advisor/2022.1.0        compiler/2022.1.0       dnnl-cpu-gomp/2022.1.0  icc/2022.1.0                intel_ippcp_intel64/2021.6.0  oclfpga/2022.1.0
ccl/2021.6.0            compiler32/2022.1.0     dnnl-cpu-iomp/2022.1.0  icc32/2022.1.0              itac/2021.6.0                 tbb/2021.6.0
clck/2021.6.0           dal/2021.6.0            dnnl-cpu-tbb/2022.1.0   init_opencl/2022.1.0        mkl/2022.1.0                  vpl/2022.1.0
compiler-rt/2022.1.0    debugger/2021.6.0       dnnl/2022.1.0           inspector/2022.1.0          mkl32/2022.1.0                vtune/2022.2.0
compiler-rt32/2022.1.0  dev-utilities/2021.6.0  dpl/2021.7.0            intel_ipp_intel64/2021.6.0  mpi/2021.6.0

-------------------------------------------------------------- /usr/share/modules/modulefiles ---------------------------------------------------------------

 I have configured all the system in a container to genereate a trial environment, before movin everything on a phisical machine, this could be the problem?

Thank you in advance!

jimdempseyatthecove
Black Belt
801 Views

Run this test:

Console Fortran app

program
   print *, "Hello"
end program

Then, from the command line .AND. environment you use to test your app, issue:

   mpiexec ./HelloWorld     (or whatever and wherever name you call your executable)

 

This should run (assuming MPI in LD_LIBRARY_PATH as setup by the MPI environment script) with default number of processes at 1 per core. If this does not run, then either the environment is not set correctly .OR. installation failure. Coarrays uses MPI.

 

Jim Dempsey

 

 

sverdrup
Beginner
783 Views

Hello Jim,

I've used a code with diagnostic as you suggested, th result is:

root@8c6937a9880a:/home/data# ifort hello.f90 -o a.out
root@8c6937a9880a:/home/data# mpiexec ./a.out
 debug 1
 Hello from image           1 of           1
 debug 2
 Goodbye from image           1 of           1
 debug 1
 Hello from image           1 of           1
 debug 2
 Goodbye from image           1 of           1
 debug 1
 Hello from image           1 of           1
 debug 2
 Goodbye from image           1 of           1
 debug 1
 Hello from image           1 of           1
 debug 2
 Goodbye from image           1 of           1
 debug 1
 Hello from image           1 of           1
 debug 2
 Goodbye from image           1 of           1
 debug 1
 Hello from image           1 of           1
 debug 2
 Goodbye from image           1 of           1

But if I try to compile with -coarray flag, the result is:


===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 6 PID 883 RUNNING AT 8c6937a9880a
=   KILLED BY SIGNAL: 11 (Segmentation fault)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 7 PID 886 RUNNING AT 8c6937a9880a
=   KILLED BY SIGNAL: 11 (Segmentation fault)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 8 PID 889 RUNNING AT 8c6937a9880a
=   KILLED BY SIGNAL: 11 (Segmentation fault)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 9 PID 892 RUNNING AT 8c6937a9880a
=   KILLED BY SIGNAL: 11 (Segmentation fault)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 10 PID 895 RUNNING AT 8c6937a9880a
=   KILLED BY SIGNAL: 11 (Segmentation fault)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 11 PID 898 RUNNING AT 8c6937a9880a
=   KILLED BY SIGNAL: 11 (Segmentation fault)
===================================================================================

 Where is the mistake, I'm going crazy with this problem.

Thanks for your help, best ragards

jimdempseyatthecove
Black Belt
770 Views

Let me guess:

Your run command line for the coarray version was:

mpiexec ./a.out

 Note, coarray programs contain a shell that performs the mpiexec.

The proper way to launch this coarray program is:

  ./a.out

 

Jim Dempsey

 

sverdrup
Beginner
757 Views

if I run with ./a.out I've:

root@f9f64a14d624:/home/data# ifort hello.f90 -o a.out
root@f9f64a14d624:/home/data# ./a.out
 debug 1
 Hello from image           1 of           1
 debug 2
 Goodbye from image           1 of           1

while if I run mpiexec I've:

root@f9f64a14d624:/home/data# mpiexec ./a.out
 debug 1
 Hello from image           1 of           1
 debug 2
 Goodbye from image           1 of           1
 debug 1
 Hello from image           1 of           1
 debug 2
 Goodbye from image           1 of           1
 debug 1
 Hello from image           1 of           1
 debug 2
 Goodbye from image           1 of           1
 debug 1
 Hello from image           1 of           1
 debug 2
 Goodbye from image           1 of           1
 debug 1
 Hello from image           1 of           1
 debug 2
 Goodbye from image           1 of           1
 debug 1
 Hello from image           1 of           1
 debug 2
 Goodbye from image           1 of           1

It is possible there is a missing library or something like that?

There are another way to run fortran code with coarray on a linux system?

Thamks in advance.

Best regards

jimdempseyatthecove
Black Belt
748 Views

>>It is possible there is a missing library or something like that? 

Possibly. Or you may have an incorrect LD_LOAD_LIBRARY setting that results in an incompatible version being loaded.

 

>>There are another way to run Fortran code with coarray on a linux system?

 

Try making a shell main PROGRAM, that calls the former PROGRAM source as a subroutine (in a separate source file). Compile all files except the new shell program with -O3 into a static library. Then compile the shell program with -OD (and other coarray options as before) but now adding the new static library. If this runs, then suspect something with the LD_LOAD_LIBRARY setting or some other corruption. Also, check to see if multiple (conflicting) Intel library paths are in LD_LOAD_LIBRARY in the release environment.

 

Jim Dempsey

sverdrup
Beginner
740 Views

In my cluster environment, where the program run without problem, I've to load any module before compie and run...

Tha modules are:

export HDF5_USE_FILE_LOCKING=FALSE
module load intel/oneAPI/2021.2/all-tools
module load netcdff/serial/4.5.4
module load intel/oneAPI/2021.2/hpc-sdk

But in my oneAPI installation in the modulefiles I don't have any of thi module...

Thi are the only module I've in my oneAPI modulefiles:

advisor  compiler       compiler32  dev-utilities  dnnl-cpu-iomp  icc          inspector            itac   mpi      vpl
ccl      compiler-rt    dal         dnnl           dnnl-cpu-tbb   icc32        intel_ipp_intel64    mkl    oclfpga  vtune
clck     compiler-rt32  debugger    dnnl-cpu-gomp  dpl            init_opencl  intel_ippcp_intel64  mkl32  tbb

any idea?

 

jimdempseyatthecove
Black Belt
734 Views

Why did you wait to say the program runs on one system but not the other?

 

Did you install the oneAPI HPCkit on your oneAPI installation system that is having the problem? IOW is the mpi installed on your oneAPI installation system the oneAPI version of mpi?.

 

Jim Dempsey

sverdrup
Beginner
724 Views

No the first system is a cluster with an old version of the oneAPI, the system I'm triing to pull up is a workstation with ubuntu.

Do you think the different version of oneAPI would be the cause? Why any other fortran code give me the same error?

The problem seems to be in the coarray "installation", all other fortran code was compiled ok and run, but nothing to do with coarray...

jimdempseyatthecove
Black Belt
720 Views

Does the application built with the ifx compiler suffer the same issues?

 

Can you install the older version of oneAPI on the problem system? (may require uninstall of the newer oneAPI).

If that works, you might then be able to determine if the issue resides with the compiler or with a library. Then attempt a mix.

 

Jim Dempsey

sverdrup
Beginner
704 Views

Well, I've downgraded the oneAPI-hpckit to the 2021.2.0-2997 and now when i run the test on the my hello.f90 I've:

root@b002fe43068c:/home/data# ifort -coarray=shared hello.f90 -o a.out
root@b002fe43068c:/home/data# ./a.out
 debug 1
 Hello from image           1 of          12
 debug 1
 Hello from image           2 of          12
 debug 1
 Hello from image           3 of          12
 debug 1
 Hello from image           4 of          12
 debug 1
 Hello from image           5 of          12
 debug 1
 debug 1
 Hello from image           7 of          12
 debug 1
 Hello from image           8 of          12
 debug 1
 Hello from image           9 of          12
 debug 1
 Hello from image          10 of          12
 debug 1
 Hello from image          11 of          12
 debug 1
 Hello from image          12 of          12
 Hello from image           6 of          12
 debug 2
 Goodbye from image           2 of          12
 debug 2
 debug 2
 Goodbye from image           4 of          12
 debug 2
 Goodbye from image           7 of          12
 debug 2
 Goodbye from image          10 of          12
 debug 2
 Goodbye from image           1 of          12
 debug 2
 Goodbye from image           5 of          12
 debug 2
 Goodbye from image           6 of          12
 debug 2
 Goodbye from image           9 of          12
 debug 2
 Goodbye from image          11 of          12
 Goodbye from image           3 of          12
 debug 2
 Goodbye from image           8 of          12
 debug 2
 Goodbye from image          12 of          12

Now I've to recompile the netcdf library with the ifort and icc compiler to test my fortran code, but at the moment this seems to be the best result, the problem seems to be due to the oneAPI version or a bad installation of the newest version. After the installation and tes I'll let you know if there is all ok.

Thanks, best regards.

sverdrup
Beginner
671 Views

Ok I've compiled all requirements, and now:

root@7e44fbb70643:/home/data# ifort -O3 -coarray=shared -coarray-num-images=6 mod_bathy_field.f90 bathy.f90 `nf-config --fflags --flibs` -o pippon
root@7e44fbb70643:/home/data# ./pippon
 Find neighbor!
 Find neighbor!
 Find neighbor!
 Find neighbor!
 Find neighbor!
 Find neighbor!
           4 next left -->           6
           5 next left -->           1
           6 next left -->           1
           2 next left -->           4
           1 next left -->           3
           3 next left -->           5
           5 next left -->           1 passed
           4 next left -->           6 passed
           2 next left -->           4 passed
           6 next left -->           1 passed
           1 next left -->           3 passed
           3 next left -->           5 passed
           2 next right -->           6
           4 next right -->           2
           5 next right -->           3
           6 next right -->           4
           1 next right -->           6
           3 next right -->           1
    0.0000     0.0000     0.0000     0.0000     0.0000     0.0000     0.0000     0.0000     0.0000     0.0000     0.0000     0.0000     0.0000     0.0000     0.0000     0.0000     0.0000     0.0000     0.0000     0.0000
    0.0000
           2 next top -->           1
    0.0000     0.0000     0.0000     0.0000     0.0000     0.0000     0.0000     0.0000     0.0000     0.0000     0.0000     0.0000     0.0000     0.0000     0.0000     0.0000     0.0000     0.0000     0.0000     0.0000
    0.0000
           4 next top -->           3
    0.0000     0.0000     0.0000     0.0000     0.0000     0.0000     0.0000     0.0000     0.0000     0.0000     0.0000     0.0000     0.0000     0.0000     0.0000     0.0000     0.0000     0.0000     0.0000     0.0000
    0.0000
           6 next top -->           5
    0.0000     0.0000     0.0000     0.0000     0.0000     0.0000     0.0000     0.0000     0.0000     0.0000     0.0000     0.0000     0.0000     0.0000     0.0000     0.0000     0.0000     0.0000     0.0000     0.0000
    0.0000
           1 next top -->           1
    0.0000     0.0000     0.0000     0.0000     0.0000     0.0000     0.0000     0.0000     0.0000     0.0000     0.0000     0.0000     0.0000     0.0000     0.0000     0.0000     0.0000     0.0000     0.0000     0.0000
    0.0000
           5 next top -->           5
-1710.2000 -1724.3999 -1732.0000 -1734.3999 -1731.8000 -1722.2000 -1717.3999 -1719.5999 -1730.0000 -1747.0000 -1769.0000 -1795.0000 -1824.3999 -1856.2000 -1887.3999 -1916.0000 -1939.5999 -1939.5999 -1953.3999 -1973.5999
-1994.0000
           3 next top -->           3

All seems to work properly.

Thank you all for your help!

Best regards

sverdrup
Beginner
636 Views

New mistake,

I've compiled all the program with no error, I've used the following command:

 

ifort -O3 -traceback -coarray=shared -coarray-num-images=6 mod_bathy_field.f90 bathy.f90 `nf-config --fflags --flibs` -o pippon

 

but if I try o run it, after few second the program freezing. To run it I've used the ./pippon command.

-trceback option suggest me to see on line 42 of the main program:

[mpiexec@14a3e02b6a26] Press Ctrl-C again to force abort
forrtl: error (69): process interrupted (SIGINT)
In coarray image 6
Image              PC                Routine            Line        Source             
pippon             000000000042970B  Unknown               Unknown  Unknown
libc.so.6          00007FF498DF9520  Unknown               Unknown  Unknown
libc.so.6          00007FF498EBFCAB  __sched_yield         Unknown  Unknown
libmpi.so.12.0.0   00007FF4939FA8F4  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00007FF493F8EB7E  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00007FF493B7E0A7  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00007FF493B7991E  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00007FF493940A7C  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00007FF493940388  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00007FF4939B32F8  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00007FF493985BE9  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00007FF493965780  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00007FF493A67495  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00007FF493944444  MPI_Barrier           Unknown  Unknown
libicaf.so         00007FF498FE9263  for_rtl_ICAF_BARR     Unknown  Unknown
pippon             0000000000413794  MAIN__                     42  bathy.f90
pippon             000000000040B9E2  Unknown               Unknown  Unknown
libc.so.6          00007FF498DE0D90  Unknown               Unknown  Unknown
libc.so.6          00007FF498DE0E40  __libc_start_main     Unknown  Unknown
pippon             000000000040B8E5  Unknown               Unknown  Unknown

forrtl: error (69): process interrupted (SIGINT)
In coarray image 2
Image              PC                Routine            Line        Source             
pippon             000000000042970B  Unknown               Unknown  Unknown
libc.so.6          00007FAF3B1A1520  Unknown               Unknown  Unknown
libc.so.6          00007FAF3B284F9A  epoll_wait            Unknown  Unknown
libtcp-fi.so       00007FAEA5008241  Unknown               Unknown  Unknown
libtcp-fi.so       00007FAEA500EB3E  Unknown               Unknown  Unknown
librxm-fi.so       00007FAEA440B6CC  Unknown               Unknown  Unknown
librxm-fi.so       00007FAEA4416BD9  Unknown               Unknown  Unknown
librxm-fi.so       00007FAEA4416CE9  Unknown               Unknown  Unknown
librxm-fi.so       00007FAEA4431A3D  Unknown               Unknown  Unknown
librxm-fi.so       00007FAEA44319C7  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00007FAF3622B3FE  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00007FAF35DFA7A1  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00007FAF3638EB7E  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00007FAF35F7E0A7  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00007FAF35F7991E  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00007FAF35D40A7C  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00007FAF35D40388  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00007FAF35DB32F8  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00007FAF35D85BE9  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00007FAF35D65780  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00007FAF35E67495  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00007FAF35D44444  MPI_Barrier           Unknown  Unknown
libicaf.so         00007FAF3B391263  for_rtl_ICAF_BARR     Unknown  Unknown
pippon             0000000000413794  MAIN__                     42  bathy.f90
pippon             000000000040B9E2  Unknown               Unknown  Unknown
libc.so.6          00007FAF3B188D90  Unknown               Unknown  Unknown
libc.so.6          00007FAF3B188E40  __libc_start_main     Unknown  Unknown
pippon             000000000040B8E5  Unknown               Unknown  Unknown

The line 42 is a sync all, but If I comment the line the error pass to the line 43, where is a call to a subroutine.

There are other informations I can supply to help me with this problem?

Best regards

jimdempseyatthecove
Black Belt
619 Views

What is the interface to that subroutine on line 43?

IOW if the calling parameters are not specified and incorrect, or if the calling API is incorrect, an image may abend, and thus bring down the other images. And the trace back above shows those of the other images and not the one that first abended.

What is the stack trace dump when you comment out line 42?

 

Another peculiarity of the above stack trace is that both __sched_yield and epoll_wait appear to be calling down to somewhere in pippon?

Has your program SIGNALQQ to specify an interrupt signal handler?

 

Jim Dempsey

 

sverdrup
Beginner
593 Views

Hello Jim,

the dump after comment the sync all @ line 42 is:

 

forrtl: error (69): process interrupted (SIGINT)
In coarray image 2
Image              PC                Routine            Line        Source             
pippon             000000000042A79B  Unknown               Unknown  Unknown
libc.so.6          00007F2897396520  Unknown               Unknown  Unknown
libc.so.6          00007F289745CCAB  __sched_yield         Unknown  Unknown
libmpi.so.12.0.0   00007F2891FFA8F4  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00007F289258EB7E  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00007F2892171F43  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00007F2891F45432  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00007F2891F447C2  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00007F28920668F4  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00007F28920DD654  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00007F28920CB8AB  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00007F28920B83BE  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00007F2892046D82  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00007F28920659E2  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00007F289259CA0C  MPI_Win_create        Unknown  Unknown
libicaf.so         00007F2897592431  for_rtl_ICAF_COIN     Unknown  Unknown
pippon             000000000040C214  mod_bathy_field_m         207  mod_bathy_field.f90
pippon             0000000000413A20  MAIN__                     43  bathy.f90
pippon             000000000040B9E2  Unknown               Unknown  Unknown
libc.so.6          00007F289737DD90  Unknown               Unknown  Unknown
libc.so.6          00007F289737DE40  __libc_start_main     Unknown  Unknown
pippon             000000000040B8E5  Unknown               Unknown  Unknown

forrtl: error (69): process interrupted (SIGINT)
In coarray image 6
Image              PC                Routine            Line        Source             
pippon             000000000042A79B  Unknown               Unknown  Unknown
libc.so.6          00007F78D4C2A520  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00007F78CFBCF46D  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00007F78CF7FA996  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00007F78CFD8EB7E  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00007F78CF97E0A7  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00007F78CF97991E  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00007F78CFC94C4B  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00007F78CFC93094  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00007F78CF866B1A  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00007F78CF8DD654  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00007F78CF8CB8AB  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00007F78CF8B83BE  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00007F78CF846D82  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00007F78CF8659E2  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00007F78CFD9CA0C  MPI_Win_create        Unknown  Unknown
libicaf.so         00007F78D4E26431  for_rtl_ICAF_COIN     Unknown  Unknown
pippon             000000000040C214  mod_bathy_field_m         207  mod_bathy_field.f90
pippon             0000000000413A20  MAIN__                     43  bathy.f90
pippon             000000000040B9E2  Unknown               Unknown  Unknown
libc.so.6          00007F78D4C11D90  Unknown               Unknown  Unknown
libc.so.6          00007F78D4C11E40  __libc_start_main     Unknown  Unknown
pippon             000000000040B8E5  Unknown               Unknown  Unknown

 

The line 48 in the dump is a call to the subroutine in question.

After commented the sync all, the process seems to goes on a little more, I understand this because it prints blocks that it did'nt before. But alwaysand only for 2 images.

exit from halo
           1 inner test------------
 exit from halo
           2 inner test------------
 exit from halo
           3 inner test------------
 exit from halo
           4 inner test------------
 exit from halo
           5 inner test------------
 exit from halo
           6 inner test------------
 exit from init
002497     3.3750 -2850.2000 -2847.8000 -2852.2000 -2846.2000 -2853.8000
002498     3.4167 -2851.2000 -2851.8000 -2850.8000 -2849.2000 -2853.8000
002499     3.4583 -2848.8000 -2851.8000 -2846.8000 -2846.2000 -2850.2000
002500     3.5000 -2834.2000 -2835.8000     0.0000 -2829.8000 -2839.2000
 exit from init
 -------------------------------------------
 6 /home/data/bat/D4.mnt          23.2490  6731 37.2740 24.9990  7200 40.0010 479.929
           6 write_bat test -----------------------
           6 inner test------------
           6 bbox=          974        1287           1         235
bbox exp=  23.2500   36.2917   30.1875   39.9375
 -------------------------------------------
 2 /home/data/bat/B4.mnt         -16.2510  9480  3.5010 24.9990  7200 40.0010 479.949
           2 write_bat test -----------------------
           2 inner test------------
           2 bbox=           26         500           1         235
bbox exp= -16.2500    3.5000   30.1875   39.9375
after thath the run is freezed.
And to answer at your last question, no I d'ont use signalqq.
this is the subroutine called from the line after sync all in the main:
subroutine write_bat(medb,outf)

use iso_fortran_env, only: stdout => output_unit, &
                           stderr => error_unit
class (Sbathy),intent(in)::medb
character(len=*), intent(in)::outf
character*40,parameter :: odir="/home/data/bathy_img/"
character*12, parameter ::ofile='pippo1.bat'
real, allocatable, dimension(:,:),codimension[:] :: bt
integer ::i,j,ic, nlat,nlon
integer, dimension(4)::ib
print *, this_image(), 'write_bat test -----------------------'
nlat=medb%gmed%nlat
nlon=medb%gmed%nlon
ib= medb%get_inner()
print*, this_image(), 'bbox= ',ib
flush(stdout) 
write(*,'(a10,4(f8.4,2x))') 'bbox exp= ',medb%gmed%lon(ib(1)),medb%gmed%lon(ib(2)),&
                        medb%gmed%lat(ib(3)),medb%gmed%lat(ib(4))
!if (this_image()==2) then
!print*,this_image(), medb%gemo%box%p0%plat, medb%gemo%box%p1%plat ,medb%gmed%box%p0%plat
!end if
allocate(bt(nlon,nlat)[*])

!local bathymetry

bt=medb%gmed%z(1:nlon,1:nlat)
print*, 'ciao 1'
sync all
print*, 'ciao 2'
do ic=2,num_images()
  if (this_image().eq.ic) then
    do j=ib(3),ib(4)
      do i=ib(1),ib(2)
         bt(i,j)[1]=bt(i,j)
         print*, 'ciao 3'
      end do
    end do
  end if
 sync all
end do

open(stdout,file=outf,status='unknown',access='stream')
if(this_image()==1) write(stdout) bt(:,:)
close(stdout)

!open(stdout,file=trim(odir)//trim(ofile),status='unknown',access='stream')
!if (this_image()==2) write(stdout) medb%gmed%z
!close(stdout)

end subroutine write_bat
Thank you,
best regards
Ron_Green
Moderator
610 Views

could you just attach both mod_bathy_field.f90 bathy.f90 so we can try your code?

sverdrup
Beginner
617 Views

Hello Ron,

I hope this help, thank you very much for the patience and the help.

best regards.

Reply