Intel® oneAPI HPC Toolkit
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
1933 Discussions

Not able to run a Intel MPI + Java program

Stefano_V
Beginner
445 Views

Hi,

I wanted to ask for advice on a problem we are having with a cluster that uses SLURM + IntelMPI.

 

I'm developing and running an application that displays 3D parallel data using Java + VTK (vtk.org)+ OSMesa + LLVM. In short, the application is written in Java and it makes native calls to the VTK libraries where MPI_Init is called.

Usually the application is compiled and run against OpenMPI but now we wanted to add support also for IntelMPI.

We compiled the graphics libraries (VTK) against IntelMPI and the native part of the application works fine. But when we use java, it always crashes.

The command we launch is something like

# This is usually in the SLURM submission script
mpirun -n 2 java MyApp

Inside MyApp there is a line that initialises VTK objects and among them the MPI "controller"

// This is Java code
mpiController = new vtkMPIController();
mpiController.InitializeJava();

And finally in the VTK code we have

// This is C++ code
void vtkMPIController::InitializeJava()
{
...
int provided;
MPI_Init_thread(argc, argv, MPI_THREAD_MULTIPLE, &provided);
...
}

The crash seems to happen in the JVM (Java Virtual Machine) when threads are used. See for example the following stacktrace

[3D Server] # A fatal error has been detected by the Java Runtime Environment:
[3D Server] #
[3D Server] # SIGSEGV (0xb) at pc=0x0000153c64d2d8cb, pid=33752, tid=0x0000153c173f3700
[3D Server] #
[3D Server] # JRE version: OpenJDK Runtime Environment (8.0_282-b08) (build 1.8.0_282-b08)
[3D Server] # Java VM: OpenJDK 64-Bit Server VM (25.282-b08 mixed mode linux-amd64 compressed oops)
[3D Server] # Problematic frame:
[3D Server] # V [libjvm.so+0x4198cb] ciBytecodeStream::get_method(bool&, ciSignature**)+0x35b
[3D Server] #
[3D Server] # Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
[3D Server] #
[3D Server] # An error report file with more information is saved as:
[3D Server] # /g100/home/userexternal/pgeremia/.HELYX/tmp/HELYX-Server-1B97438C/hs_err_pid33752.log
[3D Server] [thread 23347830134528 also had an error]
[3D Server] [3D Server] ==== backtrace (tid: 33984) ====
[3D Server] 0 0x0000000000055969 ucs_debug_print_backtrace() ???:0
[3D Server] 1 0x0000000000012b20 .annobin_sigaction.c() sigaction.c:0
[3D Server] 2 0x00000000009114a3 Monitor::wait() ???:0
[3D Server] 3 0x00000000004827d2 CompileQueue::get() ???:0
[3D Server] 4 0x000000000048be7b CompileBroker::compiler_thread_loop() ???:0
[3D Server] 5 0x0000000000aceb77 JavaThread::thread_main_inner() ???:0
[3D Server] 6 0x0000000000acfeca JavaThread::run() ???:0
[3D Server] 7 0x000000000095d732 java_start() ???:0
[3D Server] 8 0x000000000000814a start_thread() pthread_create.c:0
[3D Server] 9 0x00000000000fcdc3 __GI___clone() :0
[3D Server] =================================

 After I disabled the JIT (just-in-time) java compiler, things seem better, but still the application crashes when the JVM "goes multithread". This last expression is my gut feeling.

Please find attached the JVM crash logs and the cpuinfo output

 

Can you please help me to understand what's going wrong here?

 

Thanks

Stefano

 

 

Labels (1)
0 Kudos
11 Replies
SantoshY_Intel
Moderator
427 Views

Hi,

 

Thank you for posting in Intel Communities.

 

We can see that you are using "MPI_Init_thread()" which is a subset of MPI-3 routines.

 

But, Intel® MPI Library provides an experimental feature to enable support for Java* MPI applications. Intel MPI Library provides Java bindings for a subset of MPI-2 routines.

 

You can find all supported MPI-2 routines for Java in the URL below:

https://www.intel.com/content/www/us/en/develop/documentation/mpi-developer-reference-linux/top/misc...

 

From the above link, we can see that MPI_Init_thread() is not a supported MPI-2 routine for Java.

 

Thanks & Regards,

Santosh

 

Stefano_V
Beginner
420 Views

Hi Santosh,

 

thanks for your answer.

We are not using the Java bindings for IntelMPI because all the MPI operations are delegated to the 3D native library that is written in C++ and can access mpi.h methods directly.

 

Schematically:

┌────────────────────────────┐     ┌─────────────────────────────┐     ┌───────┐
│ │ │ │ │ │ ──► OpenMPI WORKS!
│ MyApplication │ ──► │ 3D library │ ──► │ mpi.h │
│ │ │ (VTK) │ │ │ ──► IntelMPI CRASH
└────────────────────────────┘ └─────────────────────────────┘ └───────┘
Java C++
c = new vtkMPIController(); MPI_Init_thread(MPI_THREAD_MULTIPLE)

The crash is happening inside MyApplication (inside the JVM specifically) because IntelMPI seems to interfere with the thread handling. Is it possible? What kind of tests can we do in order to understand what's happening?

 

Thanks,

Stefano

 

 

SantoshY_Intel
Moderator
363 Views

Hi,


Could you please provide us with a sample reproducer code and the steps to reproduce your issue from our end?


Thanks & Regards,

Santosh


SantoshY_Intel
Moderator
343 Views

Hi,


We haven't heard back from you. Could you please provide us with a sample reproducer code and the steps to reproduce your issue from our end?


Thanks & Regards,

Santosh



Stefano_V
Beginner
334 Views

Hi Santosh,

unfortunately, the problem is not reproducible with a small toy case.

I'm trying to add complexity to see at which level the problem starts to happen.

 

In another cluster the same problem arise with OpenMPI as well when UCX is enabled.

Is there a way to switch off ucx in the IntelMPI?

 

Thanks

Stefano

SantoshY_Intel
Moderator
311 Views

Hi,

 

Could you please provide your cluster details(if any)?

What are the OFI(Open Fabrics Interfaces) providers available in your cluster?

 

The MLX provider runs over the UCX that is currently available for the Mellanox InfiniBand* hardware. So, if you are using MLX as an OFI provider, it will use UCX in the backend.

 

To disable UCX, you need to set OFI PROVIDER to any of the other OFI providers.

To check the list of all the available OFI providers, use the below command:

source /opt/intel/oneAPI/setvars.sh
fi_info -l

Now, based on the output you can set OFI PROVIDER to any one of the available OFI providers.

Example:

export I_MPI_OFI_PROVIDER=<name>

(or)

export FI_PROVIDER=<name>

 

 

Thanks & Regards,

Santosh

 

Stefano_V
Beginner
275 Views

Hi,

 

thanks for your patience and understanding.

The relevant information about the cluster is available here.

To summarise:

*******************************************************************************
* Welcome to GALILEO100 Cluster / *
* Linux Infiniband Cluster - CentOS 8.3 *
* *
* 554 compute nodes with 2 x CPU Intel CascadeLake 8260, *
* each with 24 cores, 2.4 GHz, 384GB RAM DDR4, divided in: *
* *
* - 340 standard nodes ("thin nodes") 480 GB SSD *
* - 180 data processing nodes ("fat nodes") 2TB SSD, 3TB Intel Optane *
* - 34 (visualization "viz" ) GPU nodes with 2x NVIDIA GPU V100 *
* *
* *
* Internal Network: Mellanox Infiniband HDR100 *
* SLURM 21.08 *
* *
* For a guide on GALILEO100: *
* https://wiki.u-gov.it/confluence/display/SCAIUS/UG3.3%3A+GALILEO100+UserGuide
* For support: superc@cineca.it *
******************************************************************************* 

Currently, our software is compiled with GNU gcc. Is it recommended to compile it with icc instead?

 

Regards,

Stefano

SantoshY_Intel
Moderator
258 Views

Hi,

 

>>"Currently, our software is compiled with GNU gcc. Is it recommended to compile it with icc instead?"

Yes, we recommend you compile the software with the ICC compiler.

 

Thanks & Regards,

Santosh

 

SantoshY_Intel
Moderator
226 Views

Hi,


We haven't heard back from you. Could you please provide us with any updates on your issue?


Thanks & regards,

Santosh


SantoshY_Intel
Moderator
211 Views

Hi,


We have not heard back from you. This thread will no longer be monitored by Intel. If you need further assistance, please post a new question.


Thanks & Regards,

Santosh


Stefano_V
Beginner
203 Views

Hi Santosh,

thanks, but unfortunately I'm not able to reproduce the problem with a smallest example.

Stefano

Reply