Intel® oneAPI HPC Toolkit
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
Announcements
The Intel sign-in experience is changing in February to support enhanced security controls. If you sign in, click here for more information.
1987 Discussions

Simple MPI hello world fails on a server with Mellanox ConnectX-6 infiniband card

samfux84
New Contributor I
3,325 Views

Hi,

 

I just installed Intel base&hpc toolkits 2022.1.2 on our HPC cluster. When trying to run a simple MPI hello world example, then it fails on servers having a Mellanox ConnectX-6 infiniband card. Are those infiniband cards from Mellanox not supported?

 

[sfux@eu-login-46 intelmpi]$ cat hello.c
#include <mpi.h>
#include <stdio.h>

int main(int argc, char** argv) {
// Initialize the MPI environment
MPI_Init(NULL, NULL);

// Get the number of processes
int world_size;
MPI_Comm_size(MPI_COMM_WORLD, &world_size);

// Get the rank of the process
int world_rank;
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);

// Get the name of the processor
char processor_name[MPI_MAX_PROCESSOR_NAME];
int name_len;
MPI_Get_processor_name(processor_name, &name_len);

// Print off a hello world message
printf("Hello world from processor %s, rank %d out of %d processors\n",
processor_name, world_rank, world_size);

// Finalize the MPI environment.
MPI_Finalize();
}

[sfux@eu-login-46 intelmpi]$ mpiicc --version
icc (ICC) 2021.5.0 20211109
Copyright (C) 1985-2021 Intel Corporation. All rights reserved.

[sfux@eu-login-46 intelmpi]$ mpiicc -o hello hello.c
[sfux@eu-login-46 intelmpi]$ ls
hello hello.c
[sfux@eu-login-46 intelmpi]$

 

I try to run the example on 4 cores (2 cores on each server). Since some of our servers have Mellanox ConnectX-6 infiniband cards, I try to run this example  with setting

 

FI_PROVIDER=mlx

I_MPI_FABRICS=shm:ofi

 

and it fails with a segmentation fault:

 

Sender: LSF System <lsfadmin@eu-g1-020-1>
Subject: Job 207016280: <FI_PROVIDER=mlx I_MPI_FABRICS=shm:ofi mpirun ./hello> in cluster <euler> Exited

Job <FI_PROVIDER=mlx I_MPI_FABRICS=shm:ofi mpirun ./hello> was submitted from host <eu-login-46> by user <sfux> in cluster <euler> at Thu Mar  3 11:08:14 2022
Job was executed on host(s) <2*eu-g1-020-1>, in queue <normal.4h>, as user <sfux> in cluster <euler> at Thu Mar  3 11:08:49 2022
                            <2*eu-g1-024-1>
</cluster/home/sfux> was used as the home directory.
</cluster/home/sfux/test/intelmpi> was used as the working directory.
Started at Thu Mar  3 11:08:49 2022
Terminated at Thu Mar  3 11:09:04 2022
Results reported at Thu Mar  3 11:09:04 2022

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
FI_PROVIDER=mlx I_MPI_FABRICS=shm:ofi mpirun ./hello
------------------------------------------------------------

Exited with exit code 255.

Resource usage summary:

    CPU time :                                   44.00 sec.
    Max Memory :                                 14061 MB
    Average Memory :                             365.00 MB
    Total Requested Memory :                     8000.00 MB
    Delta Memory :                               -6061.00 MB
    Max Swap :                                   -
    Max Processes :                              9
    Max Threads :                                13
    Run time :                                   34 sec.
    Turnaround time :                            50 sec.

The output (if any) follows:

[eu-g1-024-1:14947:0:14947] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x1c0)
[eu-g1-024-1:14948:0:14948] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x1c0)
[eu-g1-020-1:103094:0:103094] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x1c0)
[eu-g1-020-1:103095:0:103095] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x1c0)
==== backtrace (tid: 103095) ====
 0 0x000000000004d455 ucs_debug_print_backtrace()  ???:0
 1 0x0000000000291441 MPIDIG_dequeue_unexp()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/intel/ch4r_unexp_hashtable.c:417
 2 0x00000000003d3cc0 MPIDIG_do_irecv()  /build/impi/_buildspace/release/../../src/mpid/ch4/generic/mpidig_recv.h:135
 3 0x000000000040356d MPIDIG_mpi_irecv()  /build/impi/_buildspace/release/../../src/mpid/ch4/generic/mpidig_recv.h:336
 4 0x000000000040356d MPIDI_POSIX_mpi_irecv()  /build/impi/_buildspace/release/../../src/mpid/ch4/shm/src/../src/../posix/posix_recv.h:60
 5 0x000000000040356d MPIDI_SHM_mpi_irecv()  /build/impi/_buildspace/release/../../src/mpid/ch4/shm/src/../src/shm_p2p.h:328
 6 0x000000000040356d MPIDI_irecv_unsafe()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_recv.h:235
 7 0x000000000040356d MPIDI_irecv_safe()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_recv.h:558
 8 0x000000000040356d MPID_Irecv()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_recv.h:791
 9 0x000000000040356d MPIC_Irecv()  /build/impi/_buildspace/release/../../src/mpi/coll/helper_fns.c:625
10 0x00000000003c1bee recv_nb()  /build/impi/_buildspace/release/../../src/mpid/common/hcoll/hcoll_rte.c:233
11 0x000000000001c908 comm_allreduce_hcolrte_generic()  common_allreduce.c:0
12 0x000000000001ce0b comm_allreduce_hcolrte()  ???:0
13 0x0000000000013a0b hmca_bcol_ucx_p2p_init_query.part.4()  bcol_ucx_p2p_component.c:0
14 0x00000000000ca47c hmca_bcol_base_init()  ???:0
15 0x0000000000049a08 hmca_coll_ml_init_query()  ???:0
16 0x00000000000bf627 hcoll_init_with_opts()  ???:0
17 0x00000000003c0493 hcoll_initialize()  /build/impi/_buildspace/release/../../src/mpid/common/hcoll/hcoll_init.c:106
18 0x00000000003c0493 hcoll_comm_create()  /build/impi/_buildspace/release/../../src/mpid/common/hcoll/hcoll_init.c:144
19 0x00000000006a0b01 MPIDI_OFI_mpi_comm_create_hook()  /build/impi/_buildspace/release/../../src/mpid/ch4/netmod/ofi/ofi_comm.c:216
20 0x00000000001ca555 MPID_Comm_create_hook()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_comm.c:198
21 0x0000000000317bdb MPIR_Comm_commit_internal()  /build/impi/_buildspace/release/../../src/mpi/comm/commutil.c:353
22 0x0000000000317bdb MPIR_Comm_commit()  /build/impi/_buildspace/release/../../src/mpi/comm/commutil.c:530
23 0x0000000000210e4b init_builtin_comms()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_init.c:1113
24 0x0000000000210e4b MPID_Init()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_init.c:1359
"lsf.o207016280" 180L, 14469C 

 

Is this a known problem? Is it wrong to use the MLX FI provider for systems with Mellanox ConnectX-6 infiniband cards?

 

Any help is appreciated.

 

Best regards

 

Sam

Labels (1)
0 Kudos
19 Replies
samfux84
New Contributor I
3,305 Views

I tried again with

 

I_MPI_FABRICS=shm:ofi

I_MPI_OFI_PROVIDER=mlx mpirun

 

and with 256 cores on two servers, each having 128 cores. Same result:

 

Sender: LSF System <lsfadmin@eu-g1-021-3>
Subject: Job 207019623: <I_MPI_FABRICS=shm:ofi I_MPI_OFI_PROVIDER=mlx mpirun ./hello> in cluster <euler> Exited

Job <I_MPI_FABRICS=shm:ofi I_MPI_OFI_PROVIDER=mlx mpirun ./hello> was submitted from host <eu-login-46> by user <sfux> in cluster <euler> at Thu Mar  3 11:52:57 2022
Job was executed on host(s) <128*eu-g1-021-3>, in queue <normal.4h>, as user <sfux> in cluster <euler> at Thu Mar  3 11:53:27 2022
                            <128*eu-g1-022-2>
</cluster/home/sfux> was used as the home directory.
</cluster/home/sfux/test/intelmpi> was used as the working directory.
Started at Thu Mar  3 11:53:27 2022
Terminated at Thu Mar  3 11:53:33 2022
Results reported at Thu Mar  3 11:53:33 2022

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
I_MPI_FABRICS=shm:ofi I_MPI_OFI_PROVIDER=mlx mpirun ./hello
------------------------------------------------------------

Exited with exit code 143.

Resource usage summary:

    CPU time :                                   320.00 sec.
    Max Memory :                                 9421 MB
    Average Memory :                             -
    Total Requested Memory :                     25600.00 MB
    Delta Memory :                               16179.00 MB
    Max Swap :                                   -
    Max Processes :                              -
    Max Threads :                                -
    Run time :                                   5 sec.
    Turnaround time :                            36 sec.

The output (if any) follows:

[eu-g1-022-2:5563 :0:5563] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b7915ba3006)
[eu-g1-022-2:5572 :0:5572] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b1e07498006)
[eu-g1-022-2:5574 :0:5574] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2aed54929006)
[eu-g1-022-2:5593 :0:5593] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b4357593006)
[eu-g1-022-2:5617 :0:5617] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b910765d006)
[eu-g1-022-2:5637 :0:5637] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2ad132c02006)
[eu-g1-022-2:5642 :0:5642] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b49a43b5006)
[eu-g1-022-2:5517 :0:5517] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b649d636006)
[eu-g1-022-2:5521 :0:5521] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b34e4ccb006)
[eu-g1-022-2:5523 :0:5523] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2ad3ba3cc006)
[eu-g1-022-2:5527 :0:5527] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b5fe34c8006)
[eu-g1-022-2:5529 :0:5529] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b0c04006006)
[eu-g1-022-2:5530 :0:5530] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b8a6df1c006)
[eu-g1-022-2:5541 :0:5541] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b7c7f4f7006)
[eu-g1-022-2:5544 :0:5544] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b03d1ef9006)
[eu-g1-022-2:5546 :0:5546] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b298e9fc006)
[eu-g1-022-2:5548 :0:5548] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2af048dba006)
[eu-g1-022-2:5579 :0:5579] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2aef6b6e7006)
[eu-g1-022-2:5609 :0:5609] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b19263b6006)
[eu-g1-022-2:5618 :0:5618] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b4df1fac006)
[eu-g1-022-2:5627 :0:5627] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b874051a006)
[eu-g1-022-2:5629 :0:5629] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b58a45b2006)
[eu-g1-022-2:5635 :0:5635] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b3d4aa13006)
[eu-g1-022-2:5644 :0:5644] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2aff0bbe0006)
[eu-g1-022-2:5515 :0:5515] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b65ec6c6006)
[eu-g1-022-2:5516 :0:5516] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2ab781617006)
[eu-g1-022-2:5519 :0:5519] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2ae61e7ba006)
[eu-g1-022-2:5522 :0:5522] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b60be873006)
[eu-g1-022-2:5564 :0:5564] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b698e6a6006)
[eu-g1-022-2:5591 :0:5591] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2ac830af3006)
[eu-g1-022-2:5592 :0:5592] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b9a9b455006)
[eu-g1-022-2:5624 :0:5624] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2afe26569006)
[eu-g1-022-2:5631 :0:5631] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2ba81af7b006)
[eu-g1-022-2:5632 :0:5632] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2ad57f8e3006)
[eu-g1-022-2:5643 :0:5643] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2adb9289f006)
[eu-g1-022-2:5645 :0:5645] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b37a72ea006)
[eu-g1-022-2:5518 :0:5518] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b1de38fd006)
[eu-g1-022-2:5520 :0:5520] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b9ddcef2006)
[eu-g1-022-2:5525 :0:5525] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b87906bb006)
[eu-g1-022-2:5526 :0:5526] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b6e36dfe006)
[eu-g1-022-2:5528 :0:5528] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b2214a76006)
[eu-g1-022-2:5533 :0:5533] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2ae0c435c006)
[eu-g1-022-2:5536 :0:5536] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2ab91c9f3006)
[eu-g1-022-2:5542 :0:5542] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b2159a96006)
[eu-g1-022-2:5543 :0:5543] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2ac7ea0ff006)
[eu-g1-022-2:5577 :0:5577] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2ab9266e4006)
[eu-g1-022-2:5582 :0:5582] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b03b567b006)
[eu-g1-022-2:5598 :0:5598] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2aba35f44006)
[eu-g1-022-2:5634 :0:5634] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2ac31fef9006)
[eu-g1-022-2:5639 :0:5639] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b04ff9db006)
[eu-g1-022-2:5524 :0:5524] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b99d7fc1006)
[eu-g1-022-2:5531 :0:5531] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2ad5816f3006)
[eu-g1-022-2:5532 :0:5532] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b0835e53006)
[eu-g1-022-2:5534 :0:5534] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b600ae4c006)
[eu-g1-022-2:5537 :0:5537] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b0abf142006)
[eu-g1-022-2:5538 :0:5538] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b32897dd006)
[eu-g1-022-2:5540 :0:5540] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b5efa245006)
[eu-g1-022-2:5547 :0:5547] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b16be69d006)
[eu-g1-022-2:5555 :0:5555] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b8876c6e006)
[eu-g1-022-2:5558 :0:5558] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2ba8207e8006)
[eu-g1-022-2:5562 :0:5562] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2ad6f4026006)
[eu-g1-022-2:5565 :0:5565] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2ac847f46006)
[eu-g1-022-2:5566 :0:5566] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2ae4f6707006)
[eu-g1-022-2:5568 :0:5568] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b5e0a0d1006)
[eu-g1-022-2:5571 :0:5571] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2ad5e4114006)
[eu-g1-022-2:5580 :0:5580] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b70a10e3006)
[eu-g1-022-2:5581 :0:5581] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b6e1514f006)
[eu-g1-022-2:5586 :0:5586] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2af715206006)
[eu-g1-022-2:5589 :0:5589] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b9a7fd9c006)
[eu-g1-022-2:5604 :0:5604] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2ae12f9ca006)
[eu-g1-022-2:5605 :0:5605] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2ab93a8f1006)
[eu-g1-022-2:5623 :0:5623] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2ac00b109006)
[eu-g1-022-2:5545 :0:5545] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b0c6bf6c006)
[eu-g1-022-2:5549 :0:5549] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b3542c64006)
[eu-g1-022-2:5550 :0:5550] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2ada2d3eb006)
[eu-g1-022-2:5552 :0:5552] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2ac14a04a006)
[eu-g1-022-2:5554 :0:5554] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b8390672006)
[eu-g1-022-2:5557 :0:5557] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b539bd27006)
[eu-g1-022-2:5559 :0:5559] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2aedbafaf006)
[eu-g1-022-2:5560 :0:5560] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b7bece8f006)
[eu-g1-022-2:5561 :0:5561] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b79c0255006)
[eu-g1-022-2:5567 :0:5567] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b4f021ee006)
[eu-g1-022-2:5569 :0:5569] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b5edc12d006)
[eu-g1-022-2:5578 :0:5578] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2ac40bf16006)
[eu-g1-022-2:5583 :0:5583] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2ac861ff5006)
[eu-g1-022-2:5584 :0:5584] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b022e546006)
[eu-g1-022-2:5585 :0:5585] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2afaf0062006)
[eu-g1-022-2:5587 :0:5587] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2afabc5d8006)
[eu-g1-022-2:5588 :0:5588] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2ab618b4f006)
[eu-g1-022-2:5590 :0:5590] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b425b7a7006)
[eu-g1-022-2:5594 :0:5594] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2ba36c348006)
[eu-g1-022-2:5595 :0:5595] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b4d2ec22006)
[eu-g1-022-2:5596 :0:5596] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b5b06c33006)
[eu-g1-022-2:5599 :0:5599] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2af2161bb006)
[eu-g1-022-2:5600 :0:5600] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b39d7add006)
[eu-g1-022-2:5601 :0:5601] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b93d2e28006)
[eu-g1-022-2:5606 :0:5606] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b9bf7758006)
[eu-g1-022-2:5610 :0:5610] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2ad7606f6006)
[eu-g1-022-2:5612 :0:5612] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b7899323006)
[eu-g1-022-2:5613 :0:5613] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b16ecc2a006)
[eu-g1-022-2:5630 :0:5630] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2ac01944e006)
[eu-g1-022-2:5633 :0:5633] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b379f9db006)
[eu-g1-022-2:5636 :0:5636] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2acf93315006)
[eu-g1-022-2:5638 :0:5638] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b635401f006)
[eu-g1-022-2:5640 :0:5640] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2ba5e5d1a006)
[eu-g1-022-2:5551 :0:5551] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b718a434006)
[eu-g1-022-2:5553 :0:5553] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b0689a35006)
[eu-g1-022-2:5570 :0:5570] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b006850f006)
[eu-g1-022-2:5575 :0:5575] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2af10ee14006)
[eu-g1-022-2:5576 :0:5576] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b8df73b6006)
[eu-g1-022-2:5602 :0:5602] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2af2b6b40006)
[eu-g1-022-2:5603 :0:5603] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b6e33161006)
[eu-g1-022-2:5607 :0:5607] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b7d9263d006)
[eu-g1-022-2:5608 :0:5608] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2ae9201a0006)
[eu-g1-022-2:5611 :0:5611] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2ba86ea0c006)
[eu-g1-022-2:5614 :0:5614] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2ae3286e6006)
[eu-g1-022-2:5615 :0:5615] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2acbbed0e006)
[eu-g1-022-2:5616 :0:5616] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2ba407a8d006)
[eu-g1-022-2:5619 :0:5619] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2ba61d0ca006)
[eu-g1-022-2:5620 :0:5620] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2ae513304006)
[eu-g1-022-2:5621 :0:5621] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b7fa3946006)
[eu-g1-022-2:5622 :0:5622] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2af2e9c7f006)
[eu-g1-022-2:5625 :0:5625] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b2a9dd81006)
[eu-g1-022-2:5626 :0:5626] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b5239efb006)
[eu-g1-022-2:5628 :0:5628] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b7efb75b006)
[eu-g1-022-2:5641 :0:5641] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2af3e83f0006)
[eu-g1-022-2:5597 :0:5597] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b75e6475006)
[eu-g1-022-2:5646 :0:5646] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b26e9c8f006)
[1646304812.267239] [eu-g1-021-3:35151:0]         address.c:988  UCX  ERROR address version mismatch: expected 0, actual 15
Abort(1090703) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(143)........:
MPID_Init(1310)..............:
MPIDI_OFI_mpi_init_hook(1974): OFI get address vector map failed
[1646304812.267234] [eu-g1-021-3:35215:0]         address.c:988  UCX  ERROR address version mismatch: expected 0, actual 15
Abort(1090703) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(143)........:
MPID_Init(1310)..............:
MPIDI_OFI_mpi_init_hook(1974): OFI get address vector map failed
[1646304812.267353] [eu-g1-021-3:35216:0]         address.c:988  UCX  ERROR address version mismatch: expected 0, actual 15
Abort(1090703) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(143)........:
MPID_Init(1310)..............:
MPIDI_OFI_mpi_init_hook(1974): OFI get address vector map failed
[1646304812.269680] [eu-g1-021-3:35248:0]         address.c:988  UCX  ERROR address version mismatch: expected 0, actual 15
Abort(1090703) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(143)........:
MPID_Init(1310)..............:
MPIDI_OFI_mpi_init_hook(1974): OFI get address vector map failed
[sfux@eu-login-46 intelmpi]$
samfux84
New Contributor I
3,263 Views

Was is also surprising is, that the batch system reports a memory usage of 12971 MB for a hello world program on 4 cores. This does not seem right.

 

Another run with increased loglevel:

 

[sfux@eu-login-29 intelmpi]$ cat lsf.o207036445
Sender: LSF System <lsfadmin@eu-g1-019-4>
Subject: Job 207036445: <I_MPI_FABRICS=shm:ofi I_MPI_OFI_PROVIDER=mlx I_MPI_DEBUG=30 FI_LOG_LEVEL=debug mpirun ./hello> in cluster <euler> Exited

Job <I_MPI_FABRICS=shm:ofi I_MPI_OFI_PROVIDER=mlx I_MPI_DEBUG=30 FI_LOG_LEVEL=debug mpirun ./hello> was submitted from host <eu-login-29> by user <sfux> in cluster <euler> at Thu Mar 3 15:05:44 2022
Job was executed on host(s) <2*eu-g1-019-4>, in queue <normal.4h>, as user <sfux> in cluster <euler> at Thu Mar 3 15:07:25 2022
<2*eu-g1-021-4>
</cluster/home/sfux> was used as the home directory.
</cluster/home/sfux/test/intelmpi> was used as the working directory.
Started at Thu Mar 3 15:07:25 2022
Terminated at Thu Mar 3 15:07:41 2022
Results reported at Thu Mar 3 15:07:41 2022

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
I_MPI_FABRICS=shm:ofi I_MPI_OFI_PROVIDER=mlx I_MPI_DEBUG=30 FI_LOG_LEVEL=debug mpirun ./hello
------------------------------------------------------------

Exited with exit code 255.

Resource usage summary:

CPU time : 44.00 sec.
Max Memory : 12971 MB
Average Memory : 391.00 MB
Total Requested Memory : 8000.00 MB
Delta Memory : -4971.00 MB
Max Swap : -
Max Processes : 9
Max Threads : 13
Run time : 37 sec.
Turnaround time : 117 sec.

The output (if any) follows:

IPL WARN> Not all cpus are available, switch to I_MPI_PIN_ORDER=compact. (Total: 128 Available: 2)
IPL WARN> Not all cpus are available, switch to I_MPI_PIN_ORDER=compact. (Total: 128 Available: 2)
[0] MPI startup(): Intel(R) MPI Library, Version 2021.5 Build 20211102 (id: 9279b7d62)
[0] MPI startup(): Copyright (C) 2003-2021 Intel Corporation. All rights reserved.
[0] MPI startup(): library kind: release
[2] MPI startup(): shm segment size (1068 MB per rank) * (2 local ranks) = 2136 MB total
[0] MPI startup(): shm segment size (1068 MB per rank) * (2 local ranks) = 2136 MB total
libfabric:94329:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:94329:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:94328:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:94328:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ROCR not supported
[0] MPI startup(): libfabric version: 1.13.2rc1-impi
libfabric:33997:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:33997:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:33997:core:core:ze_hmem_dl_init():422<warn> Failed to dlopen libze_loader.so
libfabric:33997:core:core:ofi_hmem_init():214<warn> Failed to initialize hmem iface FI_HMEM_ZE: No data available
libfabric:33997:core:mr:ofi_default_cache_size():78<info> default cache size=2112943968
libfabric:33998:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:33998:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:33998:core:core:ze_hmem_dl_init():422<warn> Failed to dlopen libze_loader.so
libfabric:33998:core:core:ofi_hmem_init():214<warn> Failed to initialize hmem iface FI_HMEM_ZE: No data available
libfabric:33998:core:mr:ofi_default_cache_size():78<info> default cache size=2112943968
libfabric:94328:core:core:ze_hmem_dl_init():422<warn> Failed to dlopen libze_loader.so
libfabric:94328:core:core:ofi_hmem_init():214<warn> Failed to initialize hmem iface FI_HMEM_ZE: No data available
libfabric:94328:core:mr:ofi_default_cache_size():78<info> default cache size=2112943968
libfabric:94329:core:core:ze_hmem_dl_init():422<warn> Failed to dlopen libze_loader.so
libfabric:94329:core:core:ofi_hmem_init():214<warn> Failed to initialize hmem iface FI_HMEM_ZE: No data available
libfabric:94329:core:mr:ofi_default_cache_size():78<info> default cache size=2112943968
libfabric:33997:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:33997:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:33997:core:core:ze_hmem_dl_init():422<warn> Failed to dlopen libze_loader.so
libfabric:33997:core:core:ofi_hmem_init():214<warn> Failed to initialize hmem iface FI_HMEM_ZE: No data available
libfabric:33997:core:mr:ofi_default_cache_size():78<info> default cache size=2112943968
libfabric:33998:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:33998:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:33998:core:core:ze_hmem_dl_init():422<warn> Failed to dlopen libze_loader.so
libfabric:33998:core:core:ofi_hmem_init():214<warn> Failed to initialize hmem iface FI_HMEM_ZE: No data available
libfabric:33998:core:mr:ofi_default_cache_size():78<info> default cache size=2112943968
libfabric:94328:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:94328:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:94328:core:core:ze_hmem_dl_init():422<warn> Failed to dlopen libze_loader.so
libfabric:94328:core:core:ofi_hmem_init():214<warn> Failed to initialize hmem iface FI_HMEM_ZE: No data available
libfabric:94328:core:mr:ofi_default_cache_size():78<info> default cache size=2112943968
libfabric:94329:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:94329:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:94329:core:core:ze_hmem_dl_init():422<warn> Failed to dlopen libze_loader.so
libfabric:94329:core:core:ofi_hmem_init():214<warn> Failed to initialize hmem iface FI_HMEM_ZE: No data available
libfabric:94329:core:mr:ofi_default_cache_size():78<info> default cache size=2112943968
libfabric:33997:verbs:fabric:verbs_devs_print():883<info> list of verbs devices found for FI_EP_MSG:
libfabric:33997:verbs:fabric:verbs_devs_print():887<info> #1 mlx5_0 - IPoIB addresses:
libfabric:33997:verbs:fabric:verbs_devs_print():897<info> 10.205.73.96
libfabric:33997:verbs:fabric:verbs_devs_print():897<info> fe80::ba59:9f03:d1:f13e
libfabric:33998:verbs:fabric:verbs_devs_print():883<info> list of verbs devices found for FI_EP_MSG:
libfabric:33998:verbs:fabric:verbs_devs_print():887<info> #1 mlx5_0 - IPoIB addresses:
libfabric:33998:verbs:fabric:verbs_devs_print():897<info> 10.205.73.96
libfabric:33998:verbs:fabric:verbs_devs_print():897<info> fe80::ba59:9f03:d1:f13e
libfabric:94328:verbs:fabric:verbs_devs_print():883<info> list of verbs devices found for FI_EP_MSG:
libfabric:94328:verbs:fabric:verbs_devs_print():887<info> #1 mlx5_0 - IPoIB addresses:
libfabric:94328:verbs:fabric:verbs_devs_print():897<info> 10.205.73.104
libfabric:94328:verbs:fabric:verbs_devs_print():897<info> fe80::ba59:9f03:d2:4e2
libfabric:94329:verbs:fabric:verbs_devs_print():883<info> list of verbs devices found for FI_EP_MSG:
libfabric:94329:verbs:fabric:verbs_devs_print():887<info> #1 mlx5_0 - IPoIB addresses:
libfabric:94329:verbs:fabric:verbs_devs_print():897<info> 10.205.73.104
libfabric:94329:verbs:fabric:verbs_devs_print():897<info> fe80::ba59:9f03:d2:4e2
libfabric:94328:verbs:fabric:vrb_get_device_attrs():618<info> device mlx5_0: first found active port is 1
libfabric:94329:verbs:fabric:vrb_get_device_attrs():618<info> device mlx5_0: first found active port is 1
libfabric:94329:verbs:fabric:vrb_get_device_attrs():618<info> device mlx5_0: first found active port is 1
libfabric:94328:verbs:fabric:vrb_get_device_attrs():618<info> device mlx5_0: first found active port is 1
libfabric:94329:verbs:fabric:vrb_get_device_attrs():618<info> device mlx5_0: first found active port is 1
libfabric:94329:core:core:ofi_register_provider():474<info> registering provider: verbs (113.20)
libfabric:94329:core:core:ofi_register_provider():502<info> "verbs" filtered by provider include/exclude list, skipping
libfabric:94328:verbs:fabric:vrb_get_device_attrs():618<info> device mlx5_0: first found active port is 1
libfabric:94328:core:core:ofi_register_provider():474<info> registering provider: verbs (113.20)
libfabric:94328:core:core:ofi_register_provider():502<info> "verbs" filtered by provider include/exclude list, skipping
libfabric:94328:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:94328:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:94328:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ZE not supported
libfabric:94328:core:mr:ofi_default_cache_size():78<info> default cache size=2112943968
libfabric:94329:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:94329:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:94329:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ZE not supported
libfabric:94329:core:mr:ofi_default_cache_size():78<info> default cache size=2112943968
libfabric:33997:verbs:fabric:vrb_get_device_attrs():618<info> device mlx5_0: first found active port is 1
libfabric:33998:verbs:fabric:vrb_get_device_attrs():618<info> device mlx5_0: first found active port is 1
libfabric:94329:verbs:fabric:verbs_devs_print():883<info> list of verbs devices found for FI_EP_MSG:
libfabric:94329:verbs:fabric:verbs_devs_print():887<info> #1 mlx5_0 - IPoIB addresses:
libfabric:94329:verbs:fabric:verbs_devs_print():897<info> 10.205.73.104
libfabric:94329:verbs:fabric:verbs_devs_print():897<info> fe80::ba59:9f03:d2:4e2
libfabric:94328:verbs:fabric:verbs_devs_print():883<info> list of verbs devices found for FI_EP_MSG:
libfabric:94328:verbs:fabric:verbs_devs_print():887<info> #1 mlx5_0 - IPoIB addresses:
libfabric:94328:verbs:fabric:verbs_devs_print():897<info> 10.205.73.104
libfabric:94328:verbs:fabric:verbs_devs_print():897<info> fe80::ba59:9f03:d2:4e2
libfabric:94328:verbs:fabric:vrb_get_device_attrs():618<info> device mlx5_0: first found active port is 1
libfabric:94329:verbs:fabric:vrb_get_device_attrs():618<info> device mlx5_0: first found active port is 1
libfabric:94328:verbs:fabric:vrb_get_device_attrs():618<info> device mlx5_0: first found active port is 1
libfabric:94329:verbs:fabric:vrb_get_device_attrs():618<info> device mlx5_0: first found active port is 1
libfabric:94328:verbs:fabric:vrb_get_device_attrs():618<info> device mlx5_0: first found active port is 1
libfabric:94328:core:core:ofi_register_provider():474<info> registering provider: verbs (113.20)
libfabric:94328:core:core:ofi_register_provider():502<info> "verbs" filtered by provider include/exclude list, skipping
libfabric:94329:verbs:fabric:vrb_get_device_attrs():618<info> device mlx5_0: first found active port is 1
libfabric:94329:core:core:ofi_register_provider():474<info> registering provider: verbs (113.20)
libfabric:94329:core:core:ofi_register_provider():502<info> "verbs" filtered by provider include/exclude list, skipping
libfabric:94328:core:core:ofi_register_provider():474<info> registering provider: tcp (113.20)
libfabric:94328:core:core:ofi_register_provider():502<info> "tcp" filtered by provider include/exclude list, skipping
libfabric:94329:core:core:ofi_register_provider():474<info> registering provider: tcp (113.20)
libfabric:94329:core:core:ofi_register_provider():502<info> "tcp" filtered by provider include/exclude list, skipping
libfabric:94328:core:core:ofi_register_provider():474<info> registering provider: sockets (113.20)
libfabric:94328:core:core:ofi_register_provider():502<info> "sockets" filtered by provider include/exclude list, skipping
libfabric:94329:core:core:ofi_register_provider():474<info> registering provider: sockets (113.20)
libfabric:94329:core:core:ofi_register_provider():502<info> "sockets" filtered by provider include/exclude list, skipping
libfabric:94328:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:94328:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:94328:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ZE not supported
libfabric:94328:core:core:ofi_register_provider():474<info> registering provider: shm (113.20)
libfabric:94328:core:core:ofi_register_provider():502<info> "shm" filtered by provider include/exclude list, skipping
libfabric:94329:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:94329:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:94329:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ZE not supported
libfabric:94329:core:core:ofi_register_provider():474<info> registering provider: shm (113.20)
libfabric:94329:core:core:ofi_register_provider():502<info> "shm" filtered by provider include/exclude list, skipping
libfabric:33997:verbs:fabric:vrb_get_device_attrs():618<info> device mlx5_0: first found active port is 1
libfabric:94328:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:94328:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:94328:core:core:ze_hmem_dl_init():422<warn> Failed to dlopen libze_loader.so
libfabric:94328:core:core:ofi_hmem_init():214<warn> Failed to initialize hmem iface FI_HMEM_ZE: No data available
libfabric:94328:core:core:ofi_register_provider():474<info> registering provider: ofi_rxm (113.20)
libfabric:94329:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:94329:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:94329:core:core:ze_hmem_dl_init():422<warn> Failed to dlopen libze_loader.so
libfabric:94329:core:core:ofi_hmem_init():214<warn> Failed to initialize hmem iface FI_HMEM_ZE: No data available
libfabric:94329:core:core:ofi_register_provider():474<info> registering provider: ofi_rxm (113.20)
libfabric:33998:verbs:fabric:vrb_get_device_attrs():618<info> device mlx5_0: first found active port is 1
libfabric:94328:psm3:core:fi_prov_ini():680<info> build options: VERSION=1101.0=11.1.0.0, HAVE_PSM3_src=1, PSM3_CUDA=0
libfabric:94328:core:core:ofi_register_provider():474<info> registering provider: psm3 (1101.0)
libfabric:94328:core:core:ofi_register_provider():502<info> "psm3" filtered by provider include/exclude list, skipping
libfabric:94329:psm3:core:fi_prov_ini():680<info> build options: VERSION=1101.0=11.1.0.0, HAVE_PSM3_src=1, PSM3_CUDA=0
libfabric:94329:core:core:ofi_register_provider():474<info> registering provider: psm3 (1101.0)
libfabric:94329:core:core:ofi_register_provider():502<info> "psm3" filtered by provider include/exclude list, skipping
libfabric:94328:core:core:ofi_register_provider():474<info> registering provider: mlx (1.4)
libfabric:94329:core:core:ofi_register_provider():474<info> registering provider: mlx (1.4)
libfabric:94328:core:core:ofi_register_provider():474<info> registering provider: ofi_hook_noop (113.20)
libfabric:94328:core:core:fi_getinfo_():1138<info> Found provider with the highest priority mlx, must_use_util_prov = 0
libfabric:94328:mlx:core:mlx_getinfo():211<info> primary detected device: mlx5_0
libfabric:94329:core:core:ofi_register_provider():474<info> registering provider: ofi_hook_noop (113.20)
libfabric:94329:core:core:fi_getinfo_():1138<info> Found provider with the highest priority mlx, must_use_util_prov = 0
libfabric:94329:mlx:core:mlx_getinfo():211<info> primary detected device: mlx5_0
libfabric:94328:mlx:core:mlx_getinfo():254<info> used inject size = 1024
libfabric:94329:mlx:core:mlx_getinfo():254<info> used inject size = 1024
libfabric:33997:verbs:fabric:vrb_get_device_attrs():618<info> device mlx5_0: first found active port is 1
libfabric:33997:core:core:ofi_register_provider():474<info> registering provider: verbs (113.20)
libfabric:33997:core:core:ofi_register_provider():502<info> "verbs" filtered by provider include/exclude list, skipping
libfabric:33997:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:33997:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:33997:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ZE not supported
libfabric:33997:core:mr:ofi_default_cache_size():78<info> default cache size=2112943968
libfabric:94328:mlx:core:mlx_getinfo():301<info> Loaded MLX version 1.11.1
libfabric:94328:mlx:core:mlx_getinfo():348<warn> MLX: spawn support 0
libfabric:94328:core:core:fi_getinfo_():1138<info> Found provider with the highest priority mlx, must_use_util_prov = 0
libfabric:94328:mlx:core:mlx_getinfo():211<info> primary detected device: mlx5_0
libfabric:94329:mlx:core:mlx_getinfo():301<info> Loaded MLX version 1.11.1
libfabric:94329:mlx:core:mlx_getinfo():348<warn> MLX: spawn support 0
libfabric:94329:core:core:fi_getinfo_():1138<info> Found provider with the highest priority mlx, must_use_util_prov = 0
libfabric:94329:mlx:core:mlx_getinfo():211<info> primary detected device: mlx5_0
libfabric:94329:mlx:core:mlx_getinfo():254<info> used inject size = 1024
libfabric:94328:mlx:core:mlx_getinfo():254<info> used inject size = 1024
libfabric:94328:mlx:core:mlx_getinfo():301<info> Loaded MLX version 1.11.1
libfabric:94328:mlx:core:mlx_getinfo():348<warn> MLX: spawn support 0
libfabric:94328:mlx:core:mlx_fabric_open():172<info>
libfabric:94328:core:core:fi_fabric_():1423<info> Opened fabric: mlx
libfabric:94328:mlx:core:ofi_check_rx_attr():786<info> Tx only caps ignored in Rx caps
libfabric:94328:mlx:core:ofi_check_tx_attr():884<info> Rx only caps ignored in Tx caps
libfabric:94329:mlx:core:mlx_getinfo():301<info> Loaded MLX version 1.11.1
libfabric:94329:mlx:core:mlx_getinfo():348<warn> MLX: spawn support 0
libfabric:94329:mlx:core:mlx_fabric_open():172<info>
libfabric:94329:core:core:fi_fabric_():1423<info> Opened fabric: mlx
libfabric:94329:mlx:core:ofi_check_rx_attr():786<info> Tx only caps ignored in Rx caps
libfabric:94329:mlx:core:ofi_check_tx_attr():884<info> Rx only caps ignored in Tx caps
libfabric:33998:verbs:fabric:vrb_get_device_attrs():618<info> device mlx5_0: first found active port is 1
libfabric:33998:core:core:ofi_register_provider():474<info> registering provider: verbs (113.20)
libfabric:33998:core:core:ofi_register_provider():502<info> "verbs" filtered by provider include/exclude list, skipping
libfabric:33998:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:33998:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:33998:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ZE not supported
libfabric:33998:core:mr:ofi_default_cache_size():78<info> default cache size=2112943968
libfabric:94328:mlx:core:ofi_check_rx_attr():786<info> Tx only caps ignored in Rx caps
libfabric:94328:mlx:core:ofi_check_tx_attr():884<info> Rx only caps ignored in Tx caps
libfabric:94329:mlx:core:ofi_check_rx_attr():786<info> Tx only caps ignored in Rx caps
libfabric:94329:mlx:core:ofi_check_tx_attr():884<info> Rx only caps ignored in Tx caps
libfabric:94328:mlx:core:mlx_cm_getname_mlx_format():73<info> Loaded UCP address: [307]...
libfabric:94329:mlx:core:mlx_cm_getname_mlx_format():73<info> Loaded UCP address: [307]...
libfabric:33997:verbs:fabric:verbs_devs_print():883<info> list of verbs devices found for FI_EP_MSG:
libfabric:33997:verbs:fabric:verbs_devs_print():887<info> #1 mlx5_0 - IPoIB addresses:
libfabric:33997:verbs:fabric:verbs_devs_print():897<info> 10.205.73.96
libfabric:33997:verbs:fabric:verbs_devs_print():897<info> fe80::ba59:9f03:d1:f13e
libfabric:33998:verbs:fabric:verbs_devs_print():883<info> list of verbs devices found for FI_EP_MSG:
libfabric:33998:verbs:fabric:verbs_devs_print():887<info> #1 mlx5_0 - IPoIB addresses:
libfabric:33998:verbs:fabric:verbs_devs_print():897<info> 10.205.73.96
libfabric:33998:verbs:fabric:verbs_devs_print():897<info> fe80::ba59:9f03:d1:f13e
libfabric:33998:verbs:fabric:vrb_get_device_attrs():618<info> device mlx5_0: first found active port is 1
libfabric:33997:verbs:fabric:vrb_get_device_attrs():618<info> device mlx5_0: first found active port is 1
libfabric:33997:verbs:fabric:vrb_get_device_attrs():618<info> device mlx5_0: first found active port is 1
libfabric:33998:verbs:fabric:vrb_get_device_attrs():618<info> device mlx5_0: first found active port is 1
libfabric:33998:verbs:fabric:vrb_get_device_attrs():618<info> device mlx5_0: first found active port is 1
libfabric:33998:core:core:ofi_register_provider():474<info> registering provider: verbs (113.20)
libfabric:33998:core:core:ofi_register_provider():502<info> "verbs" filtered by provider include/exclude list, skipping
libfabric:33998:core:core:ofi_register_provider():474<info> registering provider: tcp (113.20)
libfabric:33998:core:core:ofi_register_provider():502<info> "tcp" filtered by provider include/exclude list, skipping
libfabric:33998:core:core:ofi_register_provider():474<info> registering provider: sockets (113.20)
libfabric:33998:core:core:ofi_register_provider():502<info> "sockets" filtered by provider include/exclude list, skipping
libfabric:33997:verbs:fabric:vrb_get_device_attrs():618<info> device mlx5_0: first found active port is 1
libfabric:33997:core:core:ofi_register_provider():474<info> registering provider: verbs (113.20)
libfabric:33997:core:core:ofi_register_provider():502<info> "verbs" filtered by provider include/exclude list, skipping
libfabric:33997:core:core:ofi_register_provider():474<info> registering provider: tcp (113.20)
libfabric:33997:core:core:ofi_register_provider():502<info> "tcp" filtered by provider include/exclude list, skipping
libfabric:33998:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:33998:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:33998:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ZE not supported
libfabric:33998:core:core:ofi_register_provider():474<info> registering provider: shm (113.20)
libfabric:33998:core:core:ofi_register_provider():502<info> "shm" filtered by provider include/exclude list, skipping
libfabric:33997:core:core:ofi_register_provider():474<info> registering provider: sockets (113.20)
libfabric:33997:core:core:ofi_register_provider():502<info> "sockets" filtered by provider include/exclude list, skipping
libfabric:33997:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:33997:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:33997:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ZE not supported
libfabric:33997:core:core:ofi_register_provider():474<info> registering provider: shm (113.20)
libfabric:33997:core:core:ofi_register_provider():502<info> "shm" filtered by provider include/exclude list, skipping
libfabric:33997:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:33997:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:33997:core:core:ze_hmem_dl_init():422<warn> Failed to dlopen libze_loader.so
libfabric:33997:core:core:ofi_hmem_init():214<warn> Failed to initialize hmem iface FI_HMEM_ZE: No data available
libfabric:33997:core:core:ofi_register_provider():474<info> registering provider: ofi_rxm (113.20)
libfabric:33998:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:33998:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:33998:core:core:ze_hmem_dl_init():422<warn> Failed to dlopen libze_loader.so
libfabric:33998:core:core:ofi_hmem_init():214<warn> Failed to initialize hmem iface FI_HMEM_ZE: No data available
libfabric:33998:core:core:ofi_register_provider():474<info> registering provider: ofi_rxm (113.20)
libfabric:33997:psm3:core:fi_prov_ini():680<info> build options: VERSION=1101.0=11.1.0.0, HAVE_PSM3_src=1, PSM3_CUDA=0
libfabric:33997:core:core:ofi_register_provider():474<info> registering provider: psm3 (1101.0)
libfabric:33997:core:core:ofi_register_provider():502<info> "psm3" filtered by provider include/exclude list, skipping
libfabric:33998:psm3:core:fi_prov_ini():680<info> build options: VERSION=1101.0=11.1.0.0, HAVE_PSM3_src=1, PSM3_CUDA=0
libfabric:33998:core:core:ofi_register_provider():474<info> registering provider: psm3 (1101.0)
libfabric:33998:core:core:ofi_register_provider():502<info> "psm3" filtered by provider include/exclude list, skipping
libfabric:33997:core:core:ofi_register_provider():474<info> registering provider: mlx (1.4)
libfabric:33998:core:core:ofi_register_provider():474<info> registering provider: mlx (1.4)
libfabric:33997:core:core:ofi_register_provider():474<info> registering provider: ofi_hook_noop (113.20)
libfabric:33997:core:core:fi_getinfo_():1138<info> Found provider with the highest priority mlx, must_use_util_prov = 0
libfabric:33997:mlx:core:mlx_getinfo():211<info> primary detected device: mlx5_0
libfabric:33998:core:core:ofi_register_provider():474<info> registering provider: ofi_hook_noop (113.20)
libfabric:33998:core:core:fi_getinfo_():1138<info> Found provider with the highest priority mlx, must_use_util_prov = 0
libfabric:33998:mlx:core:mlx_getinfo():211<info> primary detected device: mlx5_0
libfabric:33997:mlx:core:mlx_getinfo():254<info> used inject size = 1024
libfabric:33998:mlx:core:mlx_getinfo():254<info> used inject size = 1024
libfabric:33997:mlx:core:mlx_getinfo():301<info> Loaded MLX version 1.11.1
libfabric:33997:mlx:core:mlx_getinfo():348<warn> MLX: spawn support 0
libfabric:33997:core:core:fi_getinfo_():1138<info> Found provider with the highest priority mlx, must_use_util_prov = 0
libfabric:33997:mlx:core:mlx_getinfo():211<info> primary detected device: mlx5_0
libfabric:33998:mlx:core:mlx_getinfo():301<info> Loaded MLX version 1.11.1
libfabric:33998:mlx:core:mlx_getinfo():348<warn> MLX: spawn support 0
libfabric:33998:core:core:fi_getinfo_():1138<info> Found provider with the highest priority mlx, must_use_util_prov = 0
libfabric:33998:mlx:core:mlx_getinfo():211<info> primary detected device: mlx5_0
[0] MPI startup(): libfabric provider: mlx
[0] MPI startup(): detected mlx provider, set device name to "mlx_hcoll"
libfabric:33997:mlx:core:mlx_getinfo():254<info> used inject size = 1024
libfabric:33997:mlx:core:mlx_getinfo():301<info> Loaded MLX version 1.11.1
libfabric:33997:mlx:core:mlx_getinfo():348<warn> MLX: spawn support 0
libfabric:33997:mlx:core:mlx_fabric_open():172<info>
libfabric:33997:core:core:fi_fabric_():1423<info> Opened fabric: mlx
libfabric:33997:mlx:core:ofi_check_rx_attr():786<info> Tx only caps ignored in Rx caps
libfabric:33997:mlx:core:ofi_check_tx_attr():884<info> Rx only caps ignored in Tx caps
libfabric:33998:mlx:core:mlx_getinfo():254<info> used inject size = 1024
libfabric:33998:mlx:core:mlx_getinfo():301<info> Loaded MLX version 1.11.1
libfabric:33998:mlx:core:mlx_getinfo():348<warn> MLX: spawn support 0
libfabric:33998:mlx:core:mlx_fabric_open():172<info>
libfabric:33998:core:core:fi_fabric_():1423<info> Opened fabric: mlx
libfabric:33998:mlx:core:ofi_check_rx_attr():786<info> Tx only caps ignored in Rx caps
libfabric:33998:mlx:core:ofi_check_tx_attr():884<info> Rx only caps ignored in Tx caps
[0] MPI startup(): max_ch4_vcis: 1, max_reg_eps 64, enable_sep 0, enable_shared_ctxs 0, do_av_insert 1
libfabric:33997:mlx:core:ofi_check_rx_attr():786<info> Tx only caps ignored in Rx caps
libfabric:33997:mlx:core:ofi_check_tx_attr():884<info> Rx only caps ignored in Tx caps
libfabric:33998:mlx:core:ofi_check_rx_attr():786<info> Tx only caps ignored in Rx caps
libfabric:33998:mlx:core:ofi_check_tx_attr():884<info> Rx only caps ignored in Tx caps
[0] MPI startup(): addrnamelen: 1024
libfabric:33997:mlx:core:mlx_cm_getname_mlx_format():73<info> Loaded UCP address: [307]...
libfabric:33998:mlx:core:mlx_cm_getname_mlx_format():73<info> Loaded UCP address: [307]...
libfabric:94328:mlx:core:mlx_av_insert():179<warn> Try to insert address #0, offset=0 (size=2) fi_addr=0x15f8fb0
libfabric:94329:mlx:core:mlx_av_insert():179<warn> Try to insert address #0, offset=0 (size=2) fi_addr=0x2020e00
libfabric:33997:mlx:core:mlx_av_insert():179<warn> Try to insert address #0, offset=0 (size=2) fi_addr=0x1c3bf80
libfabric:33998:mlx:core:mlx_av_insert():179<warn> Try to insert address #0, offset=0 (size=2) fi_addr=0x1d5be00
libfabric:33998:mlx:core:mlx_av_insert():189<warn> address inserted
libfabric:33998:mlx:core:mlx_av_insert():179<warn> Try to insert address #1, offset=1024 (size=2) fi_addr=0x1d5be00
libfabric:33997:mlx:core:mlx_av_insert():189<warn> address inserted
libfabric:33997:mlx:core:mlx_av_insert():179<warn> Try to insert address #1, offset=1024 (size=2) fi_addr=0x1c3bf80
libfabric:94328:mlx:core:mlx_av_insert():189<warn> address inserted
libfabric:94329:mlx:core:mlx_av_insert():189<warn> address inserted
libfabric:94329:mlx:core:mlx_av_insert():179<warn> Try to insert address #1, offset=1024 (size=2) fi_addr=0x2020e00
libfabric:94328:mlx:core:mlx_av_insert():179<warn> Try to insert address #1, offset=1024 (size=2) fi_addr=0x15f8fb0
libfabric:94328:mlx:core:mlx_av_insert():189<warn> address inserted
libfabric:94329:mlx:core:mlx_av_insert():189<warn> address inserted
libfabric:33997:mlx:core:mlx_av_insert():189<warn> address inserted
libfabric:33998:mlx:core:mlx_av_insert():189<warn> address inserted
libfabric:94328:mlx:core:mlx_av_insert():179<warn> Try to insert address #0, offset=0 (size=2) fi_addr=0x173e190
libfabric:94329:mlx:core:mlx_av_insert():179<warn> Try to insert address #0, offset=0 (size=2) fi_addr=0x2020e00
libfabric:33997:mlx:core:mlx_av_insert():179<warn> Try to insert address #0, offset=0 (size=2) fi_addr=0x19d2da0
libfabric:33997:mlx:core:mlx_av_insert():189<warn> address inserted
libfabric:33997:mlx:core:mlx_av_insert():179<warn> Try to insert address #1, offset=1024 (size=2) fi_addr=0x19d2da0
libfabric:33997:mlx:core:mlx_av_insert():189<warn> address inserted
libfabric:33998:mlx:core:mlx_av_insert():179<warn> Try to insert address #0, offset=0 (size=2) fi_addr=0x1d5be00
libfabric:33998:mlx:core:mlx_av_insert():189<warn> address inserted
libfabric:33998:mlx:core:mlx_av_insert():179<warn> Try to insert address #1, offset=1024 (size=2) fi_addr=0x1d5be00
libfabric:33998:mlx:core:mlx_av_insert():189<warn> address inserted
libfabric:94328:mlx:core:mlx_av_insert():189<warn> address inserted
libfabric:94328:mlx:core:mlx_av_insert():179<warn> Try to insert address #1, offset=1024 (size=2) fi_addr=0x173e190
libfabric:94328:mlx:core:mlx_av_insert():189<warn> address inserted
libfabric:94329:mlx:core:mlx_av_insert():189<warn> address inserted
libfabric:94329:mlx:core:mlx_av_insert():179<warn> Try to insert address #1, offset=1024 (size=2) fi_addr=0x2020e00
libfabric:94329:mlx:core:mlx_av_insert():189<warn> address inserted
[0] MPI startup(): Load tuning file: "/cluster/apps/nss/intel/oneapi/2022.1.2/mpi/2021.5.1/etc/tuning_generic_shm-ofi_mlx_hcoll.dat"
[eu-g1-019-4:33997:0:33997] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x1c0)
[eu-g1-019-4:33998:0:33998] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x1c0)
[eu-g1-021-4:94328:0:94328] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x1c0)
[eu-g1-021-4:94329:0:94329] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x1c0)
==== backtrace (tid: 33997) ====
0 0x0000000000291441 MPIDIG_dequeue_unexp() /build/impi/_buildspace/release/../../src/mpid/ch4/src/intel/ch4r_unexp_hashtable.c:417
1 0x00000000003d3cc0 MPIDIG_do_irecv() /build/impi/_buildspace/release/../../src/mpid/ch4/generic/mpidig_recv.h:135
2 0x000000000040356d MPIDIG_mpi_irecv() /build/impi/_buildspace/release/../../src/mpid/ch4/generic/mpidig_recv.h:336
3 0x000000000040356d MPIDI_POSIX_mpi_irecv() /build/impi/_buildspace/release/../../src/mpid/ch4/shm/src/../src/../posix/posix_recv.h:60
4 0x000000000040356d MPIDI_SHM_mpi_irecv() /build/impi/_buildspace/release/../../src/mpid/ch4/shm/src/../src/shm_p2p.h:328
5 0x000000000040356d MPIDI_irecv_unsafe() /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_recv.h:235
6 0x000000000040356d MPIDI_irecv_safe() /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_recv.h:558
7 0x000000000040356d MPID_Irecv() /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_recv.h:791
8 0x000000000040356d MPIC_Irecv() /build/impi/_buildspace/release/../../src/mpi/coll/helper_fns.c:625
9 0x00000000003c1bee recv_nb() /build/impi/_buildspace/release/../../src/mpid/common/hcoll/hcoll_rte.c:233
10 0x000000000001b544 comm_allreduce_hcolrte_generic() common_allreduce.c:0
11 0x000000000001bd1b comm_allreduce_hcolrte() ???:0
12 0x000000000001a9c1 hmca_bcol_ucx_p2p_init_query() ???:0
13 0x00000000000bd02c hmca_bcol_base_init() ???:0
14 0x0000000000066c88 hmca_coll_ml_init_query() ???:0
15 0x00000000000b3992 hcoll_init_with_opts() ???:0
16 0x00000000003c0493 hcoll_initialize() /build/impi/_buildspace/release/../../src/mpid/common/hcoll/hcoll_init.c:106
17 0x00000000003c0493 hcoll_comm_create() /build/impi/_buildspace/release/../../src/mpid/common/hcoll/hcoll_init.c:144
18 0x00000000006a0b01 MPIDI_OFI_mpi_comm_create_hook() /build/impi/_buildspace/release/../../src/mpid/ch4/netmod/ofi/ofi_comm.c:216
19 0x00000000001ca555 MPID_Comm_create_hook() /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_comm.c:198
20 0x0000000000317bdb MPIR_Comm_commit_internal() /build/impi/_buildspace/release/../../src/mpi/comm/commutil.c:353
21 0x0000000000317bdb MPIR_Comm_commit() /build/impi/_buildspace/release/../../src/mpi/comm/commutil.c:530
22 0x0000000000210e4b init_builtin_comms() /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_init.c:1113
23 0x0000000000210e4b MPID_Init() /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_init.c:1359
24 0x000000000052a1a3 MPIR_Init_thread() /build/impi/_buildspace/release/../../src/mpi/init/initthread.c:142
25 0x000000000052971b PMPI_Init() /build/impi/_buildspace/release/../../src/mpi/init/init.c:140
26 0x0000000000400eb3 main() ???:0
27 0x0000000000022555 __libc_start_main() ???:0
28 0x0000000000400db9 _start() ???:0
=================================
==== backtrace (tid: 33998) ====
0 0x0000000000291441 MPIDIG_dequeue_unexp() /build/impi/_buildspace/release/../../src/mpid/ch4/src/intel/ch4r_unexp_hashtable.c:417
1 0x00000000003d3cc0 MPIDIG_do_irecv() /build/impi/_buildspace/release/../../src/mpid/ch4/generic/mpidig_recv.h:135
2 0x000000000040356d MPIDIG_mpi_irecv() /build/impi/_buildspace/release/../../src/mpid/ch4/generic/mpidig_recv.h:336
3 0x000000000040356d MPIDI_POSIX_mpi_irecv() /build/impi/_buildspace/release/../../src/mpid/ch4/shm/src/../src/../posix/posix_recv.h:60
4 0x000000000040356d MPIDI_SHM_mpi_irecv() /build/impi/_buildspace/release/../../src/mpid/ch4/shm/src/../src/shm_p2p.h:328
5 0x000000000040356d MPIDI_irecv_unsafe() /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_recv.h:235
6 0x000000000040356d MPIDI_irecv_safe() /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_recv.h:558
7 0x000000000040356d MPID_Irecv() /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_recv.h:791
8 0x000000000040356d MPIC_Irecv() /build/impi/_buildspace/release/../../src/mpi/coll/helper_fns.c:625
9 0x00000000003c1bee recv_nb() /build/impi/_buildspace/release/../../src/mpid/common/hcoll/hcoll_rte.c:233
10 0x000000000001b544 comm_allreduce_hcolrte_generic() common_allreduce.c:0
11 0x000000000001bd1b comm_allreduce_hcolrte() ???:0
12 0x000000000001a9c1 hmca_bcol_ucx_p2p_init_query() ???:0
13 0x00000000000bd02c hmca_bcol_base_init() ???:0
14 0x0000000000066c88 hmca_coll_ml_init_query() ???:0
15 0x00000000000b3992 hcoll_init_with_opts() ???:0
16 0x00000000003c0493 hcoll_initialize() /build/impi/_buildspace/release/../../src/mpid/common/hcoll/hcoll_init.c:106
17 0x00000000003c0493 hcoll_comm_create() /build/impi/_buildspace/release/../../src/mpid/common/hcoll/hcoll_init.c:144
18 0x00000000006a0b01 MPIDI_OFI_mpi_comm_create_hook() /build/impi/_buildspace/release/../../src/mpid/ch4/netmod/ofi/ofi_comm.c:216
19 0x00000000001ca555 MPID_Comm_create_hook() /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_comm.c:198
20 0x0000000000317bdb MPIR_Comm_commit_internal() /build/impi/_buildspace/release/../../src/mpi/comm/commutil.c:353
21 0x0000000000317bdb MPIR_Comm_commit() /build/impi/_buildspace/release/../../src/mpi/comm/commutil.c:530
22 0x0000000000210e4b init_builtin_comms() /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_init.c:1113
23 0x0000000000210e4b MPID_Init() /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_init.c:1359
24 0x000000000052a1a3 MPIR_Init_thread() /build/impi/_buildspace/release/../../src/mpi/init/initthread.c:142
25 0x000000000052971b PMPI_Init() /build/impi/_buildspace/release/../../src/mpi/init/init.c:140
26 0x0000000000400eb3 main() ???:0
27 0x0000000000022555 __libc_start_main() ???:0
28 0x0000000000400db9 _start() ???:0
=================================
==== backtrace (tid: 94328) ====
0 0x0000000000291441 MPIDIG_dequeue_unexp() /build/impi/_buildspace/release/../../src/mpid/ch4/src/intel/ch4r_unexp_hashtable.c:417
1 0x00000000003d3cc0 MPIDIG_do_irecv() /build/impi/_buildspace/release/../../src/mpid/ch4/generic/mpidig_recv.h:135
2 0x000000000040356d MPIDIG_mpi_irecv() /build/impi/_buildspace/release/../../src/mpid/ch4/generic/mpidig_recv.h:336
3 0x000000000040356d MPIDI_POSIX_mpi_irecv() /build/impi/_buildspace/release/../../src/mpid/ch4/shm/src/../src/../posix/posix_recv.h:60
4 0x000000000040356d MPIDI_SHM_mpi_irecv() /build/impi/_buildspace/release/../../src/mpid/ch4/shm/src/../src/shm_p2p.h:328
5 0x000000000040356d MPIDI_irecv_unsafe() /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_recv.h:235
6 0x000000000040356d MPIDI_irecv_safe() /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_recv.h:558
7 0x000000000040356d MPID_Irecv() /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_recv.h:791
8 0x000000000040356d MPIC_Irecv() /build/impi/_buildspace/release/../../src/mpi/coll/helper_fns.c:625
9 0x00000000003c1bee recv_nb() /build/impi/_buildspace/release/../../src/mpid/common/hcoll/hcoll_rte.c:233
10 0x000000000001b544 comm_allreduce_hcolrte_generic() common_allreduce.c:0
11 0x000000000001bd1b comm_allreduce_hcolrte() ???:0
12 0x000000000001a9c1 hmca_bcol_ucx_p2p_init_query() ???:0
13 0x00000000000bd02c hmca_bcol_base_init() ???:0
14 0x0000000000066c88 hmca_coll_ml_init_query() ???:0
15 0x00000000000b3992 hcoll_init_with_opts() ???:0
16 0x00000000003c0493 hcoll_initialize() /build/impi/_buildspace/release/../../src/mpid/common/hcoll/hcoll_init.c:106
17 0x00000000003c0493 hcoll_comm_create() /build/impi/_buildspace/release/../../src/mpid/common/hcoll/hcoll_init.c:144
18 0x00000000006a0b01 MPIDI_OFI_mpi_comm_create_hook() /build/impi/_buildspace/release/../../src/mpid/ch4/netmod/ofi/ofi_comm.c:216
19 0x00000000001ca555 MPID_Comm_create_hook() /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_comm.c:198
20 0x0000000000317bdb MPIR_Comm_commit_internal() /build/impi/_buildspace/release/../../src/mpi/comm/commutil.c:353
21 0x0000000000317bdb MPIR_Comm_commit() /build/impi/_buildspace/release/../../src/mpi/comm/commutil.c:530
22 0x0000000000210e4b init_builtin_comms() /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_init.c:1113
23 0x0000000000210e4b MPID_Init() /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_init.c:1359
24 0x000000000052a1a3 MPIR_Init_thread() /build/impi/_buildspace/release/../../src/mpi/init/initthread.c:142
25 0x000000000052971b PMPI_Init() /build/impi/_buildspace/release/../../src/mpi/init/init.c:140
26 0x0000000000400eb3 main() ???:0
27 0x0000000000022555 __libc_start_main() ???:0
28 0x0000000000400db9 _start() ???:0
=================================
==== backtrace (tid: 94329) ====
0 0x0000000000291441 MPIDIG_dequeue_unexp() /build/impi/_buildspace/release/../../src/mpid/ch4/src/intel/ch4r_unexp_hashtable.c:417
1 0x00000000003d3cc0 MPIDIG_do_irecv() /build/impi/_buildspace/release/../../src/mpid/ch4/generic/mpidig_recv.h:135
2 0x000000000040356d MPIDIG_mpi_irecv() /build/impi/_buildspace/release/../../src/mpid/ch4/generic/mpidig_recv.h:336
3 0x000000000040356d MPIDI_POSIX_mpi_irecv() /build/impi/_buildspace/release/../../src/mpid/ch4/shm/src/../src/../posix/posix_recv.h:60
4 0x000000000040356d MPIDI_SHM_mpi_irecv() /build/impi/_buildspace/release/../../src/mpid/ch4/shm/src/../src/shm_p2p.h:328
5 0x000000000040356d MPIDI_irecv_unsafe() /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_recv.h:235
6 0x000000000040356d MPIDI_irecv_safe() /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_recv.h:558
7 0x000000000040356d MPID_Irecv() /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_recv.h:791
8 0x000000000040356d MPIC_Irecv() /build/impi/_buildspace/release/../../src/mpi/coll/helper_fns.c:625
9 0x00000000003c1bee recv_nb() /build/impi/_buildspace/release/../../src/mpid/common/hcoll/hcoll_rte.c:233
10 0x000000000001b544 comm_allreduce_hcolrte_generic() common_allreduce.c:0
11 0x000000000001bd1b comm_allreduce_hcolrte() ???:0
12 0x000000000001a9c1 hmca_bcol_ucx_p2p_init_query() ???:0
13 0x00000000000bd02c hmca_bcol_base_init() ???:0
14 0x0000000000066c88 hmca_coll_ml_init_query() ???:0
15 0x00000000000b3992 hcoll_init_with_opts() ???:0
16 0x00000000003c0493 hcoll_initialize() /build/impi/_buildspace/release/../../src/mpid/common/hcoll/hcoll_init.c:106
17 0x00000000003c0493 hcoll_comm_create() /build/impi/_buildspace/release/../../src/mpid/common/hcoll/hcoll_init.c:144
18 0x00000000006a0b01 MPIDI_OFI_mpi_comm_create_hook() /build/impi/_buildspace/release/../../src/mpid/ch4/netmod/ofi/ofi_comm.c:216
19 0x00000000001ca555 MPID_Comm_create_hook() /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_comm.c:198
20 0x0000000000317bdb MPIR_Comm_commit_internal() /build/impi/_buildspace/release/../../src/mpi/comm/commutil.c:353
21 0x0000000000317bdb MPIR_Comm_commit() /build/impi/_buildspace/release/../../src/mpi/comm/commutil.c:530
22 0x0000000000210e4b init_builtin_comms() /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_init.c:1113
23 0x0000000000210e4b MPID_Init() /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_init.c:1359
24 0x000000000052a1a3 MPIR_Init_thread() /build/impi/_buildspace/release/../../src/mpi/init/initthread.c:142
25 0x000000000052971b PMPI_Init() /build/impi/_buildspace/release/../../src/mpi/init/init.c:140
26 0x0000000000400eb3 main() ???:0
27 0x0000000000022555 __libc_start_main() ???:0
28 0x0000000000400db9 _start() ???:0
=================================

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 0 PID 33997 RUNNING AT eu-g1-019-4
= KILLED BY SIGNAL: 11 (Segmentation fault)
===================================================================================

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 1 PID 33998 RUNNING AT eu-g1-019-4
= KILLED BY SIGNAL: 11 (Segmentation fault)
===================================================================================
[sfux@eu-login-29 intelmpi]$
HemanthCH_Intel
Moderator
3,239 Views

Hi,

 

Thanks for posting in Intel Communities.

 

>>"Are those infiniband cards from Mellanox not supported?"

Mellanox ConnectX-6 infiniband card is supported by Intel MPI.

 

>>"I try to run the example on 4 cores (2 cores on each server)."

Could you please elaborate on this statement? Do 2 servers refers to 2 nodes?

 

If you want to run the mpi program on cluster, could you please try with the below command:

 

$ mpirun -n <number-of-processes> -ppn <processes-per-node> -hosts host1,host2 ./myprog

 

For the above command, you can use the "-f hostfile" instead of mentioning the "-hosts". For more information refer to the below link:https://www.intel.com/content/www/us/en/develop/documentation/mpi-developer-guide-linux/top/running-...

 

Could you please try setting I_MPI_FABRICS=ofi and try to run the application? Please let us know if it works.

 

Could you please provide the OS & CPU details?

 

Thanks & Regards,

Hemanth

 

samfux84
New Contributor I
3,225 Views

Hi,

 

Thank you for your reply.

 


>>"I try to run the example on 4 cores (2 cores on each server)."

Could you please elaborate on this statement? Do 2 servers refers to 2 nodes?


Our cluster has around 3000 compute nodes (several different hardware generations). I was requesting 4 CPU cores from the IBM LSF batch system and requested -R "span[ptile=2]" to make sure I get my 4 cores allocated on two compute nodes (two cores on each compute node), since I want to test if internode communication works fine.

 

The batch system automatically sets the -n <number of processors> option of mpirun and also takes care of providing the hostlist and the number of cores per host automatically to Intel MPI. This works like a charm since more than 10 years among a wide variety of Intel MPI versions. So far it was not necessary to explicitly set

 

I_MPI_HYDRA_BOOTSTRAP=lsf

 

I just tried rerunning the example with setting this variable explicitly and get the same result as shown above in my previous post.

 


Could you please try setting I_MPI_FABRICS=ofi and try to run the application? Please let us know if it works.


Please find below the logs (again with increased loglevel):

 

Sender: LSF System <lsfadmin@eu-g1-017-2>
Subject: Job 207169474: <I_MPI_FABRICS=ofi I_MPI_DEBUG=30 FI_LOG_LEVEL=debug mpirun ./hello> in cluster <euler> Done

Job <I_MPI_FABRICS=ofi I_MPI_DEBUG=30 FI_LOG_LEVEL=debug mpirun ./hello> was submitted from host <eu-login-12> by user <sfux> in cluster <euler> at Fri Mar  4 14:11:15 2022
Job was executed on host(s) <2*eu-g1-017-2>, in queue <normal.4h>, as user <sfux> in cluster <euler> at Fri Mar  4 14:11:54 2022
                            <2*eu-g1-015-3>
</cluster/home/sfux> was used as the home directory.
</cluster/home/sfux/test/intelmpi> was used as the working directory.
Started at Fri Mar  4 14:11:54 2022
Terminated at Fri Mar  4 14:12:37 2022
Results reported at Fri Mar  4 14:12:37 2022

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
I_MPI_FABRICS=ofi I_MPI_DEBUG=30 FI_LOG_LEVEL=debug mpirun ./hello
------------------------------------------------------------

Successfully completed.

Resource usage summary:

    CPU time :                                   2.64 sec.
    Max Memory :                                 798 MB
    Average Memory :                             -
    Total Requested Memory :                     8000.00 MB
    Delta Memory :                               7202.00 MB
    Max Swap :                                   -
    Max Processes :                              -
    Max Threads :                                -
    Run time :                                   23 sec.
    Turnaround time :                            82 sec.

The output (if any) follows:

IPL WARN> Not all cpus are available, switch to I_MPI_PIN_ORDER=compact. (Total: 128 Available: 2)
IPL WARN> Not all cpus are available, switch to I_MPI_PIN_ORDER=compact. (Total: 128 Available: 2)
[0] MPI startup(): Intel(R) MPI Library, Version 2021.5  Build 20211102 (id: 9279b7d62)
[0] MPI startup(): Copyright (C) 2003-2021 Intel Corporation.  All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): libfabric version: 1.13.2rc1-impi
libfabric:125166:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:125166:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:125167:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:125167:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:19989:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:19989:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:19988:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:19988:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:125166:core:core:ze_hmem_dl_init():422<warn> Failed to dlopen libze_loader.so
libfabric:125166:core:core:ofi_hmem_init():214<warn> Failed to initialize hmem iface FI_HMEM_ZE: No data available
libfabric:125166:core:mr:ofi_default_cache_size():78<info> default cache size=2112943968
libfabric:125167:core:core:ze_hmem_dl_init():422<warn> Failed to dlopen libze_loader.so
libfabric:125167:core:core:ofi_hmem_init():214<warn> Failed to initialize hmem iface FI_HMEM_ZE: No data available
libfabric:125167:core:mr:ofi_default_cache_size():78<info> default cache size=2112943968
libfabric:19988:core:core:ze_hmem_dl_init():422<warn> Failed to dlopen libze_loader.so
libfabric:19988:core:core:ofi_hmem_init():214<warn> Failed to initialize hmem iface FI_HMEM_ZE: No data available
libfabric:19988:core:mr:ofi_default_cache_size():78<info> default cache size=2112943968
libfabric:19989:core:core:ze_hmem_dl_init():422<warn> Failed to dlopen libze_loader.so
libfabric:19989:core:core:ofi_hmem_init():214<warn> Failed to initialize hmem iface FI_HMEM_ZE: No data available
libfabric:19989:core:mr:ofi_default_cache_size():78<info> default cache size=2112943968
libfabric:125166:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:125166:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:125166:core:core:ze_hmem_dl_init():422<warn> Failed to dlopen libze_loader.so
libfabric:125166:core:core:ofi_hmem_init():214<warn> Failed to initialize hmem iface FI_HMEM_ZE: No data available
libfabric:125166:core:mr:ofi_default_cache_size():78<info> default cache size=2112943968
libfabric:19988:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:19988:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:19988:core:core:ze_hmem_dl_init():422<warn> Failed to dlopen libze_loader.so
libfabric:19988:core:core:ofi_hmem_init():214<warn> Failed to initialize hmem iface FI_HMEM_ZE: No data available
libfabric:19988:core:mr:ofi_default_cache_size():78<info> default cache size=2112943968
libfabric:125167:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:125167:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:125167:core:core:ze_hmem_dl_init():422<warn> Failed to dlopen libze_loader.so
libfabric:125167:core:core:ofi_hmem_init():214<warn> Failed to initialize hmem iface FI_HMEM_ZE: No data available
libfabric:125167:core:mr:ofi_default_cache_size():78<info> default cache size=2112943968
libfabric:19989:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:19989:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:19989:core:core:ze_hmem_dl_init():422<warn> Failed to dlopen libze_loader.so
libfabric:19989:core:core:ofi_hmem_init():214<warn> Failed to initialize hmem iface FI_HMEM_ZE: No data available
libfabric:19989:core:mr:ofi_default_cache_size():78<info> default cache size=2112943968
libfabric:125166:verbs:fabric:verbs_devs_print():883<info> list of verbs devices found for FI_EP_MSG:
libfabric:125166:verbs:fabric:verbs_devs_print():887<info> #1 mlx5_0 - IPoIB addresses:
libfabric:125166:verbs:fabric:verbs_devs_print():897<info>      10.205.73.86
libfabric:125166:verbs:fabric:verbs_devs_print():897<info>      fe80::ba59:9f03:d2:346
libfabric:125167:verbs:fabric:verbs_devs_print():883<info> list of verbs devices found for FI_EP_MSG:
libfabric:125167:verbs:fabric:verbs_devs_print():887<info> #1 mlx5_0 - IPoIB addresses:
libfabric:125167:verbs:fabric:verbs_devs_print():897<info>      10.205.73.86
libfabric:125167:verbs:fabric:verbs_devs_print():897<info>      fe80::ba59:9f03:d2:346
libfabric:19988:verbs:fabric:verbs_devs_print():883<info> list of verbs devices found for FI_EP_MSG:
libfabric:19988:verbs:fabric:verbs_devs_print():887<info> #1 mlx5_0 - IPoIB addresses:
libfabric:19988:verbs:fabric:verbs_devs_print():897<info>       10.205.73.79
libfabric:19988:verbs:fabric:verbs_devs_print():897<info>       fe80::ba59:9f03:d2:426
libfabric:19989:verbs:fabric:verbs_devs_print():883<info> list of verbs devices found for FI_EP_MSG:
libfabric:19989:verbs:fabric:verbs_devs_print():887<info> #1 mlx5_0 - IPoIB addresses:
libfabric:19989:verbs:fabric:verbs_devs_print():897<info>       10.205.73.79
libfabric:19989:verbs:fabric:verbs_devs_print():897<info>       fe80::ba59:9f03:d2:426
libfabric:125166:verbs:fabric:vrb_get_device_attrs():618<info> device mlx5_0: first found active port is 1
libfabric:125167:verbs:fabric:vrb_get_device_attrs():618<info> device mlx5_0: first found active port is 1
libfabric:125167:verbs:fabric:vrb_get_device_attrs():618<info> device mlx5_0: first found active port is 1
libfabric:19988:verbs:fabric:vrb_get_device_attrs():618<info> device mlx5_0: first found active port is 1
libfabric:19989:verbs:fabric:vrb_get_device_attrs():618<info> device mlx5_0: first found active port is 1
libfabric:125166:verbs:fabric:vrb_get_device_attrs():618<info> device mlx5_0: first found active port is 1
libfabric:125167:verbs:fabric:vrb_get_device_attrs():618<info> device mlx5_0: first found active port is 1
libfabric:19989:verbs:fabric:vrb_get_device_attrs():618<info> device mlx5_0: first found active port is 1
libfabric:125167:core:core:ofi_register_provider():474<info> registering provider: verbs (113.20)
libfabric:125166:verbs:fabric:vrb_get_device_attrs():618<info> device mlx5_0: first found active port is 1
libfabric:19988:verbs:fabric:vrb_get_device_attrs():618<info> device mlx5_0: first found active port is 1
libfabric:125166:core:core:ofi_register_provider():474<info> registering provider: verbs (113.20)
libfabric:125166:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:125166:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:125166:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ZE not supported
libfabric:125166:core:mr:ofi_default_cache_size():78<info> default cache size=2112943968
libfabric:19989:verbs:fabric:vrb_get_device_attrs():618<info> device mlx5_0: first found active port is 1
libfabric:125167:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:125167:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:125167:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ZE not supported
libfabric:125167:core:mr:ofi_default_cache_size():78<info> default cache size=2112943968
libfabric:19989:core:core:ofi_register_provider():474<info> registering provider: verbs (113.20)
libfabric:19988:verbs:fabric:vrb_get_device_attrs():618<info> device mlx5_0: first found active port is 1
libfabric:19988:core:core:ofi_register_provider():474<info> registering provider: verbs (113.20)
libfabric:19988:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:19988:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:19988:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ZE not supported
libfabric:19988:core:mr:ofi_default_cache_size():78<info> default cache size=2112943968
libfabric:19989:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:19989:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:19989:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ZE not supported
libfabric:19989:core:mr:ofi_default_cache_size():78<info> default cache size=2112943968
libfabric:125166:verbs:fabric:verbs_devs_print():883<info> list of verbs devices found for FI_EP_MSG:
libfabric:125166:verbs:fabric:verbs_devs_print():887<info> #1 mlx5_0 - IPoIB addresses:
libfabric:125166:verbs:fabric:verbs_devs_print():897<info>      10.205.73.86
libfabric:125166:verbs:fabric:verbs_devs_print():897<info>      fe80::ba59:9f03:d2:346
libfabric:125167:verbs:fabric:verbs_devs_print():883<info> list of verbs devices found for FI_EP_MSG:
libfabric:125167:verbs:fabric:verbs_devs_print():887<info> #1 mlx5_0 - IPoIB addresses:
libfabric:125167:verbs:fabric:verbs_devs_print():897<info>      10.205.73.86
libfabric:125167:verbs:fabric:verbs_devs_print():897<info>      fe80::ba59:9f03:d2:346
libfabric:19988:verbs:fabric:verbs_devs_print():883<info> list of verbs devices found for FI_EP_MSG:
libfabric:19988:verbs:fabric:verbs_devs_print():887<info> #1 mlx5_0 - IPoIB addresses:
libfabric:19988:verbs:fabric:verbs_devs_print():897<info>       10.205.73.79
libfabric:19988:verbs:fabric:verbs_devs_print():897<info>       fe80::ba59:9f03:d2:426
libfabric:19989:verbs:fabric:verbs_devs_print():883<info> list of verbs devices found for FI_EP_MSG:
libfabric:19989:verbs:fabric:verbs_devs_print():887<info> #1 mlx5_0 - IPoIB addresses:
libfabric:19989:verbs:fabric:verbs_devs_print():897<info>       10.205.73.79
libfabric:19989:verbs:fabric:verbs_devs_print():897<info>       fe80::ba59:9f03:d2:426
libfabric:125167:verbs:fabric:vrb_get_device_attrs():618<info> device mlx5_0: first found active port is 1
libfabric:125166:verbs:fabric:vrb_get_device_attrs():618<info> device mlx5_0: first found active port is 1
libfabric:125166:verbs:fabric:vrb_get_device_attrs():618<info> device mlx5_0: first found active port is 1
libfabric:125167:verbs:fabric:vrb_get_device_attrs():618<info> device mlx5_0: first found active port is 1
libfabric:125166:verbs:fabric:vrb_get_device_attrs():618<info> device mlx5_0: first found active port is 1
libfabric:125166:core:core:ofi_register_provider():474<info> registering provider: verbs (113.20)
libfabric:125167:verbs:fabric:vrb_get_device_attrs():618<info> device mlx5_0: first found active port is 1
libfabric:19988:verbs:fabric:vrb_get_device_attrs():618<info> device mlx5_0: first found active port is 1
libfabric:125167:core:core:ofi_register_provider():474<info> registering provider: verbs (113.20)
libfabric:19989:verbs:fabric:vrb_get_device_attrs():618<info> device mlx5_0: first found active port is 1
libfabric:125166:core:core:ofi_register_provider():474<info> registering provider: tcp (113.20)
libfabric:125167:core:core:ofi_register_provider():474<info> registering provider: tcp (113.20)
libfabric:125166:core:core:ofi_register_provider():474<info> registering provider: sockets (113.20)
libfabric:125167:core:core:ofi_register_provider():474<info> registering provider: sockets (113.20)
libfabric:125166:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:125166:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:125166:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ZE not supported
libfabric:125166:core:core:ofi_register_provider():474<info> registering provider: shm (113.20)
libfabric:125167:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:125167:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:125167:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ZE not supported
libfabric:125167:core:core:ofi_register_provider():474<info> registering provider: shm (113.20)
libfabric:19988:verbs:fabric:vrb_get_device_attrs():618<info> device mlx5_0: first found active port is 1
libfabric:19989:verbs:fabric:vrb_get_device_attrs():618<info> device mlx5_0: first found active port is 1
libfabric:125166:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:125166:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:125166:core:core:ze_hmem_dl_init():422<warn> Failed to dlopen libze_loader.so
libfabric:125166:core:core:ofi_hmem_init():214<warn> Failed to initialize hmem iface FI_HMEM_ZE: No data available
libfabric:125166:core:core:ofi_register_provider():474<info> registering provider: ofi_rxm (113.20)
libfabric:125167:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:125167:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:125167:core:core:ze_hmem_dl_init():422<warn> Failed to dlopen libze_loader.so
libfabric:125167:core:core:ofi_hmem_init():214<warn> Failed to initialize hmem iface FI_HMEM_ZE: No data available
libfabric:125167:core:core:ofi_register_provider():474<info> registering provider: ofi_rxm (113.20)
libfabric:19988:verbs:fabric:vrb_get_device_attrs():618<info> device mlx5_0: first found active port is 1
libfabric:19988:core:core:ofi_register_provider():474<info> registering provider: verbs (113.20)
libfabric:19989:verbs:fabric:vrb_get_device_attrs():618<info> device mlx5_0: first found active port is 1
libfabric:19989:core:core:ofi_register_provider():474<info> registering provider: verbs (113.20)
libfabric:19988:core:core:ofi_register_provider():474<info> registering provider: tcp (113.20)
libfabric:19989:core:core:ofi_register_provider():474<info> registering provider: tcp (113.20)
libfabric:19988:core:core:ofi_register_provider():474<info> registering provider: sockets (113.20)
libfabric:19989:core:core:ofi_register_provider():474<info> registering provider: sockets (113.20)
libfabric:19988:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:19988:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:19988:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ZE not supported
libfabric:19988:core:core:ofi_register_provider():474<info> registering provider: shm (113.20)
libfabric:19989:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:19989:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:19989:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ZE not supported
libfabric:19989:core:core:ofi_register_provider():474<info> registering provider: shm (113.20)
libfabric:19988:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:19988:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:19988:core:core:ze_hmem_dl_init():422<warn> Failed to dlopen libze_loader.so
libfabric:19988:core:core:ofi_hmem_init():214<warn> Failed to initialize hmem iface FI_HMEM_ZE: No data available
libfabric:19988:core:core:ofi_register_provider():474<info> registering provider: ofi_rxm (113.20)
libfabric:19989:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:19989:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:19989:core:core:ze_hmem_dl_init():422<warn> Failed to dlopen libze_loader.so
libfabric:19989:core:core:ofi_hmem_init():214<warn> Failed to initialize hmem iface FI_HMEM_ZE: No data available
libfabric:19989:core:core:ofi_register_provider():474<info> registering provider: ofi_rxm (113.20)
libfabric:125166:psm3:core:fi_prov_ini():680<info> build options: VERSION=1101.0=11.1.0.0, HAVE_PSM3_src=1, PSM3_CUDA=0
libfabric:125166:core:core:ofi_register_provider():474<info> registering provider: psm3 (1101.0)
libfabric:125167:psm3:core:fi_prov_ini():680<info> build options: VERSION=1101.0=11.1.0.0, HAVE_PSM3_src=1, PSM3_CUDA=0
libfabric:125167:core:core:ofi_register_provider():474<info> registering provider: psm3 (1101.0)
libfabric:19988:psm3:core:fi_prov_ini():680<info> build options: VERSION=1101.0=11.1.0.0, HAVE_PSM3_src=1, PSM3_CUDA=0
libfabric:19988:core:core:ofi_register_provider():474<info> registering provider: psm3 (1101.0)
libfabric:19989:psm3:core:fi_prov_ini():680<info> build options: VERSION=1101.0=11.1.0.0, HAVE_PSM3_src=1, PSM3_CUDA=0
libfabric:19989:core:core:ofi_register_provider():474<info> registering provider: psm3 (1101.0)
libfabric:125166:core:core:ofi_register_provider():474<info> registering provider: mlx (1.4)
libfabric:125167:core:core:ofi_register_provider():474<info> registering provider: mlx (1.4)
libfabric:125166:core:core:ofi_register_provider():474<info> registering provider: ofi_hook_noop (113.20)
libfabric:125166:core:core:fi_getinfo_():1138<info> Found provider with the highest priority mlx, must_use_util_prov = 0
libfabric:125166:mlx:core:mlx_getinfo():211<info> primary detected device: mlx5_0
libfabric:125167:core:core:ofi_register_provider():474<info> registering provider: ofi_hook_noop (113.20)
libfabric:125167:core:core:fi_getinfo_():1138<info> Found provider with the highest priority mlx, must_use_util_prov = 0
libfabric:125167:mlx:core:mlx_getinfo():211<info> primary detected device: mlx5_0
libfabric:125166:mlx:core:mlx_getinfo():254<info> used inject size = 1024
libfabric:125167:mlx:core:mlx_getinfo():254<info> used inject size = 1024
libfabric:19988:core:core:ofi_register_provider():474<info> registering provider: mlx (1.4)
libfabric:19989:core:core:ofi_register_provider():474<info> registering provider: mlx (1.4)
libfabric:125166:mlx:core:mlx_getinfo():301<info> Loaded MLX version 1.11.1
libfabric:125166:mlx:core:mlx_getinfo():348<warn> MLX: spawn support 0
libfabric:125166:core:core:fi_getinfo_():1161<info> Since mlx can be used, psm3 has been skipped. To use psm3, please, set FI_PROVIDER=psm3
libfabric:125166:core:core:fi_getinfo_():1161<info> Since mlx can be used, verbs has been skipped. To use verbs, please, set FI_PROVIDER=verbs
libfabric:125166:core:core:fi_getinfo_():1161<info> Since mlx can be used, tcp has been skipped. To use tcp, please, set FI_PROVIDER=tcp
libfabric:125166:core:core:fi_getinfo_():1161<info> Since mlx can be used, sockets has been skipped. To use sockets, please, set FI_PROVIDER=sockets
libfabric:125166:core:core:fi_getinfo_():1161<info> Since mlx can be used, shm has been skipped. To use shm, please, set FI_PROVIDER=shm
libfabric:125166:core:core:fi_getinfo_():1138<info> Found provider with the highest priority mlx, must_use_util_prov = 0
libfabric:125166:mlx:core:mlx_getinfo():211<info> primary detected device: mlx5_0
libfabric:125167:mlx:core:mlx_getinfo():301<info> Loaded MLX version 1.11.1
libfabric:125167:mlx:core:mlx_getinfo():348<warn> MLX: spawn support 0
libfabric:125167:core:core:fi_getinfo_():1161<info> Since mlx can be used, psm3 has been skipped. To use psm3, please, set FI_PROVIDER=psm3
libfabric:125167:core:core:fi_getinfo_():1161<info> Since mlx can be used, verbs has been skipped. To use verbs, please, set FI_PROVIDER=verbs
libfabric:125167:core:core:fi_getinfo_():1161<info> Since mlx can be used, tcp has been skipped. To use tcp, please, set FI_PROVIDER=tcp
libfabric:125167:core:core:fi_getinfo_():1161<info> Since mlx can be used, sockets has been skipped. To use sockets, please, set FI_PROVIDER=sockets
libfabric:125167:core:core:fi_getinfo_():1161<info> Since mlx can be used, shm has been skipped. To use shm, please, set FI_PROVIDER=shm
libfabric:125167:core:core:fi_getinfo_():1138<info> Found provider with the highest priority mlx, must_use_util_prov = 0
libfabric:125167:mlx:core:mlx_getinfo():211<info> primary detected device: mlx5_0
[0] MPI startup(): libfabric provider: mlx
[0] MPI startup(): detected mlx provider, set device name to "mlx_hcoll"
libfabric:125166:mlx:core:mlx_getinfo():254<info> used inject size = 1024
libfabric:125166:mlx:core:mlx_getinfo():301<info> Loaded MLX version 1.11.1
libfabric:125166:mlx:core:mlx_getinfo():348<warn> MLX: spawn support 0
libfabric:125166:core:core:fi_getinfo_():1161<info> Since mlx can be used, psm3 has been skipped. To use psm3, please, set FI_PROVIDER=psm3
libfabric:125166:core:core:fi_getinfo_():1161<info> Since mlx can be used, verbs has been skipped. To use verbs, please, set FI_PROVIDER=verbs
libfabric:125166:core:core:fi_getinfo_():1161<info> Since mlx can be used, tcp has been skipped. To use tcp, please, set FI_PROVIDER=tcp
libfabric:125166:core:core:fi_getinfo_():1161<info> Since mlx can be used, sockets has been skipped. To use sockets, please, set FI_PROVIDER=sockets
libfabric:125166:core:core:fi_getinfo_():1161<info> Since mlx can be used, shm has been skipped. To use shm, please, set FI_PROVIDER=shm
libfabric:125166:mlx:core:mlx_fabric_open():172<info>
libfabric:125166:core:core:fi_fabric_():1423<info> Opened fabric: mlx
libfabric:125166:mlx:core:ofi_check_rx_attr():786<info> Tx only caps ignored in Rx caps
libfabric:125166:mlx:core:ofi_check_tx_attr():884<info> Rx only caps ignored in Tx caps
libfabric:125167:mlx:core:mlx_getinfo():254<info> used inject size = 1024
libfabric:125167:mlx:core:mlx_getinfo():301<info> Loaded MLX version 1.11.1
libfabric:125167:mlx:core:mlx_getinfo():348<warn> MLX: spawn support 0
libfabric:125167:core:core:fi_getinfo_():1161<info> Since mlx can be used, psm3 has been skipped. To use psm3, please, set FI_PROVIDER=psm3
libfabric:125167:core:core:fi_getinfo_():1161<info> Since mlx can be used, verbs has been skipped. To use verbs, please, set FI_PROVIDER=verbs
libfabric:125167:core:core:fi_getinfo_():1161<info> Since mlx can be used, tcp has been skipped. To use tcp, please, set FI_PROVIDER=tcp
libfabric:125167:core:core:fi_getinfo_():1161<info> Since mlx can be used, sockets has been skipped. To use sockets, please, set FI_PROVIDER=sockets
libfabric:125167:core:core:fi_getinfo_():1161<info> Since mlx can be used, shm has been skipped. To use shm, please, set FI_PROVIDER=shm
libfabric:125167:mlx:core:mlx_fabric_open():172<info>
libfabric:125167:core:core:fi_fabric_():1423<info> Opened fabric: mlx
libfabric:125167:mlx:core:ofi_check_rx_attr():786<info> Tx only caps ignored in Rx caps
libfabric:125167:mlx:core:ofi_check_tx_attr():884<info> Rx only caps ignored in Tx caps
libfabric:19988:core:core:ofi_register_provider():474<info> registering provider: ofi_hook_noop (113.20)
libfabric:19988:core:core:fi_getinfo_():1138<info> Found provider with the highest priority mlx, must_use_util_prov = 0
libfabric:19988:mlx:core:mlx_getinfo():211<info> primary detected device: mlx5_0
libfabric:19989:core:core:ofi_register_provider():474<info> registering provider: ofi_hook_noop (113.20)
libfabric:19989:core:core:fi_getinfo_():1138<info> Found provider with the highest priority mlx, must_use_util_prov = 0
libfabric:19989:mlx:core:mlx_getinfo():211<info> primary detected device: mlx5_0
libfabric:19988:mlx:core:mlx_getinfo():254<info> used inject size = 1024
libfabric:19989:mlx:core:mlx_getinfo():254<info> used inject size = 1024
libfabric:19988:mlx:core:mlx_getinfo():301<info> Loaded MLX version 1.11.1
libfabric:19988:mlx:core:mlx_getinfo():348<warn> MLX: spawn support 0
libfabric:19988:core:core:fi_getinfo_():1161<info> Since mlx can be used, psm3 has been skipped. To use psm3, please, set FI_PROVIDER=psm3
libfabric:19988:core:core:fi_getinfo_():1161<info> Since mlx can be used, verbs has been skipped. To use verbs, please, set FI_PROVIDER=verbs
libfabric:19988:core:core:fi_getinfo_():1161<info> Since mlx can be used, tcp has been skipped. To use tcp, please, set FI_PROVIDER=tcp
libfabric:19988:core:core:fi_getinfo_():1161<info> Since mlx can be used, sockets has been skipped. To use sockets, please, set FI_PROVIDER=sockets
libfabric:19988:core:core:fi_getinfo_():1161<info> Since mlx can be used, shm has been skipped. To use shm, please, set FI_PROVIDER=shm
libfabric:19988:core:core:fi_getinfo_():1138<info> Found provider with the highest priority mlx, must_use_util_prov = 0
libfabric:19988:mlx:core:mlx_getinfo():211<info> primary detected device: mlx5_0
libfabric:19989:mlx:core:mlx_getinfo():301<info> Loaded MLX version 1.11.1
libfabric:19989:mlx:core:mlx_getinfo():348<warn> MLX: spawn support 0
libfabric:19989:core:core:fi_getinfo_():1161<info> Since mlx can be used, psm3 has been skipped. To use psm3, please, set FI_PROVIDER=psm3
libfabric:19989:core:core:fi_getinfo_():1161<info> Since mlx can be used, verbs has been skipped. To use verbs, please, set FI_PROVIDER=verbs
libfabric:19989:core:core:fi_getinfo_():1161<info> Since mlx can be used, tcp has been skipped. To use tcp, please, set FI_PROVIDER=tcp
libfabric:19989:core:core:fi_getinfo_():1161<info> Since mlx can be used, sockets has been skipped. To use sockets, please, set FI_PROVIDER=sockets
libfabric:19989:core:core:fi_getinfo_():1161<info> Since mlx can be used, shm has been skipped. To use shm, please, set FI_PROVIDER=shm
libfabric:19989:core:core:fi_getinfo_():1138<info> Found provider with the highest priority mlx, must_use_util_prov = 0
libfabric:19989:mlx:core:mlx_getinfo():211<info> primary detected device: mlx5_0
libfabric:19988:mlx:core:mlx_getinfo():254<info> used inject size = 1024
libfabric:19988:mlx:core:mlx_getinfo():301<info> Loaded MLX version 1.11.1
libfabric:19988:mlx:core:mlx_getinfo():348<warn> MLX: spawn support 0
libfabric:19988:core:core:fi_getinfo_():1161<info> Since mlx can be used, psm3 has been skipped. To use psm3, please, set FI_PROVIDER=psm3
libfabric:19988:core:core:fi_getinfo_():1161<info> Since mlx can be used, verbs has been skipped. To use verbs, please, set FI_PROVIDER=verbs
libfabric:19988:core:core:fi_getinfo_():1161<info> Since mlx can be used, tcp has been skipped. To use tcp, please, set FI_PROVIDER=tcp
libfabric:19988:core:core:fi_getinfo_():1161<info> Since mlx can be used, sockets has been skipped. To use sockets, please, set FI_PROVIDER=sockets
libfabric:19988:core:core:fi_getinfo_():1161<info> Since mlx can be used, shm has been skipped. To use shm, please, set FI_PROVIDER=shm
libfabric:19988:mlx:core:mlx_fabric_open():172<info>
libfabric:19988:core:core:fi_fabric_():1423<info> Opened fabric: mlx
libfabric:19988:mlx:core:ofi_check_rx_attr():786<info> Tx only caps ignored in Rx caps
libfabric:19988:mlx:core:ofi_check_tx_attr():884<info> Rx only caps ignored in Tx caps
libfabric:19989:mlx:core:mlx_getinfo():254<info> used inject size = 1024
libfabric:19989:mlx:core:mlx_getinfo():301<info> Loaded MLX version 1.11.1
libfabric:19989:mlx:core:mlx_getinfo():348<warn> MLX: spawn support 0
libfabric:19989:core:core:fi_getinfo_():1161<info> Since mlx can be used, psm3 has been skipped. To use psm3, please, set FI_PROVIDER=psm3
libfabric:19989:core:core:fi_getinfo_():1161<info> Since mlx can be used, verbs has been skipped. To use verbs, please, set FI_PROVIDER=verbs
libfabric:19989:core:core:fi_getinfo_():1161<info> Since mlx can be used, tcp has been skipped. To use tcp, please, set FI_PROVIDER=tcp
libfabric:19989:core:core:fi_getinfo_():1161<info> Since mlx can be used, sockets has been skipped. To use sockets, please, set FI_PROVIDER=sockets
libfabric:19989:core:core:fi_getinfo_():1161<info> Since mlx can be used, shm has been skipped. To use shm, please, set FI_PROVIDER=shm
libfabric:19989:mlx:core:mlx_fabric_open():172<info>
libfabric:19989:core:core:fi_fabric_():1423<info> Opened fabric: mlx
libfabric:19989:mlx:core:ofi_check_rx_attr():786<info> Tx only caps ignored in Rx caps
libfabric:19989:mlx:core:ofi_check_tx_attr():884<info> Rx only caps ignored in Tx caps
[0] MPI startup(): max_ch4_vcis: 1, max_reg_eps 64, enable_sep 0, enable_shared_ctxs 0, do_av_insert 1
libfabric:125166:mlx:core:ofi_check_rx_attr():786<info> Tx only caps ignored in Rx caps
libfabric:125166:mlx:core:ofi_check_tx_attr():884<info> Rx only caps ignored in Tx caps
libfabric:125167:mlx:core:ofi_check_rx_attr():786<info> Tx only caps ignored in Rx caps
libfabric:125167:mlx:core:ofi_check_tx_attr():884<info> Rx only caps ignored in Tx caps
libfabric:19988:mlx:core:ofi_check_rx_attr():786<info> Tx only caps ignored in Rx caps
libfabric:19988:mlx:core:ofi_check_tx_attr():884<info> Rx only caps ignored in Tx caps
libfabric:19989:mlx:core:ofi_check_rx_attr():786<info> Tx only caps ignored in Rx caps
libfabric:19989:mlx:core:ofi_check_tx_attr():884<info> Rx only caps ignored in Tx caps
[0] MPI startup(): addrnamelen: 1024
libfabric:125166:mlx:core:mlx_cm_getname_mlx_format():73<info> Loaded UCP address: [307]...
libfabric:125167:mlx:core:mlx_cm_getname_mlx_format():73<info> Loaded UCP address: [307]...
libfabric:19988:mlx:core:mlx_cm_getname_mlx_format():73<info> Loaded UCP address: [307]...
libfabric:19989:mlx:core:mlx_cm_getname_mlx_format():73<info> Loaded UCP address: [307]...
libfabric:125166:mlx:core:mlx_av_insert():179<warn> Try to insert address #0, offset=0 (size=2) fi_addr=0x1925720
libfabric:125167:mlx:core:mlx_av_insert():179<warn> Try to insert address #0, offset=0 (size=2) fi_addr=0x2884190
libfabric:19988:mlx:core:mlx_av_insert():179<warn> Try to insert address #0, offset=0 (size=2) fi_addr=0x2a79f70
libfabric:19989:mlx:core:mlx_av_insert():179<warn> Try to insert address #0, offset=0 (size=2) fi_addr=0x2875190
libfabric:19988:mlx:core:mlx_av_insert():189<warn> address inserted
libfabric:19988:mlx:core:mlx_av_insert():179<warn> Try to insert address #1, offset=1024 (size=2) fi_addr=0x2a79f70
libfabric:19989:mlx:core:mlx_av_insert():189<warn> address inserted
libfabric:19989:mlx:core:mlx_av_insert():179<warn> Try to insert address #1, offset=1024 (size=2) fi_addr=0x2875190
libfabric:19988:mlx:core:mlx_av_insert():189<warn> address inserted
libfabric:19989:mlx:core:mlx_av_insert():189<warn> address inserted
libfabric:125166:mlx:core:mlx_av_insert():189<warn> address inserted
libfabric:125167:mlx:core:mlx_av_insert():189<warn> address inserted
libfabric:125167:mlx:core:mlx_av_insert():179<warn> Try to insert address #1, offset=1024 (size=2) fi_addr=0x2884190
libfabric:125166:mlx:core:mlx_av_insert():179<warn> Try to insert address #1, offset=1024 (size=2) fi_addr=0x1925720
libfabric:125167:mlx:core:mlx_av_insert():189<warn> address inserted
libfabric:125166:mlx:core:mlx_av_insert():189<warn> address inserted
libfabric:125166:mlx:core:mlx_av_insert():179<warn> Try to insert address #0, offset=0 (size=2) fi_addr=0x19b1040
libfabric:125167:mlx:core:mlx_av_insert():179<warn> Try to insert address #0, offset=0 (size=2) fi_addr=0x2884190
libfabric:125166:mlx:core:mlx_av_insert():189<warn> address inserted
libfabric:125166:mlx:core:mlx_av_insert():179<warn> Try to insert address #1, offset=1024 (size=2) fi_addr=0x19b1040
libfabric:125167:mlx:core:mlx_av_insert():189<warn> address inserted
libfabric:125167:mlx:core:mlx_av_insert():179<warn> Try to insert address #1, offset=1024 (size=2) fi_addr=0x2884190
libfabric:125166:mlx:core:mlx_av_insert():189<warn> address inserted
libfabric:125167:mlx:core:mlx_av_insert():189<warn> address inserted
[0] MPI startup(): Load tuning file: "/cluster/apps/nss/intel/oneapi/2022.1.2/mpi/2021.5.1/etc/tuning_generic_ofi_mlx_hcoll.dat"
libfabric:19988:mlx:core:mlx_av_insert():179<warn> Try to insert address #0, offset=0 (size=2) fi_addr=0x2b48a10
libfabric:19988:mlx:core:mlx_av_insert():189<warn> address inserted
libfabric:19988:mlx:core:mlx_av_insert():179<warn> Try to insert address #1, offset=1024 (size=2) fi_addr=0x2b48a10
libfabric:19988:mlx:core:mlx_av_insert():189<warn> address inserted
libfabric:19989:mlx:core:mlx_av_insert():179<warn> Try to insert address #0, offset=0 (size=2) fi_addr=0x2875190
libfabric:19989:mlx:core:mlx_av_insert():189<warn> address inserted
libfabric:19989:mlx:core:mlx_av_insert():179<warn> Try to insert address #1, offset=1024 (size=2) fi_addr=0x2875190
libfabric:19989:mlx:core:mlx_av_insert():189<warn> address inserted
[0] MPI startup(): Rank    Pid      Node name    Pin cpu
[0] MPI startup(): 0       125166   eu-g1-017-2  {44}
[0] MPI startup(): 1       125167   eu-g1-017-2  {45}
[0] MPI startup(): 2       19988    eu-g1-015-3  {44}
[0] MPI startup(): 3       19989    eu-g1-015-3  {45}
[0] MPI startup(): I_MPI_ROOT=/cluster/apps/nss/intel/oneapi/2022.1.2/mpi/2021.5.1
[0] MPI startup(): I_MPI_MPIRUN=mpirun
[0] MPI startup(): I_MPI_HYDRA_RMK=lsf
[0] MPI startup(): I_MPI_HYDRA_TOPOLIB=hwloc
[0] MPI startup(): I_MPI_INTERNAL_MEM_POLICY=default
[0] MPI startup(): I_MPI_FABRICS=ofi
[0] MPI startup(): I_MPI_DEBUG=30
[0] MPI startup(): threading: mode: direct
[0] MPI startup(): threading: vcis: 1
[0] MPI startup(): threading: app_threads: 1
[0] MPI startup(): threading: runtime: generic
[0] MPI startup(): threading: is_threaded: 0
[0] MPI startup(): threading: async_progress: 0
[0] MPI startup(): threading: num_pools: 64
[0] MPI startup(): threading: lock_level: global
[0] MPI startup(): threading: enable_sep: 0
[0] MPI startup(): threading: direct_recv: 1
[0] MPI startup(): threading: zero_op_flags: 1
[0] MPI startup(): threading: num_am_buffers: 1
[0] MPI startup(): threading: library is built with per-vci thread granularity
Hello world from processor eu-g1-017-2, rank 1 out of 4 processors
Hello world from processor eu-g1-017-2, rank 0 out of 4 processors
Hello world from processor eu-g1-015-3, rank 2 out of 4 processors
Hello world from processor eu-g1-015-3, rank 3 out of 4 processors
libfabric:125166:psm3:core:psmx3_fini():643<info>
libfabric:19988:psm3:core:psmx3_fini():643<info>
libfabric:125167:psm3:core:psmx3_fini():643<info>
libfabric:19989:psm3:core:psmx3_fini():643<info>

 

I also reran the same without increased log level and now the memory consumption looks much better and no error is shown any more:

 

Subject: Job 207170114: <I_MPI_FABRICS=ofi mpirun ./hello> in cluster <euler> Done

Job <I_MPI_FABRICS=ofi mpirun ./hello> was submitted from host <eu-login-12> by user <sfux> in cluster <euler> at Fri Mar  4 14:16:43 2022
Job was executed on host(s) <2*eu-g1-020-2>, in queue <normal.4h>, as user <sfux> in cluster <euler> at Fri Mar  4 14:17:14 2022
                            <2*eu-g1-013-4>
</cluster/home/sfux> was used as the home directory.
</cluster/home/sfux/test/intelmpi> was used as the working directory.
Started at Fri Mar  4 14:17:14 2022
Terminated at Fri Mar  4 14:17:52 2022
Results reported at Fri Mar  4 14:17:52 2022

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
I_MPI_FABRICS=ofi mpirun ./hello
------------------------------------------------------------

Successfully completed.

Resource usage summary:

    CPU time :                                   2.46 sec.
    Max Memory :                                 798 MB
    Average Memory :                             -
    Total Requested Memory :                     8000.00 MB
    Delta Memory :                               7202.00 MB
    Max Swap :                                   -
    Max Processes :                              -
    Max Threads :                                -
    Run time :                                   40 sec.
    Turnaround time :                            69 sec.

The output (if any) follows:

Hello world from processor eu-g1-020-2, rank 1 out of 4 processors
Hello world from processor eu-g1-020-2, rank 0 out of 4 processors
Hello world from processor eu-g1-013-4, rank 2 out of 4 processors
Hello world from processor eu-g1-013-4, rank 3 out of 4 processors
~

 

I was using shm:ofi, because

 

https://www.intel.com/content/www/us/en/develop/documentation/mpi-developer-reference-linux/top/envi...

 

states that this is the default for regular mode. Would you recommend to use I_MPI_FABRICS=ofi in general for such a setup that I am testing? How is then intranode communication handled when one does not specify shm at all?

 


Could you please provide the OS & CPU details?



OS: CentOS Linux release 7.9.2009 (Core)
Kernel: Linux eu-login-12 3.10.0-1160.45.1.el7.x86_64 #1 SMP Wed Oct 13 17:20:51 UTC 2021 x86_64 GNU/Linux

The nodes that I am testing with have two AMD EPYC 7742 CPUs with 64 cores each.

 

Again Thank you very much for your help.

 

Best regards

 

Sam

qumale
Beginner
3,129 Views

Dear Members,

 

I'm working on the cluster managed by Sam that reported this issue. Setting I_MPI_FABRICS=ofi didn't solve the problem, see attached logs with different level of verbosity. It seems that the problem is erratic because some of our jobs worked fine, as the one reported by Sam above.

 

Q

samfux84
New Contributor I
3,060 Views

Just a comment regarding the specs of the server where @qumale was running the job that produced lsf_207702449.out.txt and lsf_207705698.err.txt:

 

Our servers with hostnames eu-a6-* have the following specs:

 

  • Two 12-core Intel Xeon Gold 5118 processors (2.3 GHz nominal, 3.2 GHz peak)
  • 96 GB of DDR4 memory clocked at 2400 MHz

And the Infiniband network cards in those servers:

 

$ ibv_devinfo
hca_id:	mlx5_0
	transport:			InfiniBand (0)
	fw_ver:				12.23.1020
	node_guid:			9440:c9ff:ff71:48e8
	sys_image_guid:			9440:c9ff:ff71:48e8
	vendor_id:			0x02c9
	vendor_part_id:			4115
	hw_ver:				0x0
	board_id:			HPE2920111032
	phys_port_cnt:			1
		port:	1
			state:			PORT_ACTIVE (4)
			max_mtu:		4096 (5)
			active_mtu:		4096 (5)
			sm_lid:			3
			port_lid:		297
			port_lmc:		0x00
			link_layer:		InfiniBand

hca_id:	mlx5_1
	transport:			InfiniBand (0)
	fw_ver:				12.23.1020
	node_guid:			9440:c9ff:ff71:48e9
	sys_image_guid:			9440:c9ff:ff71:48e8
	vendor_id:			0x02c9
	vendor_part_id:			4115
	hw_ver:				0x0
	board_id:			HPE2920111032
	phys_port_cnt:			1
		port:	1
			state:			PORT_DOWN (1)
			max_mtu:		4096 (5)
			active_mtu:		4096 (5)
			sm_lid:			0
			port_lid:		65535
			port_lmc:		0x00
			link_layer:		InfiniBand

 

HemanthCH_Intel
Moderator
3,036 Views

Hi,


Could you please confirm which CPU you are using (two AMD EPYC 7742 CPUs or Intel Xeon Gold 5118)?


Thanks & Regards,

Hemanth.


samfux84
New Contributor I
3,026 Views

@HemanthCH_Intel Thank you for your reply. We have several different hardware generations in our HPC cluster.

 

All the logs that I provided above are from jobs ran on servers with two AMD EPYC 7742 CPUs, each of them having 64 cores. The logs that @qumale provided are from jobs ran on servers with two Intel Xeon Gold 5118.

 

We can also run the same example on other servers with two 18-core Intel Xeon Gold 6150 processors if this helps.

 

I think the problem is not depending on the CPU but rather on the infiniband network card.

 

Best regards

 

Sam

HemanthCH_Intel
Moderator
2,966 Views

Hi Sam,


Could you please provide the sample reproducer code and steps to reproduce your issue at our end?


Thanks & Regards,

Hemanth.


samfux84
New Contributor I
2,886 Views

Hi Hemanth,

 

The reproduce code is already given in the initial post of this thread (it is the hello.c code).

 

We used the following commands:

 

FI_PROVIDER=mlx I_MPI_FABRICS=shm:ofi mpirun ./hello

-> fails on the node with AMD EPYC 7742 CPUs

I_MPI_FABRICS=shm:ofi I_MPI_OFI_PROVIDER=mlx mpirun ./hello

-> fails on the node with AMD EPYC 7742 CPUs

I_MPI_FABRICS=ofi mpirun ./hello

-> worked when running on 4 AMD EPYC 7742 cores
-> worked when running on 4 Intel Xeon Gold 6150 cores
-> worked when running on 4 Intel Xeon Gold 5118 cores
-> failed when running on 256 AMD EPYC 7742 cores
-> failed when running on 256 Intel Xeon Gold 6150 cores

 

When repeating some of the tests, the last example listed worked when running on

 

* 256 AMD EPYC 7742

* 256 AMD EPYC 7H12

* 256 AMD EPYC 7663

 

But the memory consumption for these 3 examples reported by the IBM LSF batch system is between 70 GB and 100 GB which seems wrong for a 256 cores MPI hello world example.

 

Is there any general recommendation for the settings for Intel MPI with Mellanox ConnectX-6 Infiniband cards for the variables listed below?

 

* I_MPI_FABRICS

* I_MPI_OFI_PROVIDER

 

Can the variable FI_PROVIDER still be used or is it already deprecated?

 

And is there a similar page for Intel MPI 2021.5 as for Intel MPI 2019?

 

https://www.intel.com/content/www/us/en/developer/articles/technical/mpi-library-2019-over-libfabric...

 

For the moment we don't put the Intel oneAPI 2022.1.2  installation into production on our cluster unless we can resolve the issues with IntelMPI.

 

Thank you for your help and best regards

 

Sam

HemanthCH_Intel
Moderator
2,816 Views

Hi,

 

Could you please provide the output for the below commands from 256 Intel Xeon Gold 6150 cores :

 

 

ucx_info -d | grep Transport
ucx_info -v

 

 

Could you please provide the complete debug log using the below command:

 

 

I_MPI_DEBUG=40 FI_PROVIDER=mlx mpirun -n 2 -ppn 2 ./a.out

 

 

Could you please try running the MPI program on:

1. Two nodes containing 2 Intel Xeon Gold 6150 cores &

2. Two nodes containing128 Intel Xeon Gold 6150 cores and let us know if issue exists or not?

 

Use the below command for launching an MPI program on a cluster:

 

 

I_MPI_DEBUG=40 FI_PROVIDER=mlx mpirun -n 4 -ppn 2 -f nodefile ./a.out

 

 

Thanks & Regards,

Hemanth

 

samfux84
New Contributor I
2,699 Views

Intel Xeon Gold 6150 has 18 cores and our eu-a6-* compute nodes have two of those CPUs per compute node, i.e., a total of 36 cores per compute node.

 

Please find below the output of the commands that you asked for:

 

 

 

[sfux@eu-a6-001-01 ~]$ module list

Currently Loaded Modules:
  1) StdEnv   2) intel/2022.1.2


[sfux@eu-a6-001-01 ~]$ ucx_info -d | grep Transport
#      Transport: posix
#      Transport: sysv
#      Transport: self
#      Transport: tcp
#      Transport: tcp
#      Transport: tcp
#      Transport: rc_verbs
#      Transport: rc_mlx5
#      Transport: dc_mlx5
#      Transport: ud_verbs
#      Transport: ud_mlx5
#      Transport: cma
#      Transport: knem
#      Transport: xpmem
[sfux@eu-a6-001-01 ~]$ ucx_info -v
# UCT version=1.11.1 revision c58db6b
# configured with: --build=x86_64-redhat-linux-gnu --host=x86_64-redhat-linux-gnu --program-prefix= --disable-dependency-tracking --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib64 --libexecdir=/usr/libexec --localstatedir=/var --sharedstatedir=/var/lib --mandir=/usr/share/man --infodir=/usr/share/info --disable-optimizations --disable-logging --disable-debug --disable-assertions --enable-mt --disable-params-check --without-java --enable-cma --with-cuda --with-gdrcopy --with-verbs --with-knem --with-rdmacm --without-rocm --with-xpmem --without-fuse3 --without-ugni --with-cuda=/usr/local/cuda-11.2
[sfux@eu-a6-001-01 ~]$

 

 

As requested, I submitted the job requesting 2 cores on a single node:

 

[sfux@eu-login-16 intelmpi]$ bsub -n 2 -R "span[ptile=2] select[model==XeonGold_6150] rusage[mem=500]" I_MPI_DEBUG=40 FI_PROVIDER=mlx mpirun -n 2 -ppn 2 ./hello
Generic job.
Job <211960558> is submitted to queue <hpc.4h>.
[sfux@eu-login-16 intelmpi]$

 

 Please find below the corresponding logs:

 

[sfux@eu-login-16 intelmpi]$ cat lsf.o211960558
Sender: LSF System <lsfadmin@eu-a6-011-11>
Subject: Job 211960558: <I_MPI_DEBUG=40 FI_PROVIDER=mlx mpirun -n 2 -ppn 2 ./hello> in cluster <euler> Done

Job <I_MPI_DEBUG=40 FI_PROVIDER=mlx mpirun -n 2 -ppn 2 ./hello> was submitted from host <eu-login-16> by user <sfux> in cluster <euler> at Thu Mar 31 09:42:49 2022
Job was executed on host(s) <2*eu-a6-011-11>, in queue <hpc.4h>, as user <sfux> in cluster <euler> at Thu Mar 31 09:43:07 2022
</cluster/home/sfux> was used as the home directory.
</cluster/home/sfux/test/intelmpi> was used as the working directory.
Started at Thu Mar 31 09:43:07 2022
Terminated at Thu Mar 31 09:43:12 2022
Results reported at Thu Mar 31 09:43:12 2022

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
I_MPI_DEBUG=40 FI_PROVIDER=mlx mpirun -n 2 -ppn 2 ./hello
------------------------------------------------------------

Successfully completed.

Resource usage summary:

    CPU time :                                   1.22 sec.
    Max Memory :                                 200 MB
    Average Memory :                             -
    Total Requested Memory :                     1000.00 MB
    Delta Memory :                               800.00 MB
    Max Swap :                                   -
    Max Processes :                              -
    Max Threads :                                -
    Run time :                                   12 sec.
    Turnaround time :                            23 sec.

The output (if any) follows:

IPL WARN> Not all cpus are available, switch to I_MPI_PIN_ORDER=compact. (Total: 72 Available: 4)
[0] MPI startup(): Intel(R) MPI Library, Version 2021.5  Build 20211102 (id: 9279b7d62)
[0] MPI startup(): Copyright (C) 2003-2021 Intel Corporation.  All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): shm segment size (1307 MB per rank) * (2 local ranks) = 2615 MB total
[0] MPI startup(): libfabric version: 1.13.2rc1-impi
libfabric:72159:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:72159:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:72159:core:core:ze_hmem_dl_init():422<warn> Failed to dlopen libze_loader.so
libfabric:72159:core:core:ofi_hmem_init():214<warn> Failed to initialize hmem iface FI_HMEM_ZE: No data available
libfabric:72159:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:72159:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:72159:core:core:ze_hmem_dl_init():422<warn> Failed to dlopen libze_loader.so
libfabric:72159:core:core:ofi_hmem_init():214<warn> Failed to initialize hmem iface FI_HMEM_ZE: No data available
libfabric:72159:core:core:ofi_register_provider():474<info> registering provider: verbs (113.20)
libfabric:72159:core:core:ofi_register_provider():502<info> "verbs" filtered by provider include/exclude list, skipping
libfabric:72159:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:72159:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:72159:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ZE not supported
libfabric:72159:core:core:ofi_register_provider():474<info> registering provider: verbs (113.20)
libfabric:72159:core:core:ofi_register_provider():502<info> "verbs" filtered by provider include/exclude list, skipping
libfabric:72159:core:core:ofi_register_provider():474<info> registering provider: tcp (113.20)
libfabric:72159:core:core:ofi_register_provider():502<info> "tcp" filtered by provider include/exclude list, skipping
libfabric:72159:core:core:ofi_register_provider():474<info> registering provider: sockets (113.20)
libfabric:72159:core:core:ofi_register_provider():502<info> "sockets" filtered by provider include/exclude list, skipping
libfabric:72159:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:72159:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:72159:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ZE not supported
libfabric:72159:core:core:ofi_register_provider():474<info> registering provider: shm (113.20)
libfabric:72159:core:core:ofi_register_provider():502<info> "shm" filtered by provider include/exclude list, skipping
libfabric:72159:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:72159:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:72159:core:core:ze_hmem_dl_init():422<warn> Failed to dlopen libze_loader.so
libfabric:72159:core:core:ofi_hmem_init():214<warn> Failed to initialize hmem iface FI_HMEM_ZE: No data available
libfabric:72159:core:core:ofi_register_provider():474<info> registering provider: ofi_rxm (113.20)
libfabric:72159:psm3:core:fi_prov_ini():680<info> build options: VERSION=1101.0=11.1.0.0, HAVE_PSM3_src=1, PSM3_CUDA=0
libfabric:72159:core:core:ofi_register_provider():474<info> registering provider: psm3 (1101.0)
libfabric:72159:core:core:ofi_register_provider():502<info> "psm3" filtered by provider include/exclude list, skipping
libfabric:72159:core:core:ofi_register_provider():474<info> registering provider: mlx (1.4)
libfabric:72159:core:core:ofi_register_provider():474<info> registering provider: ofi_hook_noop (113.20)
libfabric:72159:core:core:fi_getinfo_():1138<info> Found provider with the highest priority mlx, must_use_util_prov = 0
libfabric:72159:core:core:fi_getinfo_():1138<info> Found provider with the highest priority mlx, must_use_util_prov = 0
[0] MPI startup(): libfabric provider: mlx
[0] MPI startup(): detected mlx provider, set device name to "mlx"
libfabric:72159:core:core:fi_fabric_():1423<info> Opened fabric: mlx
[0] MPI startup(): max_ch4_vcis: 1, max_reg_eps 64, enable_sep 0, enable_shared_ctxs 0, do_av_insert 1
[0] MPI startup(): addrnamelen: 1024
[0] MPI startup(): File "/cluster/apps/nss/intel/oneapi/2022.1.2/mpi/2021.5.1/etc/tuning_skx_shm-ofi_mlx_10.dat" not found
[0] MPI startup(): Load tuning file: "/cluster/apps/nss/intel/oneapi/2022.1.2/mpi/2021.5.1/etc/tuning_skx_shm-ofi.dat"
Hello world from processor eu-a6-011-11, rank 1 out of 2 processors
[0] MPI startup(): Rank    Pid      Node name     Pin cpu
[0] MPI startup(): 0       72159    eu-a6-011-11  {0,36}
[0] MPI startup(): 1       72160    eu-a6-011-11  {9,45}
[0] MPI startup(): I_MPI_ROOT=/cluster/apps/nss/intel/oneapi/2022.1.2/mpi/2021.5.1
[0] MPI startup(): I_MPI_MPIRUN=mpirun
[0] MPI startup(): I_MPI_HYDRA_RMK=lsf
[0] MPI startup(): I_MPI_HYDRA_TOPOLIB=hwloc
[0] MPI startup(): I_MPI_INTERNAL_MEM_POLICY=default
[0] MPI startup(): I_MPI_DEBUG=40
[0] MPI startup(): threading: mode: direct
[0] MPI startup(): threading: vcis: 1
[0] MPI startup(): threading: app_threads: 1
[0] MPI startup(): threading: runtime: generic
[0] MPI startup(): threading: is_threaded: 0
[0] MPI startup(): threading: async_progress: 0
[0] MPI startup(): threading: num_pools: 64
[0] MPI startup(): threading: lock_level: global
[0] MPI startup(): threading: enable_sep: 0
[0] MPI startup(): threading: direct_recv: 1
[0] MPI startup(): threading: zero_op_flags: 1
[0] MPI startup(): threading: num_am_buffers: 1
[0] MPI startup(): threading: library is built with per-vci thread granularity
Hello world from processor eu-a6-011-11, rank 0 out of 2 processors
[sfux@eu-login-16 intelmpi]$

 

Job on 4 cores (2 per host):

 

[sfux@eu-login-16 intelmpi]$ bsub -n 4 -R "span[ptile=2] select[model==XeonGold_6150] rusage[mem=500]" I_MPI_DEBUG=40 FI_PROVIDER=mlx mpirun -n 4 -ppn 2 ./hello
Generic job.
Job <211966082> is submitted to queue <hpc.4h>.
[sfux@eu-login-16 intelmpi]$

 

 Please find the corresponding logs below:

 

[sfux@eu-login-16 intelmpi]$ cat lsf.o211966082
Sender: LSF System <lsfadmin@eu-a6-009-20>
Subject: Job 211966082: <I_MPI_DEBUG=40 FI_PROVIDER=mlx mpirun -n 4 -ppn 2 ./hello> in cluster <euler> Done

Job <I_MPI_DEBUG=40 FI_PROVIDER=mlx mpirun -n 4 -ppn 2 ./hello> was submitted from host <eu-login-16> by user <sfux> in cluster <euler> at Thu Mar 31 10:48:04 2022
Job was executed on host(s) <2*eu-a6-009-20>, in queue <hpc.4h>, as user <sfux> in cluster <euler> at Thu Mar 31 10:48:46 2022
                            <2*eu-a6-009-22>
</cluster/home/sfux> was used as the home directory.
</cluster/home/sfux/test/intelmpi> was used as the working directory.
Started at Thu Mar 31 10:48:46 2022
Terminated at Thu Mar 31 10:48:52 2022
Results reported at Thu Mar 31 10:48:52 2022

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
I_MPI_DEBUG=40 FI_PROVIDER=mlx mpirun -n 4 -ppn 2 ./hello
------------------------------------------------------------

Successfully completed.

Resource usage summary:

    CPU time :                                   6.00 sec.
    Max Memory :                                 392 MB
    Average Memory :                             1.00 MB
    Total Requested Memory :                     2000.00 MB
    Delta Memory :                               1608.00 MB
    Max Swap :                                   -
    Max Processes :                              1
    Max Threads :                                1
    Run time :                                   1 sec.
    Turnaround time :                            48 sec.

The output (if any) follows:

IPL WARN> Not all cpus are available, switch to I_MPI_PIN_ORDER=compact. (Total: 72 Available: 4)
IPL WARN> Not all cpus are available, switch to I_MPI_PIN_ORDER=compact. (Total: 72 Available: 4)
[0] MPI startup(): Intel(R) MPI Library, Version 2021.5  Build 20211102 (id: 9279b7d62)
[0] MPI startup(): Copyright (C) 2003-2021 Intel Corporation.  All rights reserved.
[0] MPI startup(): library kind: release
[2] MPI startup(): shm segment size (1307 MB per rank) * (2 local ranks) = 2615 MB total
[0] MPI startup(): shm segment size (1307 MB per rank) * (2 local ranks) = 2615 MB total
[0] MPI startup(): libfabric version: 1.13.2rc1-impi
libfabric:54643:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:54643:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:54643:core:core:ze_hmem_dl_init():422<warn> Failed to dlopen libze_loader.so
libfabric:54643:core:core:ofi_hmem_init():214<warn> Failed to initialize hmem iface FI_HMEM_ZE: No data available
libfabric:54643:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:54643:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:54643:core:core:ze_hmem_dl_init():422<warn> Failed to dlopen libze_loader.so
libfabric:54643:core:core:ofi_hmem_init():214<warn> Failed to initialize hmem iface FI_HMEM_ZE: No data available
libfabric:54643:core:core:ofi_register_provider():474<info> registering provider: verbs (113.20)
libfabric:54643:core:core:ofi_register_provider():502<info> "verbs" filtered by provider include/exclude list, skipping
libfabric:54643:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:54643:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:54643:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ZE not supported
libfabric:54643:core:core:ofi_register_provider():474<info> registering provider: verbs (113.20)
libfabric:54643:core:core:ofi_register_provider():502<info> "verbs" filtered by provider include/exclude list, skipping
libfabric:54643:core:core:ofi_register_provider():474<info> registering provider: tcp (113.20)
libfabric:54643:core:core:ofi_register_provider():502<info> "tcp" filtered by provider include/exclude list, skipping
libfabric:54643:core:core:ofi_register_provider():474<info> registering provider: sockets (113.20)
libfabric:54643:core:core:ofi_register_provider():502<info> "sockets" filtered by provider include/exclude list, skipping
libfabric:54643:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:54643:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:54643:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ZE not supported
libfabric:54643:core:core:ofi_register_provider():474<info> registering provider: shm (113.20)
libfabric:54643:core:core:ofi_register_provider():502<info> "shm" filtered by provider include/exclude list, skipping
libfabric:54643:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:54643:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:54643:core:core:ze_hmem_dl_init():422<warn> Failed to dlopen libze_loader.so
libfabric:54643:core:core:ofi_hmem_init():214<warn> Failed to initialize hmem iface FI_HMEM_ZE: No data available
libfabric:54643:core:core:ofi_register_provider():474<info> registering provider: ofi_rxm (113.20)
libfabric:54643:psm3:core:fi_prov_ini():680<info> build options: VERSION=1101.0=11.1.0.0, HAVE_PSM3_src=1, PSM3_CUDA=0
libfabric:54643:core:core:ofi_register_provider():474<info> registering provider: psm3 (1101.0)
libfabric:54643:core:core:ofi_register_provider():502<info> "psm3" filtered by provider include/exclude list, skipping
libfabric:54643:core:core:ofi_register_provider():474<info> registering provider: mlx (1.4)
libfabric:54643:core:core:ofi_register_provider():474<info> registering provider: ofi_hook_noop (113.20)
libfabric:54643:core:core:fi_getinfo_():1138<info> Found provider with the highest priority mlx, must_use_util_prov = 0
libfabric:54643:core:core:fi_getinfo_():1138<info> Found provider with the highest priority mlx, must_use_util_prov = 0
[0] MPI startup(): libfabric provider: mlx
[0] MPI startup(): detected mlx provider, set device name to "mlx"
libfabric:54643:core:core:fi_fabric_():1423<info> Opened fabric: mlx
[0] MPI startup(): max_ch4_vcis: 1, max_reg_eps 64, enable_sep 0, enable_shared_ctxs 0, do_av_insert 1
[0] MPI startup(): addrnamelen: 1024
[0] MPI startup(): File "/cluster/apps/nss/intel/oneapi/2022.1.2/mpi/2021.5.1/etc/tuning_skx_shm-ofi_mlx_10.dat" not found
[0] MPI startup(): Load tuning file: "/cluster/apps/nss/intel/oneapi/2022.1.2/mpi/2021.5.1/etc/tuning_skx_shm-ofi.dat"
Hello world from processor eu-a6-009-20, rank 1 out of 4 processors
Hello world from processor eu-a6-009-22, rank 3 out of 4 processors
[0] MPI startup(): Rank    Pid      Node name     Pin cpu
[0] MPI startup(): 0       54643    eu-a6-009-20  {16,52}
[0] MPI startup(): 1       54644    eu-a6-009-20  {17,53}
[0] MPI startup(): 2       38880    eu-a6-009-22  {0,36}
[0] MPI startup(): 3       38881    eu-a6-009-22  {1,37}
[0] MPI startup(): I_MPI_ROOT=/cluster/apps/nss/intel/oneapi/2022.1.2/mpi/2021.5.1Hello world from processor eu-a6-009-22, rank 2 out of 4 processors

[0] MPI startup(): I_MPI_MPIRUN=mpirun
[0] MPI startup(): I_MPI_HYDRA_RMK=lsf
[0] MPI startup(): I_MPI_HYDRA_TOPOLIB=hwloc
[0] MPI startup(): I_MPI_INTERNAL_MEM_POLICY=default
[0] MPI startup(): I_MPI_DEBUG=40
[0] MPI startup(): threading: mode: direct
[0] MPI startup(): threading: vcis: 1
[0] MPI startup(): threading: app_threads: 1
[0] MPI startup(): threading: runtime: generic
[0] MPI startup(): threading: is_threaded: 0
[0] MPI startup(): threading: async_progress: 0
[0] MPI startup(): threading: num_pools: 64
[0] MPI startup(): threading: lock_level: global
[0] MPI startup(): threading: enable_sep: 0
[0] MPI startup(): threading: direct_recv: 1
[0] MPI startup(): threading: zero_op_flags: 1
[0] MPI startup(): threading: num_am_buffers: 1
[0] MPI startup(): threading: library is built with per-vci thread granularity
Hello world from processor eu-a6-009-20, rank 0 out of 4 processors
[sfux@eu-login-16 intelmpi]$

 

I submitted the last job with 252 instead of 256 cores (as the nodes have 36 cores per host and I would like to use whole compute nodes):

 

[sfux@eu-login-16 intelmpi]$ bsub -n 252 -R "span[ptile=36] select[model==XeonGold_6150] rusage[mem=500]" I_MPI_DEBUG=40 FI_PROVIDER=mlx mpirun -n 252 -ppn 36 ./hello
Generic job.
Job <211966180> is submitted to queue <hpc.4h>.
[sfux@eu-login-16 intelmpi]$

 

Please find the corresponding logs below:

 

[sfux@eu-login-16 intelmpi]$ cat lsf.o211966180
Sender: LSF System <lsfadmin@eu-a6-009-23>
Subject: Job 211966180: <I_MPI_DEBUG=40 FI_PROVIDER=mlx mpirun -n 252 -ppn 36 ./hello> in cluster <euler> Done

Job <I_MPI_DEBUG=40 FI_PROVIDER=mlx mpirun -n 252 -ppn 36 ./hello> was submitted from host <eu-login-16> by user <sfux> in cluster <euler> at Thu Mar 31 10:51:04 2022
Job was executed on host(s) <36*eu-a6-009-23>, in queue <hpc.4h>, as user <sfux> in cluster <euler> at Thu Mar 31 10:51:15 2022
                            <36*eu-a6-006-01>
                            <36*eu-a6-006-03>
                            <36*eu-a6-006-19>
                            <36*eu-a6-005-17>
                            <36*eu-a6-005-01>
                            <36*eu-a6-005-21>
</cluster/home/sfux> was used as the home directory.
</cluster/home/sfux/test/intelmpi> was used as the working directory.
Started at Thu Mar 31 10:51:15 2022
Terminated at Thu Mar 31 10:51:26 2022
Results reported at Thu Mar 31 10:51:26 2022

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
I_MPI_DEBUG=40 FI_PROVIDER=mlx mpirun -n 252 -ppn 36 ./hello
------------------------------------------------------------

Successfully completed.

Resource usage summary:

    CPU time :                                   1659.00 sec.
    Max Memory :                                 33622 MB
    Average Memory :                             3322.00 MB
    Total Requested Memory :                     126000.00 MB
    Delta Memory :                               92378.00 MB
    Max Swap :                                   -
    Max Processes :                              43
    Max Threads :                                81
    Run time :                                   18 sec.
    Turnaround time :                            22 sec.

The output (if any) follows:

[0] MPI startup(): Intel(R) MPI Library, Version 2021.5  Build 20211102 (id: 9279b7d62)
[0] MPI startup(): Copyright (C) 2003-2021 Intel Corporation.  All rights reserved.
[0] MPI startup(): library kind: release
[72] MPI startup(): shm segment size (125 MB per rank) * (36 local ranks) = 4512 MB total
[144] MPI startup(): shm segment size (125 MB per rank) * (36 local ranks) = 4512 MB total
[180] MPI startup(): shm segment size (125 MB per rank) * (36 local ranks) = 4512 MB total
[0] MPI startup(): shm segment size (125 MB per rank) * (36 local ranks) = 4512 MB total
[36] MPI startup(): shm segment size (125 MB per rank) * (36 local ranks) = 4512 MB total
[108] MPI startup(): shm segment size (125 MB per rank) * (36 local ranks) = 4512 MB total
[216] MPI startup(): shm segment size (125 MB per rank) * (36 local ranks) = 4512 MB total
[0] MPI startup(): libfabric version: 1.13.2rc1-impi
libfabric:12737:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:12737:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:12737:core:core:ze_hmem_dl_init():422<warn> Failed to dlopen libze_loader.so
libfabric:12737:core:core:ofi_hmem_init():214<warn> Failed to initialize hmem iface FI_HMEM_ZE: No data available
libfabric:12737:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:12737:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:12737:core:core:ze_hmem_dl_init():422<warn> Failed to dlopen libze_loader.so
libfabric:12737:core:core:ofi_hmem_init():214<warn> Failed to initialize hmem iface FI_HMEM_ZE: No data available
libfabric:12737:core:core:ofi_register_provider():474<info> registering provider: verbs (113.20)
libfabric:12737:core:core:ofi_register_provider():502<info> "verbs" filtered by provider include/exclude list, skipping
libfabric:12737:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:12737:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:12737:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ZE not supported
libfabric:12737:core:core:ofi_register_provider():474<info> registering provider: verbs (113.20)
libfabric:12737:core:core:ofi_register_provider():502<info> "verbs" filtered by provider include/exclude list, skipping
libfabric:12737:core:core:ofi_register_provider():474<info> registering provider: tcp (113.20)
libfabric:12737:core:core:ofi_register_provider():502<info> "tcp" filtered by provider include/exclude list, skipping
libfabric:12737:core:core:ofi_register_provider():474<info> registering provider: sockets (113.20)
libfabric:12737:core:core:ofi_register_provider():502<info> "sockets" filtered by provider include/exclude list, skipping
libfabric:12737:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:12737:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:12737:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ZE not supported
libfabric:12737:core:core:ofi_register_provider():474<info> registering provider: shm (113.20)
libfabric:12737:core:core:ofi_register_provider():502<info> "shm" filtered by provider include/exclude list, skipping
libfabric:12737:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:12737:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:12737:core:core:ze_hmem_dl_init():422<warn> Failed to dlopen libze_loader.so
libfabric:12737:core:core:ofi_hmem_init():214<warn> Failed to initialize hmem iface FI_HMEM_ZE: No data available
libfabric:12737:core:core:ofi_register_provider():474<info> registering provider: ofi_rxm (113.20)
libfabric:12737:psm3:core:fi_prov_ini():680<info> build options: VERSION=1101.0=11.1.0.0, HAVE_PSM3_src=1, PSM3_CUDA=0
libfabric:12737:core:core:ofi_register_provider():474<info> registering provider: psm3 (1101.0)
libfabric:12737:core:core:ofi_register_provider():502<info> "psm3" filtered by provider include/exclude list, skipping
libfabric:12737:core:core:ofi_register_provider():474<info> registering provider: mlx (1.4)
libfabric:12737:core:core:ofi_register_provider():474<info> registering provider: ofi_hook_noop (113.20)
libfabric:12737:core:core:fi_getinfo_():1138<info> Found provider with the highest priority mlx, must_use_util_prov = 0
libfabric:12737:core:core:fi_getinfo_():1138<info> Found provider with the highest priority mlx, must_use_util_prov = 0
[0] MPI startup(): libfabric provider: mlx
[0] MPI startup(): detected mlx provider, set device name to "mlx"
libfabric:12737:core:core:fi_fabric_():1423<info> Opened fabric: mlx
[0] MPI startup(): max_ch4_vcis: 1, max_reg_eps 64, enable_sep 0, enable_shared_ctxs 0, do_av_insert 1
[0] MPI startup(): addrnamelen: 1024
Hello world from processor eu-a6-005-17, rank 161 out of 252 processors
Hello world from processor eu-a6-005-17, rank 167 out of 252 processors
Hello world from processor eu-a6-005-17, rank 175 out of 252 processors
Hello world from processor eu-a6-005-17, rank 177 out of 252 processors
Hello world from processor eu-a6-005-17, rank 179 out of 252 processors
Hello world from processor eu-a6-005-17, rank 147 out of 252 processors
Hello world from processor eu-a6-005-17, rank 149 out of 252 processors
Hello world from processor eu-a6-005-17, rank 151 out of 252 processors
Hello world from processor eu-a6-005-17, rank 153 out of 252 processors
Hello world from processor eu-a6-005-17, rank 155 out of 252 processors
Hello world from processor eu-a6-005-17, rank 157 out of 252 processors
Hello world from processor eu-a6-005-17, rank 159 out of 252 processors
Hello world from processor eu-a6-005-17, rank 163 out of 252 processors
Hello world from processor eu-a6-005-17, rank 165 out of 252 processors
Hello world from processor eu-a6-005-17, rank 169 out of 252 processors
Hello world from processor eu-a6-005-17, rank 173 out of 252 processors
Hello world from processor eu-a6-005-17, rank 145 out of 252 processors
Hello world from processor eu-a6-005-17, rank 171 out of 252 processors
Hello world from processor eu-a6-005-17, rank 158 out of 252 processors
Hello world from processor eu-a6-005-17, rank 150 out of 252 processors
Hello world from processor eu-a6-005-17, rank 170 out of 252 processors
Hello world from processor eu-a6-005-17, rank 174 out of 252 processors
Hello world from processor eu-a6-005-17, rank 162 out of 252 processors
Hello world from processor eu-a6-005-17, rank 178 out of 252 processors
Hello world from processor eu-a6-005-17, rank 164 out of 252 processors
Hello world from processor eu-a6-005-17, rank 166 out of 252 processors
Hello world from processor eu-a6-005-17, rank 148 out of 252 processors
Hello world from processor eu-a6-005-17, rank 172 out of 252 processors
Hello world from processor eu-a6-005-17, rank 168 out of 252 processors
Hello world from processor eu-a6-005-17, rank 154 out of 252 processors
Hello world from processor eu-a6-005-17, rank 146 out of 252 processors
Hello world from processor eu-a6-005-17, rank 156 out of 252 processors
Hello world from processor eu-a6-005-17, rank 152 out of 252 processors
Hello world from processor eu-a6-006-01, rank 43 out of 252 processors
Hello world from processor eu-a6-006-01, rank 45 out of 252 processors
Hello world from processor eu-a6-006-01, rank 49 out of 252 processors
Hello world from processor eu-a6-006-01, rank 53 out of 252 processors
Hello world from processor eu-a6-006-01, rank 55 out of 252 processors
Hello world from processor eu-a6-006-01, rank 59 out of 252 processors
Hello world from processor eu-a6-006-01, rank 61 out of 252 processors
Hello world from processor eu-a6-006-01, rank 65 out of 252 processors
Hello world from processor eu-a6-006-01, rank 67 out of 252 processors
Hello world from processor eu-a6-006-01, rank 69 out of 252 processors
Hello world from processor eu-a6-006-01, rank 39 out of 252 processors
Hello world from processor eu-a6-006-01, rank 41 out of 252 processors
Hello world from processor eu-a6-006-01, rank 47 out of 252 processors
Hello world from processor eu-a6-006-01, rank 51 out of 252 processors
Hello world from processor eu-a6-006-01, rank 57 out of 252 processors
Hello world from processor eu-a6-006-01, rank 63 out of 252 processors
Hello world from processor eu-a6-006-01, rank 71 out of 252 processors
Hello world from processor eu-a6-006-19, rank 109 out of 252 processors
Hello world from processor eu-a6-006-19, rank 111 out of 252 processors
Hello world from processor eu-a6-006-19, rank 123 out of 252 processors
Hello world from processor eu-a6-006-19, rank 127 out of 252 processors
Hello world from processor eu-a6-006-19, rank 129 out of 252 processors
Hello world from processor eu-a6-006-19, rank 131 out of 252 processors
Hello world from processor eu-a6-006-19, rank 133 out of 252 processors
Hello world from processor eu-a6-006-19, rank 135 out of 252 processors
Hello world from processor eu-a6-006-19, rank 137 out of 252 processors
Hello world from processor eu-a6-006-19, rank 139 out of 252 processors
Hello world from processor eu-a6-006-19, rank 141 out of 252 processors
Hello world from processor eu-a6-006-19, rank 143 out of 252 processors
Hello world from processor eu-a6-006-19, rank 113 out of 252 processors
Hello world from processor eu-a6-006-19, rank 115 out of 252 processors
Hello world from processor eu-a6-006-19, rank 117 out of 252 processors
Hello world from processor eu-a6-006-19, rank 119 out of 252 processors
Hello world from processor eu-a6-006-19, rank 121 out of 252 processors
Hello world from processor eu-a6-006-19, rank 125 out of 252 processors
Hello world from processor eu-a6-006-01, rank 37 out of 252 processors
Hello world from processor eu-a6-006-01, rank 50 out of 252 processors
Hello world from processor eu-a6-006-01, rank 42 out of 252 processors
Hello world from processor eu-a6-006-01, rank 70 out of 252 processors
Hello world from processor eu-a6-006-01, rank 62 out of 252 processors
Hello world from processor eu-a6-006-19, rank 126 out of 252 processors
Hello world from processor eu-a6-006-19, rank 124 out of 252 processors
Hello world from processor eu-a6-006-01, rank 58 out of 252 processors
Hello world from processor eu-a6-006-19, rank 114 out of 252 processors
Hello world from processor eu-a6-006-01, rank 60 out of 252 processors
Hello world from processor eu-a6-006-01, rank 68 out of 252 processors
Hello world from processor eu-a6-005-17, rank 144 out of 252 processors
Hello world from processor eu-a6-006-01, rank 38 out of 252 processors
Hello world from processor eu-a6-006-01, rank 36 out of 252 processors
Hello world from processor eu-a6-006-01, rank 66 out of 252 processors
Hello world from processor eu-a6-006-01, rank 54 out of 252 processors
Hello world from processor eu-a6-006-01, rank 52 out of 252 processors
Hello world from processor eu-a6-006-19, rank 130 out of 252 processors
Hello world from processor eu-a6-006-19, rank 118 out of 252 processors
Hello world from processor eu-a6-006-19, rank 132 out of 252 processors
Hello world from processor eu-a6-006-19, rank 134 out of 252 processors
Hello world from processor eu-a6-006-19, rank 110 out of 252 processors
Hello world from processor eu-a6-006-19, rank 138 out of 252 processors
Hello world from processor eu-a6-006-19, rank 122 out of 252 processors
Hello world from processor eu-a6-006-19, rank 142 out of 252 processors
Hello world from processor eu-a6-006-19, rank 136 out of 252 processors
Hello world from processor eu-a6-006-19, rank 140 out of 252 processors
Hello world from processor eu-a6-006-19, rank 120 out of 252 processors
Hello world from processor eu-a6-006-01, rank 46 out of 252 processors
Hello world from processor eu-a6-006-01, rank 44 out of 252 processors
Hello world from processor eu-a6-006-19, rank 116 out of 252 processors
Hello world from processor eu-a6-006-19, rank 112 out of 252 processors
Hello world from processor eu-a6-006-01, rank 56 out of 252 processors
Hello world from processor eu-a6-005-21, rank 223 out of 252 processors
Hello world from processor eu-a6-005-21, rank 235 out of 252 processors
Hello world from processor eu-a6-005-21, rank 247 out of 252 processors
Hello world from processor eu-a6-005-21, rank 231 out of 252 processors
Hello world from processor eu-a6-005-21, rank 227 out of 252 processors
Hello world from processor eu-a6-005-21, rank 245 out of 252 processors
Hello world from processor eu-a6-005-21, rank 241 out of 252 processors
Hello world from processor eu-a6-005-21, rank 251 out of 252 processors
Hello world from processor eu-a6-005-21, rank 225 out of 252 processors
Hello world from processor eu-a6-005-21, rank 219 out of 252 processors
Hello world from processor eu-a6-005-21, rank 233 out of 252 processors
Hello world from processor eu-a6-005-21, rank 243 out of 252 processors
Hello world from processor eu-a6-005-21, rank 221 out of 252 processors
Hello world from processor eu-a6-005-21, rank 249 out of 252 processors
Hello world from processor eu-a6-005-21, rank 237 out of 252 processors
Hello world from processor eu-a6-005-21, rank 239 out of 252 processors
Hello world from processor eu-a6-005-21, rank 229 out of 252 processors
Hello world from processor eu-a6-005-21, rank 217 out of 252 processors
Hello world from processor eu-a6-005-01, rank 195 out of 252 processors
Hello world from processor eu-a6-005-01, rank 197 out of 252 processors
Hello world from processor eu-a6-005-01, rank 201 out of 252 processors
Hello world from processor eu-a6-005-01, rank 213 out of 252 processors
Hello world from processor eu-a6-005-01, rank 183 out of 252 processors
Hello world from processor eu-a6-005-01, rank 185 out of 252 processors
Hello world from processor eu-a6-005-01, rank 187 out of 252 processors
Hello world from processor eu-a6-005-01, rank 189 out of 252 processors
Hello world from processor eu-a6-005-01, rank 191 out of 252 processors
Hello world from processor eu-a6-005-01, rank 193 out of 252 processors
Hello world from processor eu-a6-005-01, rank 199 out of 252 processors
Hello world from processor eu-a6-005-01, rank 203 out of 252 processors
Hello world from processor eu-a6-005-01, rank 205 out of 252 processors
Hello world from processor eu-a6-005-01, rank 207 out of 252 processors
Hello world from processor eu-a6-005-01, rank 209 out of 252 processors
Hello world from processor eu-a6-005-01, rank 211 out of 252 processors
Hello world from processor eu-a6-005-01, rank 215 out of 252 processors
Hello world from processor eu-a6-005-01, rank 181 out of 252 processors
Hello world from processor eu-a6-005-21, rank 250 out of 252 processors
Hello world from processor eu-a6-005-21, rank 238 out of 252 processors
Hello world from processor eu-a6-005-21, rank 246 out of 252 processors
Hello world from processor eu-a6-005-21, rank 244 out of 252 processors
Hello world from processor eu-a6-005-21, rank 248 out of 252 processors
Hello world from processor eu-a6-005-21, rank 234 out of 252 processors
Hello world from processor eu-a6-005-21, rank 218 out of 252 processors
Hello world from processor eu-a6-005-21, rank 222 out of 252 processors
Hello world from processor eu-a6-005-21, rank 240 out of 252 processors
Hello world from processor eu-a6-005-21, rank 242 out of 252 processors
Hello world from processor eu-a6-005-21, rank 220 out of 252 processors
Hello world from processor eu-a6-005-21, rank 226 out of 252 processors
Hello world from processor eu-a6-005-21, rank 232 out of 252 processors
Hello world from processor eu-a6-005-21, rank 236 out of 252 processors
Hello world from processor eu-a6-005-01, rank 186 out of 252 processors
Hello world from processor eu-a6-005-01, rank 210 out of 252 processors
Hello world from processor eu-a6-005-01, rank 182 out of 252 processors
Hello world from processor eu-a6-005-01, rank 180 out of 252 processors
Hello world from processor eu-a6-005-21, rank 230 out of 252 processors
Hello world from processor eu-a6-005-21, rank 224 out of 252 processors
Hello world from processor eu-a6-005-21, rank 228 out of 252 processors
Hello world from processor eu-a6-005-21, rank 216 out of 252 processors
Hello world from processor eu-a6-005-01, rank 194 out of 252 processors
Hello world from processor eu-a6-005-01, rank 204 out of 252 processors
Hello world from processor eu-a6-005-01, rank 206 out of 252 processors
Hello world from processor eu-a6-005-01, rank 202 out of 252 processors
Hello world from processor eu-a6-005-01, rank 190 out of 252 processors
Hello world from processor eu-a6-005-01, rank 188 out of 252 processors
Hello world from processor eu-a6-005-17, rank 176 out of 252 processors
Hello world from processor eu-a6-005-01, rank 184 out of 252 processors
Hello world from processor eu-a6-005-17, rank 160 out of 252 processors
Hello world from processor eu-a6-005-01, rank 212 out of 252 processors
Hello world from processor eu-a6-005-01, rank 214 out of 252 processors
Hello world from processor eu-a6-005-01, rank 208 out of 252 processors
Hello world from processor eu-a6-005-01, rank 200 out of 252 processors
[0] MPI startup(): File "/cluster/apps/nss/intel/oneapi/2022.1.2/mpi/2021.5.1/etc/tuning_skx_shm-ofi_mlx_10.dat" not found
[0] MPI startup(): Load tuning file: "/cluster/apps/nss/intel/oneapi/2022.1.2/mpi/2021.5.1/etc/tuning_skx_shm-ofi.dat"
Hello world from processor eu-a6-005-01, rank 196 out of 252 processors
Hello world from processor eu-a6-005-01, rank 198 out of 252 processors
Hello world from processor eu-a6-005-01, rank 192 out of 252 processors
Hello world from processor eu-a6-006-19, rank 128 out of 252 processors
Hello world from processor eu-a6-009-23, rank 1 out of 252 processors
Hello world from processor eu-a6-009-23, rank 7 out of 252 processors
Hello world from processor eu-a6-009-23, rank 9 out of 252 processors
Hello world from processor eu-a6-009-23, rank 13 out of 252 processors
Hello world from processor eu-a6-009-23, rank 15 out of 252 processors
Hello world from processor eu-a6-009-23, rank 17 out of 252 processors
Hello world from processor eu-a6-009-23, rank 21 out of 252 processors
Hello world from processor eu-a6-009-23, rank 23 out of 252 processors
Hello world from processor eu-a6-009-23, rank 25 out of 252 processors
Hello world from processor eu-a6-009-23, rank 27 out of 252 processors
Hello world from processor eu-a6-009-23, rank 29 out of 252 processors
Hello world from processor eu-a6-009-23, rank 31 out of 252 processors
Hello world from processor eu-a6-009-23, rank 33 out of 252 processors
Hello world from processor eu-a6-009-23, rank 35 out of 252 processors
Hello world from processor eu-a6-009-23, rank 3 out of 252 processors
Hello world from processor eu-a6-009-23, rank 5 out of 252 processors
Hello world from processor eu-a6-009-23, rank 11 out of 252 processors
Hello world from processor eu-a6-009-23, rank 19 out of 252 processors
Hello world from processor eu-a6-006-03, rank 77 out of 252 processors
Hello world from processor eu-a6-006-03, rank 87 out of 252 processors
Hello world from processor eu-a6-006-03, rank 75 out of 252 processors
Hello world from processor eu-a6-006-03, rank 79 out of 252 processors
Hello world from processor eu-a6-006-03, rank 81 out of 252 processors
Hello world from processor eu-a6-006-03, rank 83 out of 252 processors
Hello world from processor eu-a6-006-03, rank 89 out of 252 processors
Hello world from processor eu-a6-006-03, rank 91 out of 252 processors
Hello world from processor eu-a6-006-03, rank 93 out of 252 processors
Hello world from processor eu-a6-006-03, rank 95 out of 252 processors
Hello world from processor eu-a6-006-03, rank 99 out of 252 processors
Hello world from processor eu-a6-006-03, rank 101 out of 252 processors
Hello world from processor eu-a6-006-03, rank 103 out of 252 processors
Hello world from processor eu-a6-006-03, rank 105 out of 252 processors
Hello world from processor eu-a6-006-03, rank 107 out of 252 processors
Hello world from processor eu-a6-006-03, rank 73 out of 252 processors
Hello world from processor eu-a6-006-03, rank 85 out of 252 processors
Hello world from processor eu-a6-006-03, rank 97 out of 252 processors
Hello world from processor eu-a6-009-23, rank 6 out of 252 processors
Hello world from processor eu-a6-006-19, rank 108 out of 252 processors
Hello world from processor eu-a6-006-03, rank 90 out of 252 processors
Hello world from processor eu-a6-006-03, rank 78 out of 252 processors
Hello world from processor eu-a6-009-23, rank 34 out of 252 processors
Hello world from processor eu-a6-006-03, rank 104 out of 252 processors
Hello world from processor eu-a6-006-03, rank 106 out of 252 processors
Hello world from processor eu-a6-006-03, rank 98 out of 252 processors
Hello world from processor eu-a6-006-01, rank 40 out of 252 processors
Hello world from processor eu-a6-006-01, rank 48 out of 252 processors
Hello world from processor eu-a6-006-03, rank 76 out of 252 processors
Hello world from processor eu-a6-009-23, rank 18 out of 252 processors
Hello world from processor eu-a6-009-23, rank 10 out of 252 processors
Hello world from processor eu-a6-006-03, rank 86 out of 252 processors
Hello world from processor eu-a6-006-03, rank 100 out of 252 processors
Hello world from processor eu-a6-006-03, rank 102 out of 252 processors
Hello world from processor eu-a6-006-03, rank 96 out of 252 processors
Hello world from processor eu-a6-009-23, rank 32 out of 252 processors
Hello world from processor eu-a6-006-03, rank 74 out of 252 processors
Hello world from processor eu-a6-006-03, rank 72 out of 252 processors
Hello world from processor eu-a6-006-03, rank 94 out of 252 processors
Hello world from processor eu-a6-006-03, rank 92 out of 252 processors
Hello world from processor eu-a6-006-03, rank 88 out of 252 processors
Hello world from processor eu-a6-009-23, rank 22 out of 252 processors
Hello world from processor eu-a6-009-23, rank 20 out of 252 processors
Hello world from processor eu-a6-006-03, rank 82 out of 252 processors
Hello world from processor eu-a6-009-23, rank 26 out of 252 processors
Hello world from processor eu-a6-006-03, rank 84 out of 252 processors
Hello world from processor eu-a6-006-03, rank 80 out of 252 processors
Hello world from processor eu-a6-006-01, rank 64 out of 252 processors
Hello world from processor eu-a6-009-23, rank 30 out of 252 processors
Hello world from processor eu-a6-009-23, rank 28 out of 252 processors
Hello world from processor eu-a6-009-23, rank 24 out of 252 processors
Hello world from processor eu-a6-009-23, rank 16 out of 252 processors
Hello world from processor eu-a6-009-23, rank 4 out of 252 processors
Hello world from processor eu-a6-009-23, rank 2 out of 252 processors
Hello world from processor eu-a6-009-23, rank 14 out of 252 processors
[0] MPI startup(): Rank    Pid      Node name     Pin cpu
[0] MPI startup(): 0       12737    eu-a6-009-23  {0,36}
Hello world from processor eu-a6-009-23, rank 8 out of 252 processors
Hello world from processor eu-a6-009-23, rank 12 out of 252 processors
[0] MPI startup(): 1       12738    eu-a6-009-23  {1,37}
[0] MPI startup(): 2       12739    eu-a6-009-23  {2,38}
[0] MPI startup(): 3       12740    eu-a6-009-23  {3,39}
[0] MPI startup(): 4       12741    eu-a6-009-23  {4,40}
[0] MPI startup(): 5       12742    eu-a6-009-23  {5,41}
[0] MPI startup(): 6       12743    eu-a6-009-23  {6,42}
[0] MPI startup(): 7       12744    eu-a6-009-23  {7,43}
[0] MPI startup(): 8       12745    eu-a6-009-23  {8,44}
[0] MPI startup(): 9       12746    eu-a6-009-23  {9,45}
[0] MPI startup(): 10      12747    eu-a6-009-23  {10,46}
[0] MPI startup(): 11      12748    eu-a6-009-23  {11,47}
[0] MPI startup(): 12      12749    eu-a6-009-23  {12,48}
[0] MPI startup(): 13      12750    eu-a6-009-23  {13,49}
[0] MPI startup(): 14      12751    eu-a6-009-23  {14,50}
[0] MPI startup(): 15      12752    eu-a6-009-23  {15,51}
[0] MPI startup(): 16      12753    eu-a6-009-23  {16,52}
[0] MPI startup(): 17      12754    eu-a6-009-23  {17,53}
[0] MPI startup(): 18      12755    eu-a6-009-23  {18,54}
[0] MPI startup(): 19      12756    eu-a6-009-23  {19,55}
[0] MPI startup(): 20      12757    eu-a6-009-23  {20,56}
[0] MPI startup(): 21      12758    eu-a6-009-23  {21,57}
[0] MPI startup(): 22      12759    eu-a6-009-23  {22,58}
[0] MPI startup(): 23      12760    eu-a6-009-23  {23,59}
[0] MPI startup(): 24      12761    eu-a6-009-23  {24,60}
[0] MPI startup(): 25      12762    eu-a6-009-23  {25,61}
[0] MPI startup(): 26      12763    eu-a6-009-23  {26,62}
[0] MPI startup(): 27      12764    eu-a6-009-23  {27,63}
[0] MPI startup(): 28      12765    eu-a6-009-23  {28,64}
[0] MPI startup(): 29      12766    eu-a6-009-23  {29,65}
[0] MPI startup(): 30      12767    eu-a6-009-23  {30,66}
[0] MPI startup(): 31      12768    eu-a6-009-23  {31,67}
[0] MPI startup(): 32      12769    eu-a6-009-23  {32,68}
[0] MPI startup(): 33      12770    eu-a6-009-23  {33,69}
[0] MPI startup(): 34      12771    eu-a6-009-23  {34,70}
[0] MPI startup(): 35      12772    eu-a6-009-23  {35,71}
[0] MPI startup(): 36      73584    eu-a6-006-01  {0,36}
[0] MPI startup(): 37      73585    eu-a6-006-01  {1,37}
[0] MPI startup(): 38      73586    eu-a6-006-01  {2,38}
[0] MPI startup(): 39      73587    eu-a6-006-01  {3,39}
[0] MPI startup(): 40      73588    eu-a6-006-01  {4,40}
[0] MPI startup(): 41      73589    eu-a6-006-01  {5,41}
[0] MPI startup(): 42      73590    eu-a6-006-01  {6,42}
[0] MPI startup(): 43      73591    eu-a6-006-01  {7,43}
[0] MPI startup(): 44      73592    eu-a6-006-01  {8,44}
[0] MPI startup(): 45      73593    eu-a6-006-01  {9,45}
[0] MPI startup(): 46      73594    eu-a6-006-01  {10,46}
[0] MPI startup(): 47      73595    eu-a6-006-01  {11,47}
[0] MPI startup(): 48      73596    eu-a6-006-01  {12,48}
[0] MPI startup(): 49      73597    eu-a6-006-01  {13,49}
[0] MPI startup(): 50      73598    eu-a6-006-01  {14,50}
[0] MPI startup(): 51      73599    eu-a6-006-01  {15,51}
[0] MPI startup(): 52      73600    eu-a6-006-01  {16,52}
[0] MPI startup(): 53      73601    eu-a6-006-01  {17,53}
[0] MPI startup(): 54      73602    eu-a6-006-01  {18,54}
[0] MPI startup(): 55      73603    eu-a6-006-01  {19,55}
[0] MPI startup(): 56      73604    eu-a6-006-01  {20,56}
[0] MPI startup(): 57      73605    eu-a6-006-01  {21,57}
[0] MPI startup(): 58      73606    eu-a6-006-01  {22,58}
[0] MPI startup(): 59      73607    eu-a6-006-01  {23,59}
[0] MPI startup(): 60      73608    eu-a6-006-01  {24,60}
[0] MPI startup(): 61      73609    eu-a6-006-01  {25,61}
[0] MPI startup(): 62      73611    eu-a6-006-01  {26,62}
[0] MPI startup(): 63      73612    eu-a6-006-01  {27,63}
[0] MPI startup(): 64      73613    eu-a6-006-01  {28,64}
[0] MPI startup(): 65      73614    eu-a6-006-01  {29,65}
[0] MPI startup(): 66      73615    eu-a6-006-01  {30,66}
[0] MPI startup(): 67      73616    eu-a6-006-01  {31,67}
[0] MPI startup(): 68      73617    eu-a6-006-01  {32,68}
[0] MPI startup(): 69      73618    eu-a6-006-01  {33,69}
[0] MPI startup(): 70      73619    eu-a6-006-01  {34,70}
[0] MPI startup(): 71      73620    eu-a6-006-01  {35,71}
[0] MPI startup(): 72      54076    eu-a6-006-03  {0,36}
[0] MPI startup(): 73      54077    eu-a6-006-03  {1,37}
[0] MPI startup(): 74      54078    eu-a6-006-03  {2,38}
[0] MPI startup(): 75      54079    eu-a6-006-03  {3,39}
[0] MPI startup(): 76      54080    eu-a6-006-03  {4,40}
[0] MPI startup(): 77      54081    eu-a6-006-03  {5,41}
[0] MPI startup(): 78      54082    eu-a6-006-03  {6,42}
[0] MPI startup(): 79      54083    eu-a6-006-03  {7,43}
[0] MPI startup(): 80      54084    eu-a6-006-03  {8,44}
[0] MPI startup(): 81      54085    eu-a6-006-03  {9,45}
[0] MPI startup(): 82      54086    eu-a6-006-03  {10,46}
[0] MPI startup(): 83      54087    eu-a6-006-03  {11,47}
[0] MPI startup(): 84      54088    eu-a6-006-03  {12,48}
[0] MPI startup(): 85      54089    eu-a6-006-03  {13,49}
[0] MPI startup(): 86      54090    eu-a6-006-03  {14,50}
[0] MPI startup(): 87      54091    eu-a6-006-03  {15,51}
[0] MPI startup(): 88      54092    eu-a6-006-03  {16,52}
[0] MPI startup(): 89      54093    eu-a6-006-03  {17,53}
[0] MPI startup(): 90      54094    eu-a6-006-03  {18,54}
[0] MPI startup(): 91      54095    eu-a6-006-03  {19,55}
[0] MPI startup(): 92      54096    eu-a6-006-03  {20,56}
[0] MPI startup(): 93      54097    eu-a6-006-03  {21,57}
[0] MPI startup(): 94      54099    eu-a6-006-03  {22,58}
[0] MPI startup(): 95      54100    eu-a6-006-03  {23,59}
[0] MPI startup(): 96      54101    eu-a6-006-03  {24,60}
[0] MPI startup(): 97      54103    eu-a6-006-03  {25,61}
[0] MPI startup(): 98      54104    eu-a6-006-03  {26,62}
[0] MPI startup(): 99      54105    eu-a6-006-03  {27,63}
[0] MPI startup(): 100     54106    eu-a6-006-03  {28,64}
[0] MPI startup(): 101     54107    eu-a6-006-03  {29,65}
[0] MPI startup(): 102     54108    eu-a6-006-03  {30,66}
[0] MPI startup(): 103     54109    eu-a6-006-03  {31,67}
[0] MPI startup(): 104     54110    eu-a6-006-03  {32,68}
[0] MPI startup(): 105     54111    eu-a6-006-03  {33,69}
[0] MPI startup(): 106     54112    eu-a6-006-03  {34,70}
[0] MPI startup(): 107     54113    eu-a6-006-03  {35,71}
[0] MPI startup(): 108     62164    eu-a6-006-19  {0,36}
[0] MPI startup(): 109     62165    eu-a6-006-19  {1,37}
[0] MPI startup(): 110     62166    eu-a6-006-19  {2,38}
[0] MPI startup(): 111     62167    eu-a6-006-19  {3,39}
[0] MPI startup(): 112     62168    eu-a6-006-19  {4,40}
[0] MPI startup(): 113     62169    eu-a6-006-19  {5,41}
[0] MPI startup(): 114     62170    eu-a6-006-19  {6,42}
[0] MPI startup(): 115     62171    eu-a6-006-19  {7,43}
[0] MPI startup(): 116     62172    eu-a6-006-19  {8,44}
[0] MPI startup(): 117     62173    eu-a6-006-19  {9,45}
[0] MPI startup(): 118     62174    eu-a6-006-19  {10,46}
[0] MPI startup(): 119     62175    eu-a6-006-19  {11,47}
[0] MPI startup(): 120     62176    eu-a6-006-19  {12,48}
[0] MPI startup(): 121     62177    eu-a6-006-19  {13,49}
[0] MPI startup(): 122     62178    eu-a6-006-19  {14,50}
[0] MPI startup(): 123     62179    eu-a6-006-19  {15,51}
[0] MPI startup(): 124     62180    eu-a6-006-19  {16,52}
[0] MPI startup(): 125     62181    eu-a6-006-19  {17,53}
[0] MPI startup(): 126     62182    eu-a6-006-19  {18,54}
[0] MPI startup(): 127     62183    eu-a6-006-19  {19,55}
[0] MPI startup(): 128     62184    eu-a6-006-19  {20,56}
[0] MPI startup(): 129     62185    eu-a6-006-19  {21,57}
[0] MPI startup(): 130     62186    eu-a6-006-19  {22,58}
[0] MPI startup(): 131     62187    eu-a6-006-19  {23,59}
[0] MPI startup(): 132     62188    eu-a6-006-19  {24,60}
[0] MPI startup(): 133     62189    eu-a6-006-19  {25,61}
[0] MPI startup(): 134     62190    eu-a6-006-19  {26,62}
[0] MPI startup(): 135     62191    eu-a6-006-19  {27,63}
[0] MPI startup(): 136     62192    eu-a6-006-19  {28,64}
[0] MPI startup(): 137     62194    eu-a6-006-19  {29,65}
[0] MPI startup(): 138     62195    eu-a6-006-19  {30,66}
[0] MPI startup(): 139     62196    eu-a6-006-19  {31,67}
[0] MPI startup(): 140     62197    eu-a6-006-19  {32,68}
[0] MPI startup(): 141     62198    eu-a6-006-19  {33,69}
[0] MPI startup(): 142     62199    eu-a6-006-19  {34,70}
[0] MPI startup(): 143     62200    eu-a6-006-19  {35,71}
[0] MPI startup(): 144     24096    eu-a6-005-17  {0,36}
[0] MPI startup(): 145     24097    eu-a6-005-17  {1,37}
[0] MPI startup(): 146     24098    eu-a6-005-17  {2,38}
[0] MPI startup(): 147     24099    eu-a6-005-17  {3,39}
[0] MPI startup(): 148     24100    eu-a6-005-17  {4,40}
[0] MPI startup(): 149     24101    eu-a6-005-17  {5,41}
[0] MPI startup(): 150     24102    eu-a6-005-17  {6,42}
[0] MPI startup(): 151     24103    eu-a6-005-17  {7,43}
[0] MPI startup(): 152     24104    eu-a6-005-17  {8,44}
[0] MPI startup(): 153     24105    eu-a6-005-17  {9,45}
[0] MPI startup(): 154     24106    eu-a6-005-17  {10,46}
[0] MPI startup(): 155     24107    eu-a6-005-17  {11,47}
[0] MPI startup(): 156     24108    eu-a6-005-17  {12,48}
[0] MPI startup(): 157     24109    eu-a6-005-17  {13,49}
[0] MPI startup(): 158     24110    eu-a6-005-17  {14,50}
[0] MPI startup(): 159     24111    eu-a6-005-17  {15,51}
[0] MPI startup(): 160     24112    eu-a6-005-17  {16,52}
[0] MPI startup(): 161     24113    eu-a6-005-17  {17,53}
[0] MPI startup(): 162     24114    eu-a6-005-17  {18,54}
[0] MPI startup(): 163     24115    eu-a6-005-17  {19,55}
[0] MPI startup(): 164     24116    eu-a6-005-17  {20,56}
[0] MPI startup(): 165     24117    eu-a6-005-17  {21,57}
[0] MPI startup(): 166     24118    eu-a6-005-17  {22,58}
[0] MPI startup(): 167     24119    eu-a6-005-17  {23,59}
[0] MPI startup(): 168     24120    eu-a6-005-17  {24,60}
[0] MPI startup(): 169     24121    eu-a6-005-17  {25,61}
[0] MPI startup(): 170     24122    eu-a6-005-17  {26,62}
[0] MPI startup(): 171     24123    eu-a6-005-17  {27,63}
[0] MPI startup(): 172     24124    eu-a6-005-17  {28,64}
[0] MPI startup(): 173     24125    eu-a6-005-17  {29,65}
[0] MPI startup(): 174     24126    eu-a6-005-17  {30,66}
[0] MPI startup(): 175     24127    eu-a6-005-17  {31,67}
[0] MPI startup(): 176     24128    eu-a6-005-17  {32,68}
[0] MPI startup(): 177     24129    eu-a6-005-17  {33,69}
[0] MPI startup(): 178     24130    eu-a6-005-17  {34,70}
[0] MPI startup(): 179     24131    eu-a6-005-17  {35,71}
[0] MPI startup(): 180     31407    eu-a6-005-01  {0,36}
[0] MPI startup(): 181     31408    eu-a6-005-01  {1,37}
[0] MPI startup(): 182     31409    eu-a6-005-01  {2,38}
[0] MPI startup(): 183     31410    eu-a6-005-01  {3,39}
[0] MPI startup(): 184     31411    eu-a6-005-01  {4,40}
[0] MPI startup(): 185     31412    eu-a6-005-01  {5,41}
[0] MPI startup(): 186     31413    eu-a6-005-01  {6,42}
[0] MPI startup(): 187     31414    eu-a6-005-01  {7,43}
[0] MPI startup(): 188     31415    eu-a6-005-01  {8,44}
[0] MPI startup(): 189     31416    eu-a6-005-01  {9,45}
[0] MPI startup(): 190     31417    eu-a6-005-01  {10,46}
[0] MPI startup(): 191     31418    eu-a6-005-01  {11,47}
[0] MPI startup(): 192     31419    eu-a6-005-01  {12,48}
[0] MPI startup(): 193     31420    eu-a6-005-01  {13,49}
[0] MPI startup(): 194     31421    eu-a6-005-01  {14,50}
[0] MPI startup(): 195     31422    eu-a6-005-01  {15,51}
[0] MPI startup(): 196     31423    eu-a6-005-01  {16,52}
[0] MPI startup(): 197     31424    eu-a6-005-01  {17,53}
[0] MPI startup(): 198     31425    eu-a6-005-01  {18,54}
[0] MPI startup(): 199     31426    eu-a6-005-01  {19,55}
[0] MPI startup(): 200     31427    eu-a6-005-01  {20,56}
[0] MPI startup(): 201     31429    eu-a6-005-01  {21,57}
[0] MPI startup(): 202     31430    eu-a6-005-01  {22,58}
[0] MPI startup(): 203     31431    eu-a6-005-01  {23,59}
[0] MPI startup(): 204     31432    eu-a6-005-01  {24,60}
[0] MPI startup(): 205     31433    eu-a6-005-01  {25,61}
[0] MPI startup(): 206     31435    eu-a6-005-01  {26,62}
[0] MPI startup(): 207     31436    eu-a6-005-01  {27,63}
[0] MPI startup(): 208     31437    eu-a6-005-01  {28,64}
[0] MPI startup(): 209     31438    eu-a6-005-01  {29,65}
[0] MPI startup(): 210     31439    eu-a6-005-01  {30,66}
[0] MPI startup(): 211     31440    eu-a6-005-01  {31,67}
[0] MPI startup(): 212     31441    eu-a6-005-01  {32,68}
[0] MPI startup(): 213     31442    eu-a6-005-01  {33,69}
[0] MPI startup(): 214     31443    eu-a6-005-01  {34,70}
[0] MPI startup(): 215     31444    eu-a6-005-01  {35,71}
[0] MPI startup(): 216     11802    eu-a6-005-21  {0,36}
[0] MPI startup(): 217     11803    eu-a6-005-21  {1,37}
[0] MPI startup(): 218     11804    eu-a6-005-21  {2,38}
[0] MPI startup(): 219     11805    eu-a6-005-21  {3,39}
[0] MPI startup(): 220     11806    eu-a6-005-21  {4,40}
[0] MPI startup(): 221     11807    eu-a6-005-21  {5,41}
[0] MPI startup(): 222     11808    eu-a6-005-21  {6,42}
[0] MPI startup(): 223     11809    eu-a6-005-21  {7,43}
[0] MPI startup(): 224     11810    eu-a6-005-21  {8,44}
[0] MPI startup(): 225     11811    eu-a6-005-21  {9,45}
[0] MPI startup(): 226     11813    eu-a6-005-21  {10,46}
[0] MPI startup(): 227     11814    eu-a6-005-21  {11,47}
[0] MPI startup(): 228     11815    eu-a6-005-21  {12,48}
[0] MPI startup(): 229     11816    eu-a6-005-21  {13,49}
[0] MPI startup(): 230     11817    eu-a6-005-21  {14,50}
[0] MPI startup(): 231     11818    eu-a6-005-21  {15,51}
[0] MPI startup(): 232     11819    eu-a6-005-21  {16,52}
[0] MPI startup(): 233     11820    eu-a6-005-21  {17,53}
[0] MPI startup(): 234     11821    eu-a6-005-21  {18,54}
[0] MPI startup(): 235     11822    eu-a6-005-21  {19,55}
[0] MPI startup(): 236     11823    eu-a6-005-21  {20,56}
[0] MPI startup(): 237     11825    eu-a6-005-21  {21,57}
[0] MPI startup(): 238     11826    eu-a6-005-21  {22,58}
[0] MPI startup(): 239     11827    eu-a6-005-21  {23,59}
[0] MPI startup(): 240     11828    eu-a6-005-21  {24,60}
[0] MPI startup(): 241     11829    eu-a6-005-21  {25,61}
[0] MPI startup(): 242     11830    eu-a6-005-21  {26,62}
[0] MPI startup(): 243     11831    eu-a6-005-21  {27,63}
[0] MPI startup(): 244     11832    eu-a6-005-21  {28,64}
[0] MPI startup(): 245     11833    eu-a6-005-21  {29,65}
[0] MPI startup(): 246     11834    eu-a6-005-21  {30,66}
[0] MPI startup(): 247     11835    eu-a6-005-21  {31,67}
[0] MPI startup(): 248     11836    eu-a6-005-21  {32,68}
[0] MPI startup(): 249     11837    eu-a6-005-21  {33,69}
[0] MPI startup(): 250     11838    eu-a6-005-21  {34,70}
[0] MPI startup(): 251     11839    eu-a6-005-21  {35,71}
[0] MPI startup(): I_MPI_ROOT=/cluster/apps/nss/intel/oneapi/2022.1.2/mpi/2021.5.1
[0] MPI startup(): I_MPI_MPIRUN=mpirun
[0] MPI startup(): I_MPI_HYDRA_RMK=lsf
[0] MPI startup(): I_MPI_HYDRA_TOPOLIB=hwloc
[0] MPI startup(): I_MPI_INTERNAL_MEM_POLICY=default
[0] MPI startup(): I_MPI_DEBUG=40
[0] MPI startup(): threading: mode: direct
[0] MPI startup(): threading: vcis: 1
[0] MPI startup(): threading: app_threads: 1
[0] MPI startup(): threading: runtime: generic
[0] MPI startup(): threading: is_threaded: 0
[0] MPI startup(): threading: async_progress: 0
[0] MPI startup(): threading: num_pools: 64
[0] MPI startup(): threading: lock_level: global
[0] MPI startup(): threading: enable_sep: 0
[0] MPI startup(): threading: direct_recv: 1
[0] MPI startup(): threading: zero_op_flags: 1
[0] MPI startup(): threading: num_am_buffers: 1
[0] MPI startup(): threading: library is built with per-vci thread granularity
Hello world from processor eu-a6-009-23, rank 0 out of 252 processors
[sfux@eu-login-16 intelmpi]$

 

It seems that just setting FI_PROVIDER=mlx works fine on nodes with Intel Xeon Gold 6150 CPUs. I  will test the same command also on our nodes with AMD EPYC 7742, 7H12 and 7763 and provide some logs.

 

samfux84
New Contributor I
2,686 Views

I repeated the 4 cores job (2 cores per host) on our nodes with AMD CPUs. Please find below the results:

 

EPYC 7742:

[sfux@eu-g1-001-1 ~]$ ucx_info -d | grep Transport
#      Transport: posix
#      Transport: sysv
#      Transport: self
#      Transport: tcp
#      Transport: tcp
#      Transport: tcp
#      Transport: rc_verbs
#      Transport: rc_mlx5
#      Transport: dc_mlx5
#      Transport: ud_verbs
#      Transport: ud_mlx5
#      Transport: cma
#      Transport: knem
#      Transport: xpmem
[sfux@eu-g1-001-1 ~]$ ucx_info -v
# UCT version=1.11.1 revision c58db6b
# configured with: --build=x86_64-redhat-linux-gnu --host=x86_64-redhat-linux-gnu --program-prefix= --disable-dependency-tracking --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib64 --libexecdir=/usr/libexec --localstatedir=/var --sharedstatedir=/var/lib --mandir=/usr/share/man --infodir=/usr/share/info --disable-optimizations --disable-logging --disable-debug --disable-assertions --enable-mt --disable-params-check --without-java --enable-cma --with-cuda --with-gdrcopy --with-verbs --with-knem --with-rdmacm --without-rocm --with-xpmem --without-fuse3 --without-ugni --with-cuda=/usr/local/cuda-11.2
[sfux@eu-g1-001-1 ~]$

Job logs:

[sfux@eu-login-16 intelmpi]$ cat lsf.o211967271
Sender: LSF System <lsfadmin@eu-g1-018-1>
Subject: Job 211967271: <I_MPI_DEBUG=40 FI_PROVIDER=mlx mpirun -n 4 -ppn 2 ./hello> in cluster <euler> Exited

Job <I_MPI_DEBUG=40 FI_PROVIDER=mlx mpirun -n 4 -ppn 2 ./hello> was submitted from host <eu-login-16> by user <sfux> in cluster <euler> at Thu Mar 31 11:11:13 2022
Job was executed on host(s) <2*eu-g1-018-1>, in queue <normal.4h>, as user <sfux> in cluster <euler> at Thu Mar 31 11:11:52 2022
                            <2*eu-g1-017-1>
</cluster/home/sfux> was used as the home directory.
</cluster/home/sfux/test/intelmpi> was used as the working directory.
Started at Thu Mar 31 11:11:52 2022
Terminated at Thu Mar 31 11:11:56 2022
Results reported at Thu Mar 31 11:11:56 2022

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
I_MPI_DEBUG=40 FI_PROVIDER=mlx mpirun -n 4 -ppn 2 ./hello
------------------------------------------------------------

Exited with exit code 143.

Resource usage summary:

    CPU time :                                   2.00 sec.
    Max Memory :                                 111 MB
    Average Memory :                             -
    Total Requested Memory :                     2000.00 MB
    Delta Memory :                               1889.00 MB
    Max Swap :                                   -
    Max Processes :                              -
    Max Threads :                                -
    Run time :                                   25 sec.
    Turnaround time :                            43 sec.

The output (if any) follows:

IPL WARN> Not all cpus are available, switch to I_MPI_PIN_ORDER=compact. (Total: 128 Available: 2)
IPL WARN> Not all cpus are available, switch to I_MPI_PIN_ORDER=compact. (Total: 128 Available: 2)
[0] MPI startup(): Intel(R) MPI Library, Version 2021.5  Build 20211102 (id: 9279b7d62)
[0] MPI startup(): Copyright (C) 2003-2021 Intel Corporation.  All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): libfabric version: 1.13.2rc1-impi
libfabric:57679:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:57679:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:57679:core:core:ze_hmem_dl_init():422<warn> Failed to dlopen libze_loader.so
libfabric:57679:core:core:ofi_hmem_init():214<warn> Failed to initialize hmem iface FI_HMEM_ZE: No data available
libfabric:57679:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:57679:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:57679:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ZE not supported
libfabric:57679:core:core:ofi_register_provider():474<info> registering provider: verbs (113.20)
libfabric:57679:core:core:ofi_register_provider():502<info> "verbs" filtered by provider include/exclude list, skipping
libfabric:57679:core:core:ofi_register_provider():474<info> registering provider: tcp (113.20)
libfabric:57679:core:core:ofi_register_provider():502<info> "tcp" filtered by provider include/exclude list, skipping
libfabric:57679:core:core:ofi_register_provider():474<info> registering provider: sockets (113.20)
libfabric:57679:core:core:ofi_register_provider():502<info> "sockets" filtered by provider include/exclude list, skipping
libfabric:57679:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:57679:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:57679:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ZE not supported
libfabric:57679:core:core:ofi_register_provider():474<info> registering provider: shm (113.20)
libfabric:57679:core:core:ofi_register_provider():502<info> "shm" filtered by provider include/exclude list, skipping
libfabric:57679:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:57679:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:57679:core:core:ze_hmem_dl_init():422<warn> Failed to dlopen libze_loader.so
libfabric:57679:core:core:ofi_hmem_init():214<warn> Failed to initialize hmem iface FI_HMEM_ZE: No data available
libfabric:57679:core:core:ofi_register_provider():474<info> registering provider: ofi_rxm (113.20)
libfabric:57679:psm3:core:fi_prov_ini():680<info> build options: VERSION=1101.0=11.1.0.0, HAVE_PSM3_src=1, PSM3_CUDA=0
libfabric:57679:core:core:ofi_register_provider():474<info> registering provider: psm3 (1101.0)
libfabric:57679:core:core:ofi_register_provider():502<info> "psm3" filtered by provider include/exclude list, skipping
libfabric:57679:core:core:ofi_register_provider():474<info> registering provider: mlx (1.4)
libfabric:57679:core:core:ofi_register_provider():474<info> registering provider: ofi_hook_noop (113.20)
libfabric:57679:core:core:fi_getinfo_():1138<info> Found provider with the highest priority mlx, must_use_util_prov = 0
libfabric:57679:core:core:fi_getinfo_():1138<info> Found provider with the highest priority mlx, must_use_util_prov = 0
[0] MPI startup(): libfabric provider: mlx
[0] MPI startup(): detected mlx provider, set device name to "mlx_hcoll"
libfabric:57679:core:core:fi_fabric_():1423<info> Opened fabric: mlx
[0] MPI startup(): max_ch4_vcis: 1, max_reg_eps 64, enable_sep 0, enable_shared_ctxs 0, do_av_insert 1
[0] MPI startup(): addrnamelen: 1024
[1648717915.161174] [eu-g1-017-1:56545:0]         address.c:988  UCX  ERROR address version mismatch: expected 0, actual 12
Abort(1090703) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(143)........:
MPID_Init(1310)..............:
MPIDI_OFI_mpi_init_hook(1974): OFI get address vector map failed
[1648717915.161174] [eu-g1-017-1:56546:0]         address.c:988  UCX  ERROR address version mismatch: expected 0, actual 12
Abort(1090703) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(143)........:
MPID_Init(1310)..............:
MPIDI_OFI_mpi_init_hook(1974): OFI get address vector map failed
[sfux@eu-login-16 intelmpi]$

-> Job failed

 

EPYC 7H12:

[sfux@eu-a2p-001 ~]$ module list

Currently Loaded Modules:
  1) StdEnv   2) intel/2022.1.2


[sfux@eu-a2p-001 ~]$ ucx_info -d | grep Transport
#      Transport: posix
#      Transport: sysv
#      Transport: self
#      Transport: tcp
#      Transport: tcp
#      Transport: tcp
#      Transport: rc_verbs
#      Transport: rc_mlx5
#      Transport: dc_mlx5
#      Transport: ud_verbs
#      Transport: ud_mlx5
#      Transport: rc_verbs
#      Transport: rc_mlx5
#      Transport: dc_mlx5
#      Transport: ud_verbs
#      Transport: ud_mlx5
#      Transport: cma
#      Transport: knem
#      Transport: xpmem
[sfux@eu-a2p-001 ~]$ ucx_info -v
# UCT version=1.11.1 revision c58db6b
# configured with: --build=x86_64-redhat-linux-gnu --host=x86_64-redhat-linux-gnu --program-prefix= --disable-dependency-tracking --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib64 --libexecdir=/usr/libexec --localstatedir=/var --sharedstatedir=/var/lib --mandir=/usr/share/man --infodir=/usr/share/info --disable-optimizations --disable-logging --disable-debug --disable-assertions --enable-mt --disable-params-check --without-java --enable-cma --with-cuda --with-gdrcopy --with-verbs --with-knem --with-rdmacm --without-rocm --with-xpmem --without-fuse3 --without-ugni --with-cuda=/usr/local/cuda-11.2
[sfux@eu-a2p-001 ~]$

Job  logs:

[sfux@eu-login-16 intelmpi]$ cat lsf.o211967323
Sender: LSF System <lsfadmin@eu-a2p-077>
Subject: Job 211967323: <I_MPI_DEBUG=40 FI_PROVIDER=mlx mpirun -n 4 -ppn 2 ./hello> in cluster <euler> Done

Job <I_MPI_DEBUG=40 FI_PROVIDER=mlx mpirun -n 4 -ppn 2 ./hello> was submitted from host <eu-login-16> by user <sfux> in cluster <euler> at Thu Mar 31 11:11:31 2022
Job was executed on host(s) <2*eu-a2p-077>, in queue <normal.4h>, as user <sfux> in cluster <euler> at Thu Mar 31 11:11:52 2022
                            <2*eu-a2p-083>
</cluster/home/sfux> was used as the home directory.
</cluster/home/sfux/test/intelmpi> was used as the working directory.
Started at Thu Mar 31 11:11:52 2022
Terminated at Thu Mar 31 11:11:58 2022
Results reported at Thu Mar 31 11:11:58 2022

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
I_MPI_DEBUG=40 FI_PROVIDER=mlx mpirun -n 4 -ppn 2 ./hello
------------------------------------------------------------

Successfully completed.

Resource usage summary:

    CPU time :                                   7.00 sec.
    Max Memory :                                 1166 MB
    Average Memory :                             594.00 MB
    Total Requested Memory :                     2000.00 MB
    Delta Memory :                               834.00 MB
    Max Swap :                                   -
    Max Processes :                              9
    Max Threads :                                13
    Run time :                                   5 sec.
    Turnaround time :                            27 sec.

The output (if any) follows:

IPL WARN> Not all cpus are available, switch to I_MPI_PIN_ORDER=compact. (Total: 128 Available: 2)
IPL WARN> Not all cpus are available, switch to I_MPI_PIN_ORDER=compact. (Total: 128 Available: 2)
[0] MPI startup(): Intel(R) MPI Library, Version 2021.5  Build 20211102 (id: 9279b7d62)
[0] MPI startup(): Copyright (C) 2003-2021 Intel Corporation.  All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): libfabric version: 1.13.2rc1-impi
libfabric:65567:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:65567:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:65567:core:core:ze_hmem_dl_init():422<warn> Failed to dlopen libze_loader.so
libfabric:65567:core:core:ofi_hmem_init():214<warn> Failed to initialize hmem iface FI_HMEM_ZE: No data available
libfabric:65567:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:65567:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:65567:core:core:ze_hmem_dl_init():422<warn> Failed to dlopen libze_loader.so
libfabric:65567:core:core:ofi_hmem_init():214<warn> Failed to initialize hmem iface FI_HMEM_ZE: No data available
libfabric:65567:core:core:ofi_register_provider():474<info> registering provider: verbs (113.20)
libfabric:65567:core:core:ofi_register_provider():502<info> "verbs" filtered by provider include/exclude list, skipping
libfabric:65567:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:65567:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:65567:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ZE not supported
libfabric:65567:core:core:ofi_register_provider():474<info> registering provider: verbs (113.20)
libfabric:65567:core:core:ofi_register_provider():502<info> "verbs" filtered by provider include/exclude list, skipping
libfabric:65567:core:core:ofi_register_provider():474<info> registering provider: tcp (113.20)
libfabric:65567:core:core:ofi_register_provider():502<info> "tcp" filtered by provider include/exclude list, skipping
libfabric:65567:core:core:ofi_register_provider():474<info> registering provider: sockets (113.20)
libfabric:65567:core:core:ofi_register_provider():502<info> "sockets" filtered by provider include/exclude list, skipping
libfabric:65567:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:65567:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:65567:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ZE not supported
libfabric:65567:core:core:ofi_register_provider():474<info> registering provider: shm (113.20)
libfabric:65567:core:core:ofi_register_provider():502<info> "shm" filtered by provider include/exclude list, skipping
libfabric:65567:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:65567:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:65567:core:core:ze_hmem_dl_init():422<warn> Failed to dlopen libze_loader.so
libfabric:65567:core:core:ofi_hmem_init():214<warn> Failed to initialize hmem iface FI_HMEM_ZE: No data available
libfabric:65567:core:core:ofi_register_provider():474<info> registering provider: ofi_rxm (113.20)
libfabric:65567:psm3:core:fi_prov_ini():680<info> build options: VERSION=1101.0=11.1.0.0, HAVE_PSM3_src=1, PSM3_CUDA=0
libfabric:65567:core:core:ofi_register_provider():474<info> registering provider: psm3 (1101.0)
libfabric:65567:core:core:ofi_register_provider():502<info> "psm3" filtered by provider include/exclude list, skipping
libfabric:65567:core:core:ofi_register_provider():474<info> registering provider: mlx (1.4)
libfabric:65567:core:core:ofi_register_provider():474<info> registering provider: ofi_hook_noop (113.20)
libfabric:65567:core:core:fi_getinfo_():1138<info> Found provider with the highest priority mlx, must_use_util_prov = 0
libfabric:65567:core:core:fi_getinfo_():1138<info> Found provider with the highest priority mlx, must_use_util_prov = 0
[0] MPI startup(): libfabric provider: mlx
[0] MPI startup(): detected mlx provider, set device name to "mlx_hcoll"
libfabric:65567:core:core:fi_fabric_():1423<info> Opened fabric: mlx
[0] MPI startup(): max_ch4_vcis: 1, max_reg_eps 64, enable_sep 0, enable_shared_ctxs 0, do_av_insert 1
[0] MPI startup(): addrnamelen: 1024
[0] MPI startup(): Load tuning file: "/cluster/apps/nss/intel/oneapi/2022.1.2/mpi/2021.5.1/etc/tuning_generic_shm-ofi_mlx_hcoll.dat"
[0] MPI startup(): Rank    Pid      Node name   Pin cpu
[0] MPI startup(): 0       65567    eu-a2p-077  {54}
[0] MPI startup(): 1       65568    eu-a2p-077  {55}
[0] MPI startup(): 2       64165    eu-a2p-083  {7}
[0] MPI startup(): 3       64166    eu-a2p-083  {11}
[0] MPI startup(): I_MPI_ROOT=/cluster/apps/nss/intel/oneapi/2022.1.2/mpi/2021.5.1
[0] MPI startup(): I_MPI_MPIRUN=mpirun
[0] MPI startup(): I_MPI_HYDRA_RMK=lsf
[0] MPI startup(): I_MPI_HYDRA_TOPOLIB=hwloc
[0] MPI startup(): I_MPI_INTERNAL_MEM_POLICY=default
[0] MPI startup(): I_MPI_DEBUG=40
[0] MPI startup(): threading: mode: direct
[0] MPI startup(): threading: vcis: 1
[0] MPI startup(): threading: app_threads: 1
[0] MPI startup(): threading: runtime: generic
[0] MPI startup(): threading: is_threaded: 0
[0] MPI startup(): threading: async_progress: 0
[0] MPI startup(): threading: num_pools: 64
[0] MPI startup(): threading: lock_level: global
[0] MPI startup(): threading: enable_sep: 0
[0] MPI startup(): threading: direct_recv: 1
[0] MPI startup(): threading: zero_op_flags: 1
[0] MPI startup(): threading: num_am_buffers: 1
[0] MPI startup(): threading: library is built with per-vci thread granularity
Hello world from processor eu-a2p-077, rank 1 out of 4 processors
Hello world from processor eu-a2p-077, rank 0 out of 4 processors
Hello world from processor eu-a2p-083, rank 2 out of 4 processors
Hello world from processor eu-a2p-083, rank 3 out of 4 processors
[sfux@eu-login-16 intelmpi]$

-> Job did not fail

 

EPYC 7763:

[sfux@eu-a2p-400 ~]$ ucx_info -d | grep Transport
#      Transport: posix
#      Transport: sysv
#      Transport: self
#      Transport: tcp
#      Transport: tcp
#      Transport: tcp
#      Transport: rc_verbs
#      Transport: rc_mlx5
#      Transport: dc_mlx5
#      Transport: ud_verbs
#      Transport: ud_mlx5
#      Transport: cma
#      Transport: knem
#      Transport: xpmem
[sfux@eu-a2p-400 ~]$ ucx_info -v
# UCT version=1.11.1 revision c58db6b
# configured with: --build=x86_64-redhat-linux-gnu --host=x86_64-redhat-linux-gnu --program-prefix= --disable-dependency-tracking --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib64 --libexecdir=/usr/libexec --localstatedir=/var --sharedstatedir=/var/lib --mandir=/usr/share/man --infodir=/usr/share/info --disable-optimizations --disable-logging --disable-debug --disable-assertions --enable-mt --disable-params-check --without-java --enable-cma --with-cuda --with-gdrcopy --with-verbs --with-knem --with-rdmacm --without-rocm --with-xpmem --without-fuse3 --without-ugni --with-cuda=/usr/local/cuda-11.2
[sfux@eu-a2p-400 ~]$

Job logs:

[sfux@eu-login-16 intelmpi]$ cat lsf.o211967427
Sender: LSF System <lsfadmin@eu-a2p-512>
Subject: Job 211967427: <I_MPI_DEBUG=40 FI_PROVIDER=mlx mpirun -n 4 -ppn 2 ./hello> in cluster <euler> Done

Job <I_MPI_DEBUG=40 FI_PROVIDER=mlx mpirun -n 4 -ppn 2 ./hello> was submitted from host <eu-login-16> by user <sfux> in cluster <euler> at Thu Mar 31 11:11:44 2022
Job was executed on host(s) <2*eu-a2p-512>, in queue <normal.4h>, as user <sfux> in cluster <euler> at Thu Mar 31 11:12:18 2022
                            <2*eu-a2p-482>
</cluster/home/sfux> was used as the home directory.
</cluster/home/sfux/test/intelmpi> was used as the working directory.
Started at Thu Mar 31 11:12:18 2022
Terminated at Thu Mar 31 11:12:22 2022
Results reported at Thu Mar 31 11:12:22 2022

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
I_MPI_DEBUG=40 FI_PROVIDER=mlx mpirun -n 4 -ppn 2 ./hello
------------------------------------------------------------

Successfully completed.

Resource usage summary:

    CPU time :                                   4.00 sec.
    Max Memory :                                 1136 MB
    Average Memory :                             -
    Total Requested Memory :                     2000.00 MB
    Delta Memory :                               864.00 MB
    Max Swap :                                   -
    Max Processes :                              -
    Max Threads :                                -
    Run time :                                   16 sec.
    Turnaround time :                            38 sec.

The output (if any) follows:

IPL WARN> Not all cpus are available, switch to I_MPI_PIN_ORDER=compact. (Total: 128 Available: 2)
IPL WARN> Not all cpus are available, switch to I_MPI_PIN_ORDER=compact. (Total: 128 Available: 2)
[0] MPI startup(): Intel(R) MPI Library, Version 2021.5  Build 20211102 (id: 9279b7d62)
[0] MPI startup(): Copyright (C) 2003-2021 Intel Corporation.  All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): libfabric version: 1.13.2rc1-impi
libfabric:87550:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:87550:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:87550:core:core:ze_hmem_dl_init():422<warn> Failed to dlopen libze_loader.so
libfabric:87550:core:core:ofi_hmem_init():214<warn> Failed to initialize hmem iface FI_HMEM_ZE: No data available
libfabric:87550:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:87550:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:87550:core:core:ze_hmem_dl_init():422<warn> Failed to dlopen libze_loader.so
libfabric:87550:core:core:ofi_hmem_init():214<warn> Failed to initialize hmem iface FI_HMEM_ZE: No data available
libfabric:87550:core:core:ofi_register_provider():474<info> registering provider: verbs (113.20)
libfabric:87550:core:core:ofi_register_provider():502<info> "verbs" filtered by provider include/exclude list, skipping
libfabric:87550:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:87550:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:87550:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ZE not supported
libfabric:87550:core:core:ofi_register_provider():474<info> registering provider: verbs (113.20)
libfabric:87550:core:core:ofi_register_provider():502<info> "verbs" filtered by provider include/exclude list, skipping
libfabric:87550:core:core:ofi_register_provider():474<info> registering provider: tcp (113.20)
libfabric:87550:core:core:ofi_register_provider():502<info> "tcp" filtered by provider include/exclude list, skipping
libfabric:87550:core:core:ofi_register_provider():474<info> registering provider: sockets (113.20)
libfabric:87550:core:core:ofi_register_provider():502<info> "sockets" filtered by provider include/exclude list, skipping
libfabric:87550:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:87550:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:87550:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ZE not supported
libfabric:87550:core:core:ofi_register_provider():474<info> registering provider: shm (113.20)
libfabric:87550:core:core:ofi_register_provider():502<info> "shm" filtered by provider include/exclude list, skipping
libfabric:87550:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:87550:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:87550:core:core:ze_hmem_dl_init():422<warn> Failed to dlopen libze_loader.so
libfabric:87550:core:core:ofi_hmem_init():214<warn> Failed to initialize hmem iface FI_HMEM_ZE: No data available
libfabric:87550:core:core:ofi_register_provider():474<info> registering provider: ofi_rxm (113.20)
libfabric:87550:psm3:core:fi_prov_ini():680<info> build options: VERSION=1101.0=11.1.0.0, HAVE_PSM3_src=1, PSM3_CUDA=0
libfabric:87550:core:core:ofi_register_provider():474<info> registering provider: psm3 (1101.0)
libfabric:87550:core:core:ofi_register_provider():502<info> "psm3" filtered by provider include/exclude list, skipping
libfabric:87550:core:core:ofi_register_provider():474<info> registering provider: mlx (1.4)
libfabric:87550:core:core:ofi_register_provider():474<info> registering provider: ofi_hook_noop (113.20)
libfabric:87550:core:core:fi_getinfo_():1138<info> Found provider with the highest priority mlx, must_use_util_prov = 0
libfabric:87550:core:core:fi_getinfo_():1138<info> Found provider with the highest priority mlx, must_use_util_prov = 0
[0] MPI startup(): libfabric provider: mlx
[0] MPI startup(): detected mlx provider, set device name to "mlx_hcoll"
libfabric:87550:core:core:fi_fabric_():1423<info> Opened fabric: mlx
[0] MPI startup(): max_ch4_vcis: 1, max_reg_eps 64, enable_sep 0, enable_shared_ctxs 0, do_av_insert 1
[0] MPI startup(): addrnamelen: 1024
[0] MPI startup(): Load tuning file: "/cluster/apps/nss/intel/oneapi/2022.1.2/mpi/2021.5.1/etc/tuning_generic_shm-ofi_mlx_hcoll.dat"
[0] MPI startup(): Rank    Pid      Node name   Pin cpu
[0] MPI startup(): 0       87550    eu-a2p-512  {120}
[0] MPI startup(): 1       87551    eu-a2p-512  {121}
[0] MPI startup(): 2       80139    eu-a2p-482  {120}
[0] MPI startup(): 3       80140    eu-a2p-482  {121}
[0] MPI startup(): I_MPI_ROOT=/cluster/apps/nss/intel/oneapi/2022.1.2/mpi/2021.5.1
[0] MPI startup(): I_MPI_MPIRUN=mpirun
[0] MPI startup(): I_MPI_HYDRA_RMK=lsf
[0] MPI startup(): I_MPI_HYDRA_TOPOLIB=hwloc
[0] MPI startup(): I_MPI_INTERNAL_MEM_POLICY=default
[0] MPI startup(): I_MPI_DEBUG=40
[0] MPI startup(): threading: mode: direct
[0] MPI startup(): threading: vcis: 1
[0] MPI startup(): threading: app_threads: 1
[0] MPI startup(): threading: runtime: generic
[0] MPI startup(): threading: is_threaded: 0
[0] MPI startup(): threading: async_progress: 0
[0] MPI startup(): threading: num_pools: 64
[0] MPI startup(): threading: lock_level: global
[0] MPI startup(): threading: enable_sep: 0
[0] MPI startup(): threading: direct_recv: 1
[0] MPI startup(): threading: zero_op_flags: 1
[0] MPI startup(): threading: num_am_buffers: 1
[0] MPI startup(): threading: library is built with per-vci thread granularity
Hello world from processor eu-a2p-512, rank 1 out of 4 processors
Hello world from processor eu-a2p-512, rank 0 out of 4 processors
Hello world from processor eu-a2p-482, rank 2 out of 4 processors
Hello world from processor eu-a2p-482, rank 3 out of 4 processors
[sfux@eu-login-16 intelmpi]$

-> Job did not fail

 

The jobs failed on EPYC 7742, but not on EPYC 7H12 and EPYC 7763. Is it somehow possible to have setting (e.g. and environment variable) which allows to get consistent results for nodes with the same interconnect (Mellanox ConnectX-6) but different CPUs?

 

Best regards

 

Sam

 

JyotsnaK_Intel
Moderator
2,496 Views

Hi Sam,


Thank you for your inquiry. We offer support for hardware platforms that the Intel® oneAPI product supports. These platforms include those that are part of the Intel® Core™ processor family or higher, the Intel® Xeon® processor family, the Intel® Xeon® Scalable processor family, and others which can be found here – Intel® oneAPI Base Toolkit System Requirements, Intel® oneAPI HPC Toolkit System Requirements, Intel® oneAPI IoT Toolkit System Requirements

If you wish to use oneAPI on hardware that is not listed at one of the sites above, we encourage you to visit and contribute to the open oneAPI specification - https://www.oneapi.io/spec/


qumale
Beginner
2,485 Views

Dear JyotsnaK_Intel,

 

please note that the attached logs have been produced running on Intel Xeon Gold 6150 processors, listed in the supported device on the webpage Intel® oneAPI HPC Toolkit System Requirements

 

Q

HemanthCH_Intel
Moderator
2,433 Views

Hi,

 

In the above post, Sam confirmed as "It seems that just setting FI_PROVIDER=mlx works fine on nodes with Intel Xeon Gold 6150 CPUs." Could you please confirm whether this issue is resolved or not? If the issue is not resolved, Could you please provide the scenario where it gives the error?

 

Thanks & Regards,

Hemanth

 

HemanthCH_Intel
Moderator
2,361 Views

Hi,


We haven't heard back from you. Could you please provide any updates on your issue?


Thanks & Regards,

Hemanth



samfux84
New Contributor I
2,354 Views

Dear @HemanthCH_Intel ,

 

Thank you for your  reply.

 

Please keep the issue open for some more time. I will ask @qumale to rerun the computations for which he provided the logs above, to check if he still gets the same errors. If this is the case, then we would have to investigate this issue further.

 

I can also fully understand that you don't support other CPUs than Intel for Intel OneAPI. It looks like there is no setting which consistently results in non-crashing computations on different AMD CPU types. I think we will then ask our users to switch from IntelMPI to OpenMPI as OpenMPI runs fine on Intel and on AMD CPUs.

 

Best regards

 

Sam

DrAmarpal_K_Intel
1,953 Views

Hi Sam,

 

I am going ahead and closing this thread due to inactivity. As conveyed before please reach us through the priority support channel once you have all the details.

 

Intel will no longer respond to this thread. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only.

 

Best regards,

Amar

 

Reply