Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2228 Discussions

mpiifort running fine on some nodes and showing "open_hca: device mlx4_0 not found" for others

Edrisse_C_
Beginner
1,514 Views

Dear all,

Using mpiifort on a cluster results in : "open_hca: device mlx4_0 not found" for some group nodes while for others there is no error and mpiifort runs perfectly fine. All the nodes have the same hardware/software configuration. I already had a look at the similar topic at :

https://software.intel.com/en-us/forums/intel-clusters-and-hpc-technology/topic/393416

And applied the proposed solution of commenting the ofa-v2-mlx4_0-1 and ofa-v2-mlx4_0-2 lines in /etc/dat.conf, but it did not solve the issue.

Would you have any idea of what might be wrong ? I attach the error log as well as ibstat output if it can help :

$ ibstat

CA 'mlx4_0'
        CA type: MT4099
        Number of ports: 1
        Firmware version: 2.31.5050
        Hardware version: 1
        Node GUID: 0xf45214030090c050
        System image GUID: 0xf45214030090c053
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 56
                Base lid: 1
                LMC: 0
                SM lid: 1
                Capability mask: 0x0251486a
                Port GUID: 0xf45214030090c051
                Link layer: InfiniBand

Many thanks in advance,

Edrisse535717

0 Kudos
1 Solution
Dmitry_S_Intel
Moderator
1,514 Views

Hi,

Please test with I_MPI_FABRICS=shm:ofa

--

Dmitry

View solution in original post

0 Kudos
2 Replies
Dmitry_S_Intel
Moderator
1,515 Views

Hi,

Please test with I_MPI_FABRICS=shm:ofa

--

Dmitry

0 Kudos
Edrisse_C_
Beginner
1,514 Views

Hi Dmitry,

Many thanks for your answer, I confirm you that adding I_MPI_FABRICS=shm:ofa solves the issue.

Best Regards,
Edrisse

0 Kudos
Reply