Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Edrisse_C_
Beginner
278 Views

mpiifort running fine on some nodes and showing "open_hca: device mlx4_0 not found" for others

Jump to solution

Dear all,

Using mpiifort on a cluster results in : "open_hca: device mlx4_0 not found" for some group nodes while for others there is no error and mpiifort runs perfectly fine. All the nodes have the same hardware/software configuration. I already had a look at the similar topic at :

https://software.intel.com/en-us/forums/intel-clusters-and-hpc-technology/topic/393416

And applied the proposed solution of commenting the ofa-v2-mlx4_0-1 and ofa-v2-mlx4_0-2 lines in /etc/dat.conf, but it did not solve the issue.

Would you have any idea of what might be wrong ? I attach the error log as well as ibstat output if it can help :

$ ibstat

CA 'mlx4_0'
        CA type: MT4099
        Number of ports: 1
        Firmware version: 2.31.5050
        Hardware version: 1
        Node GUID: 0xf45214030090c050
        System image GUID: 0xf45214030090c053
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 56
                Base lid: 1
                LMC: 0
                SM lid: 1
                Capability mask: 0x0251486a
                Port GUID: 0xf45214030090c051
                Link layer: InfiniBand

Many thanks in advance,

Edrisse535717

0 Kudos
1 Solution
Dmitry_S_Intel
Employee
278 Views

Hi,

Please test with I_MPI_FABRICS=shm:ofa

--

Dmitry

View solution in original post

2 Replies
Dmitry_S_Intel
Employee
279 Views

Hi,

Please test with I_MPI_FABRICS=shm:ofa

--

Dmitry

View solution in original post

Edrisse_C_
Beginner
278 Views

Hi Dmitry,

Many thanks for your answer, I confirm you that adding I_MPI_FABRICS=shm:ofa solves the issue.

Best Regards,
Edrisse

Reply