Software Archive
Read-only legacy content
17060 Discussions

Failed to start openibd for Mellanox IB card.

Xiao_peng_W_
Beginner
7,316 Views

I was trying to install ofed on my mic host node to enable the Mellanox IB card for Xeon Phi node. But encountered the problem that openibd could not be started after installing:

# service openibd start
    Loading Mellanox MLX4_IB HCA driver:                       [FAILED]
    Loading Mellanox MLX4_EN HCA driver:                       [FAILED]
                                                           [FAILED]

The installation steps were:
  1. Install common intel-mic-* rpms in mpss_gold_update_3. After installing the mpssd could be started and mic node could be booted up successfully.
    2.  download ofed-1.5.4.1 from http://www.openfabrics.org/downloads/OFED/ofed-1.5.4/OFED-1.5.4.1.tgz and following the steps in readme-en.txt that install the ofed by install.pl. In this step I installed all the ofed packages first and then remove the rpm kernel-ib*.
    3. Install the ofed rpm packages in ofed/intel-mic-ofed*
    4. restart the mpssd by : service mpss restart

From the log the issue was the mlx4_ib driver could not be loaded correctly so I tried to load driver manually, but failed with following error message.
# modprobe mlx4_ib
    FATAL: Error inserting mlx4_ib (/lib/modules/2.6.32-220.el6.x86_64/updates/drivers/infiniband/hw/mlx4/mlx4_ib.ko): Unknown symbol in module, or unknown parameter (see dmesg)

From the output of dmesg I saw a log of message like following:
  mlx4_ib: disagrees about version of symbol mlx4_find_cached_vlan
  mlx4_ib: Unknown symbol mlx4_find_cached_vlan
  mlx4_ib: disagrees about version of symbol mlx4_buf_write_mtt
  mlx4_ib: Unknown symbol mlx4_buf_write_mtt
  mlx4_ib: disagrees about version of symbol mlx4_fmr_unmap
  mlx4_ib: Unknown symbol mlx4_fmr_unmap
  mlx4_ib: disagrees about version of symbol mlx4_unregister_interface
  mlx4_ib: Unknown symbol mlx4_unregister_interface
  mlx4_ib: Unknown symbol mlx4_qp_lookup_lock
  mlx4_ib: disagrees about version of symbol mlx4_write_mtt

It was very strange that the error message looked like the mlx4_ib driver was not build with correct kernel version, then I tried to rebuild intel-mic-ofed-kmod-6720-16.el6.src.rpm on my host node, but the build process failed too.
     rpmbuild --rebuild ./src/intel-mic-ofed-kmod-6720-16.el6.src.rpm

The OS of my host node is redhat server 6.2.

0 Kudos
3 Replies
TaylorIoTKidd
New Contributor I
7,316 Views

Consulting the experts, they suggest:

1. Try to unload any previously loaded device drivers first: service openibd restart

2. Sometimes just a reboot of the system helps.

If the above doesn't help, give me any additional information you can.

Regards
--
Taylor

 

 

0 Kudos
Xiao_peng_W_
Beginner
7,316 Views

It's very appreciated for your reply.

Removing the drivers ' mlx4_en mlx4_ib mlx4_core' and then restart the service openibd worked. And then I start the service 'service ofed-mic start', it output success messages.

Now on my mic card, I could dispaly the IB device with ibv_devinfo command. But I have no idea that how to verify that the communication path has been created correctly between mic card and host node. Could you show me how to verify the IB setting or refer me a doc?

# ibv_devinfo
hca_id: mlx4_0
        transport:                      InfiniBand (0)
        fw_ver:                         2.11.500
        node_guid:                      5cf3:fc00:0005:3d0b
        sys_image_guid:                 5cf3:fc00:0005:3d0e
        vendor_id:                      0x02c9
        vendor_part_id:                 4099
        hw_ver:                         0x0
        phys_port_cnt:                  2
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                2048 (4)
                        active_mtu:             2048 (4)
                        sm_lid:                 1
                        port_lid:               84
                        port_lmc:               0x00
                        link_layer:             IB

                port:   2
                        state:                  PORT_DOWN (1)
                        max_mtu:                2048 (4)
                        active_mtu:             2048 (4)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             IB

hca_id: scif0
        transport:                      iWARP (1)
        fw_ver:                         0.0.1
        node_guid:                      261e:65ff:fed5:b54b
        sys_image_guid:                 261e:65ff:fed5:b54b
        vendor_id:                      0x8086
        vendor_part_id:                 0
        hw_ver:                         0x1
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 1
                        port_lid:               1001
                        port_lmc:               0x00
                        link_layer:             IB

0 Kudos
Loc_N_Intel
Employee
7,316 Views

Hi Xiao,

I use the command ibv_devinfo both on host and MIC cards to check the SCIF link setup. For testing if whether a standard protocol TCP or SCIF is used, I normally test a simple program and set I_MPI_DEBUG = 3.

For example, I first run a sample program and set I_MPI_DEVICE to ssm, the output indicates that TCP was used to transfer data between host and MIC. Then I rerun the test and set I_MPI_DEVICE to rdma:ofa-v2-mlx4_01u, the output shows me this time ofa-v2-mlx4_01u is used this time. See below:

# mpirun -genv I_MPI_DEBUG 3 -genv I_MPI_DEVICE ssm -host localhost -n 1 ./test.host : -host mic0 -n 1 -wdir /tmp ./test.mic : -host mic1 -n 1 -wdir /tmp ./test.mic
[0] MPI startup(): shared memory and socket data transfer modes
[2] MPI startup(): shared memory and socket data transfer modes
[1] MPI startup(): shared memory and socket data transfer modes
[0] MPI startup(): shm and tcp data transfer modes
[2] MPI startup(): shm and tcp data transfer modes
[1] MPI startup(): shm and tcp data transfer modes
[0] MPI startup(): Rank    Pid      Node name
[0] MPI startup(): 0       9397     knightscorner4
[0] MPI startup(): 1       5024     knightscorner4-mic0
[0] MPI startup(): 2       5009     knightscorner4-mic1
Hello world: rank 0 of 3 running on knightscorner4
Hello world: rank 1 of 3 running on knightscorner4-mic0
Hello world: rank 2 of 3 running on knightscorner4-mic1

# mpirun -genv I_MPI_DEBUG 3 -genv I_MPI_DEVICE rdma:ofa-v2-mlx4_0-1u -host localhost -n 1 ./test.host : -host mic0 -n 1 -wdir /tmp ./test.mic : -host mic1 -n 1 -wdir /tmp ./test.mic
[0] MPI startup(): RDMA data transfer mode
[2] MPI startup(): RDMA data transfer mode
[1] MPI startup(): RDMA data transfer mode
[0] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1u
[2] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1u
[1] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1u
[0] MPI startup(): DAPL provider ofa-v2-mlx4_0-1u
[0] MPI startup(): dapl data transfer mode
[1] MPI startup(): DAPL provider ofa-v2-mlx4_0-1u
[1] MPI startup(): dapl data transfer mode
[2] MPI startup(): DAPL provider ofa-v2-mlx4_0-1u
[2] MPI startup(): dapl data transfer mode
[0] MPID_nem_init_dapl_coll_fns(): User set DAPL collective mask = 0000
[0] MPID_nem_init_dapl_coll_fns(): Effective DAPL collective mask = 0000
[1] MPID_nem_init_dapl_coll_fns(): User set DAPL collective mask = 0000
[1] MPID_nem_init_dapl_coll_fns(): Effective DAPL collective mask = 0000
[2] MPID_nem_init_dapl_coll_fns(): User set DAPL collective mask = 0000
[2] MPID_nem_init_dapl_coll_fns(): Effective DAPL collective mask = 0000
[0] MPI startup(): Rank    Pid      Node name
[0] MPI startup(): 0       9413     knightscorner4
[0] MPI startup(): 1       5030     knightscorner4-mic0
[0] MPI startup(): 2       5015     knightscorner4-mic1
Hello world: rank 0 of 3 running on knightscorner4
Hello world: rank 1 of 3 running on knightscorner4-mic0
Hello world: rank 2 of 3 running on knightscorner4-mic1

 

0 Kudos
Reply