- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I was trying to install ofed on my mic host node to enable the Mellanox IB card for Xeon Phi node. But encountered the problem that openibd could not be started after installing:
# service openibd start
Loading Mellanox MLX4_IB HCA driver: [FAILED]
Loading Mellanox MLX4_EN HCA driver: [FAILED]
[FAILED]
The installation steps were:
1. Install common intel-mic-* rpms in mpss_gold_update_3. After installing the mpssd could be started and mic node could be booted up successfully.
2. download ofed-1.5.4.1 from http://www.openfabrics.org/downloads/OFED/ofed-1.5.4/OFED-1.5.4.1.tgz and following the steps in readme-en.txt that install the ofed by install.pl. In this step I installed all the ofed packages first and then remove the rpm kernel-ib*.
3. Install the ofed rpm packages in ofed/intel-mic-ofed*
4. restart the mpssd by : service mpss restart
From the log the issue was the mlx4_ib driver could not be loaded correctly so I tried to load driver manually, but failed with following error message.
# modprobe mlx4_ib
FATAL: Error inserting mlx4_ib (/lib/modules/2.6.32-220.el6.x86_64/updates/drivers/infiniband/hw/mlx4/mlx4_ib.ko): Unknown symbol in module, or unknown parameter (see dmesg)
From the output of dmesg I saw a log of message like following:
mlx4_ib: disagrees about version of symbol mlx4_find_cached_vlan
mlx4_ib: Unknown symbol mlx4_find_cached_vlan
mlx4_ib: disagrees about version of symbol mlx4_buf_write_mtt
mlx4_ib: Unknown symbol mlx4_buf_write_mtt
mlx4_ib: disagrees about version of symbol mlx4_fmr_unmap
mlx4_ib: Unknown symbol mlx4_fmr_unmap
mlx4_ib: disagrees about version of symbol mlx4_unregister_interface
mlx4_ib: Unknown symbol mlx4_unregister_interface
mlx4_ib: Unknown symbol mlx4_qp_lookup_lock
mlx4_ib: disagrees about version of symbol mlx4_write_mtt
It was very strange that the error message looked like the mlx4_ib driver was not build with correct kernel version, then I tried to rebuild intel-mic-ofed-kmod-6720-16.el6.src.rpm on my host node, but the build process failed too.
rpmbuild --rebuild ./src/intel-mic-ofed-kmod-6720-16.el6.src.rpm
The OS of my host node is redhat server 6.2.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Consulting the experts, they suggest:
1. Try to unload any previously loaded device drivers first: service openibd restart
2. Sometimes just a reboot of the system helps.
If the above doesn't help, give me any additional information you can.
Regards
--
Taylor
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It's very appreciated for your reply.
Removing the drivers ' mlx4_en mlx4_ib mlx4_core' and then restart the service openibd worked. And then I start the service 'service ofed-mic start', it output success messages.
Now on my mic card, I could dispaly the IB device with ibv_devinfo command. But I have no idea that how to verify that the communication path has been created correctly between mic card and host node. Could you show me how to verify the IB setting or refer me a doc?
# ibv_devinfo
hca_id: mlx4_0
transport: InfiniBand (0)
fw_ver: 2.11.500
node_guid: 5cf3:fc00:0005:3d0b
sys_image_guid: 5cf3:fc00:0005:3d0e
vendor_id: 0x02c9
vendor_part_id: 4099
hw_ver: 0x0
phys_port_cnt: 2
port: 1
state: PORT_ACTIVE (4)
max_mtu: 2048 (4)
active_mtu: 2048 (4)
sm_lid: 1
port_lid: 84
port_lmc: 0x00
link_layer: IB
port: 2
state: PORT_DOWN (1)
max_mtu: 2048 (4)
active_mtu: 2048 (4)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: IB
hca_id: scif0
transport: iWARP (1)
fw_ver: 0.0.1
node_guid: 261e:65ff:fed5:b54b
sys_image_guid: 261e:65ff:fed5:b54b
vendor_id: 0x8086
vendor_part_id: 0
hw_ver: 0x1
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 1
port_lid: 1001
port_lmc: 0x00
link_layer: IB
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Xiao,
I use the command ibv_devinfo both on host and MIC cards to check the SCIF link setup. For testing if whether a standard protocol TCP or SCIF is used, I normally test a simple program and set I_MPI_DEBUG = 3.
For example, I first run a sample program and set I_MPI_DEVICE to ssm, the output indicates that TCP was used to transfer data between host and MIC. Then I rerun the test and set I_MPI_DEVICE to rdma:ofa-v2-mlx4_01u, the output shows me this time ofa-v2-mlx4_01u is used this time. See below:
# mpirun -genv I_MPI_DEBUG 3 -genv I_MPI_DEVICE ssm -host localhost -n 1 ./test.host : -host mic0 -n 1 -wdir /tmp ./test.mic : -host mic1 -n 1 -wdir /tmp ./test.mic [0] MPI startup(): shared memory and socket data transfer modes [2] MPI startup(): shared memory and socket data transfer modes [1] MPI startup(): shared memory and socket data transfer modes [0] MPI startup(): shm and tcp data transfer modes [2] MPI startup(): shm and tcp data transfer modes [1] MPI startup(): shm and tcp data transfer modes [0] MPI startup(): Rank Pid Node name [0] MPI startup(): 0 9397 knightscorner4 [0] MPI startup(): 1 5024 knightscorner4-mic0 [0] MPI startup(): 2 5009 knightscorner4-mic1 Hello world: rank 0 of 3 running on knightscorner4 Hello world: rank 1 of 3 running on knightscorner4-mic0 Hello world: rank 2 of 3 running on knightscorner4-mic1 # mpirun -genv I_MPI_DEBUG 3 -genv I_MPI_DEVICE rdma:ofa-v2-mlx4_0-1u -host localhost -n 1 ./test.host : -host mic0 -n 1 -wdir /tmp ./test.mic : -host mic1 -n 1 -wdir /tmp ./test.mic [0] MPI startup(): RDMA data transfer mode [2] MPI startup(): RDMA data transfer mode [1] MPI startup(): RDMA data transfer mode [0] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1u [2] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1u [1] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1u [0] MPI startup(): DAPL provider ofa-v2-mlx4_0-1u [0] MPI startup(): dapl data transfer mode [1] MPI startup(): DAPL provider ofa-v2-mlx4_0-1u [1] MPI startup(): dapl data transfer mode [2] MPI startup(): DAPL provider ofa-v2-mlx4_0-1u [2] MPI startup(): dapl data transfer mode [0] MPID_nem_init_dapl_coll_fns(): User set DAPL collective mask = 0000 [0] MPID_nem_init_dapl_coll_fns(): Effective DAPL collective mask = 0000 [1] MPID_nem_init_dapl_coll_fns(): User set DAPL collective mask = 0000 [1] MPID_nem_init_dapl_coll_fns(): Effective DAPL collective mask = 0000 [2] MPID_nem_init_dapl_coll_fns(): User set DAPL collective mask = 0000 [2] MPID_nem_init_dapl_coll_fns(): Effective DAPL collective mask = 0000 [0] MPI startup(): Rank Pid Node name [0] MPI startup(): 0 9413 knightscorner4 [0] MPI startup(): 1 5030 knightscorner4-mic0 [0] MPI startup(): 2 5015 knightscorner4-mic1 Hello world: rank 0 of 3 running on knightscorner4 Hello world: rank 1 of 3 running on knightscorner4-mic0 Hello world: rank 2 of 3 running on knightscorner4-mic1
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page