Software Archive
Read-only legacy content
17061 Discussions

Issue with Infiniband Qlogic IBA 7322

Quentin_B_1
Beginner
2,539 Views

Hi,

we have issue with infiniband card Qlogic on mic.
we use redhat 6.3, last mpss version and 2 phi card on each node.
ofed 1.5.4.1 has been installed
IB card is ok on head node is ok
we have rpm -U all intel-mic-ofed, reboot head node
but when we try to start ofed-mic,
we have the following error msg and only scif0 has sstarted on mic0 and mic1 :

"
ibpd: pid 4137 /dev/ibp1 started 4 threads
ibpd: pid 4159 /dev/ibp2 started 4 threads
kernel: ------------[ cut here ]------------
kernel: WARNING: at fs/sysfs/dir.c:512 sysfs_add_one+0xc9/0x130() (Not tainted)
kernel: Hardware name: S2600GZ
kernel: sysfs: cannot create duplicate filename '/devices/pci0000:80/0000:80:01.0/0000:81:00.0/infiniband/qib0/knx_node'
kernel: Modules linked in: ibscif(U) ibp_server(U) autofs4 nfs lockd fscache nfs_acl auth_rpcgss sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf bridge stp llc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables rdma_ucm(U) ib_sdp(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ib_cm(U) ib_sa(U) ipv6 ib_uverbs(U) ib_umad(U) mlx4_ib(U) mlx4_en(U) mlx4_core(U) ib_mthca(U) sg microcode ib_qib(U) mic(U) ib_mad(U) ib_core(U) sb_edac edac_core i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support ioatdma igb dca shpchp ext3 jbd mbcache sd_mod crc_t10dif isci libsas scsi_transport_sas ahci wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: ibp_server]
kernel: Pid: 1908, comm: qib/mic0 Not tainted 2.6.32-279.el6.x86_64 #1
kernel: Call Trace:
kernel: [<ffffffff8106b747>] ? warn_slowpath_common+0x87/0xc0
kernel: [<ffffffff8106b836>] ? warn_slowpath_fmt+0x46/0x50
kernel: [<ffffffff811f3329>] ? sysfs_add_one+0xc9/0x130
kernel: [<ffffffff811f1612>] ? sysfs_add_file_mode+0x62/0xb0
kernel: [<ffffffff811f1671>] ? sysfs_add_file+0x11/0x20
kernel: [<ffffffff811f16a6>] ? sysfs_create_file+0x26/0x30
kernel: [<ffffffff8134ce79>] ? device_create_file+0x19/0x20
kernel: [<ffffffffa02bf82c>] ? qib_knx_server_listen+0x23c/0x720 [ib_qib]
kernel: [<ffffffffa02bf5f0>] ? qib_knx_server_listen+0x0/0x720 [ib_qib]
kernel: [<ffffffff81091d66>] ? kthread+0x96/0xa0
kernel: [<ffffffff8100c14a>] ? child_rip+0xa/0x20
kernel: [<ffffffff81091cd0>] ? kthread+0x0/0xa0
kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20
kernel: ---[ end trace 640779d065f7b165 ]---
"

any idea ?

thanks

Quentin Bouyer

0 Kudos
14 Replies
Quentin_B_1
Beginner
2,539 Views

If i shutdown one mic card, i can start ofed-mic
without any pb.
we will try to downgrade to redhat 6.2 because we think
these release is more compatible with our Qlogic IB cards.

0 Kudos
Frances_R_Intel
Employee
2,539 Views

In the install instructions in the readme file that comes with the MPSS, it says to not install the kernel-ib* packages from the standard ofed 1.5.4.1 release and instead install the ones from the MPSS release. In your case, since you brought up ofed on the host before installing the MPSS, that would have meant uninstalling those packages before installing the ofed files from the MPSS. The readme also gives very strict instructions about the order in which things must be brought up: for RHEL 6.4 mpss->rdma->opensmd->ofed-mic; for other releases mpss->openibd->opensmd->ofed-mic. You probably did this, but just in case, I thought I would mention it.

0 Kudos
Quentin_B_1
Beginner
2,539 Views
I've just reinstalled a node with redhat 6.2 - with QLogic driver from intel ( 7.1.0.0.58 ) : driver ok on head node, rpm -e for kernel-ib , rpm -ivh ofed/intel-mic-ofed*.rpm ok but still always only interface scif0 for each mic - with OFED 1.5.4.1, install ok without kernel-ib. install all intel-mic-ofed. Still same pb as above + same error msg in /var/log/message ( my first post ) IB card is Infinipath QLE7340 No pb with mellanox IB card on other node. any idea ? thanks
0 Kudos
Frances_R_Intel
Employee
2,539 Views

OK, I am not (yet) very experienced with these drivers but I think (hope) I understand now.

In the readme file that comes with the MPSS, there are two sets of ofed driver install directions. One is for the TrueScale Infiniband drivers, the other is for all others. Mellanox definitely requires the drivers described in the all others section. When you said you use the  'QLogic driver from Intel (7.1.0.0.58)' , it hit me - that release number is for the Intel(r) True Scale Fabric Suite. For those drivers, your need to use the True Scale install directions from the MPSS readme file. They recommend using a 7.2 release of the True Scale drivers (which is not publically available yet but should be very shortly). But also, for this software to work with the coprocessor cards, you must install the drivers in the psm directory from the MPSS release. You cannot use these drivers with the RHEL 6.4 release but you can with 6.2 or 6.3.

Do you think that might be the problem? If so, could you try those other instructions on only the nodes using the Qlogic cards?

0 Kudos
Quentin_B_1
Beginner
2,539 Views
I've installed all rpm from the psm directory and still have the same pb : same error msg as post 1 and only scif0 interface up on mic. I use redhat 6.2, QLogicIB-Basic.RHEL6-x86_64.7.1.1.0.25 ( last version available on download center ) and mpss_gold_update_3-2.1.6720-13--rhel-6.2 I think our release of QLogic is the source of our pb i will wait the availability of QLogic v 7.2 thanks you very much for your help
0 Kudos
Frances_R_Intel
Employee
2,539 Views

7.2 is out there now. (Go to https://downloadcenter.intel.com/default.aspx and search on True Scale. I don't know if you will need the serial number from your adapter or not.) Do you want to give it a try?

0 Kudos
Quentin_B_1
Beginner
2,539 Views
I've downloaded it. thanks I've reinstlaled a fresh redhat 6.3 on the head node + QLogic IB basic 7.2 + mpss gold update 3 + ofed mic but still have the pb and error msg I've attached all operations I've done.
0 Kudos
Wendy__C_
Beginner
2,539 Views

It looks like a bug ? I don't have any qlogic card to try out but based on reading the source ... the server wants to create a new sys file. Since it now gets two MICs, my guess is that the file name gets duplicated (?). It is then stuck while trying to put the warning message into the log   "sysfs: cannot create duplicate filename ....".

 

0 Kudos
Frances_R_Intel
Employee
2,539 Views

Quentin, I looked through the text file you attached and everything looks right up until it goes badly wrong. I think this should be reported at http://premier.intel.com as an MPSS issue. If do not have access (you need to register through the registrationcenter.intel.com), I will submit the issue for you.

0 Kudos
Quentin_B_1
Beginner
2,539 Views
Can you reported this issue for me ? I have an account, but don't have rights to do anything ... How can I follow this issue ? I notice that , it doesn't matter the version of ib driver you use on head node because these drivers are deleted by the installation of intel-mic-ofed...rpm . In fact, it appears there is 2 pb : one with ibp_client because it doesn't allow ib_qib to plugin on itself ( only mlx4 or mthca seems to be allowed ) and second problem for ib_qib itself. thanks again for your help
0 Kudos
Quentin_B_1
Beginner
2,539 Views
I've seen on a intel doc , that IB on mic is only supported for Mellanox HCA for the moment. Maybe it explain our pb ?
0 Kudos
Frances_R_Intel
Employee
2,539 Views

No, that document is out of date - do you remember where you saw it? This changed with the update 3 release last month.

0 Kudos
Quentin_B_1
Beginner
2,539 Views
ok. the doc i mention can be found here : http://software.intel.com/en-us/articles/system-administration-for-the-intel-xeon-phi-coprocessor and look for 650_Intel_R__Xeon_Phi_tm__Cluster_configuration-v081 , page 34
0 Kudos
Frances_R_Intel
Employee
2,539 Views

I was looking through some very old posts and realized I had left this issue hanging. Ultimately, the issue was a combination of poor documentation (for which I share part of the guilt), differences in the installation process for OFED for True Scale and Mellanox adapters and a bug. The documentation which is included in the MPSS release is much better these days (for which I cannot take the credit) and provides good directions for installing the True Scale version of the OFED software. The bug was fixed in MPSS 3.0.17461. So that's the story here.

0 Kudos
Reply