We have a strange problem on one of our servers using Intel 82599 SRIOV NIC. The server was working alright for almost ~8 months with SRIOV PF/VF's working fine. Suddenly we ran into an issue where one of the PF doesn't seem to be working. We need help in isolating if the SRIOV PF has failed in hardware or whether this is a software problem.
Currently running ethtool offline tests, exits with the below dmesg
# ethtool -t eth103 offline
The test result is PASS
The test extra info:
Register test (offline) 0
Eeprom test (offline) 0
Interrupt test (offline) 0
Loopback test (offline) 0
Link test (on/offline) 0
[895552.667586] ixgbe: eth103: ixgbe_disable_rx_queue: RXDCTL.ENABLE on Rx queue 64 not cleared within the polling period
Also show-ring shows
# ethtool --show-ring eth103
Ring parameters for eth103:
RX Mini: 0
RX Jumbo: 0
Current hardware settings:
RX Mini: 0
RX Jumbo: 0
only 64 rings, whereas previously it used to show 512 rings.
We have some VM's that have SRIOV VF's PCI assigned to them from this bad SRIOV PF. They also run into the same issue. we added some debug prints in ixgbevf driver & saw that ixgbevf_reset_hw_vf() that gets called at init fails at
ret_val = mbx->ops.read_posted(hw, msgbuf, IXGBE_VF_PERMADDR_MSG_LEN);
with the following error
[ 3.484162] ixgbevf: read_posted retval:-100 (IXGBE_ERR_MBX)
The link status of the SRIOV PF seems to be fine
# ip link show dev eth103
5: eth103: mtu 1500 qdisc mq state UP qlen 1000
link/ether 00:1b:21:a3:94:39 brd ff:ff:ff:ff:ff:ff
vf 0 MAC 02:17:3e:67:a0:f8
vf 1 MAC 02:17:3e:45:bf:4a
vf 2 MAC 02:17:3e:78:d2:d7
vf 3 MAC 02:17:3e:1a:fb:c6
vf 4 MAC 02:17:3e:58:35:8d
vf 5 MAC 02:17:3e:52:ae:4c
vf 6 MAC 02:17:3e:62:2d:b9
vf 7 MAC 02:17:3e:24:ae:e3
vf 8 MAC 02:17:3e:22:35:2b
vf 9 MAC 02:17:3e:59:86:40
vf 10 MAC 02:17:3e:6f:9c:de
vf 11 MAC 02:17:3e:13:0a:c1
vf 12 MAC 02:17:3e:24:b5:79
vf 13 MAC 02:17:3e:2d:e1:2a
vf 14 MAC 02:17:3e:0c:11:df
vf 15 MAC 02:17:3e:7b:82:d2
vf 16 MAC 02:17:3e:43:5c:8d
vf 17 MAC 02:17:3e:54:ed:b2
vf 18 MAC 02:17:3e:70:8f:53
vf 19 MAC 02:17:3e:55:8d:2f
vf 20 MAC 02:17:3e:72:18:20
vf 21 MAC 02:17:3e:12:ff:95
vf 22 MAC 02:17:3e:71:d8:4d
vf 23 MAC 02:17:3e:27:eb:9f
vf 24 MAC 02:17:3e:29:7a:ad
vf 25 MAC 02:17:3e:2c:e9:4e
vf 26 MAC 02:17:3e:15:ce:57
vf 27 MAC 02:17:3e:6d:61:2c
vf 28 MAC 02:17:3e:4c:24:4d
vf 29 MAC 02:17:3e:4c:ab:7e
vf 30 MAC 1e:f8:b3:79:75:b2
vf 31 MAC 02:02:2f:eb:73:1e
So, essentially the mailbox + tx/rx queues doesnt appear to work.
Dump of all registers with ethtool on this PF can be found here
# Physical servers run ubuntu-natty (11.04) running linux-kernel 2.6.38-8-server. We are running ixgbe driver 3.2.9 that we locally compiled to disable mac anti-spoofing (primarily we call hw->mac.ops.set_mac_anti_spoofing always with disabled flag). We did this to enable bonding of SRIOV VF's within VM's
# At the physical server level we use ixgbevf 1.0.19-k0 & expose/use couple of SRIOV VF's locally within the physical server for bonding. Primarily we setup a linux active-backup bond across SRIOV VF's from two different SRIOV PF's
# We run several KVM VM's on these servers that are running ubuntu-precise (12.04) running linux-kernel 3.2.0-25-generic with ixgbevf driver version 2.2.0-k. These VM's are PCI attached with SRIOV VF's & they in turn setup active-backup bonds across the VF's out of different SRIOV PF's.
# We setup bonds primarily for failovers & at the same time use SRIOV for performance.
We dont know if this problem will go away upon a power-cycle of the server. We are keeping this server in the same state if some more active state information is required. Pls let us know if any more state information would help in isolating this problem.
Any help appreciated.
I was actually thinking along the lines of any OS/Kernel updates; however since you haven't rebooted in so long, that would be unlikely.
My experts are not sure what the source of your problem is. The error is, as you pointed out the mailbox communication stopped working. We can't tell from the description if it is a software (driver, or kernel) or a hardware problem.
All we can suggest is to save the kernel an dmesg logs and reboot. If the PF and VF's work after reboot, we are more inclined to believe it is a software problem of some sort, otherwise a hardware failure.
Also, before you reboot, if dump the registers with the ethreg tool:
If you post it, I'll see if it provides any more useful information
Wish I had a magic bullet for you.
please find the output of ethregs in:
I just ran it on eth103 without any options, pls let us know if you need to re-run it with any particular options.
Thanks for your help,
I think the PF dump is also available in the attached file (it has all VFs and then the PF). Or, otherwise, pls let us know which command exactly you need to perform:
Intel Corporation 82599EB 10-Gigabit SFI/SFP+ Network Connection
One more question on this. Is there some version compatibility requirement required between ixgbe & ixgbevf?
We are currently on ixgbe 3.2.9 and we have two versions of ixgbevf interfacing with the same NIC. They are ixgbevf 1.0.19-k0 & ixgbevf 2.2.0-k. Would it be an issue if differing versions of ixgbevf working with simultaneously on the NIC.
One thing we observed was
ixgbevf 2.2.0-k has a new msg code 0x6
Though we dont use MAC VLAN, it clears up MAC VLAN like below
hw->mac.ops.set_uc_addr(hw, 0, NULL);
However ixgbe 3.2.9 doesnt understand IXGBE_VF_SET_MACVLAN and prints message like this
[1020846.780262] ixgbe: eth103: ixgbe_rcv_msg_from_vf: Unhandled Msg 00000006
This happens very frequently (i.e. the ixgbevf for some reason keeps doing this almost every 2 secs) & ixgbe keeps printing this message.
We dont know if there are any other such incompatibilities that can result in this behaviour? Any insights appreciated.
We are not sure why your PF seemed to freeze. We will keep an eye out for such behavior, thanks for bringing it to our attention.
As for your PF/VF alighment. They are fairly tightly coupled. The way the VF driver communicates with the PF driver is through messages in the mailbox. If one side doesn't understand the other, such an error will occur.
I'd recommend the user to update both PF and VF drivers to the latest version that are available from our Source forge site. URL below:
PF Driver - latest ixgbe version is 3.10.16
VF Driver - latest ixgbevf version is 2.6.2