Community
cancel
Showing results for 
Search instead for 
Did you mean: 
idata
Community Manager
1,569 Views

SRIOV PF/VFs suddenly stopped working & tx/rx queues doesnt seem to be operational

We have a strange problem on one of our servers using Intel 82599 SRIOV NIC. The server was working alright for almost ~8 months with SRIOV PF/VF's working fine. Suddenly we ran into an issue where one of the PF doesn't seem to be working. We need help in isolating if the SRIOV PF has failed in hardware or whether this is a software problem.

Currently running ethtool offline tests, exits with the below dmesg

# ethtool -t eth103 offline

The test result is PASS

The test extra info:

Register test (offline) 0

Eeprom test (offline) 0

Interrupt test (offline) 0

Loopback test (offline) 0

Link test (on/offline) 0

[895552.667586] ixgbe: eth103: ixgbe_disable_rx_queue: RXDCTL.ENABLE on Rx queue 64 not cleared within the polling period

Also show-ring shows

# ethtool --show-ring eth103

Ring parameters for eth103:

Pre-set maximums:

RX: 4096

RX Mini: 0

RX Jumbo: 0

TX: 4096

Current hardware settings:

RX: 64

RX Mini: 0

RX Jumbo: 0

TX: 64

only 64 rings, whereas previously it used to show 512 rings.

We have some VM's that have SRIOV VF's PCI assigned to them from this bad SRIOV PF. They also run into the same issue. we added some debug prints in ixgbevf driver & saw that ixgbevf_reset_hw_vf() that gets called at init fails at

ret_val = mbx->ops.read_posted(hw, msgbuf, IXGBE_VF_PERMADDR_MSG_LEN);

with the following error

[ 3.484162] ixgbevf: read_posted retval:-100 (IXGBE_ERR_MBX)

The link status of the SRIOV PF seems to be fine

# ip link show dev eth103

5: eth103: mtu 1500 qdisc mq state UP qlen 1000

link/ether 00:1b:21:a3:94:39 brd ff:ff:ff:ff:ff:ff

vf 0 MAC 02:17:3e:67:a0:f8

vf 1 MAC 02:17:3e:45:bf:4a

vf 2 MAC 02:17:3e:78:d2:d7

vf 3 MAC 02:17:3e:1a:fb:c6

vf 4 MAC 02:17:3e:58:35:8d

vf 5 MAC 02:17:3e:52:ae:4c

vf 6 MAC 02:17:3e:62:2d:b9

vf 7 MAC 02:17:3e:24:ae:e3

vf 8 MAC 02:17:3e:22:35:2b

vf 9 MAC 02:17:3e:59:86:40

vf 10 MAC 02:17:3e:6f:9c:de

vf 11 MAC 02:17:3e:13:0a:c1

vf 12 MAC 02:17:3e:24:b5:79

vf 13 MAC 02:17:3e:2d:e1:2a

vf 14 MAC 02:17:3e:0c:11:df

vf 15 MAC 02:17:3e:7b:82:d2

vf 16 MAC 02:17:3e:43:5c:8d

vf 17 MAC 02:17:3e:54:ed:b2

vf 18 MAC 02:17:3e:70:8f:53

vf 19 MAC 02:17:3e:55:8d:2f

vf 20 MAC 02:17:3e:72:18:20

vf 21 MAC 02:17:3e:12:ff:95

vf 22 MAC 02:17:3e:71:d8:4d

vf 23 MAC 02:17:3e:27:eb:9f

vf 24 MAC 02:17:3e:29:7a:ad

vf 25 MAC 02:17:3e:2c:e9:4e

vf 26 MAC 02:17:3e:15:ce:57

vf 27 MAC 02:17:3e:6d:61:2c

vf 28 MAC 02:17:3e:4c:24:4d

vf 29 MAC 02:17:3e:4c:ab:7e

vf 30 MAC 1e:f8:b3:79:75:b2

vf 31 MAC 02:02:2f:eb:73:1e

So, essentially the mailbox + tx/rx queues doesnt appear to work.

Dump of all registers with ethtool on this PF can be found here

https://docs.google.com/document/d/1u-QY4vwpri_l_NZii8mrnfB0bPy1OlprwICj9rM9S_o/edit https://docs.google.com/document/d/1u-QY4vwpri_l_NZii8mrnfB0bPy1OlprwICj9rM9S_o/edit

Our setup:

# Physical servers run ubuntu-natty (11.04) running linux-kernel 2.6.38-8-server. We are running ixgbe driver 3.2.9 that we locally compiled to disable mac anti-spoofing (primarily we call hw->mac.ops.set_mac_anti_spoofing always with disabled flag). We did this to enable bonding of SRIOV VF's within VM's

# At the physical server level we use ixgbevf 1.0.19-k0 & expose/use couple of SRIOV VF's locally within the physical server for bonding. Primarily we setup a linux active-backup bond across SRIOV VF's from two different SRIOV PF's

# We run several KVM VM's on these servers that are running ubuntu-precise (12.04) running linux-kernel 3.2.0-25-generic with ixgbevf driver version 2.2.0-k. These VM's are PCI attached with SRIOV VF's & they in turn setup active-backup bonds across the VF's out of different SRIOV PF's.

# We setup bonds primarily for failovers & at the same time use SRIOV for performance.

We dont know if this problem will go away upon a power-cycle of the server. We are keeping this server in the same state if some more active state information is required. Pls let us know if any more state information would help in isolating this problem.

Any help appreciated.

Thanks

Shyam

0 Kudos
9 Replies
Patrick_K_Intel1
Employee
75 Views

I will do some digging and see if I can find anything.

Did you happen to have any recent updates applied to your OS?

thanx,

Patrick

idata
Community Manager
75 Views

Thanks Patrick. No, we didnt try recent updates ixgbe driver level. We had some issues moving to 3.2.17 (some times we had SRIOV VF's spawned without irq's attached), so we moved down to 3.2.9 which was stable.

Patrick_K_Intel1
Employee
75 Views

I was actually thinking along the lines of any OS/Kernel updates; however since you haven't rebooted in so long, that would be unlikely.

My experts are not sure what the source of your problem is. The error is, as you pointed out the mailbox communication stopped working. We can't tell from the description if it is a software (driver, or kernel) or a hardware problem.

All we can suggest is to save the kernel an dmesg logs and reboot. If the PF and VF's work after reboot, we are more inclined to believe it is a software problem of some sort, otherwise a hardware failure.

Also, before you reboot, if dump the registers with the ethreg tool:

http://sourceforge.net/projects/e1000/files/Ethregs%20-%20Register%20Dump%20Tool/ http://sourceforge.net/projects/e1000/files/Ethregs%20-%20Register%20Dump%20Tool/

If you post it, I'll see if it provides any more useful information

Wish I had a magic bullet for you.

thanx,

Patrick

ALyak
Beginner
75 Views

Hi Patrick,

please find the output of ethregs in:

https://docs.google.com/open?id=0ByBy89zr3kJNdEFUMk9SV0dXdjA https://docs.google.com/open?id=0ByBy89zr3kJNdEFUMk9SV0dXdjA

I just ran it on eth103 without any options, pls let us know if you need to re-run it with any particular options.

Thanks for your help,

Alex.

Patrick_K_Intel1
Employee
75 Views

Can you provide the dump from the PF?

ALyak
Beginner
75 Views

Hi Patrick,

I think the PF dump is also available in the attached file (it has all VFs and then the PF). Or, otherwise, pls let us know which command exactly you need to perform:

.....

03:00.0 (8086:10fb)

Intel Corporation 82599EB 10-Gigabit SFI/SFP+ Network Connection

Name Value

~~~~ ~~~~~

CTRL 00000000

STATUS 000c8000

CTRL_EXT 10010000

ESDP 00000876

I2CCTL 0000000f

FRTIMER 30bef1b6

TCPTIMER 00000000

PFVFLRE[1] 00000000

LEDCTL 45444140

PFVFLRE[0] 00000000

PFVFLREC[0] deadbeef

PFVFLREC[1] deadbeef

PFVFLREC[2] deadbeef

PFVFLREC[3] deadbeef

PFMBICR[0] 00000000

PFMBICR[1] 00000000

PFMBICR[2] 00000000

PFMBICR[3] 00000000

PFMBIMR[0] ffffffff

PFMBIMR[1] ffffffff

PFMBIMR[2] ffffffff

PFMBIMR[3] ffffffff

EICS 00000000

EIAC 4000ffff

EITR[000] 000001e8

EITR[001] 000003d0

EITR[002] 00000798

EITR[003] 00000000

EITR[004] 00000000

EITR[005] 00000000

EITR[006] 00000000

EITR[007] 00000000

EITR[008] 00000000

EITR[009] 00000000

EITR[010] 00000000

EITR[011] 00000000

EITR[012] 00000000

EITR[013] 00000000

EITR[014] 00000000

EITR[015] 00000000

EITR[016] 00000000

....

idata
Community Manager
75 Views

Hi Patrick,

We rebooted the server & now the SRIOV PF/VF's are working alright. So it looks like its a s/w issue. Can you pls check if the ethregs/ethtool dump above provides any further info on the issue?

Thanks.

--Shyam

idata
Community Manager
75 Views

Hi Patrick,

One more question on this. Is there some version compatibility requirement required between ixgbe & ixgbevf?

We are currently on ixgbe 3.2.9 and we have two versions of ixgbevf interfacing with the same NIC. They are ixgbevf 1.0.19-k0 & ixgbevf 2.2.0-k. Would it be an issue if differing versions of ixgbevf working with simultaneously on the NIC.

One thing we observed was

ixgbevf 2.2.0-k has a new msg code 0x6

Though we dont use MAC VLAN, it clears up MAC VLAN like below

ixgbevf_set_uc_addr_vf (IXGBE_VF_SET_MACVLAN)

hw->mac.ops.set_uc_addr(hw, 0, NULL);

However ixgbe 3.2.9 doesnt understand IXGBE_VF_SET_MACVLAN and prints message like this

[1020846.780262] ixgbe: eth103: ixgbe_rcv_msg_from_vf: Unhandled Msg 00000006

This happens very frequently (i.e. the ixgbevf for some reason keeps doing this almost every 2 secs) & ixgbe keeps printing this message.

We dont know if there are any other such incompatibilities that can result in this behaviour? Any insights appreciated.

Thanks

Shyam

Patrick_K_Intel1
Employee
75 Views

We are not sure why your PF seemed to freeze. We will keep an eye out for such behavior, thanks for bringing it to our attention.

As for your PF/VF alighment. They are fairly tightly coupled. The way the VF driver communicates with the PF driver is through messages in the mailbox. If one side doesn't understand the other, such an error will occur.

I'd recommend the user to update both PF and VF drivers to the latest version that are available from our Source forge site. URL below:

http://sourceforge.net/projects/e1000/files/ http://sourceforge.net/projects/e1000/files/

PF Driver - latest ixgbe version is 3.10.16

VF Driver - latest ixgbevf version is 2.6.2