Re: Failover not working on X520 NIC with ixgbevf driver

GORaw1 · ‎11-18-2019

Hi,

I asked this question previously, but the previous suggestion from https://access.redhat.com/solutions/27863 did not apply.

That answer covers bonding in the hypervisor for a NIC that will be connected to vSwitch, with VMs then attached to vSwitch , i.e. it’s a non SRIOV case.

Can you suggest any other possible resolution?

I have the following setup:

Virtual environment with Openstack with Intel X520 NIC
Hypervisor using ixgbe driver
Virtual machine using ixgbevf driver (version 4.6.1) on Red Hat Linux 7.6
VM interfaces are bonded in active-standby mode on ingress and egress

In normal state everything is fine, the bond interfaces are operational, e.g.:

$ more /proc/net/bonding/bond0

Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

Bonding Mode: fault-tolerance (active-backup)

Primary Slave: None

Currently Active Slave: eth2

MII Status: up

MII Polling Interval (ms): 0

Up Delay (ms): 0

Down Delay (ms): 0

Slave Interface: eth2

MII Status: up

Speed: 10000 Mbps

Duplex: full

Link Failure Count: 0

Permanent HW addr: fa:16:3e:07:55:40

Slave queue ID: 0

Slave Interface: eth3

MII Status: up

Speed: 10000 Mbps

Duplex: full

Link Failure Count: 0

Permanent HW addr: fa:16:3e:7c:dd:ab

Slave queue ID: 0

However when the primary interface ens2f0 is taken down on the hypervisor, failover does not occur to the eth3 interface on the VM and traffic fails:

$ more /proc/net/bonding/bond0

Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

Bonding Mode: fault-tolerance (active-backup)

Primary Slave: None

Currently Active Slave: eth2

MII Status: up

MII Polling Interval (ms): 0

Up Delay (ms): 0

Down Delay (ms): 0

Slave Interface: eth2

MII Status: up

Speed: Unknown

Duplex: Unknown

Link Failure Count: 0

Permanent HW addr: fa:16:3e:07:55:40

Slave queue ID: 0

Slave Interface: eth3

MII Status: up

Speed: 10000 Mbps

Duplex: full

Link Failure Count: 0

Permanent HW addr: fa:16:3e:7c:dd:ab

Slave queue ID: 0

eth2 is still marked as UP by the bonding module and as the currently active slave even though the link is down:

$ ethtool eth2 | grep detected

Link detected: no

There aren't any errors logged on either VM or hypervisor, except for:

On the VM:

kernel: ixgbevf 0000:00:06.0 eth2: NIC Link is Down

kernel: ixgbevf 0000:00:08.0 eth4: NIC Link is Down

On the hypervisor

kernel: ixgbe 0000:37:00.0: removed PHC on ens2f0

When the ens2f0 interface on the hypervisor is brought up again, everything is fine, i.e. the bond appears to work only if the original primary interface is running.

Are there any issues with this driver in this setup (though it is the latest driver version), or could there be a configuration which is not applied?

Thanks

Greg

Caguicla_Intel · ‎11-19-2019

Hello GORaw1,

Thank you for posting in Intel Ethernet Communities.

Please allow us to check on this further. Rest assured that we will provide an update within 1-3 business days.

Hoping for your patience.

Best regards,

Crisselle C

Intel Customer Support

A Contingent Worker at Intel

Caguicla_Intel · ‎11-22-2019

Hello GORaw1,

Good day!

Kindly provide the Linux kernel logs from both host and guest. This will help us in further checking your request.

Looking forward to hear from you.

Best regards,

Crisselle C

Intel Customer Support

A Contingent Worker at Intel

GORaw1 · ‎11-22-2019

Hi,

Attached /var/log/messages files. Please let me know if anything else would be required.

Thanks

Greg

GORaw1 · ‎11-22-2019

Guest messages files attached.

Caguicla_Intel · ‎11-25-2019

Hello GORaw1,

Thank you for the reply. We will check on the provided details and get back to you as soon as possible but not later than 3 business days.

Hoping for your patience.

Best regards,

Crisselle C

Intel Customer Support

A Contingent Worker at Intel

GORaw1 · ‎11-25-2019

Thanks - please let me know of any update.

Thanks

Greg

Caguicla_Intel · ‎11-28-2019

Hello GORaw1,

Good day!

Please be informed that we are still checking on your request. We will get back to you within 1-3 business days.

Thank you for the patience and kind understanding on this matter.

Best regards,

Crisselle C

Intel Customer Support

A Contingent Worker at Intel

Caguicla_Intel · ‎12-04-2019

Hello GORaw1,

I hope you are doing great!

We sincerely apologize for the delay on this matter as we are thoroughly checking on this. Please give us more time to look into this. Rest assured that we will update you as soon as there is any findings but not later than 5 business days.

Best regards,

Crisselle C

Intel Customer Support

A Contingent Worker at Intel

Caguicla_Intel · ‎12-11-2019

Hello GORaw1,

Thank you very much for the patience on this matter.

After further checking with our engineering team, they have seen what might be the cause of the issue.

They see that the MII Polling Interval on the bond is set to 0 (this value can be manually set using a parameter called 'miimon'). This setting determines how often the link state of each slave is inspected for link failures, and a value of zero disables MII link monitoring. Alternatively, arp monitoring could be used by setting the parameters arp_interval and arp_ip_target.

From the provided information, they don't see any arp settings configured so it seems that MII Polling may be the issue. They have seen 100ms recommended in documentation as a standard value to use.

You may try this solution by running the command:

echo 100 > /sys/class/net/bond0/bonding/miimon

With this, we'd like to check if you are willing to run this test and share the result with us.

Looking forward to your response.

Best regards,

Crisselle C

Intel Customer Support

A Contingent Worker at Intel

GORaw1 · ‎12-11-2019

Hi,

Thanks for the reply.

I had miimon=100 configured in the bond ifcfg file, but this does not seem to take effect here.

# more ifcfg-bond0

DEVICE=bond0

BONDING_OPTS=mode=1 primary=eth2 miimon=100

TYPE=Bond

BONDING_MASTER=yes

BOOTPROTO=none

ONBOOT=yes

...

However this is not reflected in the /sys/class/net/bond0/bonding/miimon file:

$ more /sys/class/net/bond0/bonding/miimon

0

When I write 100 to this file then it works - MII monitoring starts and failover happens correctly.

The same issue seems to happen if the "primary" attribute is set in the BONDING_OPTS parameter in the ifcfg file - here the value in the /sys/class file is empty:

x520_fe_new2-fe-0$ more /sys/class/net/bond0/bonding/primary

x520_fe_new2-fe-0$

Is there a known issue with these bonding attributes not being activated for this driver? Would this not be unlikely to be an issue in the bonding driver itself otherwise bonding and failover would not work at all for any type of NIC?

Thanks

Greg

Caguicla_Intel · ‎12-12-2019

Hello Greg,

You are welcome.

We are glad to hear that you have tested it and able to give us an update on this matter. We will forward this to our engineers and get back to you within 3-5 business days.

Hoping for your patience.

Best regards,

Crisselle C

Intel Customer Support

A Contingent Worker at Intel

Caguicla_Intel · ‎12-17-2019

Hello Greg,

Thank you for the patience on this matter.

We'd like to verify if you have rebooted the system after making changes to the ifcfg-bond0 file? The settings in ifcfg-* files are applied only when the service is starting up.

Alternatively, you can trigger a network service restart with the command

'systemctl restart network' in case you don't want to reboot every time the changes are made.

Additionally, we believe that the BONDING_OPTS list needs to have quotations around the parameters as well as noted in RHEL examples. So the line in the ifcfg-bond0 file should look like:

BONDING_OPTS="mode=1 primary=eth2 miimon=100"

If this does not fix the behavior, please share their exact step-by-step process for configuring bonding so we can better troubleshoot the issue.

Hoping to hear an update from you.

Best regards,

Crisselle C

Intel Customer Support

A Contingent Worker at Intel

GORaw1 · ‎12-20-2019

Hi,

The issue does appear to be a configuration problem with the BONDING_OPTS parameter in the bond ifcfg file. Once these are enclosed in double quotes the "miimon" parameter is read correctly and monitoring of the slave interfaces starts. Bonding behaves correctly and the interfaces fail over as expected.

Thanks for your help with this issue but it appears now to be resolved.

Thanks

Greg

Caguicla_Intel · ‎12-19-2019

Hello Greg,

Good day!

We'd like to check if you were able to read our previous post above. We would highly appreciate if you can share an update regarding this request

Looking forward to your reply.

Best regards,

Crisselle C

Intel Customer Support

A Contingent Worker at Intel

AlfredoS_Intel · ‎12-25-2019

Hi Goraw1,

I am just sending another follow up if you still have questions or clarifications.

Since we have not heard back from you, I will close this inquiry now.

If you need further assistance, please post a new question.

Best regards,

Alfred S.

Intel Customer Support

A Contingent Worker at Intel