FreeBSD ixl (XL710) driver & LAGG/LACP redux

DMarq1 · ‎04-27-2016

Okay, my original problem was:

I had marked it as resolved since I thought I found the issue and I didn't want to waste anybody's time. Turns out that sadly, that wasn't the case. In my original issue I had tagged vlan interfaces on top. I completely removed those to make testing easier. So now I'm just left with the 4 ixl# interfaces (since I have the 4x10GE card) and then the lagg on top. I've tried configuring the lagg using both lacp mode and loadbalance (Cisco EtherChannel) mode.

To make sure it wasn't an LACP bug as I previously read (and thought), I set the HPe switch to static LAGG and configured it using source_ip + source_port + destination_ip + destination_port (instead of the default of source_mac + destination_mac). I then set the FreeBSD side to:

laggproto loadbalance lagghash l3,l4

To match.

So the revised config is:

ifconfig_ixl0="mtu 9000 up"

ifconfig_ixl1="mtu 9000 up"

ifconfig_ixl2="mtu 9000 up"

ifconfig_ixl3="mtu 9000 up"

cloned_interfaces="lagg0 tap0 bridge0"

ifconfig_lagg0="laggproto loadbalance lagghash l3,l4 laggport ixl0 laggport ixl1 laggport ixl2 laggport ixl3 mtu 9000"

ifconfig_tap0="mtu 9000"

ifconfig_bridge0="inet 192.168.4.101/24 addm lagg0 addm tap0 mtu 9000"

defaultrouter="192.168.4.1"

The symptoms are somewhat similar to:

https://lists.freebsd.org/pipermail/freebsd-net/2015-June/042593.html https://lists.freebsd.org/pipermail/freebsd-net/2015-June/042593.html

I say this because some nodes on the same subnet can be pinged and sometimes they can't. When they can't, adding a static ARP entry seems to fix it. There's a patch in that thread, but that patch is already in ixl-1.4.27.

Here's the weird part. Right now it's in a state where pinging the local subnet seems fine (so far), and even pinging the default gateway (the HPe switch) works. However trying to ping through to the core Extreme networks switch, 2 of the interfaces (VLANs) on it work, and 1 doesn't:

Eg, 192.168.0.1 pings, 192.168.1.1 pings, 192.168.2.1 does NOT ping. However another host with the same exact config except using ix instead of ixl (so X520 instead of XL710) doesn't have this problem. Obviously since these IPs are outside of the subnet, ARP isn't the problem. It's not a route/return route problem, as like I said, the other host works fine.

Connections that do work don't stay working, eg:

# svnlite checkout https://svn.FreeBSD.org/base/head/ https://svn.FreeBSD.org/base/head/ /usr/src

...

A sys/dev/hptmv/hptproc.c

A sys/dev/hptmv/mv.c

A sys/dev/hptmv/entry.c

A sys/dev/hptmv/osbsd.h

A sys/dev/hptmv/array.h

A sys/dev/hptmv/access601.h

A sys/dev/hptmv/hptintf.h

A sys/dev/hptmv/amd64-elf.raid.o.uu

svn: E000060: Error running context: Operation timed out

So it works for about 30 seconds or so and then just stops.

If I rekick this node using Ubuntu Linux 16.04 LTS or ESXi 6.0U2, everything works great without touching any configuration on the network side. So it really seems to be an issue with the FreeBSD driver (and not the card's NVm) when coupled with the lagg driver.