We have an MFSYS25 chassis with two switches and a full set of compute modules running linux... I'm encountering a problem where the host interfaces corresponding to one of the switches (i.e. eth2 and eth3) will flap up/down constantly several times per second. This happens on all the compute modules, so I assume it's not just a bad interface card on one of them. Rebooting or re-seating the switch clears up the problem and causes all the interfaces to come up, but the next time they go down/up (e.g. host is rebooted or interface is reconfigured) they start flapping again and will not recover unless the switch is rebooted.
Nothing shows up in the port diagnostics or anywhere else on the switch that I'm aware of. I assumed this was a bad switch module and swapped it with a spare, but the problem persists. Perhaps an issue with the chassis back/mid-plane? Nothing shows up in the general communications diagnostics either...
Anyone encounter this? Anything I can do to diagnose?
Assume you have four NICs on each compute module, two onboard NICs and two from mezzanine card. Note that the two onboard NICs goes to the first switch, and the two from mezzanine card goes to the second switch. You can configure fault tolerance between the NICs, or connect them to different networks/VLANs, but they should not be connected to the same network. What's your network configuration? Check if there is a loop in the network, thus STP may disconnect the port to cut the path.
Yes, we have the NIC mezzanine card on each host hooked up to the second switch. The problem is only happening on the host interfaces connected to the second switch.
Both switches are hooked up to the same upstream switch (same VLAN/network), but not in such a way that would cause a bridge loop (i.e. the MFSYS switches are not hooked up to each other, and the hosts are not bridging across interfaces). When the problem is happening, the interface shows 'forwarding' as the STP status as expected. Also, if the interface were disabled by STP I would not expect it to flap up and down rapidly like this.
Furthermore, I just disabled (admin down) all external ports on the second switch and the problem still happens. Disabling port-fast on the host's ports likewise has no effect. Doesn't seem to be an issue with STP.
Any other suggestions?
I've reset switch 1 and it had no effect on the flapping interfaces on switch 2. Still unable to bring up the interface, and existing interfaces that were flapping continued to flap.
Physically removing the switch is a bit more involved since it's remote, and there's no way to shut the switch down from the management interface. I can do that if you think it will be helpful, but if this is just a way to determine whether it's an issue with a bridge loop, then I think resetting the switch should be equivalent...
I had the NOC physically remove the switch for a few minutes and there was no change. Interfaces on the second switch still flapped and would not come up, except those already up.
Replacing the switch is the first thing I tried... I swapped both switch 1 and switch 2 with replacements a few weeks back, and the issue remained on interfaces connected to switch 2.
I have not tried different driver versions. Seems unlikely that a particular driver would have problems only on one of two identical switches. Something to try if there are no better avenues, I guess...