We are implementing a new Hyper-V cluster, and need some design help. In particular, we need guidance with the best way to accomplish switch redundancy using network teaming. The goal is to have the NIC team be plugged into two separate switches, so that if one switch goes down the networking for the virtual machines will continue to run.
I noticed based on experience and on a separate discussion thread that if I create a VMLB team using two NICs plugged into two separate switches, one of the NIC's is always marked as disabled in the Team Properties. My first question is about redundancy - does this still provide for switch failover even though only one port is active? Should I use ALB (with RLB disabled) instead?
First of all, I want to warn you that I don't have experience using VMLB teams, but I do have some understanding about how the team members behave.
I suspect that the broadcast probes that are used by default to detect the status of other ports on the team are not making it across the network. If you disable probes, then both ports of the team should come active. You can configure Probes in the Advanced tab on Windows device manager by using the Properties button. First try unchecking Send probes. If all ports become active, then you have likely found the cause of one port being disabled.
By default broadcast probes are sent out the primary team member port to the other member ports to make sure those ports are up. This should work fairly well when the team member ports are connected to the same switch. If the probes are blocked or time out, then the secondary team members are disabled. If you disable probes, then the teaming software will watch for activity on the links to decide if a port is active.
Probes can be configured for multicast as long as the multicast packets can make it to the other team members. You might be able to get that configuration to work if you want to keep probes enabled.
The VMLB team type exposes the MAC addresses of the Hyper-V NICs to the outside and splits up the Hyper-V NICs among the available ports. All traffic from one Hyper-V NIC will use the same team member port to send and receive the traffic for that Hyper-V NIC. So with the VMLB team you get to split up the receive traffic as well as the transmit traffic. With the ALB team, the team uses the physical adapter MAC addresses and you won't be able to enable RLB. With either team mode, if one team member port becomes disabled because of a switch port, cable or adapter failure, then the Hyper-V NICs will failover to a different team member port.
You will probably want to test failover with any team mode and probe configuration to make sure that the failover works as expected.
The http://www.intel.com/network/connectivity/resources/doc_library/white_papers/254031.pdf Advanced Network Services Software Whitepaper has a lot of information about how the teams work and how they present the MAC addresses, but the paper is a little old, so the VMLB team type is not in there. Nevertheless, the paper is very good about explaining how the teaming works and can probably help you decide on the best teaming mode and configuration. http://www.intel.com/support/network/sb/cs-009747.htm How do I use Teaming with Advanced Networking Services (ANS)? is another resource for teaming information.
I hope you find this information helpful.
I am curious to hear back about what worked for you.
I'm currently working on a test lab with a similar goal, and ran into similar problems.
From the documentation (which is in the driver):
Virtual Machine Load Balancing (VMLB) provides transmit and receive traffic load balancing across Virtual Machines bound to the team interface, as well as fault tolerance in the event of switch port, cable, or adapter failure.
Note switch port failure is OK, but not switch failure. I.e. although it can handle port failures on a switch, all the team members have to be connected to the same switch. The only team type that does support switch redundancy is is SFT (switch fault tolerance), but this is limited to two team members with only one active at a time, so effectively you halve your potential bandwidth compared to VMLB, worse if you wanted to use more ports than 2.
I was trying two switches connected together with a LAG. Rebooting one switch could cause failed pings from one VM to another (on differerent hosts) and they wouldn't recover when the switch came back. So I'm going to try SFT instead.
If you have stacking switches, you may be able to get them to behave as if they are one switch, and then VMLB might work (I need to find a stacking cable for my switches to try this out) but then you have to worry about fault tolerance within the stack, especially if you have more than 2 switches stacked.
There may be other switch features in more advanced switches than the ones I'm using for my lab (Netgear GS748TS) that enable you to connect team members to different switches and then those switches toghether in a way that works with VMLB. If anyone reading this has experience, please post here.
Thanks Mark and Aitor for the input - I really appreciate it! I think I may have found both a temp fix and also a bug. I was following the steps from this Intel paper:
The first issue I discovered is that in order for the probes to work properly on the VMLB team, I had to configure at least one untagged VLAN on the switch port; I assume that probes communicate only across an untagged VLAN.
After I got that fixed, the 2nd issue I ran into was that I was able to send out pings from the Hyper-V virtual machine to a physical host, but I was unable to get any replies. When I went to look at the NIC inside th VM, it showed packets going out but no packets coming in. I was further able to confirm this by looking at the ARP table on the physical host I was trying to ping. It showed both the IP Address of the virtual machine, and also its MAC address.
I called Intel Tech support, and eventually what we found that worked was breaking the team, and disabling the Virtual Machine Queues (VMDQs), and then reforming the team and then the virtual switch in Hyper-V. After that it started working. I still have further testing to do, but so far so good.
The tech created me a ticket; someone will contact me after they have a chance to research the issue with VMDq and way to get it working with VMDq.
Thank you for the information. Please send me a private message with your service ticket number so I can follow up on the service ticket. Have a great day.
I've got a bit further with my testing. Unfortunately the network cards (well, LOM actually) don't have VMDQ, so can't comment on those issues. I certainly haven't had to delete and rebuild a team to switch modes, or un-bind the hyper-v virtual switch from the team.
Anyway, I found that SFT works fine if you are connecting to two switches, and VMLB works fine with two switches if they are stacking switches (I had to dig out an HDMI (!) cable to stack the Netgears) - as then logically they are one switch. I can then do stuff like pull out cables to simulatre cable/port failure, and even pull the power on one of the units, with a continous ping going from within one VM to another (on a different hyper-v node). No problems - typically only a second or so of dropped pings, a few more if the stack master is powered down. The VMs were communicating over their own VLAN.
I guess this would also apply to the other non-SFT teaming modes - if you want switch redundancy, use stacking switches, not switches connected by LAGs.
Anyway, very happy, I can progress to the next stage of my lab...
I have two HP managed Switches. They are trunked togeher via two 10Gbps connections, but are not setup in "stacked" mode. So far so good on my testing. If I pull one cable, I drop at most one ping packet during failover. I am also doing some network performance testing from within the virtual machine, and getting 800Mbps of performance, which is probably pretty good for a virtual machine running across a GigE physical connection. I haven't done any tuning which might get me closer to 1 Gbps from within the virtaul machine.
Intel had me upload some debug files so that they can troubleshoot the issue with VMDq not working properly. I'll let you know once I have an update