Is there a default rate limit for VFs created for 82599?
On our servers we create up to 32 VFs per NIC and assign to VMs (PCI passthrough). Servers run Ubuntu with KV limitM.
We found out that usage of VF on Boot OS level gives up to 2-3Gbps throughput. Same between VFs.
Measured using iperf running with single or multiple parallel threads.
In another our environment we have X540 cards and same problem exists there as well.
Thanks for posting to the forum.
There is no rate limiting by default. If all VF's are in contention then each will get an equal share of the bandwidth - I have several blogs and papers detailing the rate lmiting goodness.
There are several factors when dealing with performance. One of the biggest is that you make sure the 10Gbps device is in a X8 PCIe slot. If you have it in anyhing less, your performance is going to suffer horribly.
After that, two VF's on the same port should communicate with each other at PCIe speeds, which should be in the neighborhood of 20+Gbps. I don't have the exact numbers committed to memory :-(.
What generation of processor/chipset are you using may also have a big impact.
Look at those things and let us know how it goes.
Thanks a lot for a quick response.
You are completely right about VF contention, but in this case the system is in complete idle state and only few VFs assigned to VMs.
If cards in x4 slots ixgbe driver doesn't allow us to create VFs at all, but I double checked it (PCI Express:5.0Gb/s:Width x8) on all servers.
The test we were performing was between any 2 servers (not between VFs).
It is actually a bit more complicated as we are using bonding as well
For HA purposes we were assigning 2 VFs to each VM (one per PF) and bonding them together (Ubuntu bonding, active/backup mode). Exactly the same is set on Boot OS level as well.
In this test config all ports connected to the same 10G switch, so no ISL issues.
CPUs are Intel(R) Xeon(R) CPU E5-2620...
That performance is more inline of an emulated path through DOM0 rather than sr-iov.
I assume you have seen the paper I published several months ago on SR-IOV and bonding:
I know I can do upwards of 6Gbps in a VF to another VF on a separate box and that was on older processor.
Before going down the pather of interrupt alignment and such, can you do an experiment?
Can you remove all bonding, and not fire up any VM's. Then assign an IP address to a VF in the kernel in a different subnet than any other of your eth devices. Do this on each side, then do some performance testing. Let's remove all the 'other stuff' and just talk VF to VF before adding back layers of the other stuff.
This is a great paper. Thanks!!!
I configured IPs on regular VFs but with no luck - same 2-3Gbps. Strangely, but it varies from run to run. At the same time there is no any additional traffic on 10G switch.
I've not mentioned it before, but on these servers we are using Ubuntu 11.04 with kernel 2.6.38-8
The IP utility there has version
ip utility, iproute2-ss100519
and it seems like it doesn't support rate limiting. At least when I try to set it up it returns:
# ip link set eth103 vf 22 rate 10000
RTNETLINK answers: Operation not supported
but our VMs are running with Ubuntu 12.04.1 (3.2.0-25). VFs exposed directly to VMs using PCI passthrough with KVM and iperf gives same results there as well:
On server side:
# iperf --server --len=64K --nodelay
On Client side:
# iperf --client xx.xx.xx.xx --len=64K --nodelay --parallel=4 --time=50 --interval=10
[ ID] Interval Transfer Bandwidth
[ 6] 0.0-10.0 sec 583 MBytes 489 Mbits/sec
[ 3] 0.0-10.0 sec 561 MBytes 471 Mbits/sec
[ 5] 0.0-10.0 sec 500 MBytes 420 Mbits/sec
[ 4] 0.0-10.0 sec 536 MBytes 450 Mbits/sec
[SUM] 0.0-10.0 sec 2.13 GBytes 1.83 Gbits/sec
[ 5] 10.0-20.0 sec 470 MBytes 394 Mbits/sec
[ 3] 10.0-20.0 sec 526 MBytes 442 Mbits/sec
[ 4] 10.0-20.0 sec 515 MBytes 432 Mbits/sec
[ 6] 10.0-20.0 sec 548 MBytes 459 Mbits/sec
[SUM] 10.0-20.0 sec 2.01 GBytes 1.73 Gbits/sec
[ 3] 20.0-30.0 sec 558 MBytes 468 Mbits/sec
[ 5] 20.0-30.0 sec 568 MBytes 476 Mbits/sec
[ 4] 20.0-30.0 sec 480 MBytes 403 Mbits/sec
[ 6] 20.0-30.0 sec 559 MBytes 469 Mbits/sec
[SUM] 20.0-30.0 sec 2.11 GBytes 1.82 Gbits/sec
Glad you like the paper :-)
You need to download the latest iproute2 tool to do the rate limiting and other cool stuff.
Have you tried PF to PF performance testing?
I suspect that it is more in the network stack than it is in the VF's and drivers.
To get the best performance in for my demos and stuff, I turn off stuff like irqbalance and sometimes the network manager. For example, if the interrupt assigned to a VF (or PF) is assigned to a core/package that is not on the same segment (I think it is called) that the PCIe connector is assigned to,then the interrupt must not just go from one package (CPU) to nother, but also across the QPI bus. This has a big hit on performance.
The latest generation kernel does some interesting things with performance that seems to have unwanted problems with the network performance.
There is a script that in inside of the source for our drivers (I can't recall the name at the moment - and I'm away at class without access to my lab) that will disable a bunch of this stuff AND try to align the driver/interrupts with the correct cores/packages. See if you can find this.
We've performed set of tests and got very interesting results:
PF to PF without SR-IOV enabled ~9.5 Gpbs
PF to PF with SR-IOV enabled ~5.0 Gpbs
we tried 32, 16 and 8 VFs per NIC - same results
VF to VF ~ 5.0 Gpbs
Bonded VF to Bonded VF ~3.5 Gpbs
It was done under Ubuntu 11.04 with old iproute tools. The bonding was done like:
slaves fe1002 fe1003
Any ideas what might be slowing it down?
Hi Patrick, we retesting this on Ubuntu-Precise kernel (3.2.0-29-generic # 46) and the latest ixgbe driver (3.11.33):
PF to PF (SR-IOV enabled) and VF to PF – 9.5 GBs
VF to PF and VF to VF – 5.5 Gbs
It appears that the limiting factor is the VF receiving queue. We see that the interrupt that is assigned to it, happens on a single CPU and this CPU gets 100% busy with handling softirqs. We tried to spread the IRQs between several CPUs using /proc/irq/XXX/smp_affinity and /sys/class/net/XXX/queues/rx-0/rpc_cpus (Receive Packet Steering), but still all the interrupts get handled on the same CPU. Is there a way to spread the interrupts for the receiving queue between several CPUs?
That apart, we also did a bi-directional test: we saw that VF-to-VF in total gets to 10Gbs, but PF-to-PF gets in total 13-14 Gbs, which is strange. Is the 10Gbs network bandwidth uni-directional or bi-directional?
Regarding the script: we tried it, but it is intended for multi-queue interfaces. When SR-IOV is enabled, each interface (VFs and PF) receives only one tx/rx queue pair (PF receives a singe TxRx queue), so it does not look relevant for the SR-IOV case.
Lastly, we disabled irqbalance, but didn't see any notable difference.
P.S.: I will also reply to other threads that we are having in parallel with you:) Thank you for being so responsive.
Vlad (and Alex I presume),
We have been scratching our heads abou tthis and our best guess is that this may be BIOS related. The latest BIOS for the R510 is 1.11.0 from 8/20/12. Can you try to update the BIOS and see what is happening?
If that doesn't help, maybe do a dump to see which interrupts aer assigned to the VF and on which core/package that is assigned to. If the traffic has to go across the QPI bus, there are some performance issues, so need to make sure the interrupts for the VF are on the same package where the NIC is.
Me and my colleagues run in to the same performance problem with the PF functions when SR-IOV was enabled, but it seems like an upgrade of the ixgbe and ixgbevf driver seem to have solved the problem. We ran with:
ixgbe driver 3.18.7
ixgbevf driver 2.11.3
Enea Linux with kernel 3.10
and got the same results with and without SR-IOV enabled.