Re: Inconsistent throughput with SR-IOV VFs at 10 Gbps using Intel Corporation Ethernet Converged Network Adapter X710 - Page 2

JSasi1 · ‎12-19-2019

Hello,

I am performing throughput tests on OpenStack virtual machines with SR-IOV interfaces using iperf3, and I'm getting very varying results seemingly at random every time I start a new iperf3 client. Sometimes I get the maximum throughput I theoretically expect to get (in comparison to native and VirtIO throughput), but other times I get other results going to as low as around 1 Gbps. The tests are done between the SR-IOV interfaces of two virtual machines through a 10 Gbps link. Throughput with VirtIO interfaces via the same link is ~9.39 Gbps. I have six virtual machines with a SR-IOV interface each across four compute nodes with the same NIC model (Intel Corporation Ethernet Converged Network Adapter X710), and I get a similar behavior in all of them.

Here are some excerpts of iperf3 tests, all executed within ~2 minutes (it can easily go from 9.39 Gbps to 1 Gbps between consecutive tests, and vice versa):

[ ID] Interval Transfer Bandwidth Retr Cwnd

[ 4] 0.00-1.00 sec 382 MBytes 3.20 Gbits/sec 238 489 KBytes

[ 4] 1.00-2.00 sec 1.05 GBytes 9.04 Gbits/sec 12 629 KBytes

[ 4] 2.00-3.00 sec 1.01 GBytes 8.69 Gbits/sec 0 669 KBytes

[ 4] 3.00-4.00 sec 992 MBytes 8.32 Gbits/sec 2 529 KBytes

[ 4] 4.00-5.00 sec 1.03 GBytes 8.86 Gbits/sec 0 539 KBytes

[ ID] Interval Transfer Bandwidth Retr Cwnd

[ 4] 0.00-1.00 sec 765 MBytes 6.42 Gbits/sec 201 496 KBytes

[ 4] 1.00-2.00 sec 179 MBytes 1.50 Gbits/sec 0 609 KBytes

[ 4] 2.00-3.00 sec 149 MBytes 1.25 Gbits/sec 0 660 KBytes

[ 4] 3.00-4.00 sec 150 MBytes 1.25 Gbits/sec 9 588 KBytes

[ 4] 4.00-5.00 sec 118 MBytes 988 Mbits/sec 8 550 KBytes

[ 4] 4.00-5.00 sec 1.09 GBytes 9.39 Gbits/sec 0 747 KBytes

[ 4] 5.00-6.00 sec 1.09 GBytes 9.39 Gbits/sec 0 747 KBytes

[ 4] 6.00-7.00 sec 1.09 GBytes 9.39 Gbits/sec 0 747 KBytes

[ 4] 7.00-8.00 sec 1.09 GBytes 9.38 Gbits/sec 0 1.08 MBytes

[ 4] 8.00-9.00 sec 1.09 GBytes 9.39 Gbits/sec 0 1.08 MBytes

[ ID] Interval Transfer Bandwidth Retr Cwnd

[ 4] 0.00-1.00 sec 400 MBytes 3.36 Gbits/sec 1200 625 KBytes

[ 4] 1.00-2.00 sec 232 MBytes 1.95 Gbits/sec 0 673 KBytes

[ 4] 2.00-3.00 sec 298 MBytes 2.50 Gbits/sec 1 1.58 MBytes

[ 4] 3.00-4.00 sec 449 MBytes 3.76 Gbits/sec 1 3.09 MBytes

[ 4] 4.00-5.00 sec 487 MBytes 4.09 Gbits/sec 6 3.09 MBytes

Compute nodes have Ubuntu 16.04.6 LTS and Linux Kernel 4.4.0-170-generic. VMs have Ubuntu 18.04.3 LTS and Linux Kernel 4.15.0-72-generic. I use i40e and i40evf drivers, which according to the man page here (https://docs.oracle.com/cd/E86824_01/html/E54777/i40evf-7d.html) has support for this NIC at 10G or 40G. I have SR-IOV, IOMMU, and NUMA topology enabled in BIOS/grub of all compute nodes. The compute nodes are:

Manufacturer: Supermicro

Product Name: SYS-1029P-WTR

I provide additional information in the post below.

Here are some things that I've tried:

- Using other tools to measure throughput: I got similar behavior with iperf2 and nuttcp

- Verifying the traffic arriving to the switch during the tests (in case the 10G switch was somehow the bottleneck): it does match the throughput output by iperf3

- `sudo ethtool -s ens5 speed 10000`: Cannot set new settings: Operation not supported

- `sudo ip link set enp94s0f1 vf 4 rate 10000`, then sending through that VF: same behavior

- `sudo tc qdisc add dev ens5 root fq maxrate 10gbit` in VM: same behavior

- tso off in VM: same behavior (the best throughput results are worse now because I am CPU-bound above ~3 Gbps, but I still get runs of around 1 Gbps)

I would appreciate any suggestion that could perhaps solve this issue or could help me continue troubleshooting it.

Thanks in advance,

Jorge

JSasi1 · ‎02-05-2020

Hello,

Thanks for the reply. Below are the answers to the information you asked for.

Find attached to this post the dmesg output of the host. At second 750 is when the VMs are launched and the VFs requested. Let me know if the dmesg output of the VM would be helpful as well.
For the VMs, I'm deploying Ubuntu 18.04, with kernel 4.15.0-66. CPU architecture and memory/disk resources varies depending on each case, but for these tests I used: 2 vCPUs with CPU pinning to host CPUs, 1 socket/NUMA node, 8 GB RAM, 40 GB storage. According to htop, CPU resources didn't seem to be the bottleneck during the throughput tests through VFs.
I'm creating seven VFs per interface, following the example of the SR-IOV tutorial for OpenStack. I haven't gotten to use all of them at once, however, but problems already appear when only one is in use by a VM. According to the output of `/sys/class/net/*/device/sriov_totalvfs`, the maximum number of supported VFs is 32.
I haven't made any configuration changes or tuning to the interfaces or the drivers.
Iperf3 tests between the host interfaces of two servers went as expected, obtaining throughput results above 9 Gbps. In addition, throughput using NIC emulation in the Virtual Machines with VirtIO also gave similar results above 9 Gbps. It seems that the degraded throughput only affects VFs.

Please, let me know if you would need any additional information.

Thank you,

Jorge

AlfredoS_Intel · ‎02-06-2020

Hi JSasi1,

Thank you for providing that information. We would like to ask for your cooperation to provide the following:

1. Project Name:

2. Application:

3. Platform: (e.g name of the Server Platform)

4. Network Interface (e.g. 4x10G)

5. Project Status (e.g Engineering Validation Test, Design Validation Test, Production Validation Test)

6. Manageability (e.g., none, NCSI, MCTP, SMBus):

We look forward to your reply post. Should we not hear from you, we will reach out to you after three business days.

Best Regards,

Alfred S

Intel® Customer Support

A Contingent Worker at Intel

AlfredoS_Intel · ‎02-11-2020

Hi Jsasi1,

According to the log that you sent, the i40e driver being used is the “in kernel” driver included with the Operating System.

"i40e: Intel(R) Ethernet Connection XL710 Network Driver - version 1.4.25-k"

The “in kernel” driver can include a reduced feature set that my limit performance. Please update to the latest release driver which you can download on this link, https://downloadcenter.intel.com/download/24411/Intel-Network-Adapter-Driver-for-PCIe-40-Gigabit-Ethernet-Network-Conn

Please also review the x710/xl710 performance tuning guide located here, https://www.intel.com/content/dam/www/public/us/en/documents/reference-guides/xl710-x710-performance-tuning-linux-guide.pdf Specifically section 4.0 regarding driver tuning.

Additionally, try the set affinity script to align interrupt vectors to the local node = [path-to-i40epackage]/scrips/set_irq_affinity -x local ethX

Looking forward to your update. If we do not get your update, we will check back with you after three business days to give you ample time to test and observe our recommendation.

Best Regards,

Alfred S

Intel® Customer Support

JSasi1 · ‎02-11-2020

Hello,

Thanks for the detailed suggestions.

The i40e driver displayed by the dmesg output was a mistake on my end, sorry. I used insmod to load the latest driver that you pointed me out in a previous post, but, as I didn't run "make install", the old driver was loaded again after rebooting. After running "make install" followed by modprobe, the new driver is automatically loaded after rebooting. I tested it again in case the driver had not been loaded correctly before, but experienced the same results.

I will be looking at the other suggestions that you point out and look at other possible tuning options to see if I can see an improvement.

Thanks,

Jorge

AlfredoS_Intel · ‎02-11-2020

Hi Jsasi1,

We appreciate the quick update.

To give you ample time, we will wait for the results of our other recommendations after three business days. Should we not hear from you, we will reach out to you to ask if you need more time.

Best Regards,

Alfred S

Intel® Customer Support

JSasi1 · ‎02-12-2020

Hello,

Since your previous post I've playing around with different kind of tunings and configurations and trying to analyze other areas such as hw/sw interrupts in both host and guest, CPU usage, and packet size in the tests, to try to hopefully identify some patterns in which the SR-IOV performance results could improve.

Changes in neither of Tx/Rx queue size, or interrupt coalescing, or irq balance/affinity alone seemed to do much. Irq affinity in particular didn't seem very useful because the VM only had two CPUs and the VF only had two Tx/Rx queues so I didn't have too much room in this regard. Increasing ring size didn't have any effect on performance. As for interrupt moderation, I was able to notice the interrupt density varying as I modified the settings, but neither increasing or decreasing the interrupt rate with adpative interrupt moderation off had any impact on the performance.

I did however notice two things that caught my attention:

With UDP tests, the larger the frame length, the more stable the throughput is. However, this does not seem to be an issue with CPU, because I'm CPU bound no matter the frame length (due to lack of UDP offloading unlike TCP offloading in the NIC). Using length of 65507 bytes the throughput stays stable at around 8 Gbps with minor packet loss and/or throughput decay events, but with length of 1472 bytes thoughput fluctuates a lot between nearly 0 Mbps and the theoretical maximum of ~3 Gbps. Packet loss in the receiver also fluctuates averaging around 25%. For intermediate frame length values, the results are more stable the higher the frame length. I wasn't able to test higher MSS with TCP due to the network not supporting it. I did however disable TSO and while I noticed that the captured packets now had the regular MSS size, the results still didn't improve.
Something that I did that considerably improved TCP tests was the following. I disabled all but one Tx/Rx queue in the guest's interface, so that I could then use IRQ affinity to affinitize the interrupts of that queue to a specific CPU. I did this in both the receiver (iperf3 -s) and sender (iperf3 -c). Then, surprisingly to me, I consistently got much better results by binding the iperf3 processes to the CPU that was not affinitized the interrupts. After doing that, the performance is still not ideal compared to virtIO/native results, but I rarely average less than 8 Gbps in comparison to the expected ~9.39 Gbps from virtIO/native. In any case, according to htop and mpstat, it never looks like CPU resources are the bottleneck. Here's an example below:

[  5]  30.00-31.00  sec  1.08 GBytes  9.30 Gbits/sec                  
[  5]  31.00-32.00  sec  1.09 GBytes  9.39 Gbits/sec                  
[  5]  32.00-33.00  sec   932 MBytes  7.82 Gbits/sec                  
[  5]  33.00-34.00  sec  1.08 GBytes  9.25 Gbits/sec                  
[  5]  34.00-35.00  sec   865 MBytes  7.26 Gbits/sec                  
[  5]  35.00-36.00  sec  1.01 GBytes  8.69 Gbits/sec                  
[  5]  36.00-37.00  sec  1.09 GBytes  9.39 Gbits/sec                  
[  5]  37.00-38.00  sec  1.00 GBytes  8.61 Gbits/sec                  
[  5]  38.00-39.00  sec  1.09 GBytes  9.39 Gbits/sec                  
[  5]  39.00-40.00  sec   887 MBytes  7.44 Gbits/sec                  
[  5]  39.00-40.00  sec   887 MBytes  7.44 Gbits/sec

I'm by no means an expert in these matters so I'm not sure if this information could offer any hint on what could be the cause of the problem, and I can't even explain why the results were better in those two cases. But in case it does suggest anything to you, I would appreciate further suggestions, but take as much time as you need.

Thanks,

Jorge.

AlfredoS_Intel · ‎02-12-2020

Hi JSasi1,

We appreciate the time that you spent tinkering and experimenting with our recommendations.

We will try to check on the results that you have provided, and we will get back to you no later than 3 business days.

Best Regards,

Alfred S

Intel® Customer Support

JSasi1 · ‎02-13-2020

Hello,

I would like to post an update as I think that after further investigation I've been able to encounter the right setup where everything appears to be working perfectly!

So far, I was using thread siblings for the deployment of virtual machines, as I've seen it suggested in most places where I looked at to configure SR-IOV in a virtualized environment. This means that when I deployed a virtual machine with 2 vCPUs, they both belonged to the same core. After playing around with CPU affinity, I noticed that performance was virtually excellent when I affinitized the IRQs or the VF to the CPU that is pinned to the physical CPU with the highest number in the host machine (which, to my understanding, it's the second thread and should have the worst performance of the two, but I'm not entirely sure about this).

After deploying a virtual machine with isolated vCPUs so that they are not thread siblings, I came to the following conclusions in order to obtain the best performance:

With thread siblings, I need to affinitize the IRQs to the vCPU that is pinned to the "second" thread in the host machine (higher CPU number). Then, I need to either remove all but one Tx/Rx queue and affinitize it to that vCPU, or affinitize all queues to that vCPU.
Without thread siblins, I need to affinitize the IRQs to the same vCPU, bot doesn't matter which one. So, again, I need to either remove all but one Tx/Rx queue and affinitize it to that vCPU, or affinitize all queues to that vCPU.

I'm unsure why having different Tx/Rx queues with IRQs triggering in different CPUs could be causing trouble, but following my investigation, that seems to be the case. I have vCPUs pinned to physical CPUs so they don't float around, and I don't overcommit CPUs. Furthermore, I have the CPUs that are reserved for virtual machines isolated from kernel processes in the host machine through isolcpus in the GRUB.

In any case, thank you for all your help in troubleshooting these issues. For now, I'm going to stick to these configurations to obtain the best performance in my setup. If I experience additional issues or find further information that could be relevant I'll post it here.

Best regards,

Jorge.

AlfredoS_Intel · ‎02-14-2020

Hi JSasi1,

Again, we deeply appreciate your initiative in looking for a configuration that works for you.

We will be temporarily closing the thread. In case you have further questions, just feel free to re-open the thread or create a new thread.

Best Regards,

Alfred S

Intel® Customer Support