PCIe perfomance drops on Haswell/Broadwell CPUs

Oblaukhov__Konstanti · ‎09-09-2018

Hi All,

we are developing custom, FPGA-based PCIe express board for low-latency video capturing (4x300 MiB/s bandwith, PCIe 2.0 x4).

On some platforms, with Haswell/Broadwell CPU (found on LGA2011v3 CPUs: i7-5960X, i7-6950X, Xeon E5-1650v4, Xeon E5-2603v4) we had faced issue very similiar to https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/600913
Platform offers good full-duplex performance (1650 MiB/s on PCIe 2.0 x4). But relatively rarely (like once in several seconds) host start reporting that all posted data credits (PD) are consumed, and that continues for tens of microseconds. As the result, even quite big internal board->host FIFOs (128 KiB) are overflowed like every 20-30 seconds.

We've tried other CPUs and haven't faced this issue on old desktop Ivy Bridge, Skylake, and newest Coffee Lake CPUs. Maybe because they were pure desktop CPU with lower number of PCIe links?

After some investigation, we've noticed suspicious (and almost undocumented) option in server motherboard config (Supermicro): "Chipset Configuration -> North Bridge -> IIO Config -> Snoop Response Hold Off" which was 9 by default.

Searching gives us:

It’s a BIOS setup option to allow DPDK performance (VT-D) tuning for IIO snoop response hold off. It can be used to set snoop response hold-off timer. It controls the amount of time a posted prefetch will wait to respond to a snoop while waiting for its fetch to arrive, in hope to responding with modified data rather than dumping the prefetch. Set to 0 to disable the hold-off time.

(Supermicro)

This hidden parameter is accessible on Lenovo ThinkSystem servers that use the Intel Xeon Scalable Family processors, as well as System x servers with Intel Xeon E5 v3 or E5 v4 processors.

For some workloads in which throughput and latency are critical, it is better to constantly poll the status of an I/O device rather than use an interrupt. Network adapter device drivers commonly use a thread to continuously poll in a fast loop so that incoming requests can be handled as fast as possible.

This can create contention between a processor core running the polling thread and the processor’s Integrated I/O feature (IIO) for an I/O-owned line in cache. This contention can cause an I/O operation to lose ownership of the cache line it has just acquired. It must then spend more time reacquiring the cache line to write it back.

When there are a large number of network ports each servicing small packets, the system may not be able to achieve the full throughput required due to excessive I/O and core contentions of cache lines. For this situation, the I/O operation should delay its response to core snoops and hold onto its cache lines until it successfully completes its write.

The Snoop Response Hold Off parameter allows the I/O operation to delay its snoop response by a selected amount to achieve this delay. It is possible to adjust this parameter using Lenovo’s Advanced Settings Utility (ASU) or the OneCLI tool as follows.

(Lenovo)

According to Lenovo, 9 means 2048K cycles or 4 us.

After disabling this option (setting it to 0) issue seems disappear! No lack of credits, and generally, all our FIFOs are always almost-empty, without any "spikes".

It pretty clear what Snoop Response Hold Off means, but it's unclear why it leads to such behavior. Can someone explain?

Also, consumer desktop motherboards (and CPUs?), of course, haven't "Snoop Response Hold Off" setting, at least we haven't found anything on Asus X99-E WS + i7-5960X. So, what should we do with such platforms?

McCalpinJohn · ‎09-12-2018

This is a very interesting set of words, but I am not sure that there is enough public information to understand all the implications....

This looks like it is related to Intel's Data Direct IO (https://www.intel.com/content/dam/www/public/us/en/documents/technology-briefs/data-direct-i-o-technology-brief.pdf), which is implemented in the Xeon E5 and E7 processors.

The idea of the IIO device acquiring "ownership" of a cache line is well beyond the simple ideas in the 2012 white paper linked above, but I have not followed this part of Intel's architecture. With an inclusive L3 cache, Data Direct IO, and an on-chip Integrated IO unit, it is not surprising that Intel would have continued integrating the IO functionality into the coherence protocol.

The idea of delaying a snoop response is not new. I was modeling this more than a decade ago for optimization of fine-grained interactions between CPU cores and IO devices. There are lots of tricky parts involved here, especially on systems with strong memory ordering models (such as Intel processors). Delaying the response to a snoop may require that you delay responses to all subsequent snoops, and that may require that other units (caches and/or cores) delay their snoop responses as well. It is common for an implementation to enforce ordering in situations where it is not formally required (e.g., forcing a single ordering across strongly ordered memory access transactions and strongly ordered IO transactions) -- that could result in the problem with PCIe posted data credits.

In any case, I am glad that you found this write-up -- it gives me a lot to thing about -- and that you fixed your IO buffer overflows!

foo__aaron · ‎09-25-2018

Thanks for sharing the finding. We dont have the IIO Snoop Reponse Holdoff setting in our BIOS.. is there an MSR to set this?

Our problem is some what similar, PCIe link with a direct IIO connection at about 80% throughput ( PCIe -> IIO -> DDR write) has no problem at all, then it hits some kind of wall and stalls like crazy. It can do ~ 95% PCIe link but the latency... the stalls... enter double digit usecs!

McCalpinJohn · ‎09-26-2018

The Lenovo document is the only reference I have ever seen to this feature, but I have not spent a lot of time studying the IO subsystem. (It makes me happy to consider this to be Someone Else's Problem.)