temporary pcie bandwidth drops on Haswell-v3

Friedhelm_S_ · ‎11-24-2015

Hi All,

we have been developing HD video capture PCIe (Gen2x8) cards, which are installed in HPC servers with Intels Dual-Xeon NUMA architecture. With the SandyBridge-v1/IvyBridge-v2 architecture everything worked fine. Now with the new Haswell-v3 servers we have the following problem:

The video streams (PCIe slot -> RootComplex) start stuttering every few seconds or minutes. When this happens all Tx posted data credits have expired. We observed this situation (all PD credits consumed) already in with the IvyBridge architecture, however, the system recovered quickly from this situation and the temporary bandwidth drop was easily compensated for by the FIFOs in the Tx signal path (no visual degradation in the video streams). This is not the case with the Haswell architecture: sometimes the PD credits are being returned quite slowly – even at times when no new Tx packets are being issued. Typically in this case we observe PD credits being freed up in small steps only: 0 – 4 – 8 – 12 - … It then takes tens of microseconds until the system has recovered. When everything is working as expected the PD credits are being freed up in much larger chunks. The described behavior is noticeable even on low Tx bandwidths (>= 2.2 GBit/s).

We stripped our software to a minimum to ensure that the data we capture is not processed at all - just transferred to memory via DMA. We double-checked the driver software and also made some tests with different memory allocation methods and DMA transfer setups.
We are using Linux and did the tests with kernel 3.7 (OpenSuse 12.1) and 3.10 (CentOS 7.1). We also tried servers from ASUS and Supermicro.
None of these different test scenarios helps us to get rid of the problem resp. to find a hint whats going on.

Has anyone an idea what the cause of such problems?
Is there a difference between IvyBridge-v2 and Haswell-v3 regarding PCIe credits handling (buffering, flow control)?
Are there tools from Intel helping us to find out what's going regards.

Thanks and kind regards
Friedhelm Schanz

McCalpinJohn · ‎11-24-2015

There are some changes to the default cache coherence protocol in Haswell EP that might be related. Unfortunately the "Direct Cache Access" feature that is perhaps the most obvious thing to look at with PCIe DMA transactions is very minimally documented.

Some ideas of things to try while you are waiting to hear from someone who actually knows what is going on....

Since these are two-socket servers, the first question is whether the behavior is the same when the DMA target buffers are on the same chip as the PCIe card vs being located on the other chip -- and compare whatever patterns you see on Haswell EP to the behavior on Sandy Bridge EP and/or Ivy Bridge EP.
The default cache coherence policy on most Haswell EP systems is "home snoop", rather than the "source snoop" that was the default on Sandy Bridge EP and Ivy Bridge EP. I have not done a lot of IO testing, but for processor-initiated memory accesses, "source snoop" gives significantly lower memory latency (but also significantly lower QPI throughput).
I would also try running with the uncore frequency set to "maximum" in the BIOS (almost certainly not the default).
If none of this helps, and you are running on processors with more than 8 cores, I would try booting the machine in "Cluster On Die" mode. This will make each chip look like 2 NUMA nodes, but the resulting change(s) in the L3 address mapping may change the DCA behavior in a useful way.

Friedhelm_S_ · ‎11-26-2015

John, many thanks for your comments and hints. Please see my comment below.

Since these are two-socket servers, the first question is whether the behavior is the same when the DMA target buffers are on the same chip as the PCIe card vs being located on the other chip -- and compare whatever patterns you see on Haswell EP to the behavior on Sandy Bridge EP and/or Ivy Bridge EP.

We tried both: memory located on the local NUMA node and memory located on the remote: no noticeable differences

The default cache coherence policy on most Haswell EP systems is "home snoop", rather than the "source snoop" that was the default on Sandy Bridge EP and Ivy Bridge EP. I have not done a lot of IO testing, but for processor-initiated memory accesses, "source snoop" gives significantly lower memory latency (but also significantly lower QPI throughput).

We also tried "home snoop": no noticeable changes

I would also try running with the uncore frequency set to "maximum" in the BIOS (almost certainly not the default).

Unfortunately there's no related setting in the BIOS of the servers we use (or at least we can't find such a setting)

If none of this helps, and you are running on processors with more than 8 cores, I would try booting the machine in "Cluster On Die" mode. This will make each chip look like 2 NUMA nodes, but the resulting change(s) in the L3 address mapping may change the DCA behavior in a useful way.

Unfortunately we only have CPUs with just 8 cores

Maybe our problem is not related to cache management and/or NUMA architecture issues. By the way: we also tried different memory allocations on the v2 architecture and notice slightly performance differences (which is obvious), but we never went into this kind of 'malfunctions' we have with Haswell-v3 . Maybe it's something related to PCIe credits and flow control !?
Any other issues we could consider in our investigations?
Are there any means/tools available helping us to find out what the causes of our problems are? Maybe Intel's PCM?

Thanks again and regards
Friedhelm Schanz

McCalpinJohn · ‎11-30-2015

Sounds like you need to find someone who knows how this actually works! The approaches you have tried cover pretty much everything I know....

Hmmm..... One other thing comes to mind.... If a Haswell notices that you are using 256-bit registers, it will take a ~10 microsecond stall to turn on the upper 128-bits of the pipelines (any data type -- not just FP). If you have not used the 256-bit registers for a full millisecond, then the hardware will turn off the "upper" 128-bit pipelines. This stall can pretty much only be detected by either extremely fine-grained measurements of the throughput of 256-bit operations or by looking at the difference between "Reference Cycles Not Halted" and "TSC Cycles".

Sandy Bridge and Ivy Bridge also turn off the upper 128 bits of the pipelines, but there is no stall when the upper pipeline is turned on --- the processor just runs 256-bit instructions at 1/2 speed for a few thousand cycles until the upper pipelines are ready.

You can avoid these stalls by compiling for SSE4.1 (instead of AVX or AVX2), but you still might run into 256-bit instructions in library routines.

Peter_L_3 · ‎12-18-2015

Any progress on this ?

TMeye5 · ‎03-16-2016

Looks like I'm not the only one experiencing this problem. I'm experiencing DMA "blockages" as described by Friedhelm Schanz above when writing from PCI-express to main memory.

What I see:

-No problems with Sandy/Ivy-Bridge XEON EP dual socket setups

-No problems with Haswell Desktop CPUs

With Haswell EP Dual socket setups the experience varies greatly. But the worst case is having the card (PCIe slot) on another socket than the DMA memory addressed is located (so basically when doing DMA transfers over the QPI link). In this setup I see DMA "blockages" for up to 1-2 MILLISECONDS!!! Which is a terrible thing for high bandwidth devices (imagine a 10G network card or a video grabber) or devices without a big buffer.

Different mainboard manufacturers don't seem to help (Supermicro, Asus, Asrock, HP, Dell), it definitely seems to be a problem caused by Intel.

What also helps is disabling all power saving related stuff in the BIOS (has it something to do with SMI bios interrupts?), but its not a 100% solution, just the frequency of the errors events drops considerably.

Any suggestions are appreciated, does somebody have a good link to Intel engineers?

Thanks & best regards

Thomas

Friedhelm_S_ · ‎03-16-2016

here's an update on our investigations regarding the described problem:

In the meantime we've tested several CPU models with the XEON Haswell-EP architecture and find out that our problem mainly occurs on the CPU models based on the 8-core die (4, 6, 8 cores). CPUs with a 12-core die (10, 12 cores) seem to work much better. We also expect the 18-core die based modules to even work better.
We also tested some of the new XEON Broadwell-EP CPUs (V4) - some Supermicro servers with latest BIOS already support that CPUs. Here it seems that even the low range models seem to work much better in our environment than the related Haswell-EP CPUs.

Anyway we still have setups where the 'larger' Haswell/Broadwell-EP CPUs still 'behave worse' compared to the Sandy/IVYBridge-EP.

I also agree with Thomas that different mainboard manufacturers don't help. Also we've already disabled all power saving stuff in the BIOS. We've spent a lot of time optimizing our memory allocation components in order to minimize traffic via the QPI link. All these tasks improve the system behavior, but so far there's still no 100% solution.

Any suggestions are welcome.

Thanks and all regards
Friedhelm

TMeye5 · ‎03-18-2016

I can only make assumptions to why the bigger (10 core+) setups are working better than the smaller ones. The most likely difference I see is the bigger L3 cache on these CPUs, which, maybe in combination with DDIO (in my words: DMA cache allocation/update in L3) could lead to a different behaviour. DDIO is also one function I suspect to be part of the problem, even though it should improve performance.

I still hope on an Intel engineer that actually knows the problem... I'm sure the right guy @ Intel could say right away what the problem is. And I'm very sure other hardware has the same problem (e.g. network cards), just nobody sees it directly as a problem (retransmission of Ethernet packets).

Best regards

Thomas

Galim · ‎03-28-2016

Hello! Recently I've experienced Device to Host DMA bandwidth drops on custom PCIe card connected to Haswell-EP dual socketed machine. In our case the drops were caused by very long remote reads (SG DMA descriptor fetch from Host's system memory). In some cases DMA Read transaction completions were delayed to 10's of microseconds leading to descriptor starvation and subsequent stalls of DMA write stream. The same, but less severe read latency spikes were later observed on desktop Skylake platform. No such big latency spikes were even detected on prevoius Haswell platforms, both desktop and Xeon's.

If your write stream depends somehow on read requests (descriptor fetch or something else) you may be facing the same effect.

TMeye5 · ‎04-04-2016

Hello all!

thank you Galim for your reply. 10's of microseconds for read request to completion also doesn't sound very good. But in our case, we don't even do SG DMA, we just linearly write a full image frame to memory (>1 Megabyte), so the Write-DMA doesn't depend on any read DMA accesses. At the moment we just see the issue with Write-DMA, I'm not sure if the same would happen if we reverse the data direction to read (I'm assuming the problem exists there too).

Best regards,

Thomas

aaron_f_1 · ‎04-17-2016

Observe the same thing, dual socket E5 v3 system. We see lack of PCIe non-posted credits for duration`s of 10-20usec when saturating PCIe Device -> CPU local memory with writes (Gen3x8) at a sustained ~ 56Gbps. Whats interesting is the write throughput perf is entirely determined by the UnCore frequency... which gets dynamically scaled up/down based on whatever metric hasswell-ep uses. For us unfortunately there is no "Force UnCore to Max Freq" bios setting so we`re stuck with the dynamic scaling thus have to ensure the buffers on our PCIe device and sw app can soak up delays in uncore frequency ramp up.. kinda sucks.

Wish intel would release the register specs to control UnCore frequency scaling.

McCalpinJohn · ‎04-17-2016

My Dell R630 systems have a BIOS setting that allows me to change the uncore frequency from the default of "dynamic" to a value of "maximum". Reading the fixed-function cycle counter in the UBox confirms that this does allow the uncore to run at a fixed high frequency independent of the core frequency. If I recall the value correctly, this maximum uncore frequency is either 3.0 GHz or 3.1 GHz on my Xeon E5-2660 v3 (2.6 GHz nominal) systems.

Caveats:

Running the uncore at maximum frequency increases the idle power consumption significantly.
1. On my Xeon E5-2660 v3 (with my particular set of BIOS options), the RAPL-reported package idle power increased from ~9W to ~14W when I changed the uncore frequency from "dynamic" to "maximum".
The fixed-function cycle counter in the UBox does not count while the processor is in a package C-state, so you have to be careful with measurements.
1. The counter seems to remain active when the package is in the C1E state (on a system with deeper C-states disabled).
2. A simple core-contained "spinner" program is enough to keep the package in C1 so that the uncore frequency can be measured accurately.
The uncore frequency request of "maximum" is overridden when the system is power-throttled, and the uncore frequency is dropped to match the core frequency.
1. In my tests all cores run at the same frequency when the chip is power-throttled.
2. This behavior probably also happens with thermal throttling, but I have not tested that explicitly.

aaron_f_1 · ‎04-22-2016

Lucky for you, but the Intel motherboards dont have this setting which is some what ironic..

Am maxing out uncore clock @ 3ghz on a 2620v3. Probably the higher end 2600v3`s go to 3.1ghz.

Do you have any idea what calc the uncore is doing to decide the frequency? Guessing something like bus occupancy seems like a good choice. This dynamic behavior sucks major ass.. We`ve got plenty of CPU cycles free so TDP is not an issue but high *determinstic* PCIe and DDR4 bandwidth is really critical for latency sensitive IO applications.

SyJong_C_Intel · ‎05-18-2016

Hi Guys,

If the bios do not allow you to change the uncore frequency to Max, you use a msrtools to change the uncore frequency

https://01.org/msr-tools

Examples:-

to read the uncore frequency for socket 0 Lcore 0.

# rdmsr -p 0 0x620

c1d

here the result displayed "c1d" c=lowest uncore frequency and "1d" is the Max, so the change socket 0 uncore freqeuncy:-

# wrmsr -p 0 0x620 0x1d1d

To change socket 1, frequency, change the lcore id any lcore_id on socket 1.

# wrmsr -p <locre_id in socket 1> 0x620 <Max Frequency in Hex 2 times>

You can use rdmsr to confirm the change is successful

# rdmsr -p 0 0x620

1d1d

Choi Sy Jong

aaron_f_1 · ‎07-09-2016

perfect, thank you!

Aaron

Friedhelm_S_ · ‎07-13-2016

Hello all,

unfortunately tuning the uncore frequency does not fix the problems on our system(s). We still have DMA 'blockages' for about hundreds of microseconds.

Does anybody has found some new hints regrading the issue or finally a solution for it?

Thanks and all regards

Friedhelm

McCalpinJohn · ‎07-13-2016

Have you checked to see if the CPUs are accumulating any "halted" cycles? This can be due to either p-state transitions or due to enabling the extra pipelines when 256-bit operations are used.

A paper that I read recently said that (unlike prior processors) Haswell server processors all change frequency at the same time, with requests batched up and executed by the PCU every 0.5 milliseconds or so. I don't know if the uncore also stalls during any of these transactions, but this is a fairly significant change in behavior that could lead to unexpected consequences....

Subbiah_K_ · ‎08-12-2016

I am trying to do set the uncore frequency scaling to maximum on a E5-2640 v4 (Broadwell). I do have the msrtools. Anyone knows the MSR registers on broadwell to change this ?. My BIOS doesn't allow me to change it.

Will_N_ · ‎08-26-2016

Subbiah - Is it not still 0x620?

Everyone - I was having a possibly related problem with Haswell v3s but could never pin it down. I posted a question here and John helped out: https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/606803

After abandoning the problem for a while I've just tried it on Broadwell EP(E5 v4) and found that it does not occur with our app on the 14 core E5 2680 v4 but does on the 6-core E5 1650 v4. One thing that has stayed constant is that we get FREQ_TRANS_CYCLES events (an uncore PCU event) whenever spikes occur. Are other people seeing the same thing?

Will

JJoha8 · ‎09-07-2016

Is there a list of supported Uncore frequencies on HSW? Do supported frequencies change between different HSW models? Any changes in BDW?

McCalpinJohn · ‎09-07-2016

The default contents of MSR 0x620 give the minimum and maximum uncore clock multipliers for Xeon E5 v3. Like the CPU core clock multipliers these are relative to the 100 MHz reference clock. Typical values are 0x0c (12d) for the minimum multiplier and 0x1e (30d) for the maximum multiplier, but these will vary by processor model number.

Under "normal" circumstances, the Power Control Unit will choose an uncore multiplier in the range of [min,max] based on a variety of (undocumented) factors. A nice experimental survey of the behavior is:

https://tu-dresden.de/die_tu_dresden/zentrale_einrichtungen/zih/forschung/projekte/firestarter/2015_hackenberg_hppac.pdf