<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic temporary pcie bandwidth drops on Haswell-v3 in Software Tuning, Performance Optimization &amp; Platform Monitoring</title>
    <link>https://community.intel.com/t5/Software-Tuning-Performance/temporary-pcie-bandwidth-drops-on-Haswell-v3/m-p/1081298#M5468</link>
    <description>&lt;P&gt;Hi All,&lt;/P&gt;

&lt;P&gt;we have been developing HD video capture PCIe (Gen2x8) cards, which are installed in HPC servers with Intels Dual-Xeon NUMA architecture. With the SandyBridge-v1/IvyBridge-v2 architecture everything worked fine. Now with the new Haswell-v3 servers we have the following problem:&lt;/P&gt;

&lt;P&gt;The video streams (PCIe slot -&amp;gt; RootComplex) start stuttering every few seconds or minutes. When this happens all Tx posted data credits have expired. We observed this situation (all PD credits consumed) already in with the IvyBridge architecture, however, the system recovered quickly from this situation and the temporary bandwidth drop was easily compensated for by the FIFOs in the Tx signal path (no visual degradation in the video streams). This is not the case with the Haswell architecture: sometimes the PD credits are being returned quite slowly – even at times when no new Tx packets are being issued. Typically in this case we observe PD credits being freed up in small steps only: &amp;nbsp;0 – 4 – 8 – 12 - …&amp;nbsp; It then takes tens of microseconds until the system has recovered. When everything is working as expected the PD credits are being freed up in much larger chunks. The described behavior is noticeable even on low Tx bandwidths (&amp;gt;= 2.2 GBit/s).&lt;/P&gt;

&lt;P&gt;We stripped our software to a minimum to ensure that the data we capture is not processed at all - just transferred to memory via DMA. We double-checked the driver software and also made some tests with different memory allocation methods and DMA transfer setups.&lt;BR /&gt;
	We are using Linux and did the tests with kernel 3.7 (OpenSuse 12.1) and 3.10 (CentOS 7.1). We also tried servers from ASUS and Supermicro.&lt;BR /&gt;
	None of these different test scenarios helps us to get rid of the problem resp. to find a hint whats going on.&lt;BR /&gt;
	&lt;BR /&gt;
	Has anyone an idea what the cause of such problems?&lt;BR /&gt;
	Is there a difference between IvyBridge-v2 and Haswell-v3 regarding PCIe credits handling (buffering, flow control)?&lt;BR /&gt;
	Are there tools from Intel helping us to find out what's going regards.&lt;/P&gt;

&lt;P&gt;Thanks and kind regards&lt;BR /&gt;
	Friedhelm Schanz&lt;BR /&gt;
	&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Tue, 24 Nov 2015 12:34:48 GMT</pubDate>
    <dc:creator>Friedhelm_S_</dc:creator>
    <dc:date>2015-11-24T12:34:48Z</dc:date>
    <item>
      <title>temporary pcie bandwidth drops on Haswell-v3</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/temporary-pcie-bandwidth-drops-on-Haswell-v3/m-p/1081298#M5468</link>
      <description>&lt;P&gt;Hi All,&lt;/P&gt;

&lt;P&gt;we have been developing HD video capture PCIe (Gen2x8) cards, which are installed in HPC servers with Intels Dual-Xeon NUMA architecture. With the SandyBridge-v1/IvyBridge-v2 architecture everything worked fine. Now with the new Haswell-v3 servers we have the following problem:&lt;/P&gt;

&lt;P&gt;The video streams (PCIe slot -&amp;gt; RootComplex) start stuttering every few seconds or minutes. When this happens all Tx posted data credits have expired. We observed this situation (all PD credits consumed) already in with the IvyBridge architecture, however, the system recovered quickly from this situation and the temporary bandwidth drop was easily compensated for by the FIFOs in the Tx signal path (no visual degradation in the video streams). This is not the case with the Haswell architecture: sometimes the PD credits are being returned quite slowly – even at times when no new Tx packets are being issued. Typically in this case we observe PD credits being freed up in small steps only: &amp;nbsp;0 – 4 – 8 – 12 - …&amp;nbsp; It then takes tens of microseconds until the system has recovered. When everything is working as expected the PD credits are being freed up in much larger chunks. The described behavior is noticeable even on low Tx bandwidths (&amp;gt;= 2.2 GBit/s).&lt;/P&gt;

&lt;P&gt;We stripped our software to a minimum to ensure that the data we capture is not processed at all - just transferred to memory via DMA. We double-checked the driver software and also made some tests with different memory allocation methods and DMA transfer setups.&lt;BR /&gt;
	We are using Linux and did the tests with kernel 3.7 (OpenSuse 12.1) and 3.10 (CentOS 7.1). We also tried servers from ASUS and Supermicro.&lt;BR /&gt;
	None of these different test scenarios helps us to get rid of the problem resp. to find a hint whats going on.&lt;BR /&gt;
	&lt;BR /&gt;
	Has anyone an idea what the cause of such problems?&lt;BR /&gt;
	Is there a difference between IvyBridge-v2 and Haswell-v3 regarding PCIe credits handling (buffering, flow control)?&lt;BR /&gt;
	Are there tools from Intel helping us to find out what's going regards.&lt;/P&gt;

&lt;P&gt;Thanks and kind regards&lt;BR /&gt;
	Friedhelm Schanz&lt;BR /&gt;
	&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 24 Nov 2015 12:34:48 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/temporary-pcie-bandwidth-drops-on-Haswell-v3/m-p/1081298#M5468</guid>
      <dc:creator>Friedhelm_S_</dc:creator>
      <dc:date>2015-11-24T12:34:48Z</dc:date>
    </item>
    <item>
      <title>There are some changes to the</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/temporary-pcie-bandwidth-drops-on-Haswell-v3/m-p/1081299#M5469</link>
      <description>&lt;P&gt;There are some changes to the default cache coherence protocol in Haswell EP that might be related.&amp;nbsp; Unfortunately the "Direct Cache Access" feature that is perhaps the most obvious thing to look at with PCIe DMA transactions is very minimally documented.&lt;/P&gt;

&lt;P&gt;Some ideas of things to try while you are waiting to hear from someone who actually knows what is going on....&lt;/P&gt;

&lt;UL&gt;
	&lt;LI&gt;Since these are two-socket servers, the first question is whether the behavior is the same when the DMA target buffers are on the same chip as the PCIe card vs being located on the other chip -- and compare whatever patterns you see on Haswell EP to the behavior on Sandy Bridge EP and/or Ivy Bridge EP.&lt;/LI&gt;
	&lt;LI&gt;The default cache coherence policy on most Haswell EP systems is "home snoop", rather than the "source snoop" that was the default on Sandy Bridge EP and Ivy Bridge EP.&amp;nbsp;&amp;nbsp; I have not done a lot of IO testing, but for processor-initiated memory accesses, "source snoop" gives significantly lower memory latency (but also significantly lower QPI throughput).&lt;/LI&gt;
	&lt;LI&gt;I would also try running with the uncore frequency set to "maximum" in the BIOS (almost certainly not the default).&lt;/LI&gt;
	&lt;LI&gt;If none of this helps, and you are running on processors with more than 8 cores, I would try booting the machine in "Cluster On Die" mode.&amp;nbsp;&amp;nbsp; This will make each chip look like 2 NUMA nodes, but the resulting change(s) in the L3 address mapping may change the DCA behavior in a useful way.&lt;/LI&gt;
&lt;/UL&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 24 Nov 2015 15:31:34 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/temporary-pcie-bandwidth-drops-on-Haswell-v3/m-p/1081299#M5469</guid>
      <dc:creator>McCalpinJohn</dc:creator>
      <dc:date>2015-11-24T15:31:34Z</dc:date>
    </item>
    <item>
      <title>John, many thanks for your</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/temporary-pcie-bandwidth-drops-on-Haswell-v3/m-p/1081300#M5470</link>
      <description>&lt;P&gt;&lt;BR /&gt;
	John, many thanks for your comments and hints. Please see my comment below.&lt;/P&gt;

&lt;BLOCKQUOTE&gt;
	&lt;UL&gt;
		&lt;LI&gt;Since these are two-socket servers, the first question is whether the behavior is the same when the DMA target buffers are on the same chip as the PCIe card vs being located on the other chip -- and compare whatever patterns you see on Haswell EP to the behavior on Sandy Bridge EP and/or Ivy Bridge EP.&lt;/LI&gt;
	&lt;/UL&gt;
&lt;/BLOCKQUOTE&gt;

&lt;P&gt;We tried both: memory located on the local NUMA node and memory located on the remote: no noticeable differences&lt;BR /&gt;
	&amp;nbsp;&lt;/P&gt;

&lt;BLOCKQUOTE&gt;
	&lt;UL&gt;
		&lt;LI&gt;The default cache coherence policy on most Haswell EP systems is "home snoop", rather than the "source snoop" that was the default on Sandy Bridge EP and Ivy Bridge EP.&amp;nbsp;&amp;nbsp; I have not done a lot of IO testing, but for processor-initiated memory accesses, "source snoop" gives significantly lower memory latency (but also significantly lower QPI throughput).&lt;/LI&gt;
	&lt;/UL&gt;
&lt;/BLOCKQUOTE&gt;

&lt;P&gt;We also tried "home snoop": no noticeable changes&lt;BR /&gt;
	&amp;nbsp;&lt;/P&gt;

&lt;BLOCKQUOTE&gt;
	&lt;UL&gt;
		&lt;LI&gt;I would also try running with the uncore frequency set to "maximum" in the BIOS (almost certainly not the default).&lt;/LI&gt;
	&lt;/UL&gt;
&lt;/BLOCKQUOTE&gt;

&lt;P&gt;Unfortunately there's no related setting in the BIOS of the servers we use (or at least we can't find such a setting)&lt;BR /&gt;
	&amp;nbsp;&lt;/P&gt;

&lt;BLOCKQUOTE&gt;
	&lt;UL&gt;
		&lt;LI&gt;If none of this helps, and you are running on processors with more than 8 cores, I would try booting the machine in "Cluster On Die" mode.&amp;nbsp;&amp;nbsp; This will make each chip look like 2 NUMA nodes, but the resulting change(s) in the L3 address mapping may change the DCA behavior in a useful way.&lt;/LI&gt;
	&lt;/UL&gt;
&lt;/BLOCKQUOTE&gt;

&lt;P&gt;Unfortunately we only have CPUs with just 8 cores&lt;/P&gt;

&lt;P&gt;Maybe our problem is not related to cache management and/or NUMA architecture issues. By the way: we also tried different memory allocations on the v2 architecture and notice slightly performance differences (which is obvious), but we never went into this kind of 'malfunctions' we have with Haswell-v3 . Maybe it's something related to PCIe credits and flow control !?&lt;BR /&gt;
	Any other issues we could consider in our investigations?&lt;BR /&gt;
	Are there any means/tools available helping us to find out what the causes of our problems are? Maybe Intel's PCM?&lt;/P&gt;

&lt;P&gt;Thanks again and regards&lt;BR /&gt;
	Friedhelm Schanz&lt;BR /&gt;
	&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 26 Nov 2015 15:34:29 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/temporary-pcie-bandwidth-drops-on-Haswell-v3/m-p/1081300#M5470</guid>
      <dc:creator>Friedhelm_S_</dc:creator>
      <dc:date>2015-11-26T15:34:29Z</dc:date>
    </item>
    <item>
      <title>Sounds like you need to find</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/temporary-pcie-bandwidth-drops-on-Haswell-v3/m-p/1081301#M5471</link>
      <description>&lt;P&gt;Sounds like you need to find someone who knows how this actually works!&amp;nbsp;&amp;nbsp; The approaches you have tried cover pretty much everything I know....&lt;/P&gt;

&lt;P&gt;Hmmm.....&amp;nbsp; One other thing comes to mind....&amp;nbsp;&amp;nbsp; If a Haswell notices that you are using 256-bit registers, it will take a ~10 microsecond stall to turn on the upper 128-bits of the pipelines (any data type -- not just FP).&amp;nbsp; &amp;nbsp; If you have not used the 256-bit registers for a full millisecond, then the hardware will turn off the "upper" 128-bit pipelines.&amp;nbsp;&amp;nbsp; This stall can pretty much only be detected by either extremely fine-grained measurements of the throughput of 256-bit operations or by looking at the difference between "Reference Cycles Not Halted" and "TSC Cycles".&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;Sandy Bridge and Ivy Bridge also turn off the upper 128 bits of the pipelines, but there is no stall when the upper pipeline is turned on --- the processor just runs 256-bit instructions at 1/2 speed for a few thousand cycles until the upper pipelines are ready.&lt;/P&gt;

&lt;P&gt;You can avoid these stalls by compiling for SSE4.1 (instead of AVX or AVX2), but you still might run into 256-bit instructions in library routines.&lt;/P&gt;</description>
      <pubDate>Mon, 30 Nov 2015 19:30:30 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/temporary-pcie-bandwidth-drops-on-Haswell-v3/m-p/1081301#M5471</guid>
      <dc:creator>McCalpinJohn</dc:creator>
      <dc:date>2015-11-30T19:30:30Z</dc:date>
    </item>
    <item>
      <title>Any progress on this ?</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/temporary-pcie-bandwidth-drops-on-Haswell-v3/m-p/1081302#M5472</link>
      <description>&lt;P&gt;Any progress on this ?&lt;/P&gt;</description>
      <pubDate>Fri, 18 Dec 2015 14:30:32 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/temporary-pcie-bandwidth-drops-on-Haswell-v3/m-p/1081302#M5472</guid>
      <dc:creator>Peter_L_3</dc:creator>
      <dc:date>2015-12-18T14:30:32Z</dc:date>
    </item>
    <item>
      <title>Looks like I'm not the only</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/temporary-pcie-bandwidth-drops-on-Haswell-v3/m-p/1081303#M5473</link>
      <description>&lt;P&gt;Looks like I'm not the only one experiencing this problem. I'm experiencing DMA "blockages" as described by Friedhelm Schanz above when writing from PCI-express to main memory.&lt;/P&gt;

&lt;P&gt;What I see:&lt;/P&gt;

&lt;P&gt;-No problems with Sandy/Ivy-Bridge XEON EP dual socket setups&lt;/P&gt;

&lt;P&gt;-No problems with Haswell Desktop CPUs&lt;/P&gt;

&lt;P&gt;With Haswell EP Dual socket setups the experience varies greatly. But the worst case is having the card (PCIe slot) on another socket than the DMA memory addressed is located (so basically when doing DMA transfers over the QPI link). In this setup I see DMA "blockages" for up to 1-2 MILLISECONDS!!! Which is a terrible thing for high bandwidth devices (imagine a 10G network card or a video grabber) or devices without a big buffer.&lt;/P&gt;

&lt;P&gt;Different mainboard manufacturers don't seem to help (Supermicro, Asus, Asrock, HP, Dell), it definitely seems to be a problem caused by Intel.&lt;/P&gt;

&lt;P&gt;What also helps is disabling all power saving related stuff in the BIOS (has it something to do with SMI bios interrupts?), but its not a 100% solution, just the frequency of the errors events drops considerably.&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;Any suggestions are appreciated, does somebody have a good link to Intel engineers?&lt;/P&gt;

&lt;P&gt;Thanks &amp;amp; best regards&lt;/P&gt;

&lt;P&gt;Thomas&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 16 Mar 2016 10:57:08 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/temporary-pcie-bandwidth-drops-on-Haswell-v3/m-p/1081303#M5473</guid>
      <dc:creator>TMeye5</dc:creator>
      <dc:date>2016-03-16T10:57:08Z</dc:date>
    </item>
    <item>
      <title>here's an update on our</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/temporary-pcie-bandwidth-drops-on-Haswell-v3/m-p/1081304#M5474</link>
      <description>&lt;P&gt;here's an update on our investigations regarding the described problem:&lt;/P&gt;

&lt;P&gt;In the meantime we've tested several CPU models with the XEON Haswell-EP architecture and find out that our problem mainly occurs on the CPU models based on the 8-core die (4, 6, 8 cores). CPUs with a 12-core die (10, 12 cores) seem to work much better. We also expect the 18-core die based modules to even work better.&lt;BR /&gt;
	We also tested some of the new XEON Broadwell-EP CPUs (V4) - some Supermicro servers with latest BIOS already support that CPUs. Here it seems that even the low range models seem to work much better in our environment than the related Haswell-EP CPUs.&lt;/P&gt;

&lt;P&gt;Anyway we still have setups where the 'larger' Haswell/Broadwell-EP CPUs still 'behave worse' compared to the Sandy/IVYBridge-EP.&lt;/P&gt;

&lt;P&gt;I also agree with Thomas that different mainboard manufacturers don't help. Also we've already disabled all power saving stuff in the BIOS. We've spent a lot of time optimizing our memory allocation components in order to minimize traffic via the QPI link. All these tasks improve the system behavior, but so far there's still no 100% solution.&lt;/P&gt;

&lt;P&gt;Any suggestions are welcome.&lt;/P&gt;

&lt;P&gt;Thanks and all regards&lt;BR /&gt;
	Friedhelm&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 16 Mar 2016 13:33:41 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/temporary-pcie-bandwidth-drops-on-Haswell-v3/m-p/1081304#M5474</guid>
      <dc:creator>Friedhelm_S_</dc:creator>
      <dc:date>2016-03-16T13:33:41Z</dc:date>
    </item>
    <item>
      <title>I can only make assumptions</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/temporary-pcie-bandwidth-drops-on-Haswell-v3/m-p/1081305#M5475</link>
      <description>&lt;P&gt;I can only make assumptions to why the bigger (10 core+) setups are working better than the smaller ones. The most likely difference I see is the bigger L3 cache on these CPUs, which, maybe in combination with DDIO (in my words: DMA cache allocation/update in L3) could lead to a different behaviour. DDIO is also one function I suspect to be part of the problem, even though it should improve performance.&lt;/P&gt;

&lt;P&gt;I still hope on an Intel engineer that actually knows the problem... I'm sure the right guy @ Intel could say right away what the problem is. And I'm very sure other hardware has the same problem (e.g. network cards), just nobody sees it directly as a problem (retransmission of Ethernet packets).&lt;/P&gt;

&lt;P&gt;Best regards&lt;/P&gt;

&lt;P&gt;Thomas&lt;/P&gt;</description>
      <pubDate>Fri, 18 Mar 2016 10:40:05 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/temporary-pcie-bandwidth-drops-on-Haswell-v3/m-p/1081305#M5475</guid>
      <dc:creator>TMeye5</dc:creator>
      <dc:date>2016-03-18T10:40:05Z</dc:date>
    </item>
    <item>
      <title>Hello! Recently I've</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/temporary-pcie-bandwidth-drops-on-Haswell-v3/m-p/1081306#M5476</link>
      <description>&lt;P&gt;Hello! Recently I've experienced Device to Host DMA bandwidth drops on custom PCIe card connected to Haswell-EP dual socketed machine. In our case the drops were caused by very long remote reads (SG DMA descriptor fetch from Host's system memory). In some cases DMA Read transaction completions were delayed to 10's of microseconds leading to descriptor starvation and subsequent stalls of DMA write stream. The same, but less severe read latency spikes were later observed on desktop Skylake platform. No such big latency spikes were even detected on prevoius Haswell platforms, both desktop and Xeon's.&lt;/P&gt;

&lt;P&gt;If your write stream depends &lt;SPAN style="font-size: 13.008px; line-height: 19.512px;"&gt;somehow&amp;nbsp;&lt;/SPAN&gt;on read requests (descriptor fetch or something else) you may be facing the same effect.&lt;/P&gt;</description>
      <pubDate>Mon, 28 Mar 2016 11:13:49 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/temporary-pcie-bandwidth-drops-on-Haswell-v3/m-p/1081306#M5476</guid>
      <dc:creator>Galim</dc:creator>
      <dc:date>2016-03-28T11:13:49Z</dc:date>
    </item>
    <item>
      <title>Hello all!</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/temporary-pcie-bandwidth-drops-on-Haswell-v3/m-p/1081307#M5477</link>
      <description>&lt;P&gt;Hello all!&lt;/P&gt;

&lt;P&gt;thank you Galim for your reply. 10's of microseconds for read request to completion also doesn't sound very good. But in our case, we don't even do SG DMA, we just linearly write a full image frame to memory (&amp;gt;1 Megabyte), so the Write-DMA doesn't depend on any read DMA accesses. At the moment we just see the issue with Write-DMA, I'm not sure if the same would happen if we reverse the data direction to read (I'm assuming the problem exists there too).&lt;/P&gt;

&lt;P&gt;Best regards,&lt;/P&gt;

&lt;P&gt;Thomas&lt;/P&gt;</description>
      <pubDate>Mon, 04 Apr 2016 10:26:46 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/temporary-pcie-bandwidth-drops-on-Haswell-v3/m-p/1081307#M5477</guid>
      <dc:creator>TMeye5</dc:creator>
      <dc:date>2016-04-04T10:26:46Z</dc:date>
    </item>
    <item>
      <title>Observe the same thing, dual</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/temporary-pcie-bandwidth-drops-on-Haswell-v3/m-p/1081308#M5478</link>
      <description>&lt;P&gt;Observe the same thing, dual socket E5 v3 system. We see lack of PCIe non-posted credits for duration`s of 10-20usec when saturating PCIe Device -&amp;gt; CPU local memory with writes (Gen3x8) at a sustained ~ 56Gbps. Whats interesting is the write throughput perf is entirely determined by the UnCore frequency... which gets dynamically scaled up/down based on whatever metric hasswell-ep uses. For us unfortunately there is no "Force UnCore to Max Freq" bios setting so we`re stuck with the dynamic scaling thus have to ensure the buffers on our PCIe device and sw app can soak up delays in uncore frequency ramp up.. kinda sucks.&lt;/P&gt;

&lt;P&gt;Wish intel would release the register specs to control UnCore frequency scaling.&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sun, 17 Apr 2016 08:14:10 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/temporary-pcie-bandwidth-drops-on-Haswell-v3/m-p/1081308#M5478</guid>
      <dc:creator>aaron_f_1</dc:creator>
      <dc:date>2016-04-17T08:14:10Z</dc:date>
    </item>
    <item>
      <title>My Dell R630 systems have a</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/temporary-pcie-bandwidth-drops-on-Haswell-v3/m-p/1081309#M5479</link>
      <description>&lt;P&gt;My Dell R630 systems have a BIOS setting that allows me to change the uncore frequency from the default of "dynamic" to a value of "maximum".&amp;nbsp;&amp;nbsp; Reading the fixed-function cycle counter in the UBox confirms that this does allow the uncore to run at a fixed high frequency independent of the core frequency.&amp;nbsp;&amp;nbsp; If I recall the value correctly, this maximum uncore frequency is either 3.0 GHz or 3.1 GHz on my Xeon E5-2660 v3 (2.6 GHz nominal) systems.&amp;nbsp;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;Caveats:&lt;/P&gt;

&lt;OL&gt;
	&lt;LI&gt;Running the uncore at maximum frequency increases the idle power consumption significantly.
		&lt;OL&gt;
			&lt;LI&gt;On my Xeon E5-2660 v3 (with my particular set of BIOS options), the RAPL-reported package idle power increased from ~9W to ~14W when I changed the uncore frequency from "dynamic" to "maximum".&lt;/LI&gt;
		&lt;/OL&gt;
	&lt;/LI&gt;
	&lt;LI&gt;The fixed-function cycle counter in the UBox does not count while the processor is in a package C-state, so you have to be careful with measurements.
		&lt;OL&gt;
			&lt;LI&gt;The counter seems to remain active when the package is in the C1E state (on a system with deeper C-states disabled).&lt;/LI&gt;
			&lt;LI&gt;A simple core-contained "spinner" program is enough to keep the package in C1 so that the uncore frequency can be measured accurately.&lt;/LI&gt;
		&lt;/OL&gt;
	&lt;/LI&gt;
	&lt;LI&gt;The uncore frequency request of "maximum" is overridden when the system is power-throttled, and the uncore frequency is dropped to match the core frequency.&amp;nbsp;
		&lt;OL&gt;
			&lt;LI&gt;In my tests all cores run at the same frequency when the chip is power-throttled.&lt;/LI&gt;
			&lt;LI&gt;This behavior probably also happens with thermal throttling, but I have not tested that explicitly.&lt;/LI&gt;
		&lt;/OL&gt;
	&lt;/LI&gt;
&lt;/OL&gt;</description>
      <pubDate>Sun, 17 Apr 2016 15:50:09 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/temporary-pcie-bandwidth-drops-on-Haswell-v3/m-p/1081309#M5479</guid>
      <dc:creator>McCalpinJohn</dc:creator>
      <dc:date>2016-04-17T15:50:09Z</dc:date>
    </item>
    <item>
      <title>Lucky for you, but the Intel</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/temporary-pcie-bandwidth-drops-on-Haswell-v3/m-p/1081310#M5480</link>
      <description>&lt;P&gt;Lucky for you, but the Intel motherboards dont have this setting which is some what ironic..&lt;/P&gt;

&lt;P&gt;Am maxing out uncore clock @ 3ghz on a 2620v3. Probably the higher end 2600v3`s go to 3.1ghz.&lt;/P&gt;

&lt;P&gt;Do you have any idea what calc the uncore is doing to decide the frequency? Guessing something like bus occupancy seems like a good choice. This dynamic behavior sucks major ass.. We`ve got plenty of CPU cycles free so TDP is not an issue but high *determinstic* PCIe and DDR4 bandwidth is really critical for latency sensitive IO applications.&lt;/P&gt;</description>
      <pubDate>Fri, 22 Apr 2016 10:45:22 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/temporary-pcie-bandwidth-drops-on-Haswell-v3/m-p/1081310#M5480</guid>
      <dc:creator>aaron_f_1</dc:creator>
      <dc:date>2016-04-22T10:45:22Z</dc:date>
    </item>
    <item>
      <title>Hi Guys,</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/temporary-pcie-bandwidth-drops-on-Haswell-v3/m-p/1081311#M5481</link>
      <description>&lt;P&gt;Hi Guys,&lt;/P&gt;

&lt;P&gt;If the bios do not allow you to change the uncore frequency to Max, you use a msrtools to change the uncore frequency&lt;/P&gt;

&lt;P&gt;&lt;A href="https://01.org/msr-tools" target="_blank"&gt;https://01.org/msr-tools&lt;/A&gt;&lt;/P&gt;

&lt;P&gt;Examples:-&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;to read the uncore frequency for socket 0 Lcore 0.&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;STRONG&gt;# rdmsr -p 0 0x620&lt;/STRONG&gt;&lt;/P&gt;

&lt;P&gt;&lt;STRONG&gt;c1d&lt;/STRONG&gt;&lt;/P&gt;

&lt;P&gt;here the result displayed "c1d" c=lowest uncore frequency and "1d" is the Max, so the change socket 0 uncore freqeuncy:-&lt;/P&gt;

&lt;P&gt;&lt;STRONG&gt;# wrmsr -p 0 0x620 0x1d1d&lt;/STRONG&gt;&lt;/P&gt;

&lt;P&gt;To change &amp;nbsp;socket 1, frequency, &amp;nbsp;change the lcore id any lcore_id on socket 1.&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-weight: 700; font-size: 13.008px; line-height: 19.512px;"&gt;# wrmsr -p &amp;lt;locre_id in socket 1&amp;gt; 0x620 &amp;lt;Max Frequency in Hex 2 times&amp;gt;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;You can use rdmsr to confirm the change is successful&lt;/P&gt;

&lt;P style="font-size: 13.008px; line-height: 19.512px;"&gt;&lt;SPAN style="font-weight: 700;"&gt;# rdmsr -p 0 0x620&lt;/SPAN&gt;&lt;/P&gt;

&lt;P style="font-size: 13.008px; line-height: 19.512px;"&gt;&lt;SPAN style="font-weight: 700;"&gt;1d1d&lt;/SPAN&gt;&lt;/P&gt;

&lt;P style="font-size: 13.008px; line-height: 19.512px;"&gt;&amp;nbsp;&lt;/P&gt;

&lt;P style="font-size: 13.008px; line-height: 19.512px;"&gt;Choi Sy Jong&lt;/P&gt;

&lt;P style="font-size: 13.008px; line-height: 19.512px;"&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 18 May 2016 13:12:20 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/temporary-pcie-bandwidth-drops-on-Haswell-v3/m-p/1081311#M5481</guid>
      <dc:creator>SyJong_C_Intel</dc:creator>
      <dc:date>2016-05-18T13:12:20Z</dc:date>
    </item>
    <item>
      <title>perfect, thank you!
 
Aaron</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/temporary-pcie-bandwidth-drops-on-Haswell-v3/m-p/1081312#M5482</link>
      <description>&lt;P&gt;perfect, thank you!&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;Aaron&lt;/P&gt;</description>
      <pubDate>Sat, 09 Jul 2016 09:26:05 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/temporary-pcie-bandwidth-drops-on-Haswell-v3/m-p/1081312#M5482</guid>
      <dc:creator>aaron_f_1</dc:creator>
      <dc:date>2016-07-09T09:26:05Z</dc:date>
    </item>
    <item>
      <title> </title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/temporary-pcie-bandwidth-drops-on-Haswell-v3/m-p/1081313#M5483</link>
      <description>&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;Hello all,&lt;/P&gt;

&lt;P&gt;unfortunately tuning the uncore frequency does not fix the problems on our system(s). We still have DMA 'blockages' for about hundreds of microseconds.&lt;/P&gt;

&lt;P&gt;Does anybody has found some new hints regrading the issue or finally a solution for it?&lt;/P&gt;

&lt;P&gt;Thanks and all regards&lt;BR /&gt;
	&lt;BR /&gt;
	Friedhelm&lt;/P&gt;</description>
      <pubDate>Wed, 13 Jul 2016 14:59:54 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/temporary-pcie-bandwidth-drops-on-Haswell-v3/m-p/1081313#M5483</guid>
      <dc:creator>Friedhelm_S_</dc:creator>
      <dc:date>2016-07-13T14:59:54Z</dc:date>
    </item>
    <item>
      <title>Have you checked to see if</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/temporary-pcie-bandwidth-drops-on-Haswell-v3/m-p/1081314#M5484</link>
      <description>&lt;P&gt;Have you checked to see if the CPUs are accumulating any "halted" cycles?&amp;nbsp;&amp;nbsp; This can be due to either p-state transitions or due to enabling the extra pipelines when 256-bit operations are used.&lt;/P&gt;

&lt;P&gt;A paper that I read recently said that (unlike prior processors) Haswell server processors all change frequency at the same time, with requests batched up and executed by the PCU every 0.5 milliseconds or so.&amp;nbsp; I don't know if the uncore also stalls during any of these transactions, but this is a fairly significant change in behavior that could lead to unexpected consequences....&lt;/P&gt;</description>
      <pubDate>Wed, 13 Jul 2016 19:38:10 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/temporary-pcie-bandwidth-drops-on-Haswell-v3/m-p/1081314#M5484</guid>
      <dc:creator>McCalpinJohn</dc:creator>
      <dc:date>2016-07-13T19:38:10Z</dc:date>
    </item>
    <item>
      <title>I am trying to do set the</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/temporary-pcie-bandwidth-drops-on-Haswell-v3/m-p/1081315#M5485</link>
      <description>&lt;P&gt;I am trying to do set the uncore frequency scaling to maximum on a E5-2640 v4 (Broadwell). I do have the msrtools. Anyone knows the MSR registers on broadwell to change this ?. My BIOS doesn't allow me to change it.&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 12 Aug 2016 20:59:56 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/temporary-pcie-bandwidth-drops-on-Haswell-v3/m-p/1081315#M5485</guid>
      <dc:creator>Subbiah_K_</dc:creator>
      <dc:date>2016-08-12T20:59:56Z</dc:date>
    </item>
    <item>
      <title>Subbiah - Is it not still</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/temporary-pcie-bandwidth-drops-on-Haswell-v3/m-p/1081316#M5486</link>
      <description>&lt;P&gt;Subbiah - Is it not still 0x620?&lt;/P&gt;

&lt;P&gt;Everyone - I was having a possibly related problem with Haswell v3s but could never pin it down. I posted a question here and John helped out:&amp;nbsp;https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/606803&lt;/P&gt;

&lt;P&gt;After abandoning the problem for a while I've just tried it on Broadwell EP(E5 v4) and found that it does not occur with our app on the 14 core E5 2680 v4 but does on the 6-core E5 1650 v4. One thing that has stayed constant is that we get FREQ_TRANS_CYCLES events (an uncore PCU event) whenever spikes occur. Are other people seeing the same thing?&lt;/P&gt;

&lt;P&gt;Will&lt;/P&gt;</description>
      <pubDate>Fri, 26 Aug 2016 16:29:22 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/temporary-pcie-bandwidth-drops-on-Haswell-v3/m-p/1081316#M5486</guid>
      <dc:creator>Will_N_</dc:creator>
      <dc:date>2016-08-26T16:29:22Z</dc:date>
    </item>
    <item>
      <title>Is there a list of supported</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/temporary-pcie-bandwidth-drops-on-Haswell-v3/m-p/1081317#M5487</link>
      <description>&lt;P&gt;Is there a list of supported Uncore frequencies on HSW? Do supported frequencies change between different HSW models? Any changes in BDW?&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 07 Sep 2016 11:53:52 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/temporary-pcie-bandwidth-drops-on-Haswell-v3/m-p/1081317#M5487</guid>
      <dc:creator>JJoha8</dc:creator>
      <dc:date>2016-09-07T11:53:52Z</dc:date>
    </item>
  </channel>
</rss>

