<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Thanks John for your inputs.  in Software Tuning, Performance Optimization &amp; Platform Monitoring</title>
    <link>https://community.intel.com/t5/Software-Tuning-Performance/Make-sure-certain-PCIe-writes-are-64bytes-to-improve-the-bus/m-p/1073934#M5384</link>
    <description>&lt;P&gt;Thanks John for your inputs.&amp;nbsp;&lt;/P&gt;

&lt;P&gt;Any idea how to achieve this on a Linux system, to make the MMIO addresses or a region of addresses to WC?&lt;/P&gt;

&lt;P&gt;Regards, -Anil&lt;/P&gt;</description>
    <pubDate>Wed, 14 Sep 2016 14:39:34 GMT</pubDate>
    <dc:creator>Anil_A_1</dc:creator>
    <dc:date>2016-09-14T14:39:34Z</dc:date>
    <item>
      <title>Make sure certain PCIe writes are 64bytes to improve the bus performance</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Make-sure-certain-PCIe-writes-are-64bytes-to-improve-the-bus/m-p/1073931#M5381</link>
      <description>&lt;P&gt;Have a use case where the CPU(Xeon, Haswell,&amp;nbsp;&lt;SPAN class="s1"&gt;E5-2658) has to write 64 bytes of data to the device connected over PCIe bus. On the CPU side, a user space application does a memcpy from a local buffer to the memory mapped address of the device. I believe the memcpy function might be copying 8bytes in turn and thus generating PCIe TLP layer packets with 8 bytes of data and other control overheads.&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;Is there a way to ensure that the 64bytes of data is packed into one PCIe TLP packet and written on the bus?&lt;/P&gt;</description>
      <pubDate>Tue, 13 Sep 2016 08:32:16 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Make-sure-certain-PCIe-writes-are-64bytes-to-improve-the-bus/m-p/1073931#M5381</guid>
      <dc:creator>Anil_A_1</dc:creator>
      <dc:date>2016-09-13T08:32:16Z</dc:date>
    </item>
    <item>
      <title>I have a Intel IPS account,</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Make-sure-certain-PCIe-writes-are-64bytes-to-improve-the-bus/m-p/1073932#M5382</link>
      <description>&lt;P&gt;I have a Intel IPS account, but not sure where to raise this question in terms of product and technology.&lt;/P&gt;</description>
      <pubDate>Tue, 13 Sep 2016 08:33:06 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Make-sure-certain-PCIe-writes-are-64bytes-to-improve-the-bus/m-p/1073932#M5382</guid>
      <dc:creator>Anil_A_1</dc:creator>
      <dc:date>2016-09-13T08:33:06Z</dc:date>
    </item>
    <item>
      <title>There is no way to absolutely</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Make-sure-certain-PCIe-writes-are-64bytes-to-improve-the-bus/m-p/1073933#M5383</link>
      <description>&lt;P&gt;There is no way to absolutely guarantee a single 64-Byte packet, but if you use a Write-Combining memory type and issue a small number of consecutive writes (e.g., 2 32-Byte AVX/AVX2 stores) to 64 Bytes starting at a 64-Byte-aligned address, then you will get a single 64-Byte PCIe transaction *almost* all the time. (The reasons are complex, but ultimately not relevant -- the device must be able to handle partial block transfers as well as the desired full 64-Byte transfers.)&lt;/P&gt;

&lt;P&gt;Note that the memory type depends on both the MTRR and the PAT for the address in question.&amp;nbsp; This is described in Chapter 11 of Volume 3 of the Intel Architectures Software Developer's Manual (document 325384-059, June 2016).&amp;nbsp; Table 11-7 shows six different combinations of MTRR and PAT values that result in the Write-Combining memory type.&amp;nbsp; One or more of these may be more convenient than the others in your particular situation.&lt;/P&gt;

&lt;P&gt;Note also that the "streaming" or "non-temporal" store instructions won't generate streaming stores if the memory type is not WC (or WB, which is not allowed for MMIO regions).&amp;nbsp; The streaming/non-temporal store instructions are not required to generate write-combining if the memory type is WC, but they are more compact and this should reduce the probability of taking an interrupt in the middle of a sequence of stores that fill a 64-Byte write-combining buffer.&amp;nbsp; The discussion in Section 11.3.1 of Volume 3 of the SWDM needs to be read very carefully.&amp;nbsp; Some possible caveats:&lt;/P&gt;

&lt;UL&gt;
	&lt;LI&gt;Section 11.3.1 says that a full WC buffer will be written as a single burst, it is not clear whether this section applies to MMIO transactions.&lt;/LI&gt;
	&lt;LI&gt;Executing 2 32-Byte stores takes a minimum of 2 cycles on your Haswell processor, it is possible that some external event will cause the write-combining buffer to be flushed after the first 32-Byte store, but before the second 32-Byte store.&lt;/LI&gt;
&lt;/UL&gt;

&lt;P&gt;In theory, the PCIe controller could merge multiple consecutive 64-Byte transfers into a larger PCIe transfer (e.g., 128 Byte or 256 Byte, if allowed by the PCIe maximum transfer size handshaking), but I have not been able to find any documentation on whether such a feature exists or is controllable.&amp;nbsp; I have only done performance measurements for write-combining MMIO on a small number of systems, and all were consistent with a 64 Byte payload size for processor-driven writes to MMIO.&lt;/P&gt;</description>
      <pubDate>Wed, 14 Sep 2016 14:15:43 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Make-sure-certain-PCIe-writes-are-64bytes-to-improve-the-bus/m-p/1073933#M5383</guid>
      <dc:creator>McCalpinJohn</dc:creator>
      <dc:date>2016-09-14T14:15:43Z</dc:date>
    </item>
    <item>
      <title>Thanks John for your inputs. </title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Make-sure-certain-PCIe-writes-are-64bytes-to-improve-the-bus/m-p/1073934#M5384</link>
      <description>&lt;P&gt;Thanks John for your inputs.&amp;nbsp;&lt;/P&gt;

&lt;P&gt;Any idea how to achieve this on a Linux system, to make the MMIO addresses or a region of addresses to WC?&lt;/P&gt;

&lt;P&gt;Regards, -Anil&lt;/P&gt;</description>
      <pubDate>Wed, 14 Sep 2016 14:39:34 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Make-sure-certain-PCIe-writes-are-64bytes-to-improve-the-bus/m-p/1073934#M5384</guid>
      <dc:creator>Anil_A_1</dc:creator>
      <dc:date>2016-09-14T14:39:34Z</dc:date>
    </item>
    <item>
      <title>The usual way to do this is</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Make-sure-certain-PCIe-writes-are-64bytes-to-improve-the-bus/m-p/1073935#M5385</link>
      <description>&lt;P&gt;The usual way to do this is in a device driver (that runs in the kernel).&amp;nbsp;&amp;nbsp; To set up the mappings for the kernel to use, just use the "ioremap_wc()" interface and it will make sure that the MTRRs and PATs are set up correctly.&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; I think that recent kernels use "remap_pfn_range()" to create a mapping for user-space access to the MMIO area, but I have a great deal of trouble following all of the changes in the kernel function names and their ever-changing locations in the kernel source trees.&lt;/P&gt;

&lt;P&gt;Aside: There are some weird comments in the Linux kernel documentation (Linux/Documentation/x86/mtrr.txt) about "phasing out" MTRRs -- this is grossly misleading and confusing.&amp;nbsp; Linux can't "phase out" MTRRs -- they are part of the hardware and they have to be programmed correctly.&amp;nbsp; I think what they are trying to say is that they are "phasing out" the explicit use of the MTRR interface.&amp;nbsp; This is perfectly reasonable -- the effective memory type is determined by the combination of the MTRR and PAT settings in a very complex way, and the kernel interfaces should be based on the desired memory type, with the MTRR and PAT handling done in a consistent manner at a lower level.&lt;BR /&gt;
	&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 15 Sep 2016 22:07:14 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Make-sure-certain-PCIe-writes-are-64bytes-to-improve-the-bus/m-p/1073935#M5385</guid>
      <dc:creator>McCalpinJohn</dc:creator>
      <dc:date>2016-09-15T22:07:14Z</dc:date>
    </item>
  </channel>
</rss>

