Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

Sample code for PCIe Burst Transfer white paper by Intel?




I bumped into a white paper by intel:

Is there sample code for Linux on Xeon (E5-2600) processor that I can take a look, instead of the general idea outlined in the paper?

For example, basically the steps are:

1. Mark memory Region as WC 

-   Any sample code for this?

2. Burst transfer

-  Sample code to use this __mm256_store_si256() functions

Any help is appreciated.



0 Kudos
7 Replies
Honored Contributor III

Since this is intended for use with the memory-mapped IO associated with PCIe devices, much of the code is going to be device-specific and is likely to live in the device driver for the specific device of interest.

With that said, the basic calls that are used in the device driver are going to be very similar.  Unfortunately the deities of the Linux kernel world have changed the interfaces quite frequently (even within the same major version of the kernel), so there is an irritatingly large amount of homework required up front.  On the Xeon E3-1200 (v1 "Sandy Bridge") system that I used for device driver work, I was running RHEL 6.x, which is based on Linux 2.6.32-* kernels.

The device driver I used (for an FPGA) was pretty simple, but there are a lot of steps:

  1. pci_get_device() to find the desired device in the system
  2. pci_enable_device()
  3. pci_set_consistent_dma_mask()
  4. Get the BAR information for each of the devices exported memory ranges:
    1. pci_resource_start()
    2. pci_resource_len()
    3. pci_resource_flags()
  5. ioremap_wc() to map one of the BARs to user space using the Write-Combining attribute.

Once this is done there are several ways to perform the write-combining stores:

  1. Create an ioctl in the device driver that includes the write-combining stores when activated.
    1. The white paper uses compiler intrinsics such as _mm256_store_si256()
    2. I use inline assembler to generate MOVNTDQ instructions.
  2. Create an "mmap" function in the device driver that uses io_remap_pfn_range() to create a virtual address range for the user process to use to access the MMIO space directly.

With the WC memory type, you don't really need to use the non-temporal store instructions -- all stores to a write-combining region are "write-combined" -- but it is a good way to remind yourself what is going on.


0 Kudos

Hi Dr. Bandwidth,

Thanks for the reply. It is very useful.

I tried to compile a sample code I found online using AVX with purpose to get some exercise on it. The compilation breaks at #include <stdlib.h> as pasted below.  It seems this intrinsic set only works for user space, contrary to the white paper?

/usr/lib/gcc/x86_64-linux-gnu/4.8/include/mm_malloc.h:27:20: fatal error: stdlib.h: No such file or directory
 #include <stdlib.h>

The code:

#include <asm/i387.h>
#include <x86intrin.h>

static void CopyWithAVX(uint8_t* dst, uint8_t* src, size_t size)
    size_t stride = 16 * sizeof(__m256i);
    while (size)
        __m256i a = _mm256_load_si256((__m256i*)src + 0);
        __m256i b = _mm256_load_si256((__m256i*)src + 1);
        __m256i c = _mm256_load_si256((__m256i*)src + 2);
        __m256i d = _mm256_load_si256((__m256i*)src + 3);
        __m256i e = _mm256_load_si256((__m256i*)src + 4);
        __m256i f = _mm256_load_si256((__m256i*)src + 5);
        __m256i g = _mm256_load_si256((__m256i*)src + 6);
        __m256i h = _mm256_load_si256((__m256i*)src + 7);
        __m256i i = _mm256_load_si256((__m256i*)src + 8);
        __m256i j = _mm256_load_si256((__m256i*)src + 9);
        __m256i k = _mm256_load_si256((__m256i*)src + 10);
        __m256i l = _mm256_load_si256((__m256i*)src + 11);
        __m256i m = _mm256_load_si256((__m256i*)src + 12);
        __m256i n = _mm256_load_si256((__m256i*)src + 13);
        __m256i o = _mm256_load_si256((__m256i*)src + 14);
        __m256i p = _mm256_load_si256((__m256i*)src + 15);
        _mm256_store_si256((__m256i*)dst + 0, a);
        _mm256_store_si256((__m256i*)dst + 1, b);
        _mm256_store_si256((__m256i*)dst + 2, c);
        _mm256_store_si256((__m256i*)dst + 3, d);
        _mm256_store_si256((__m256i*)dst + 4, e);
        _mm256_store_si256((__m256i*)dst + 5, f);
        _mm256_store_si256((__m256i*)dst + 6, g);
        _mm256_store_si256((__m256i*)dst + 7, h);
        _mm256_store_si256((__m256i*)dst + 8, i);
        _mm256_store_si256((__m256i*)dst + 9, j);
        _mm256_store_si256((__m256i*)dst + 10, k);
        _mm256_store_si256((__m256i*)dst + 11, l);
        _mm256_store_si256((__m256i*)dst + 12, m);
        _mm256_store_si256((__m256i*)dst + 13, n);
        _mm256_store_si256((__m256i*)dst + 14, o);
        _mm256_store_si256((__m256i*)dst + 15, p);
        size -= stride;
        src += stride;
        dst += stride;

static int __init mmiotest_init(void)

	// call CopyWithAVX() here

	return 0;

static void __exit mmiotest_exit(void)




The makefile compiler switch:  (I added -mmmx and -msse because compiler complained about those two being disabled)

obj-m        := mmiotest.o
KERN_SRC    := /lib/modules/$(shell uname -r)/build/
PWD            := $(shell pwd)

ccflags-y := -march=corei7-avx -mmmx -msse -mpreferred-stack-boundary=4



0 Kudos
Honored Contributor III

Memory allocation in the kernel is different than in user space.  I recommend the book "Linux Device Drivers" (3rd edition) at as a good reference.  (Unfortunately many of the details change across kernel revisions, but the book is very useful for understanding the overall structure and limitations of working with device drivers.)

I have some performance-related notes on this topic in a technical report (free registration required):

I will attempt to make this (and other reports) available at my more official web site:


0 Kudos

Thanks John.  


I have a feeling that white paper is implementing the intrinsics in the user-space code. That is why it is compiled. 


0 Kudos

Hi John,

I realize this is a bit of an old thread, but it is very relevant to what I'm working on and figured I'd rather add to this thread than start a new one.  I've read the originally referenced Intel whitepaper on PCIe burst transfers and your referenced paper on processor to FPGA memory mapped I/O.  Each of these papers seem to recommend using two AVX 32-byte aligned stores to write to the device mapped memory followed by a mfence to achieve a single 64-byte Write TLP to be sent to the device.  Your paper goes on to suggest actually using the non-temporal versions of these AVX stores.  However, I don't understand why 8 standard 8-byte stores followed by an mfence to a WC region wouldn't also result in a single 64-byte Write TLP.  Assuming the WC region (and first store) was on a 64-byte aligned address, I'd expect all the 8-byte stores to be held in the WC buffer and then the entire cache line would be flushed out as a single 64-byte Write TLP upon issuance of the mfence (if not before that when the buffer is full).  Can you maybe shed some light on the need for the wider AVX stores?  Not sure if it matters, but I'm actually mapping the memory into user space, which is where I'm issuing the store instructions from.



0 Kudos
Honored Contributor III

I think your understanding is correct.  There is no need for SIMD stores and there is no need for non-temporal stores when writing to WC memory space -- as long as all of the stores in the cache line are "very close together", the majority of the bus transactions will be full-cache-line streaming stores.  (There is no need for an MFENCE until you require ordering between WC stores and ordinary stores -- the act of filling buffer is enough to cause it to flush.)

I use non-temporal stores as a reminder that these are write-combining stores.  It is not necessary, but I worry that if I did not use the non-temporal form I might come back to the code later and find myself confused.  (I have also used WP and WT types for memory-mapped IO, and wanted to make it obvious which version I was working on from simple inspection of the instruction types used in the code.)

Since I am using non-temporal stores, I use the SIMD versions simply because they are more familiar to me.  Completing the stores to all the bytes in the target cache line with as few stores as possible should reduce the probability of taking an interrupt in the middle of the sequence (which will cause the WC buffer to flush with one or more partial cache-line writes).   I have not tested this recently, but there should be no harm in using the wider store instructions.  It is not possible to completely prevent partial-cache-line writes, but the performance analysis is easier if they occur infrequently enough that I can ignore them.  This is almost always the case if all 64 Bytes of stores come from contiguous store instructions, but again this is a visual reminder that I am trying to do encourage specific hardware behavior with these stores.

0 Kudos

Hi John - Thank you for your quick reply.  That explains it and shores up my understanding.

0 Kudos