Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

Can we use SSE intrinsics to write to a memory mapped PCI device memory

Anil_A_
Beginner
892 Views

I have a use case where the x86 CPU has to write 64 bytes of data to PCIe slave device whose memory has been mmapp'ed into the user space. As of now, i use memcpy to do that, but it turns out that it is very slow. Can we use the Intel SSE intrinsics like _mm_stream_si128 to speed it up? Or any other mechanism other than using DMA.

The objective is to pack all the 64 bytes into one TLP and send it on the PCI bus to reduce the overhead.

The system config is: Dual socket haswell has a custom NIC connected on x16 PCIe bus.

0 Kudos
4 Replies
McCalpinJohn
Honored Contributor III
892 Views

This will work if the memory is properly mapped as Write-Combining.  

The Linux kernel folks keep on changing the interfaces, but on one system the required kernel call was "ioremap_wc()".    This set up combination of MTRRs and PATs required to get the write-combining type.   This can be done in two different ways (as shown in Table 11-7 of Volume 3 of the SW Developer's Manual 325384-055), but I don't remember which approach was used.   Performance was fine -- about 73% of peak, which is what I expected from a back-of-the-envelope estimate of packet header overhead.

In every version of the Linux kernel that I looked at the "ioremap_cache()" call is silently converted to "ioremap_nocache()" for memory-mapped IO space.  This is usually the correct thing to do, but it makes it difficult to experiment....

More details are available at:

https://www.researchgate.net/publication/266375644_Low_Level_Microbenchmarks_of_Processor_to_FPGA_Memory-Mapped_IO

0 Kudos
Anil_A_
Beginner
892 Views

Thanks John for your suggestion. Is there a way all these things can be done from user space. Mine is user-space driver. I believe these ioremap_* functions are only available in the kernel code.

Thanks

-Anil

0 Kudos
Patrick_L_Intel
Employee
892 Views

Anil A. wrote:

Thanks John for your suggestion. Is there a way all these things can be done from user space. Mine is user-space driver. I believe these ioremap_* functions are only available in the kernel code.

Thanks

-Anil

Hi Anil,

Please take a look Data Place Development Kit (DPDK) library: http://dpdk.org/doc/guides/prog_guide/env_abstraction_layer.html#pci-access

DPDK contains a set optimized C libraries that can accelerate packet processing on IA. It uses Linux's UIO framework to map required memory space from kernel space to user space, so user can simply open uio device to communicate with NIC.

Best Regards,

Patrick

0 Kudos
McCalpinJohn
Honored Contributor III
892 Views

The ioremap_*() calls do have to be done in the kernel -- I was assuming that you would be able to modify the existing device driver code that mapped the PCIe device into user space. 

Table 11-7 of Vol 3 of the SW Developer's manual (document 325384) shows how the combination of MTRR's and PAT's controls the caching mode.  If the existing driver sets up any combination of MTRR and PAT values that map to UC, then you will not be able to perform a 64 Byte store. (The same should be true for WP or WT, though I have never seen them used in Linux.   WB mode should not be used for MMIO.)  

If you really only need to write 64 Bytes, you could try doing 128-bit or 256-bit stores.   Intel cautions against using stores larger than 64 bits to MMIO, but (if I recall correctly) it is not guaranteed *not* to work, so you might get lucky?  This would not give you a single 64 Byte store, but it might let you get away with less than 8 8-Byte stores

Anyway, the point is that the hardware controls the size of the write transactions, and this control is via a combination of the MTRR and PAT values -- both of which can only be controlled in the kernel.   The Data Place Development Kit and the Linux UIO infrastructure don't change this -- they just make it easier to write a kernel device driver that allows the user to make the desired mapping requests.

0 Kudos
Reply