- Marquer comme nouveau
- Marquer
- S'abonner
- Sourdine
- S'abonner au fil RSS
- Surligner
- Imprimer
- Signaler un contenu inapproprié
I have a use case where the x86 CPU has to write 64 bytes of data to PCIe slave device whose memory has been mmapp'ed into the user space. As of now, i use memcpy to do that, but it turns out that it is very slow. Can we use the Intel SSE intrinsics like _mm_stream_si128 to speed it up? Or any other mechanism other than using DMA.
The objective is to pack all the 64 bytes into one TLP and send it on the PCI bus to reduce the overhead.
The system config is: Dual socket haswell has a custom NIC connected on x16 PCIe bus.
Lien copié
- Marquer comme nouveau
- Marquer
- S'abonner
- Sourdine
- S'abonner au fil RSS
- Surligner
- Imprimer
- Signaler un contenu inapproprié
This will work if the memory is properly mapped as Write-Combining.
The Linux kernel folks keep on changing the interfaces, but on one system the required kernel call was "ioremap_wc()". This set up combination of MTRRs and PATs required to get the write-combining type. This can be done in two different ways (as shown in Table 11-7 of Volume 3 of the SW Developer's Manual 325384-055), but I don't remember which approach was used. Performance was fine -- about 73% of peak, which is what I expected from a back-of-the-envelope estimate of packet header overhead.
In every version of the Linux kernel that I looked at the "ioremap_cache()" call is silently converted to "ioremap_nocache()" for memory-mapped IO space. This is usually the correct thing to do, but it makes it difficult to experiment....
More details are available at:
- Marquer comme nouveau
- Marquer
- S'abonner
- Sourdine
- S'abonner au fil RSS
- Surligner
- Imprimer
- Signaler un contenu inapproprié
Thanks John for your suggestion. Is there a way all these things can be done from user space. Mine is user-space driver. I believe these ioremap_* functions are only available in the kernel code.
Thanks
-Anil
- Marquer comme nouveau
- Marquer
- S'abonner
- Sourdine
- S'abonner au fil RSS
- Surligner
- Imprimer
- Signaler un contenu inapproprié
Anil A. wrote:
Thanks John for your suggestion. Is there a way all these things can be done from user space. Mine is user-space driver. I believe these ioremap_* functions are only available in the kernel code.
Thanks
-Anil
Hi Anil,
Please take a look Data Place Development Kit (DPDK) library: http://dpdk.org/doc/guides/prog_guide/env_abstraction_layer.html#pci-access
DPDK contains a set optimized C libraries that can accelerate packet processing on IA. It uses Linux's UIO framework to map required memory space from kernel space to user space, so user can simply open uio device to communicate with NIC.
Best Regards,
Patrick
- Marquer comme nouveau
- Marquer
- S'abonner
- Sourdine
- S'abonner au fil RSS
- Surligner
- Imprimer
- Signaler un contenu inapproprié
The ioremap_*() calls do have to be done in the kernel -- I was assuming that you would be able to modify the existing device driver code that mapped the PCIe device into user space.
Table 11-7 of Vol 3 of the SW Developer's manual (document 325384) shows how the combination of MTRR's and PAT's controls the caching mode. If the existing driver sets up any combination of MTRR and PAT values that map to UC, then you will not be able to perform a 64 Byte store. (The same should be true for WP or WT, though I have never seen them used in Linux. WB mode should not be used for MMIO.)
If you really only need to write 64 Bytes, you could try doing 128-bit or 256-bit stores. Intel cautions against using stores larger than 64 bits to MMIO, but (if I recall correctly) it is not guaranteed *not* to work, so you might get lucky? This would not give you a single 64 Byte store, but it might let you get away with less than 8 8-Byte stores
Anyway, the point is that the hardware controls the size of the write transactions, and this control is via a combination of the MTRR and PAT values -- both of which can only be controlled in the kernel. The Data Place Development Kit and the Linux UIO infrastructure don't change this -- they just make it easier to write a kernel device driver that allows the user to make the desired mapping requests.

- S'abonner au fil RSS
- Marquer le sujet comme nouveau
- Marquer le sujet comme lu
- Placer ce Sujet en tête de liste pour l'utilisateur actuel
- Marquer
- S'abonner
- Page imprimable