Performance of P2P DMA PCIe packets routed by CPU?

KKaiG · ‎11-30-2017

Hello,

I like to know whether P2P DMA packets on PCIe bus routed by GPU have narrower bandwidth than Dev-to-RAM cases.

[SYMPTOM]

When I run a sequential data transfer workload (read of SSD data blocks) using peer-to-peer DMA from triple Intel DC P4600 SSD (striped with md-raid0) to NVIDIA Tesla P40, it performed with worse throughput (7.1GB/s) than theoretical one (9.6GB/s),

On the other hands, 3x Intel DC P4600 SSD configuration recorded 9.5GB/s throughput, when we tried SSD-to-RAM DMA with same kernel driver.

GPU's device memory is mapped to PCI BAR1 region using NVIDIA GPUDirect RDMA. So, they have physical address of the host system, thus, we can use these addresses as destination address of NVME READ command.

I wrote up a Linux kernel driver which intermediates direct data transfer between NVMe-SSD and GPU or host RAM.

It constructs NVME READ commands to read a particular SSD blocks and to store them onto the specified destination address (that may be GPU's device memory), then enqueues the command into message queue of the inbox nvme driver.

Likely, our Linux kernel module is not guilty, because it performs SSD-to-GPU P2P DMA with 6.3GB/s throughput on the dual SSDs configuration. Individual SSDs performs with 3.2GB/s throughput; equivalent to the catalog spec of DC P4600.

[PLEASE TELL ME IF YOU KNOW]

Does Xeon E5-2650v4 (Broadwell-EP) processor have hardware limitation on the capability of peer-to-peer DMA packet routing less than the PCIe specification?
If Broadwell-EP Xeon has such limitation on P2P DMA routing, is it improved at the Skylake-S (Xeon Scalable)?

...and, I hope folk's suggestion if you have any ideas to check more than CPU's capability.

Best regards,

* SSD-to-RAM works as expected (9.5GB/s)

* SSD-to-GPU works slower than my expectation (7.1GB/s)

idata · ‎11-30-2017

Hello kkaigai,

Thanks for posting.

Based on your inquiries, we believe it's best to engage our communities to address this thread.

Best regards,

Eugenio F.

idata · ‎11-30-2017

Hello kkaigai,

Thank you for joining the Processors Community.

Please let me review this matter, I will update the thread as soon possible.

Regards,

Amy C.

KKaiG · ‎11-30-2017

Thanks for your follow-up.

Please don't hesitate to ask me for further information and our investigations.

idata · ‎12-21-2017

/thread/120096 kkaigai, thank you for your patience.

Please find below the answer to your questions, I will copy you question and then add the answer.

1. Does Xeon E5-2650v4 (Broadwell-EP) processor have hardware limitation on the capability of peer-to-peer DMA packet routing less than the PCIe specification?

Root complex transmits packets out of its ports and receives packets on its ports which it forwards to memory.A multi-port root complex may also route packets from one port to another port but is NOT required by the specification to do so. The chipset supports peer-to-peer packet routing between PCI Express endpoints and PCI devices, memory and graphics. It is yet to be determined if the first generation PCI Express chipsets, will support peer-to-peer packet routing between PCI Express endpoints. Remember that the specification does not require the root complex to support peer-to-peer packet routing between the multiple Links associated with the root complex.

2. If Broadwell-EP Xeon has such limitation on P2P DMA routing, is it improved at the Skylake-S (Xeon Scalable)?

Depends upon which specific scalable processor are we comparing. In general scalable processors have more # of PCI lanes than E5- V4.

Regards,

Amy C.

KKaiG · ‎12-21-2017

2. If Broadwell-EP Xeon has such limitation on P2P DMA routing, is it improved at the Skylake-S (Xeon Scalable)?

Depends upon which specific scalable processor are we comparing. In general scalable processors have more # of PCI lanes than E5- V4.

Thanks for your reply. At this moment, we still cannot clarify whether E5-2650v4 (Broadwell) processor has narrower bandwidth of P2P DMA than D2H.

Fortunately, we could get a new GPU (Tesla V100), then we are under arrangement of other hardware facilities based on Skylake (Xeon Gold 6128T).

Once the new hardware gets delivered, we try to run the benchmark and hope to report the results.

idata · ‎12-22-2017

/thread/120096 kkaigai sure, let us know the results.

Regards,

Amy C.

KKaiG · ‎01-31-2018

Based on my new server, with Xeon Gold 6126T and NVIDIA Tesla V100 / P40 (as previous test), it improved P2P DMA performance from 7.1GB/s to 8.5GB/s.

https://translate.google.com/translate?hl=ja&sl=auto&tl=en&u=http://kaigai.hatenablog.com/entry/2018/01/30/211342 https://translate.google.com/translate?hl=ja&sl=auto&tl=en&u=http://kaigai.hatenablog.com/entry/2018/01/30…

One strange result was observed at SSD-to-RAM DMA test (GPU is not involved).

It was degraded to 8.5GB/s from 9.6GB/s in the Broadwell.

I'm under investigation, so please suggest me if you have any ideas to look at.

idata · ‎02-15-2018

Hello, kkaigai.

In this case, in order to continue assisting you regarding this issue, I would highly recommend you to open a thread with ourhttps://software.intel.com/en-us/home Intel® Developer Zone department. Make sure you provide a clear explanation of your issue and we will be able to help you.

Antony S.

maybe0524 · ‎10-11-2020

1) How to confirm that features list of p4600 products supporting CMB? As I kown, P2P needs EP support CMB.

2) Is it possible to use P2P only if the kernel version is greater than 4.20?

https://www.kernel.org/doc/html/latest/driver-api/pci/p2pdma.html