According to this webpage

Ilya_G_ · ‎03-08-2019

Hello,
I use Denverton stepping B1 C3955 @2.10Ghz with BIOS 0015D96 on Harcuvar CRB.
Since Denverton doesn’t have DMA controller for transferring block of data between RAM and PCIe. I use the method suggested by McCalpin John at https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/538135 who writes the following:
“If the PCIe device does not have its own DMA controller, then the fastest way to copy data from system memory to that IO device is to use a processor core. You would need to set up a memory-mapped IO range for the device with the write-combining attribute, then use a processor core (or thread) to read from (cacheable) system memory and write to the MMIO range using streaming stores”
For PCIe region targeted I use BAR 0 (video memory) of Matrox Millennium G550 LP PCIE card installed on Harcuvar CRB. This BAR 0 is defined in MMU as non-cacheable and write-combining.

I call ippInit() and ippiGetLibVersion() that returns: “ippIP SSE4.2 (y8) 9.0.4 (r52811)”
After that I call ippsCopy_64s to copy 16 Mbytes of data from local buffer in DDR SDRAM to BAR0.
The address of local buffer and BAR0 is aligned on 64-bytes.

The throughput that I get is 90 Mbytes/s on copy from DDR SDRAM to PCIe and 10 Mbytes/s on copy from PCIe to DDR SDRAM.
Q1. Do above numbers make sense?
Q2. Is the usage of ippsCopy_64s best option in case of absence of DMA engine?
Is there any other method to make transfer to/from PCIe in order to get high throughput?
Q3. I tried ippsCopy_32s, ippsCopy_16s, ippsCopy_8u, but the result is same as in ippsCopy_64s. Could you explain please?
Q4. I also tried ippiCopyManaged_8u_C1R with the parameter IPP_NONTEMPORAL_STORE as suggested in https://software.intel.com/en-us/articles/ippscopy-vs-ippicopymanaged ;
the result still the same as ippsCopy_64s. Could you explain please?

Thanks.
Ilya.

Jonghak_K_Intel · ‎03-13-2019

According to this webpage https://www.matrox.com/graphics/en/products/graphics_cards/g_series/g550pcie/ ; and https://www.matrox.com/graphics/media/pdf/products/graphics_cards/g_series/en_g550_guide.pdf

it seems like the graphics card's type is PCIe x1 and the bandwidth is up to 250 MBytes.

Q1. Do above numbers make sense? I can't tell by just the number.
Q2. Is the usage of ippsCopy_64s best option in case of absence of DMA engine?
Is there any other method to make transfer to/from PCIe in order to get high throughput? Memory Mapped I/O is known as a suitable method for devices requiring large data transfer such as graphic cards.
Q3. I tried ippsCopy_32s, ippsCopy_16s, ippsCopy_8u, but the result is same as in ippsCopy_64s. Could you explain please? If the bottleneck is PCIe I/O then they could be at the same speed. Can you measure how fast it can write on the local memory for each function ?
Q4. I also tried ippiCopyManaged_8u_C1R with the parameter IPP_NONTEMPORAL_STORE as suggested in https://software.intel.com/en-us/articles/ippscopy-vs-ippicopymanaged
the result still the same as ippsCopy_64s. Could you explain please? This will reduce caching overhead but won't speed up the copying speed through PCIe.

Jonghak_K_Intel · ‎03-19-2019

Hi,

do you need any further support?

otherwise, we would like to close this issue.

thank you

Ilya_G_ · ‎03-26-2019

Hi,

Thank you for the answers.

Ilya.

Usage of Intel IPP ippsCopy functions for data transfer between DDR SDRAM and PCIe