Intel® Integrated Performance Primitives
Deliberate problems developing high-performance vision, signal, security, and storage applications.

Usage of Intel IPP ippsCopy functions for data transfer between DDR SDRAM and PCIe

Ilya_G_
Beginner
1,379 Views

Hello,
I use Denverton stepping B1 C3955 @2.10Ghz with BIOS 0015D96 on Harcuvar CRB.
Since Denverton doesn’t have DMA controller for transferring block of data between RAM and PCIe. I use the method suggested by McCalpin John at https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/538135 who writes the following: 
“If the PCIe device does not have its own DMA controller, then the fastest way to copy data from system memory to that IO device is to use a processor core.   You would need to set up a memory-mapped IO range for the device with the write-combining attribute, then use a processor core (or thread) to read from (cacheable) system memory and write to the MMIO range using streaming stores”
For PCIe region targeted I use BAR 0 (video memory) of Matrox Millennium G550 LP PCIE card installed on Harcuvar CRB. This BAR 0 is defined in MMU as non-cacheable and write-combining.

I call ippInit() and ippiGetLibVersion() that returns: “ippIP SSE4.2 (y8) 9.0.4 (r52811)”
After that I call  ippsCopy_64s to copy 16 Mbytes of data from local buffer in DDR SDRAM to BAR0. 
The address of local buffer and BAR0 is aligned on 64-bytes.

The throughput that I get is 90 Mbytes/s on copy from DDR SDRAM to PCIe and 10 Mbytes/s on copy from PCIe to DDR SDRAM.
Q1. Do above numbers make sense?
Q2. Is the usage of ippsCopy_64s best option in case of absence of DMA engine?
Is there any other method to make transfer to/from PCIe in order to get high throughput?
Q3. I tried  ippsCopy_32s, ippsCopy_16s, ippsCopy_8u, but the result is same as in ippsCopy_64s. Could you explain please?
Q4. I also tried ippiCopyManaged_8u_C1R with the parameter IPP_NONTEMPORAL_STORE as suggested in https://software.intel.com/en-us/articles/ippscopy-vs-ippicopymanaged ;
the result still the same as ippsCopy_64s. Could you explain please?

Thanks.
Ilya.

0 Kudos
3 Replies
Jonghak_K_Intel
Employee
1,379 Views

According to this webpage https://www.matrox.com/graphics/en/products/graphics_cards/g_series/g550pcie/ ; and https://www.matrox.com/graphics/media/pdf/products/graphics_cards/g_series/en_g550_guide.pdf

it seems like the graphics card's type is PCIe x1 and the bandwidth is up to 250 MBytes. 

Q1. Do above numbers make sense? I can't tell by just the number. 
Q2. Is the usage of ippsCopy_64s best option in case of absence of DMA engine?
Is there any other method to make transfer to/from PCIe in order to get high throughput? Memory Mapped I/O is known as a suitable method for devices requiring large data transfer such as graphic cards. 
Q3. I tried  ippsCopy_32s, ippsCopy_16s, ippsCopy_8u, but the result is same as in ippsCopy_64s. Could you explain please? If the bottleneck is PCIe I/O then they could be at the same speed. Can you measure how fast it can write on the local memory for each function ? 
Q4. I also tried ippiCopyManaged_8u_C1R with the parameter IPP_NONTEMPORAL_STORE as suggested in https://software.intel.com/en-us/articles/ippscopy-vs-ippicopymanaged 
the result still the same as ippsCopy_64s. Could you explain please? This will reduce caching overhead but won't speed up the copying speed through PCIe. 

0 Kudos
Jonghak_K_Intel
Employee
1,379 Views

Hi, 

 

do you need any further support? 

otherwise, we would like to close this issue. 

 

thank you 

0 Kudos
Ilya_G_
Beginner
1,379 Views

Hi,

Thank you for the answers.

Ilya.

 

 

0 Kudos
Reply