- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I am using Linux 2.6.32 with MMU with DDR SDRAM and I've run into some performance issues copying data. Copying from DDR to DDR using memcpy in an application only gets me 6-10MB/s. Copying from SRAM (via an mmapped buffer) is equally slow. Doing the same copy in the kernel is about 3 times faster. A memcpy on vmalloced or kmalloced buffers is about as fast as a DMA copy. I am using the binary toolchain for MMU Linux. 32KiB data and instruction caches. 8 uTLB entries for data and instructions, 256 TLB entries. Both kernel and application are compiled with -O2 and the process is set to real time priority (SCHED_FIFO). My tests:
# define BUFLEN 53600
source = (char*)malloc(BUFLEN);
dest = (char*)malloc(BUFLEN);
for(i = 0; i < BUFLEN; i++)
{
source = i % 100 + 20;
}
*tp = 1;
memcpy(dest, source, BUFLEN); // 5.3ms
*tp = 0;
char* source = kmalloc(53600, GFP_DMA|GFP_KERNEL);
char* dest = kmalloc(53600, GFP_DMA|GFP_KERNEL);
char* source_io = (char*)(ioremap_nocache((unsigned int)source, 53600));
char* dest_io = (char*)(ioremap_nocache((unsigned int)dest, 53600));
char* source_v = vmalloc(53600);
char* dest_v = vmalloc(53600);
volatile unsigned int* dma = (volatile unsigned int*)ioremap_nocache(DDR_TO_DDR_DMA_BASE, DDR_TO_DDR_DMA_SPAN);
int j;
dma = 0; // reset status
dma = (unsigned int)virt_to_phys(source);
dma = (unsigned int)virt_to_phys(dest);
dma = 53600;
dma = 0x84;
for(j = 0; j < 53600; j++)
{
source = j % 100 + 20;
}
*tp = 1;
memcpy(dest, source, 53600); // 1.6ms
*tp = 0;
mdelay(1);
*tp = 1;
memcpy(dest_v, source_v, 53600); // 1.8ms
*tp = 0;
mdelay(1);
*tp = 1;
memcpy(dest_io, source_io, 53600); // 4.7ms
*tp = 0;
mdelay(1);
*tp = 2;
dma = 0x8C;
while (!(dma & 1)); // 1.7ms
dma = 0x84;
*tp = 0;
Measurements are done by putting out a signal on a GPIO pin (tp) (using mmap on /dev/mem to do this in user space, so no file operations are included in the measurement). I would expect the user space copy to be the same as the vmalloc copy (with some performance loss compared to the kmalloc copy because of non-contiguous memory), but it's 3 times slower. It's close, but actually even slower, to the uncached in-kernel copy. The results are consistent between runs. Any ideas to explain the discrepancy?
Link Copied
3 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The kernel used a stream-lined access, while the standard glibc didn't. You may adapt arch/nios2/lib/memcpy.c instead of glibc.
- Hippo commit b3eaa1c911109e99c6cd06ec9b91a86a34929fa6 nios2: memcpy with stream-lined memory access As sdram has long access latency, the total access time can be reduced with sequential access in the same page. This patch unrolls the copy loop and stream-lines the memory access as (read x8 + write x8) for aligned source and destination, or (read x4 + write x4) for unalinged, using registers for buffer.- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
--- Quote Start --- Measurements are done by putting out a signal on a GPIO pin (tp) --- Quote End --- As in User mode I/O devices can't be accessed directly, are you sure that the setting of the I/O bit does not result in an exception, a switch to Kernel mode and back and thus slows down the process ? Testing with several different buffer sizes should make this clear. Moreover cache issues might play some role, here. -Michael
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
--- Quote Start --- As sdram has long access latency, the total access time can be reduced with sequential access in the same page. --- Quote End --- Should the cache not take care of this ? -Michel

Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page