copy performance in user space vs. kernel

Altera_Forum · ‎06-17-2010

Hi,

I am using Linux 2.6.32 with MMU with DDR SDRAM and I've run into some performance issues copying data. Copying from DDR to DDR using memcpy in an application only gets me 6-10MB/s. Copying from SRAM (via an mmapped buffer) is equally slow. Doing the same copy in the kernel is about 3 times faster. A memcpy on vmalloced or kmalloced buffers is about as fast as a DMA copy.

I am using the binary toolchain for MMU Linux. 32KiB data and instruction caches. 8 uTLB entries for data and instructions, 256 TLB entries.

Both kernel and application are compiled with -O2 and the process is set to real time priority (SCHED_FIFO).

My tests:


# define BUFLEN 53600
source = (char*)malloc(BUFLEN);
dest = (char*)malloc(BUFLEN);
for(i = 0; i < BUFLEN; i++)
{
	source = i % 100 + 20;
}
*tp = 1;
memcpy(dest, source, BUFLEN); // 5.3ms
*tp = 0;


char* source = kmalloc(53600, GFP_DMA|GFP_KERNEL);
char* dest = kmalloc(53600, GFP_DMA|GFP_KERNEL);
char* source_io = (char*)(ioremap_nocache((unsigned int)source, 53600));
char* dest_io = (char*)(ioremap_nocache((unsigned int)dest, 53600));
char* source_v = vmalloc(53600);
char* dest_v = vmalloc(53600);
volatile unsigned int* dma = (volatile unsigned int*)ioremap_nocache(DDR_TO_DDR_DMA_BASE, DDR_TO_DDR_DMA_SPAN);
int j;
dma = 0; // reset status
dma = (unsigned int)virt_to_phys(source);
dma = (unsigned int)virt_to_phys(dest);
dma = 53600;
dma = 0x84;
	
for(j = 0; j < 53600; j++)
{
	source = j % 100 + 20;
}
	
*tp = 1;
memcpy(dest, source, 53600);  // 1.6ms
*tp = 0;
mdelay(1);
*tp = 1;
memcpy(dest_v, source_v, 53600);  // 1.8ms
*tp = 0;
mdelay(1);
*tp = 1;
memcpy(dest_io, source_io, 53600);  // 4.7ms
*tp = 0;
mdelay(1);
*tp = 2;
dma = 0x8C;
while (!(dma & 1));  // 1.7ms
dma = 0x84;
*tp = 0;

Measurements are done by putting out a signal on a GPIO pin (tp) (using mmap on /dev/mem to do this in user space, so no file operations are included in the measurement).

I would expect the user space copy to be the same as the vmalloc copy (with some performance loss compared to the kmalloc copy because of non-contiguous memory), but it's 3 times slower. It's close, but actually even slower, to the uncached in-kernel copy. The results are consistent between runs.

Any ideas to explain the discrepancy?

Altera_Forum · ‎06-18-2010

The kernel used a stream-lined access, while the standard glibc didn't. You may adapt arch/nios2/lib/memcpy.c instead of glibc.

- Hippo

commit b3eaa1c911109e99c6cd06ec9b91a86a34929fa6

nios2: memcpy with stream-lined memory access

As sdram has long access latency, the total access time can be reduced

with sequential access in the same page.

This patch unrolls the copy loop and stream-lines the memory access as

(read x8 + write x8) for aligned source and destination, or

(read x4 + write x4) for unalinged, using registers for buffer.

Altera_Forum · ‎06-18-2010

--- Quote Start ---

Measurements are done by putting out a signal on a GPIO pin (tp)

--- Quote End ---

As in User mode I/O devices can't be accessed directly, are you sure that the setting of the I/O bit does not result in an exception, a switch to Kernel mode and back and thus slows down the process ?

Testing with several different buffer sizes should make this clear. Moreover cache issues might play some role, here.

-Michael

Altera_Forum · ‎06-18-2010

--- Quote Start ---

As sdram has long access latency, the total access time can be reduced with sequential access in the same page.

--- Quote End ---

Should the cache not take care of this ?

-Michel