Nios® V/II Embedded Design Suite (EDS)
Support for Embedded Development Tools, Processors (SoCs and Nios® V/II processor), Embedded Development Suites (EDSs), Boot and Configuration, Operating Systems, C and C++
12743 Discussions

copy performance in user space vs. kernel

Altera_Forum
Honored Contributor II
2,262 Views

Hi, 

 

I am using Linux 2.6.32 with MMU with DDR SDRAM and I've run into some performance issues copying data. Copying from DDR to DDR using memcpy in an application only gets me 6-10MB/s. Copying from SRAM (via an mmapped buffer) is equally slow. Doing the same copy in the kernel is about 3 times faster. A memcpy on vmalloced or kmalloced buffers is about as fast as a DMA copy. 

 

I am using the binary toolchain for MMU Linux. 32KiB data and instruction caches. 8 uTLB entries for data and instructions, 256 TLB entries. 

 

Both kernel and application are compiled with -O2 and the process is set to real time priority (SCHED_FIFO). 

 

My tests: 

# define BUFLEN 53600 source = (char*)malloc(BUFLEN); dest = (char*)malloc(BUFLEN); for(i = 0; i < BUFLEN; i++) { source = i % 100 + 20; } *tp = 1; memcpy(dest, source, BUFLEN); // 5.3ms *tp = 0; 

char* source = kmalloc(53600, GFP_DMA|GFP_KERNEL); char* dest = kmalloc(53600, GFP_DMA|GFP_KERNEL); char* source_io = (char*)(ioremap_nocache((unsigned int)source, 53600)); char* dest_io = (char*)(ioremap_nocache((unsigned int)dest, 53600)); char* source_v = vmalloc(53600); char* dest_v = vmalloc(53600); volatile unsigned int* dma = (volatile unsigned int*)ioremap_nocache(DDR_TO_DDR_DMA_BASE, DDR_TO_DDR_DMA_SPAN); int j; dma = 0; // reset status dma = (unsigned int)virt_to_phys(source); dma = (unsigned int)virt_to_phys(dest); dma = 53600; dma = 0x84; for(j = 0; j < 53600; j++) { source = j % 100 + 20; } *tp = 1; memcpy(dest, source, 53600); // 1.6ms *tp = 0; mdelay(1); *tp = 1; memcpy(dest_v, source_v, 53600); // 1.8ms *tp = 0; mdelay(1); *tp = 1; memcpy(dest_io, source_io, 53600); // 4.7ms *tp = 0; mdelay(1); *tp = 2; dma = 0x8C; while (!(dma & 1)); // 1.7ms dma = 0x84; *tp = 0; 

Measurements are done by putting out a signal on a GPIO pin (tp) (using mmap on /dev/mem to do this in user space, so no file operations are included in the measurement). 

 

I would expect the user space copy to be the same as the vmalloc copy (with some performance loss compared to the kmalloc copy because of non-contiguous memory), but it's 3 times slower. It's close, but actually even slower, to the uncached in-kernel copy. The results are consistent between runs. 

 

Any ideas to explain the discrepancy?
0 Kudos
3 Replies
Altera_Forum
Honored Contributor II
1,396 Views

The kernel used a stream-lined access, while the standard glibc didn't. You may adapt arch/nios2/lib/memcpy.c instead of glibc. 

 

- Hippo 

 

commit b3eaa1c911109e99c6cd06ec9b91a86a34929fa6 

 

nios2: memcpy with stream-lined memory access 

 

As sdram has long access latency, the total access time can be reduced 

with sequential access in the same page. 

 

This patch unrolls the copy loop and stream-lines the memory access as 

(read x8 + write x8) for aligned source and destination, or 

(read x4 + write x4) for unalinged, using registers for buffer.
0 Kudos
Altera_Forum
Honored Contributor II
1,396 Views

 

--- Quote Start ---  

Measurements are done by putting out a signal on a GPIO pin (tp) 

--- Quote End ---  

As in User mode I/O devices can't be accessed directly, are you sure that the setting of the I/O bit does not result in an exception, a switch to Kernel mode and back and thus slows down the process ?  

 

Testing with several different buffer sizes should make this clear. Moreover cache issues might play some role, here. 

 

-Michael
0 Kudos
Altera_Forum
Honored Contributor II
1,396 Views

 

--- Quote Start ---  

As sdram has long access latency, the total access time can be reduced with sequential access in the same page. 

--- Quote End ---  

 

Should the cache not take care of this ?  

-Michel
0 Kudos
Reply