Nios® V/II Embedded Design Suite (EDS)
Support for Embedded Development Tools, Processors (SoCs and Nios® V/II processor), Embedded Development Suites (EDSs), Boot and Configuration, Operating Systems, C and C++

Slow memcpy speed

Honored Contributor II

Hi all, 

I have a design based upon the “Lab 4 - Linux FFT Application” from Rocketboard which runs on the Terasic DE0-Nano-SoC (Cyclone V SoC) evaluation board. 


First the data is transferred from the FPGA to the HPS SDRAM using DMA. This transfer is fast: 8 kBytes (1k * 64 bit) takes 21 us => 380 Mbytes/s. 


Doing HPS signal processing on the data while stored in sdram is a bit slow, so to increase the signal processing speed the 8 kBytes data is copied into an array using memcpy. 

Now the signal processing is much faster, but the memcpy “penalty” is high: Transferring the 8 kBytes of data takes 500 us = 16 Mbytes/s using the compile flag O0, O2 or O3. 

Using compile flag from O1 increases memcpy transfer rate to 188us = 42 Mbytes/s, but from what I have read this still seems to be at least 4 times slower than expected. 


Has anyone done similar tests, or know if there are any other options that must be set to get a faster memcpy transfer? 


All timing measurements are done using an oscilloscope (start/stop trigger signals are written from the HPS to the FPGA-GPIO). 


OS: Angstrom v2015.12. Linux real time kernel version 4.1.22-ltsi-rt (PREEMPT RT)
0 Kudos
2 Replies
Honored Contributor II

An update: 


When defining arrays like this 

int value[2048]; //source array 

int dest[2048] ; //destination array 

and running memcpy(dest,value,2048*4), memcpy speed is high: 446 Mbytes/s 

And the compile flag -Ofast give faster speed than -O1, as expected. 


- - - - - - 


My design is based upon the fpga_fft example from Rocketboard where DMA transfers data from FPGA into HPS’s DRAM memory.  

The memory space for these data (*value) is defined using mmap:  


volatile unsigned int *value; 

volatile unsigned int dest[2048*4]; 

# define result_base (FFT_SUB_DATA_BASE + (int)mappedbase +(FFT_SUB_DATA_SPAN/2)) 


- - - - - - 

In main: 


// we need to get a pointer to the LW_BRIDGE from the softwares point of view.  

// need to open a file. 

/* Open /dev/mem */ 

if ((mem = open("/dev/mem", O_RDWR | O_SYNC)) == -1) 

fprintf(stderr, "Cannot open /dev/mem\n"), exit(1); 

// now map it into lw bridge space: 

mappedbase = mmap(0, 0x1f0000, prot_read | prot_write, map_shared, mem, alt_lwfpgaslvs_ofst); 


if (mappedBase == (void *)-1) { 

printf("Memory map failed. error %i\n", (int)mappedBase); 



Run DMA and wait for completion 





// And when the DMA is finnished the data is available: 

value = (unsigned int *)((int)result_base);  


- - - - - - 


Now, when running memcpy(dest,value,2048*4) the speed is slow: only 42 Mbytes/s, and the compiler does not respond as expected to the -O compiler flags, i.e. -Ofast is slower that -O1. 

It seems that using mmap really slows down the access to memory. Is it possible to speed this up? 


Any help would be greatly appreciated! 


0 Kudos
Honored Contributor II

I think my problem is related to the high address (ALT_LWFPGASLVS_OFST = ff200000) that is used, and this might have to be fixed in the kernel space… 

While waiting for someone to fix this for me :) , I wrote an assembly version of the memcpy using the “NEON memory copy with preload” example from arm infocenter.  

I had to add “SUBS r2,r2,#0x40” before the loop, if not the loop would go 64 bytes too far (thus overwriting memory). 


Using this "neon memcpy" I got a bit more speed (62 MBytes/s), and I could use the -Ofast flag to optimize the rest of the code. 

This function is called the same way as memcpy, but the data must be 64 bytes aligned: 

void *neon_memcpy(void *ut, const void *in, size_t n) 



.arch armv7-a 

.fpu neon 

.global neon_memcpy 

.type neon_memcpy, %function 


SUBS r2,r2,#0x40 


PLD [r1,# 0xC0] 

VLDM r1!,{d0-d7} 

VSTM r0!,{d0-d7} 

SUBS r2,r2,#0x40 

BGE neon_copy_loop 

bx lr
0 Kudos