Nios® II Embedded Design Suite (EDS)
Support for Embedded Development Tools, Processors (SoCs and Nios® II processor), Embedded Development Suites (EDSs), Boot and Configuration, Operating Systems, C and C++
Intel Support hours are Monday-Fridays, 8am-5pm PST, except Holidays. Thanks to our community members who provide support during our down time or before we get to your questions. We appreciate you!

Need Forum Guidance? Click here
Search our FPGA Knowledge Articles here.

Slow memcpy speed

Honored Contributor II

Hi all, 

I have a design based upon the “Lab 4 - Linux FFT Application” from Rocketboard which runs on the Terasic DE0-Nano-SoC (Cyclone V SoC) evaluation board. 


First the data is transferred from the FPGA to the HPS SDRAM using DMA. This transfer is fast: 8 kBytes (1k * 64 bit) takes 21 us => 380 Mbytes/s. 


Doing HPS signal processing on the data while stored in sdram is a bit slow, so to increase the signal processing speed the 8 kBytes data is copied into an array using memcpy. 

Now the signal processing is much faster, but the memcpy “penalty” is high: Transferring the 8 kBytes of data takes 500 us = 16 Mbytes/s using the compile flag O0, O2 or O3. 

Using compile flag from O1 increases memcpy transfer rate to 188us = 42 Mbytes/s, but from what I have read this still seems to be at least 4 times slower than expected. 


Has anyone done similar tests, or know if there are any other options that must be set to get a faster memcpy transfer? 


All timing measurements are done using an oscilloscope (start/stop trigger signals are written from the HPS to the FPGA-GPIO). 


OS: Angstrom v2015.12. Linux real time kernel version 4.1.22-ltsi-rt (PREEMPT RT)
0 Kudos
2 Replies
Honored Contributor II

An update: 


When defining arrays like this 

int value[2048]; //source array 

int dest[2048] ; //destination array 

and running memcpy(dest,value,2048*4), memcpy speed is high: 446 Mbytes/s 

And the compile flag -Ofast give faster speed than -O1, as expected. 


- - - - - - 


My design is based upon the fpga_fft example from Rocketboard where DMA transfers data from FPGA into HPS’s DRAM memory.  

The memory space for these data (*value) is defined using mmap:  


volatile unsigned int *value; 

volatile unsigned int dest[2048*4]; 

# define result_base (FFT_SUB_DATA_BASE + (int)mappedbase +(FFT_SUB_DATA_SPAN/2)) 


- - - - - - 

In main: 


// we need to get a pointer to the LW_BRIDGE from the softwares point of view.  

// need to open a file. 

/* Open /dev/mem */ 

if ((mem = open("/dev/mem", O_RDWR | O_SYNC)) == -1) 

fprintf(stderr, "Cannot open /dev/mem\n"), exit(1); 

// now map it into lw bridge space: 

mappedbase = mmap(0, 0x1f0000, prot_read | prot_write, map_shared, mem, alt_lwfpgaslvs_ofst); 


if (mappedBase == (void *)-1) { 

printf("Memory map failed. error %i\n", (int)mappedBase); 



Run DMA and wait for completion 





// And when the DMA is finnished the data is available: 

value = (unsigned int *)((int)result_base);  


- - - - - - 


Now, when running memcpy(dest,value,2048*4) the speed is slow: only 42 Mbytes/s, and the compiler does not respond as expected to the -O compiler flags, i.e. -Ofast is slower that -O1. 

It seems that using mmap really slows down the access to memory. Is it possible to speed this up? 


Any help would be greatly appreciated! 


Honored Contributor II

I think my problem is related to the high address (ALT_LWFPGASLVS_OFST = ff200000) that is used, and this might have to be fixed in the kernel space… 

While waiting for someone to fix this for me :) , I wrote an assembly version of the memcpy using the “NEON memory copy with preload” example from arm infocenter.  

I had to add “SUBS r2,r2,#0x40” before the loop, if not the loop would go 64 bytes too far (thus overwriting memory). 


Using this "neon memcpy" I got a bit more speed (62 MBytes/s), and I could use the -Ofast flag to optimize the rest of the code. 

This function is called the same way as memcpy, but the data must be 64 bytes aligned: 

void *neon_memcpy(void *ut, const void *in, size_t n) 



.arch armv7-a 

.fpu neon 

.global neon_memcpy 

.type neon_memcpy, %function 


SUBS r2,r2,#0x40 


PLD [r1,# 0xC0] 

VLDM r1!,{d0-d7} 

VSTM r0!,{d0-d7} 

SUBS r2,r2,#0x40 

BGE neon_copy_loop 

bx lr