Re: Slow memcpy speed

Altera_Forum · ‎12-06-2016

Hi all,

I have a design based upon the “Lab 4 - Linux FFT Application” from Rocketboard which runs on the Terasic DE0-Nano-SoC (Cyclone V SoC) evaluation board.

First the data is transferred from the FPGA to the HPS SDRAM using DMA. This transfer is fast: 8 kBytes (1k * 64 bit) takes 21 us => 380 Mbytes/s.

Doing HPS signal processing on the data while stored in sdram is a bit slow, so to increase the signal processing speed the 8 kBytes data is copied into an array using memcpy.

Now the signal processing is much faster, but the memcpy “penalty” is high: Transferring the 8 kBytes of data takes 500 us = 16 Mbytes/s using the compile flag O0, O2 or O3.

Using compile flag from O1 increases memcpy transfer rate to 188us = 42 Mbytes/s, but from what I have read this still seems to be at least 4 times slower than expected.

Has anyone done similar tests, or know if there are any other options that must be set to get a faster memcpy transfer?

All timing measurements are done using an oscilloscope (start/stop trigger signals are written from the HPS to the FPGA-GPIO).

OS: Angstrom v2015.12. Linux real time kernel version 4.1.22-ltsi-rt (PREEMPT RT)

Altera_Forum · ‎12-16-2016

An update:

When defining arrays like this

int value[2048]; //source array

int dest[2048] ; //destination array

and running memcpy(dest,value,2048*4), memcpy speed is high: 446 Mbytes/s

And the compile flag -Ofast give faster speed than -O1, as expected.

- - - - - -

My design is based upon the fpga_fft example from Rocketboard where DMA transfers data from FPGA into HPS’s DRAM memory.

The memory space for these data (*value) is defined using mmap:

volatile unsigned int *value;

volatile unsigned int dest[2048*4];

# define result_base (FFT_SUB_DATA_BASE + (int)mappedbase +(FFT_SUB_DATA_SPAN/2))

- - - - - -

In main:

// we need to get a pointer to the LW_BRIDGE from the softwares point of view.

// need to open a file.

/* Open /dev/mem */

if ((mem = open("/dev/mem", O_RDWR | O_SYNC)) == -1)

fprintf(stderr, "Cannot open /dev/mem\n"), exit(1);

// now map it into lw bridge space:

mappedbase = mmap(0, 0x1f0000, prot_read | prot_write, map_shared, mem, alt_lwfpgaslvs_ofst);

if (mappedBase == (void *)-1) {

printf("Memory map failed. error %i\n", (int)mappedBase);

perror("mmap");

}

Run DMA and wait for completion

...

// And when the DMA is finnished the data is available:

value = (unsigned int *)((int)result_base);

- - - - - -

Now, when running memcpy(dest,value,2048*4) the speed is slow: only 42 Mbytes/s, and the compiler does not respond as expected to the -O compiler flags, i.e. -Ofast is slower that -O1.

It seems that using mmap really slows down the access to memory. Is it possible to speed this up?

Any help would be greatly appreciated!

Thanks,

Altera_Forum · ‎12-29-2016

I think my problem is related to the high address (ALT_LWFPGASLVS_OFST = ff200000) that is used, and this might have to be fixed in the kernel space…

While waiting for someone to fix this for me :) , I wrote an assembly version of the memcpy using the “NEON memory copy with preload” example from arm infocenter.

I had to add “SUBS r2,r2,#0x40” before the loop, if not the loop would go 64 bytes too far (thus overwriting memory).

Using this "neon memcpy" I got a bit more speed (62 MBytes/s), and I could use the -Ofast flag to optimize the rest of the code.

This function is called the same way as memcpy, but the data must be 64 bytes aligned:

void *neon_memcpy(void *ut, const void *in, size_t n)

neon_memcpy.S:

.arch armv7-a

.fpu neon

.global neon_memcpy

.type neon_memcpy, %function

neon_memcpy:

SUBS r2,r2,#0x40

neon_copy_loop:

PLD [r1,# 0xC0]

VLDM r1!,{d0-d7}

VSTM r0!,{d0-d7}

SUBS r2,r2,#0x40

BGE neon_copy_loop

bx lr