- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi all,
I have a design based upon the “Lab 4 - Linux FFT Application” from Rocketboard which runs on the Terasic DE0-Nano-SoC (Cyclone V SoC) evaluation board. First the data is transferred from the FPGA to the HPS SDRAM using DMA. This transfer is fast: 8 kBytes (1k * 64 bit) takes 21 us => 380 Mbytes/s. Doing HPS signal processing on the data while stored in sdram is a bit slow, so to increase the signal processing speed the 8 kBytes data is copied into an array using memcpy. Now the signal processing is much faster, but the memcpy “penalty” is high: Transferring the 8 kBytes of data takes 500 us = 16 Mbytes/s using the compile flag O0, O2 or O3. Using compile flag from O1 increases memcpy transfer rate to 188us = 42 Mbytes/s, but from what I have read this still seems to be at least 4 times slower than expected. Has anyone done similar tests, or know if there are any other options that must be set to get a faster memcpy transfer? All timing measurements are done using an oscilloscope (start/stop trigger signals are written from the HPS to the FPGA-GPIO). OS: Angstrom v2015.12. Linux real time kernel version 4.1.22-ltsi-rt (PREEMPT RT)Link Copied
2 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
An update:
When defining arrays like this int value[2048]; //source array int dest[2048] ; //destination array and running memcpy(dest,value,2048*4), memcpy speed is high: 446 Mbytes/s And the compile flag -Ofast give faster speed than -O1, as expected. - - - - - - My design is based upon the fpga_fft example from Rocketboard where DMA transfers data from FPGA into HPS’s DRAM memory. The memory space for these data (*value) is defined using mmap: volatile unsigned int *value; volatile unsigned int dest[2048*4]; # define result_base (FFT_SUB_DATA_BASE + (int)mappedbase +(FFT_SUB_DATA_SPAN/2)) - - - - - - In main: // we need to get a pointer to the LW_BRIDGE from the softwares point of view. // need to open a file. /* Open /dev/mem */ if ((mem = open("/dev/mem", O_RDWR | O_SYNC)) == -1) fprintf(stderr, "Cannot open /dev/mem\n"), exit(1); // now map it into lw bridge space: mappedbase = mmap(0, 0x1f0000, prot_read | prot_write, map_shared, mem, alt_lwfpgaslvs_ofst); if (mappedBase == (void *)-1) { printf("Memory map failed. error %i\n", (int)mappedBase); perror("mmap"); } Run DMA and wait for completion ... ... // And when the DMA is finnished the data is available: value = (unsigned int *)((int)result_base); - - - - - - Now, when running memcpy(dest,value,2048*4) the speed is slow: only 42 Mbytes/s, and the compiler does not respond as expected to the -O compiler flags, i.e. -Ofast is slower that -O1. It seems that using mmap really slows down the access to memory. Is it possible to speed this up? Any help would be greatly appreciated! Thanks,- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I think my problem is related to the high address (ALT_LWFPGASLVS_OFST = ff200000) that is used, and this might have to be fixed in the kernel space…
While waiting for someone to fix this for me :) , I wrote an assembly version of the memcpy using the “NEON memory copy with preload” example from arm infocenter. I had to add “SUBS r2,r2,#0x40” before the loop, if not the loop would go 64 bytes too far (thus overwriting memory). Using this "neon memcpy" I got a bit more speed (62 MBytes/s), and I could use the -Ofast flag to optimize the rest of the code. This function is called the same way as memcpy, but the data must be 64 bytes aligned: void *neon_memcpy(void *ut, const void *in, size_t n) neon_memcpy.S: .arch armv7-a .fpu neon .global neon_memcpy .type neon_memcpy, %function neon_memcpy: SUBS r2,r2,#0x40 neon_copy_loop: PLD [r1,# 0xC0] VLDM r1!,{d0-d7} VSTM r0!,{d0-d7} SUBS r2,r2,#0x40 BGE neon_copy_loop bx lr
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page