HPS2FPGA bridge throughput

Altera_Forum · ‎12-10-2013

Hi everybody,

I am currently playing around with the Cyclone V SoC Development board.

I have a QSys system running resembling the Golden system reference design apart from the fact that the h2f_axi_clock is controlled by a PLL residing on FPGA side.

The h2f_axi_clock is 80 MHz in my case.

I have the onchip memory up and running according to the GSRD. I used the Linux example to run a Linux application on the HPS. From within the application I can write to and read from the on-chip memory, the memory address range is mmaped to the linux user space for this purpose.

Now the question:

I get about 45 MByte/s throughput when using a memcpy to copy a block of 65kBytes of data from the HPS to the FPGA (transfer needs about 1.4 ms). I measured the time it takes to memcpy the following way:

clock_gettime(clock_realtime, &start);

memcpy((void*)hw_onchip_mem_base, (void*)&buffer[0], ONCHIP_MEMORY2_0_SPAN);

clock_gettime(clock_realtime, &end);

45 MBytes/s seem to be quite low. I have a 64 Bit bus width to the memory and a clock of 80 MHz. So I would expect about 640 MBytes/s theoretical throughput.

Of course I can imagine that the bridge is only able to transmit with a certain burst size, arbitration must take place, Linux data handling will add some overhead and maybe there are other restrictions.

But is 45 MByte/s all I can get? That would be quite bad...

Any ideas how to improve the performance?

What am I doing wrong?

Has anybody better results and how?

Tool is Quartus 13.1, Linux 3.9 Kernel.

Thanks in advance!!

Volker

Altera_Forum · ‎12-11-2013

I'm not a linux expert but I think memcopy does not utilize DMA or bursts or something similar. From my experience with NIOS and memcpy I saw that all the data was byte-oriented. So if you have 64-Bit Bus-Interface, memcpy does not "know" about it and copy just one byte after another.

Altera_Forum · ‎12-11-2013

Some memcpy implementations use 32-bit copy operations, and if the CPU has a data cache it will cause burst transfers, but it is still slower than a DMA solution as I'd say you'd need about 10 CPU cycles for each 32-bit transfer (maybe a Nios II guru will correct this figure).

If the source or destination isn't 32-bit aligned, memcpy will most probably revert to 8-bit transfers anyway.

Altera_Forum · ‎12-11-2013

Thanks! I did some googling on that - you seem to be right.

I will try some DMA stuff in the near future - if anybody has hints or examples - would be very appriciated!

Thanks again!!

Volker

Altera_Forum · ‎12-11-2013

--- Quote Start ---

if anybody has hints or examples - would be very appriciated!

--- Quote End ---

1) Look at the generated assembly code.

You'll likely find that the CPU performs a read (64-bits perhaps), and then a write. From that you can immediately understand why the transfer will be slow.

2) Use SignalTap II to probe "something".

For example, if you cannot probe the bridge, probe the destination memory interface (Avalon-MM bus signals).

3) Read the documentation regarding DMA controllers, and then test it.

I have not used the HPS system, but I would assume they have DMA controllers, or allow a DMA controller in the FPGA fabric to access the HPS system buses.

One of the first things I do when determining whether a processor is suitable for a project is to test the DMA controller(s) to ensure my bus transfer requirements are met. A CPU can generally not generate burst transactions, so testing memcpy() is not a good performance test.

Cheers,

Dave

Altera_Forum · ‎12-17-2013

Hi,

I have to deal with a similar task.

I wonder if it's really the responsibility of the developer to write all the code for the DMA HPS2FPGA transfer himself, or if Altera is going to provide some kind of Linux driver lib that makes it easy for the "user space" developer to access this hardware component (and all the others, too) without the knowlegde of every control register bit of the architecture and the additional knowledge about AMBA, AXI, AHB, NIC-301, etc. stuff.

If the answer to that question is "yes", it would at least bring a little bit of light into the intended (Linux) software support by Altera for their SoC chips.

I'm asking because there are fragments for supporting the device in the linux kernel source tree by having the "socfpga_defconfig", several device driver moduls I can chose with menuconfig and a (not fully?) implemented device tree source (dts) file available.

I wonder if it is Alteras intention to push those things further or if it is fully left to the "community" to implement (or not implement) those things for the socfpga, fully.

I, too, am looking for some examples that show me all the might of the HPS2FPGA interface which is advertised as "high bandwidth interconnect backbone". All I can find so far is really just useful for switching some LEDs on a Dev Board which really does not need such sophisticated hardware, right?!?

Regards,

Maik

Altera_Forum · ‎01-09-2014

Hi all,

I just want to add some feedback here.

I have now successfully integrated the modular SGDMA from BadOmen (http://www.alterawiki.com/wiki/modular_sgdma) on FPGA side and connected it as read master to the FPGA2SDRAM interface. I decided to use a FPGA DMA as I think I can get the highest performance this way (HPS DMA connected via L3, bandwidth to SDRAM controller must be shared with other peripherals).

My FPGA QSys bus logic allows a theoretical throughput of 320 MBytes/s to date (32 Bit Avalon MM IF on custom component @ 80MHz --> much room for optimization). With signal tap I see that the DMA is able to copy the data with a throughput of about 305 MBytes/s.

This looks really promising now :-)

Thanks again,

Volker

Altera_Forum · ‎02-17-2014

--- Quote Start ---

Hi all,

I just want to add some feedback here.

I have now successfully integrated the modular SGDMA from BadOmen (http://www.alterawiki.com/wiki/modular_sgdma) on FPGA side and connected it as read master to the FPGA2SDRAM interface. I decided to use a FPGA DMA as I think I can get the highest performance this way (HPS DMA connected via L3, bandwidth to SDRAM controller must be shared with other peripherals).

My FPGA QSys bus logic allows a theoretical throughput of 320 MBytes/s to date (32 Bit Avalon MM IF on custom component @ 80MHz --> much room for optimization). With signal tap I see that the DMA is able to copy the data with a throughput of about 305 MBytes/s.

This looks really promising now :-)

Thanks again,

Volker

--- Quote End ---

Dear designer777 (http://www.alteraforum.com/forum/member.php?u=87612),

Ok. But what you wrote is the opposite direction e.g. FPGA_to_HPS_SDRAM_dedicated_interfaces (at max 6. channels) of this topic.

Combining FPGA-side DMA master peripheral to this dedicated interface sounds good.

I think "simple" FPGA_to_HPS_bridge has lower throughput.

When I first created a pilot QSys FW design with HPS_to_FPGA bridges, and an ARM swapped data between FPGA-side memory and on-chip (OCRAM) memory, I only measured 20-30 MBytes/sec (using SOCAL functions in BareMetal application). It is a poor performance.

Is the same performance of FPGA-to-HPS bridge and HPS-to-FPGA bridge?

Moreover, you mentioned ~ 80 MHz as clock freq = does it mean HPS-FPGA bridges clock freq driven from FPGA-side ALT_PLL?

Because, my former examination is that HPS-FPGA bridges clocks (e.g. h2f_axi_clock, h2f_lw_axi_clock, f2h_axi_clock etc.) can be used at most ~133 MHz clock speed. Have you any experience to set them properly?

Regards,

ZS.V.

Altera_Forum · ‎02-17-2014

One comment...

I have Cyclone V SoC experience almost a year.

The Quartus system has been well improved since 12.1, 13.01 ... 13.1 (HW/FW level).

However, it is sad but true fact that Altera could not provide at the moment any valuable reference designs (e.g. GHRD) to support the integration of FPGA-to-HPS bridges (opposite direction), and FPGA-to-SDRAM interfaces for memory bandwidth testing purposes. Released GHRD-s only integrates HPS-to-FPGA bridge or its LightWeight version. In most cases the information must be collected from different sources (this forum, rocketboard mail list, etc.). For an interesting component like the Address Span Extender has only a short Altera wiki entry :(

The only one Memory throughput SW test is in the BoardTestSystem (included in the Cyclone V SoC KIT package), but its source code has not been released yet (an SR sent about this without any success).

Unfortunately, their tutorials and reference designs are still not the same level as their IDEs.

Regards,

ZS.V.

Altera_Forum · ‎06-24-2014

Dear Designer777,

I’m doing now a system that is quite similar to what you did a few months ago. We have managed to write to HPS’s SDRAM through hps2sdram bridge using modular SGDMA in the FPGA fabric. I have a counter that advances as long as the Avalon ready signal is asserted. I write this counter to SDRAM but after 63KB it stops and ready signal turns low. I defined descriptor.length = 0xFFFFFFFF;, tried different starting addresses, but the result is the same.

Could you please help me to understand how can I define the length in the SDRAM to which I’m writing?

Thanks,

A.

Altera_Forum · ‎08-05-2015

Hi, Can you please help me with the 2x HPS2FPGA bridges throughput / max clock / c code example (full bridge)

I use the lightweight for now at 100 Mhz successfully, at 200Mhz bridge fails. Someone mentioned max 133Mhz.for bridge clock. What is the maximum for Cyclone V I7 device.

Except for the data width of the full bridge, can I clock the bridge faster than the lightweight bridge for better throughput?

Except for data width why is it better performance for full bridge?

Can someone please help and provide me a c code (bare metal) example how to setup the full HPS2FPGA bridge, and how to use it e.g. Accessing a qsys component (PIO, memory ext). I do not find any examples on Web.

Thanks