Re: FPGA Master Driven Coherent Memory Transactions Using Write/Read Masters on Linux

Altera_Forum · ‎02-13-2017

Hi,

I'm trying to compare 3 fpga master driven memory transaction scenarios on altera cyclone v soc:

- Connecting to SDRAM controller directly,

- Connecting to SDRAM controller using F2H bridge,

- Connecting to scu through acp using F2H bridge.

.. and

- I'm running linux

- I use msgdma write/read IP's as FPGA Masters.

I allocate contiguous memory using cma kernel feature combined with texas instrument cmem api.

Naturally MSGDMA is controlled via Linux driver.

I would appreciate help with the usage of ACP port.

current progress. Following the documentation (9-29 in Cyclone V Device Handbook):

SCU is enabled (checked manually),

SMP bit in ACTLR is set (checked manually),

I have no access to page map configuration, but I assume allocated coherent memory is marked shareable due to the fact, that I use both ARM cores,

I target ACP in physical memory region (0x80000000 base address),

After reset ID Mapper is in dynamic state,

Write/Read Masters use Avalon interfaces not AXI, so they don't drive AxUSER and AxCACHE signals!!! These signals are generated by memory interconnect.

here i get dizzy... In documentation it is stated, that ACP ID mapper can control overriding signals for AXI masters that cannot drive the sideband signals themselves.

BUT... it is stated that this is not true for FPGA Masters. so i conclude that i must drive these signals myself.

I went into source code and set these AxUSER and AxCACHE signals manually!! I tried setting AxCACHE[1] and AxUSER[0] to 1. there were noticeable changes in transfer speed (2x-3x), but transactions still weren't coherent.

Then I set AxUSERS signal to "11111" and AxCACHE to "1111". Again there were changes in transaction speeds.

I find this quite a challanging task and finally here is my question:Is it possible to use ACP with an FPGA master through memory-mapped interconnect. And if that's so, what am I missing here?

Altera_Forum · ‎02-16-2017

success!!! i'm overwhelmed! :)

Problem was hiding in the Texas Instruments CMEM API driver I've been using!

So..., CMA Kernel feature allows dma_alloc_coherent() to dynamically allocate large buffers of contiguous memory. CMEM API while mapping physical address to user space sets page protection settings (As I see it, some of these settings are architecture dependent). I can't find good documentation about these parameters for ARM, so if you know good source - please post it!

I found flags in "arch/arm/include/asm/pgtable-2level.h":

# define l_pte_mt_uncached (http://lxr.free-electrons.com/ident?i=l_pte_mt_uncached) (_at (http://lxr.free-electrons.com/ident?i=_at)(pteval_t (http://lxr.free-electrons.com/ident?i=pteval_t), 0x00) << 2) /* 0000 */

#define l_pte_mt_bufferable (http://lxr.free-electrons.com/ident?i=l_pte_mt_bufferable) (_at (http://lxr.free-electrons.com/ident?i=_at)(pteval_t (http://lxr.free-electrons.com/ident?i=pteval_t), 0x01) << 2) /* 0001 */

#define l_pte_mt_writethrough (http://lxr.free-electrons.com/ident?i=l_pte_mt_writethrough) (_at (http://lxr.free-electrons.com/ident?i=_at)(pteval_t (http://lxr.free-electrons.com/ident?i=pteval_t), 0x02) << 2) /* 0010 */

#define l_pte_mt_writeback (http://lxr.free-electrons.com/ident?i=l_pte_mt_writeback) (_at (http://lxr.free-electrons.com/ident?i=_at)(pteval_t (http://lxr.free-electrons.com/ident?i=pteval_t), 0x03) << 2) /* 0011 */

#define l_pte_mt_minicache (http://lxr.free-electrons.com/ident?i=l_pte_mt_minicache) (_at (http://lxr.free-electrons.com/ident?i=_at)(pteval_t (http://lxr.free-electrons.com/ident?i=pteval_t), 0x06) << 2) /* 0110 (sa1100, xscale) */

#define l_pte_mt_writealloc (http://lxr.free-electrons.com/ident?i=l_pte_mt_writealloc) (_at (http://lxr.free-electrons.com/ident?i=_at)(pteval_t (http://lxr.free-electrons.com/ident?i=pteval_t), 0x07) << 2) /* 0111 */

#define l_pte_mt_dev_shared (http://lxr.free-electrons.com/ident?i=l_pte_mt_dev_shared) (_at (http://lxr.free-electrons.com/ident?i=_at)(pteval_t (http://lxr.free-electrons.com/ident?i=pteval_t), 0x04) << 2) /* 0100 */

#define l_pte_mt_dev_nonshared (http://lxr.free-electrons.com/ident?i=l_pte_mt_dev_nonshared) (_at (http://lxr.free-electrons.com/ident?i=_at)(pteval_t (http://lxr.free-electrons.com/ident?i=pteval_t), 0x0c) << 2) /* 1100 */

#define l_pte_mt_dev_wc (http://lxr.free-electrons.com/ident?i=l_pte_mt_dev_wc) (_at (http://lxr.free-electrons.com/ident?i=_at)(pteval_t (http://lxr.free-electrons.com/ident?i=pteval_t), 0x09) << 2) /* 1001 */

#define l_pte_mt_dev_cached (http://lxr.free-electrons.com/ident?i=l_pte_mt_dev_cached) (_at (http://lxr.free-electrons.com/ident?i=_at)(pteval_t (http://lxr.free-electrons.com/ident?i=pteval_t), 0x0b) << 2) /* 1011 */

#define l_pte_mt_vectors (http://lxr.free-electrons.com/ident?i=l_pte_mt_vectors) (_at (http://lxr.free-electrons.com/ident?i=_at)(pteval_t (http://lxr.free-electrons.com/ident?i=pteval_t), 0x0f) << 2) /* 1111 */

#define l_pte_mt_mask (http://lxr.free-electrons.com/ident?i=l_pte_mt_mask) (_at (http://lxr.free-electrons.com/ident?i=_at)(pteval_t (http://lxr.free-electrons.com/ident?i=pteval_t), 0x0f) << 2)

Yeah..., my assumption that memory is marked shared seems to be incorrect. It might be shared between processor cores, but it is not shared with other peripherals (devices). Anyway, I used l_pte_mt_dev_shared flag and data seems to be cached (dummy data creation done by HPS is 5x faster when compared with uncached memory), and transferred data passes validation.

in conclusion.

acp port can be used with avalon-mm masters, but AxCACHE and AxUSERS signals must be set directly (by default Qsys interconnect pulls them to the ground).

I'm planning to summarize all this into a tutorial (in couple of months). Give me notice if interested! ;)

Altera_Forum · ‎04-05-2017

I was incorrect, l_pte_mt_dev_shared stands for something else and this memory is not cached, although initialization was even faster then for memory acquired with malloc (I suppose explanation could be lazy allocation, still curios though, why it was faster then simply noncached memory).

Nevertheless, additionally PROT signal had to be set to ground.

So in conclusion:

- cache, user signals set high,

- prot signal set low,

- Access only through dedicated ACP window ( address | 0x80000000 ).

Note: this will work only with cached memory via acp. connection to sdram via f2h bridge and noncached memory will not work.

Altera_Forum · ‎10-25-2017

--- Quote Start ---

I was incorrect, l_pte_mt_dev_shared stands for something else...

--- Quote End ---

What setting did you end up using instead of the l_pte_mt_dev_shared value?

Also, do you have any code examples you can share?

Altera_Forum · ‎10-27-2017

code.

I actually ended up writing driver to allocate contiguous cached/noncached memory to userspace :rolleyes:. The code is available here: http://git.edi.lv/rihards.novickis/fpsoc_linux_drivers (http://git.edi.lv/rihards.novickis/fpsoc_linux_drivers).

The driver is called cma, there are still some features I should implement...

Anyway check the USAGE file and please give me some feedback :)

page protection.

Regarding page protection flags - for cached memory I used the default ones provided by the mmap() syscall. See the cma/driver/cma.c file, cma_mmap(...) function.

sideband signals.

I'm not sure whether this is the problem You are having, but actually my problem was on the hardware side. Internally ARM uses AXI interface, but the MSGDMA cores work with Avalon interface, thus Qsys actually generates special interconnect to connect these interfaces. For ACP to work, You need to assert special side-band signals present in AXI, but not present in Avalon. Default values for these side-band signals are set through Qsys interconnect, so if You want Your DMA core to be able to access cached memory You have to manually set them after Qsys system generation. Haha, this is quite dirty and you have to redo this after every Qsys generation.

I passed the default value of these signals in generated top entity file, I use VHDL, so for me it is hps_system.vhd. In the code it looks something like this (and I'm not 100% about the requirement of reasserting prot signals):

f2h_AWCACHE => "1111", -- .awcache

f2h_AWPROT => "000", -- .awprot

f2h_AWUSER => "11111", -- .awuser

f2h_ARCACHE => "1111", -- .arcache

f2h_ARPROT => "000", -- .arprot

f2h_ARUSER => "11111", -- .aruser

Anyway, I suggest using RTL viewer to check whether these signals are asserted correctly.

As a proof of this working :D, here is graph of measured speeds where You can observe that transaction speed reduces when "packet" size is comparable with L2 cache :)

https://www.alteraforum.com/forum/attachment.php?attachmentid=14235

I hope this helps!

Altera_Forum · ‎03-13-2018

Rihards,

So I've been having trouble consistently using FPGA to HPS bridge (and SDRAM memory hooked directly to the CPU) with larger buffers (not sure about the smaller ones) so I'd like to ask for your help.

I'm using Linux too and have a custom vhdl IP that I set up to write data to memory (4M buffer).

My device driver simply uses "dma_alloc_coherent" to allocate the buffer and the ioctl's control the IP one of which is to enable writing to the 4M buffer. What I see on signal tap is that the IP starts to write to the memory BUT after 15 writes the "avm_waitrequest" is asserted so nothing else happens. The IP is waiting for the HPS side to do it's thing...

Do you happen to have any insight into this? Any suggestions? Thanks in advance.

Altera_Forum · ‎03-14-2018

Hey!

Some directions I would follow:

- Is FPGA-HPS bridge enabled? (under "/sys/class/fpga_bridge")

- Then I would check whether the data is actually written to the memory (while using non cached memory, at first).

- If data is not there => How do You pass physical address to Your DMA IP core?

I would be glad to look into it myself, of course if Your core is not under NDA or something...