- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Greetings,
we have questions about FPGA-HPS I/O on the Agilex 7M FPGA. The latency of FPGA->HPS writes is not behaving as we expect, and this is critical for our application. We are working with the DK-DEV-AGM039FES M-Series Development Kit.
We have a small amount of data (1 cache line) that we want to send from FPGA logic to HPS for processing. Using the Cache Coherency Unit (CCU), we expect the Agilex 7 to route the write from FPGA directly to HPS L2 cache, as the cache line should be loaded for that address [1][2].
Instead, we observe that the cache line seems to get evicted, and the underlying memory is updated instead. The HPS will then pay full latency (230ns) of fetching the cache line from the underlying memory. We have provided some details of our test setup below.
Q1: Can you help us determine if we have a configuration issue causing the cache to be flushed and underlying memory to be written instead?
Q2: Does the Agilex 7 have a hardware bug that prevents the Cache Coherency Unit from functioning correctly, leading to increased latency in FPGA-HPS communication? See [4], [5], [7] about a documented bug.
Q3: Do you have an estimate what kind of latency should be realizable for writing 1 cache line from FPGA logic to HPS (HPS accessing 1 word of it?)
Q4: Is there a reference design and setup for Agilex 7 which can demonstrate low-latency access from FPGA to HPS cache? Alternatively, can you describe how we could demonstrate this on the evkit? [6] does not seem to directly demonstrate the achieved latency.
Test setup: FPGA logic
======================
We have used the DMA IP as demonstrated in [6]. Additionally, we also tried our own IP block that generates 64-byte writes to the EMIF memory. These accesses go through Cache Coherency Translator IP block. The CCT CSR is configured as follows:
ARDOMAIN = 0b01 # [1:0]
ARBAR = 0b00 # [3:2]
ARSNOOP = 0b0000 # [7:4]
ARCACHE = 0b1111 # [11:8]
AWDOMAIN = 0b01 # [13:12]
AWBAR = 0b00 # [15:14]
AWSNOOP = 0b000 # [18:16]
AWCACHE = 0b1111 # [22:19]
AxUSER_upper = 0b000001 # [28:23] = AxUSER[7:2]
AxPROT = 0b001 # [31:29]
We have tried many variations, including all the values for AxPROT, and setting AWSNOOP=0b001 (supposedly WriteLineUnique).
We were not able to fix the latency by tuning these settings.
Test setup: HPS configuration
=============================
One HPS running Linux, we use u-dma-buf device tree fragment similarly to [6], to get a cacheable mapping to EMIF memory:
#include "../socfpga_agilex7m_socdk.dts"
/ {
reserved-memory {
#address-cells = <2>;
#size-cells = <2>;
ranges;
testbuf: testbuf@10000000 {
compatible = "shared-dma-pool";
reusable;
reg = <0x0 0x10000000 0x0 0x00400000>;
label = "testbuf";
};
};
soc {
udmabuf@10000000 {
compatible = "ikwzm,u-dma-buf";
device-name = "udmabuf0";
size = <0x00400000>;
memory-region = <&testbuf>;
dma-coherent;
};
};
};
We use kernel cmdline isolcpus=1 nohz_full=1 + sched_setaffinity() to isolate a CPU for our tests.
From Linux user space, we map this memory with one of:
int fd = open("/dev/udmabuf0", O_RDWR); /* Cached, or ... */
int fd = open("/dev/udmabuf0", O_RDWR | O_SYNC); /* Uncached / Device */
void* mapping = mmap(NULL, sz, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
In Arm Development Studio, we are able to verify that a cacheable (or uncacheable depending on request) virtual mapping to NP:0x10000000 is present in our process.
In uncached mode, we measure about 230ns latency per access from 'mapping', reading a word in a loop through a volatile pointer.
In cached mode, initially, one access completes in about 1.67ns. However, when the FPGA updates the memory (<= 64 bytes), the CPU read latency jumps to approximately 230 nanoseconds when the address is read again. The cache line should be present since the CPU busy-polls the address with reads.
From Arm Development Studio "Memory" and "Cache" views, or by establishing both cached + uncached mapping and reading those, we can observe that the underlying memory indeed does get updated unexpectedly when the FPGA logic writes to the address. We expect only cache to be updated, but no write to memory. If the CPU itself writes to the cached mapping, we observe that only cache gets updated and the underlying memory does not (until line gets evicted much later).
References
==========
[1] Agilex 7 HPS Technical Reference Manual, section 7.3.5.3.: "If you do a Cache Allocate transaction, the CCU maintains coherency and allocates in the cache. This is useful when you want to maintain coherency and keep data available in the system with minimal latency, so the masters avoid traversing to the external memory for each transaction." ... "On cache hits, write data is stored in cache." https://www.intel.com/programmable/technical-pdfs/683567.pdf
[2] AN 886: Intel Agilex 7 Device Design Guidelines, section 5.1.8 https://cdrdv2-public.intel.com/773606/an886-683634-773606.pdf
[3] Setting up and Using Bridges Linux Example https://altera-fpga.github.io/rel-25.1/embedded-designs/agilex-7/f-series/soc/setup-use-bridges/ug-setup-use-bridges-agx7f-soc/#add-u-dma-buf-driver-to-create-cma-regions
[4] KB 000086381: Why do I see cache coherency problems between the HPS and FPGA on Intel Agilex® 7 FPGA SoC designs in Intel® Quartus® Prime Pro Edition Software version 20.4 and earlier? https://www.intel.com/content/www/us/en/support/programmable/articles/000086381.html
[5] U-Boot: "drivers: cache: ncore: Disable snoop filter": https://github.com/altera-fpga/u-boot-socfpga/commit/d192adafebcd5e742a229aedbdcc7d6957d68f02
[6] Setting up and Using Bridges Linux Example: https://altera-fpga.github.io/rel-25.1/embedded-designs/agilex-7/f-series/soc/setup-use-bridges/ug-setup-use-bridges-agx7f-soc/#add-u-dma-buf-driver-to-create-cma-regions
[7] About snoop filters: https://www.intel.com/content/www/us/en/docs/programmable/814346/25-1/snoop-filters.html
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
Thanks to contact Altera. My name is Boon Khai, I'm current working on an answer and will get back to you soon.
Regards,
Boon Khai.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Thank you for waiting,
I'm still working to get input from the subject matter expert(SME) in Altera regarding your question 1 and question 3. But in the mean time can provide the response for Q2 and Q4 based on the current documentation and known issue.
For Q2
- You are right, the hardware bug described is due to the IP Bug in the CCU, where from software perspective, disabling the snoop filter is the only workaround this issue. The snoop filter will benefit when there is a heavy load transaction. like in your case sending small amount of data eg: 1 cache line disabling the snoop filter is unlikely to negatively affect performance, in fact for small cache line the HPS expect fast access, disabling the snoop filter should be more beneficial in your case, improved latency and reliability.
For Q4
- At this time, to the best of my knowledge, there is no fully validated reference design that demonstrates low-latency cache-injected writes from FPGA to HPS L2 cache. That said, I will check with the SMEs to confirm whether there might be an internal example or recommended setup that we can share.
Regards,
Boon Khai.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you Boon Khai for your efforts so far. This issue is still very relevant for us.
It is good to hear that the CCU bug should not be a problem for us. In that case, I'm still hoping that this is something we can fix by configuration or software.
Don't hesitate to ask any extra information about our setup if needed.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page