One float16 access per external memory DIMM in a kernel running at at an operating frequency equal to or higher than the memory controller operating frequncy (200 MHz on Stratix V and 266 MHz on Arria 10) will exhaust the external memory bandwidth and result in memory stalls. This is expected behavior and will not necessary reduce performance since the kernel is already running at its peak achievable throughput.
I am able to achieve frequency of 230 MHz( on Arria 10) , also I have like that access and writing of the result, I believe having reading and writing port should have not result in memory contention. Is my assumption correct.
Also , I am able to achieve high memory bandwidth (with high occupancy) when I am reading, but for writing output memory( with low occupancy) is very low , also I am not able to figure out what could be the reason for that. Can you suggest how to improve output writing bandwidth
Is this the same code as the one in your other thread? The read access is 512-bit wide which means it can efficiently saturate the bandwidth of one memory bank. However, the write is narrow and unless you have multiple consecutive writes that are coalesced into one wider access, write performance is going to be poor. You should check the HTML report to see how the memory ports are instantiated by the compiler. Since, by default, the compiler interleaves buffers if your board has two memory banks, your reads and writes can actually conflict. But if you disable memory interleaving as mentioned in the Programming Guide and put the input and output buffers on separate memory banks, then the accesses will not conflict with each other.
Yes that is the same code as in my other thread. If I check my board specification, I have only one DDR4 memory bank. I can see in the report generated that multiple ports are generated for the storing the data, as you can see in the attached photo. I thought that even having one memory bank there will be separate read and right ports that means there should not be any memory contention is that not correct.
I am although writing to consecutive write port they are not coalesced , is it correct to say?. If yes then having one like write float8 or float16 and one read of float16 could be the best case given I have one memory bank ?
The ports are indeed separate, but they are all connected to the same bus and they will be all competing with each other for the memory bandwidth. The writes in your kernel are already coalesced over x (all the ports in your image are actually 512-bit ports). What you should do is to disable unrolling over y since that is the reason for all the non-coalescable ports you are getting. Please check my reply in the other thread.