Re: Multi-Port-Front-End DDR3 controller is too slow

Altera_Forum · ‎12-21-2015

Hi,

I've using a Cyclone V SOC with 1Gbyte of DDR3 memory connected to the FPGA DDR3 pins - I've set the basic clock rate to 350MHz. I've got the NIOS CPU connected to this, and can verify that the memory is working.

I'm trying to send multiple streams of data to the DDR3 memory, and read streams back out again. This is being controlled by the FPGA. Unfortunately, I seem to have a bottleneck in terms of bandwidth.

I'm assuming that maximum bandwidth would be 350M x 2 x 32bits = 22.4Gbits per second. I'm not managing to get 6Gbits per second (total).

I've set it up in QSYS for the MPFE with 5 ports. One 32bit read/write port is connected to the NIOS. Three 64 bit write-only ports are exported to the FPGA, and one 64bit read-only port is also exported to the FPGA.

I've set the clock rate on these ports to 150MHz, and I'm sending 32 words at a time in bursts of 1 (I've tried bursts of 4 but that didn't work, and I'm puzzled as to why my burst width limits me to max burst of 4. Equally, I'm puzzled that as I'm using v15.1 and the Avalon MM port is generated with a "beginbursttransfer" signal, why the Avalon MM manual suggests "Altera recommends that you do not use thissignal. This signal exists to support legacymemory controllers." How legacy is the MPFE on the UniPhy IP for Cyclone V in Quartus 15.1? I've also monitored the NIOS bus, and that uses burst cycles of 1 as well.

I know the manual for the MPFE suggests I use 128-bit wide bus as I'm running at 1/2 the DDR clock rate, so should use 4x the bus width, but then I need to assign two FIFOs to each port, and I run out of FIFOs in the MPFE. Is there no way of speeding this up? Why is it not even managing 6Gbits per second - which is nearly 1/4 of the maximum?

Altera_Forum · ‎12-22-2015

I don't have enough experience to address your specific case, but I can share some experiences I have had.

I've got a project on the Cyclone V (non SoC, but I'm going that direction next most likely) using the arrow BMCV board. This board has a 16 bit memory interface to the hard memory controller and it runs at 333 MHz if my memory serves me correctly.

In my design I've got a MPFE with a 32 bit wide port going to a jtag bridge and a 128 bit wide port fed from my logic running burst length of 128 and I achieved about 80% of my calculated efficiency.

I think I calculated that if there was no overhead for calibration or anything else that you could get 10.6Gbits/sec and after a lot of testing up and down I almost always had a hard failure at 80% of this (I varied clock and burst length). n my case the burst length seemed secondary, but to keep the overhead down bursting really does seem to make sense. It took me a while (and some help from an FAE) to get all of the avalon signals managed correctly.

There's a KB article that hints at a possible explanation of the 80% in my case, but I never got a confirmation (it talks about losing a cycle due to scheduling every four cycles or something like that, but it doesn't really add up.

I had really hoped to get into the low 90's but never figured out what the problem was so I had to alter my approach.

I ran into similar issues with the FIFO resources.

I did use beginbursttransfer in my design.

So I guess I would expect you to be able to get a much higher efficiency in yours. In mine since one port is basically idle and the other port is as wide as I can make it with long bursts, I'm not sure it is a fair comparison.

I have really only scrutinized the write bandwidth; I read it out much slower, but I'm constantly writing.

-Lance

Altera_Forum · ‎12-22-2015

There is really not enough information provided to help you.

As simple as that, you need to keep in mind that you are using a DDR3-RAM. If you only write 32 Bits words, there will be a lot of overhead because of the write/read Operation with BL8s in a DDR3. Also, there should be some logic for thr addressing. To many changes of the rows within one bank leads to bad performance. At least, you will never get the calculated maximum because you can not completly eliminate the latencies between activate, read, write, precharge and finally refresh cyclrs consumes time