Hi,I'm trying to maximize the read bandwidth thru the High Performance (HP) DDR2 memory controller using the Stratix III devkit & the included Micron DIMM. I am getting less than I had hoped for (~50% when doing 8x256b read bursts, with waitrequest backpressure slowing things down). I'll keep trying tricks to coax a few more % of perf out (at expense of latency, additional buffering, longer bursts, etc), but I'd really appreciate it if anyone has more insight into what works best with this controller, and/or if there's a way to use the controller differently, or even redesign a module or two of it. (I'm hoping to not have to design my own or buy another for this relatively simple need.) Here's the specific scenario to help bound this potentially huge question: - I'm only doing reads (when I care about the perf), no writes. I'd happily sacrifice write perf if necessary. - Only 1 avalon master (a dma arbiter) is requesting all the reads, and is already 'directly' connected to the memory controller (well, no burst nor width adapters and no bridges nor clock crossings, just the sopc arbitration logic w/ idle/non-requesting masters). - I'm using half-rate controller (for max mem clk freq), which means there's a 256b-wide data bus on the avalon side (& the controller doesn't use bursts so the master doesn't use "avalon burst", although it of course can & will generate sequential addressed requests). - I have multiple (large) read streams to fetch from the memory (>5), so I think it should be possible in DDR2 to fetch large enough bursts & arrange the address bits a bit so that bank interleaving can hide the ras/cas overheads. - The consumers of these streams of data are relatively slow, so I'm prefetching & buffering read data in order to get better read bandwidth. - Based on the docs I've seen, I think the controller only has a 4-deep input request buffer (non bursts!), so I don't think it's possible for it to have enough info for hiding the overheads & so I'm guessing it's just taking each request, activating a new page when needed, and then trying the next request... following all the ddr2 timing rules of course. It seems like this would make it hard to do much of anything fancy in terms of bank interleaving & etc with this. At best, it can probably be opportunistic about leaving banks open & hoping a future request might hit an open bank again later (but can cause other probs so it might be odd to do this by default). - So, I'm assuming that I can increase BW by increasing the sequential address (~burst) length, but that's getting more expensive (buffers), less responsive to changes in address (wasted prefetch), and longer latencies when 2 or more new addresses show up for new streams. I might also be able to do some multi-bank request interleaving if pages are indeed left open. - But, what I'd really like is to get near the max datarate... If 4 bank interleaving was used to hide ras/cas overheads, couldn't streams (+ some smart address bits -> bank/row/col addr bit mappint) be used to get close to max theoretical bandwidth? (-refresh effects) with a reasonable sized burst? (eg: 8 x256b or so, depending on mem_clk frequency vs memory timing parameters). It might not be most power efficient since pages would be getting opened & closed more than necessary, but this seems straightforward... (but not with this avalon interface?) Like I said, I'm no expert, so hopefully I'm just missing something. Any tips, insight, achieved perf# 's, & etc would be really appreciated. I can use any tips on how to get the last bits of perf from the HP DDR2 memory controller, but I'm hoping someone will tell me that I can modify an existing DDR2 controller to give me the lower level sequence control. Thanks, -Brian PS: using Quartus II 8.0 sp1, sopc builder, stratix III speed grade 3, ... PPS: oh, an explanation of what the HP DDR2 controller does or doesn't do (eg: does it keep page open if intermediate request is to another bank?) would also help keep me from trial/error type of reverse engineering.
If you read sequentially, than you can utilyze your DDR2-Controller up to 95%. Forget about burst and so -- just read the data until you need it.Take care of Avalon-Master-Clock and DDR2 Clock -- it should be both the same (sys_clock or something simmilar)! Regards, Kest
I would recommend the Avalon Multi-port DDR controller from Microtronix. I managed to read 512x128-bit data with one burst, in 512 clocks (continuous readdatavalid) plus around 25 clocks as read latency.Have a look at http://www.microtronix.com/products/?product_id=92 Best regards avtx30
It's no problem to come near to theoretical RAM bandwidth performing full row accesses, as kest mentioned. I understand, that the present problem is in reading smaller quantities concurrently. My understanding is, that the HP controller is keeping the state for each individual memory page in the RAM array, so the answer to briane's last question would be yes. But I didn't yet test HP performance in switching open pages and am not aware of available options to optimize the behaviour in this regard.
thanks FvM. That's right, I don't think there's any perf questions when reading 1 continuous block of memory. The question is about efficiently managing multiple concurrent block reads the the HP controller (to still get ~max ddr2 bandwidth with minimal buffering and minimal worst-case latency).I can hope that the Microtronix controller could end up doing better (if I redesign to use its multiple ports most efficiently), but this optimization requirement is specific enough that I'd believe I could make a more efficient solution with lower level control. My coworker heard about an (Altera?) class for designing your own memory (ddr2?) controller. Perhaps I'll look into that for the long term solution.
As long as you have only one slave port to the memory controller, you can not optimise (hide) the cycles for opening a new bank for read operations. If your reads are random spread, you have to live with the lower performance.I have did some tests with the microtronics controller some time before ( in a single data rate configutation), and the result was rather dissapointing due to clock-domain synchronisations. In some cases it is better to have a slower clock to the memory and no clock domain synchronisation, instead of a high spead memory access, with big synchronisation penalty. Stefaan