NIOS SDRAM performance

Altera_Forum · ‎11-08-2004

I have measured the speed of memcpy's of NIOS2. I use optimized code consisting of four consecutive READs and four consecutive write accesses. Code snippet:

while (i--) {
       d0 = __builtin_ldwio(pfrom);
       d1 = __builtin_ldwio(pfrom+1);
       d2 = __builtin_ldwio(pfrom+2);
       d3 = __builtin_ldwio(pfrom+3);
       pfrom+=4;
              
       __builtin_stwio(pto,   d0);
       __builtin_stwio(pto+1, d1);
       __builtin_stwio(pto+2, d2);
       __builtin_stwio(pto+3, d3);
       pto+=4;
}

Compiling this with -O3 will yields quite optimal code with four reads to different regs and for writes:

    movhi    r7, %hiadj(1048576)   #    pfrom
    addi    r7, r7, %lo(1048576)   #    pfrom
    movhi    r6, %hiadj(1052672)   #    pto
    addi    r6, r6, %lo(1052672)   #    pto
    movi    r8, 15   #    i
.L25:
    ldwio    r3, 0(r7)   #    d0, * pfrom
    ldwio    r4, 4(r7)   #    d1
    ldwio    r5, 8(r7)   #    d2
    ldwio    r9, 12(r7)   #    d3
    addi    r7, r7, 16   #    pfrom,  pfrom
    stwio    r3, 0(r6)   #    d0, * pto
    stwio    r4, 4(r6)   #    d1
    stwio    r5, 8(r6)   #    d2
    stwio    r9, 12(r6)   #    d3
    addi    r8, r8, -1   #    i,  i
    cmpnei    r3, r8, -1   #    i
    addi    r6, r6, 16   #    pto,  pto
    bne    r3, zero, .L25

The transfer rates seemed too slow, so I did further investigations. It turns out that NIOS2 has a very poor SDRAM read performance because it does not perform consecutive SDRAM read accesses (but it does for write accesses).

Here's a link to an oscilloscope image of a READ access: oscilloscope: sdram read (http://dziegel.free.fr/nios2/sdram_read.jpg)

However, write access seems to be fine: oscilloscope: sdram write (http://dziegel.free.fr/nios2/sdram_write.jpg)

Note: As you can see in the oscilloscope images, the accesses to not cross a SDRAM row (no RAS cycle between reads).

Tests were performed on a NIOS 1C20 Development Kit, Project: NIOS2 full_featured.

So my questions are:

- What are the reasons for the slow read performance? IMHO, the read requests could be executed in the same speed than the write requests.

- Will this behaviour be changed / fixed?

Thank you,

Dirk

Altera_Forum · ‎11-09-2004

Dirk,

Thanks for the good info. I was planning on writing the exact code you did for moving/copying SDRAM and I just assumed it would have good performance due to back to back reads.

Looking at the datasheet a read should take the number of clocks you have set for CAS latency in most situations. (certainly back to back reads on the same row) I count about 12 clocks in your trace!

Since SDRAM is so popular for NIOS/SOPC systems this would be an excellent place to focus efforts to increase performance.

It would be nice if someone from Altera could at least comment on this topic.

If Altera has no plans to improve this I would be willing to fund the improvements if anyone knows how to modify/replace the SDRAM controller. (assuming its legal to modify the SDRAM controller)

Anyone know of a better SDRAM controller that is SOPC builder ready?

Ken

Altera_Forum · ‎11-09-2004

Ken and Altera,

I guess it's not the SDRAM controller's fault, this comes from NIOS http://forum.niosforum.com/work2/style_emoticons/<#EMO_DIR#>/sad.gif . A copy using DMA does not show this behaviour, I see consecutive reads (sorry no oscilloscope image at hand). Maybe the NIOS data master implementation is suboptimal... I'd really be interested in the technical reason for these delays ;-)

Dirk

Altera_Forum · ‎11-09-2004

Dirk,

I've definitely found the dma to be the ultimate data mover as well.

While this is ok for some things, you can't for instance dma directly into registers when you need to manipulate data.

I guess a work around would be to dma small work packets into a small onchip ram. Sorta like a manual cache. Unless LDW on onchip ram has the same problem as it does when reading SDRAM.

Have you looked into the LD and ST to and from onchip?

Ken

Altera_Forum · ‎11-09-2004

Ken,

I didn't look at LD and ST to onchip mem - I need memcpy performance by processor in my app - I often copy small amounts of data ~20 bytes, so the setup of a DMA ist not neglectible any more compared to the copy duration. And the SDRAM is accessed by custom components, so I really need the data in RAM (cache bypass!). What it also worries me is that this behaviour may affect performance in general, since all consecutive data reads (if you work on array or so) are affected. I'd really like to see the NIOS data master fixed http://forum.niosforum.com/work2/style_emoticons/<#EMO_DIR#>/biggrin.gif ...

Dirk

Altera_Forum · ‎11-09-2004

Hi Dirk,

What you're seeing is the result of the Nios data master not being 'latency aware' (the instruction master is, and this allows relatively speedy instruction fetch even with a cache miss). Both master ports on the DMA controller are, and that is why Ken sees the performance he does. In a nutshell, Nios II was really designed to be as simple (small/fast) as possible and deliver best performance when things are cached.

However, you raise a valid point with respect to more complex systems that have custom logic or other processors sharing memory -- as such things cannot be cached. I'll have a chat with our CPU expert to see what the penalty for adding latency awareness to the data master would be.

In the mean time I have to second the opinions above for either using DMA (which sounds like something you don't want to do), or dedicating a small on-chip RAM(s) to your high-speed buffers. The onchip memories can also be dual-ported, further enhancing performance.

PS: Latency aware means that an Avalon master accepts the 'readdatavalid' signal, rather than merely the 'waitrequest' signal as all masters must do.

Altera_Forum · ‎11-09-2004

Hello Jesse,

the main reason why we (me and my colleagues) have chosen a FPGA based solution is that we needed some custom peripherals that can share memory with a CPU http://forum.niosforum.com/work2/style_emoticons/<#EMO_DIR#>/wink.gif . Additionally, small portions of the shared data has to be copied around in memory, that's what I need memcpy performance for. And, my total shared mem is ~2MB in size, so on-chip RAM won't do http://forum.niosforum.com/work2/style_emoticons/<#EMO_DIR#>/biggrin.gif

Technically, where do these 12 cycles come from? I'd expect to see 2 or 3 cycles between reads, not 12.

But I see your argument (size/speed tradeoff), this is reasonable. For my application, I'd like a wizard option to choose between a small or fast data master, just as for the NIOS2 core.

Dirk

Altera_Forum · ‎11-11-2004

Jesse,

Any progress in your talks with your CPU Expert?

This is really important stuff. 2-3 cycles expected vs. 12 cycles is death for many applications. (including mine)

And the biggest/scariest problem is that this info has to come out only after monumental efforts by people like Dirk.

Chapter 16 "Nios II Core Implementation Details" states that the /f core is designed to "Minimize the instructions-per-cycle execution efficiency" and for "performance-critical applications ... with large amounts of code and or/data..."

Then in the Instruction Execution Performance for the Nios II /f Core Table, LOAD has >1 as the number of cycles.

Anyway, "large amounts of code and or/data" means SDRAM these days, and although 12 cycles is technically covered by >1, the enormous penalty for SDRAM access should at least be spelled out.

What other issues like this remain un documented?

We're designing real products with real money and deserve to have this information provided. We can handle the truth we just need to know what it is.

Of course there is also the fact that if you want to compete (or want us customers to compete) with the NiosII processor we'll need to do much better than 12 cycles per read.

Sorry if this comes across as harsh. I have all the respect and appreciation for Altera and the Nios/SOPC group and wish you all and your products the greatest possible success. And I look forward to the day when I ship a Nios based product that kicks butt.

Ken

Altera_Forum · ‎11-12-2004

Hi all,

Regarding what Jesse said, I might be wrong but as far as I know whenever you have an Avalon master device you intend to use with Altera SDRAM controller, that respective master HAS to use the 'readdatavalid' signal. The 'waitrequest' is only used by the SDRAM controller to hold new requests from the master if there is no more room in the pipeline.

In order to optimize the SDRAM access you have to be able to fill the SDRAM pipeline with requests, this means you will have to request data in advance (whether this means 'dumb', DMA-like approaches, or 'smart' techniques like cache or instruction/branch prediction). When you do random data reads, optimizing SDRAM access will require help from a operation-dedicated cache controller (which can 'predict' where your next reads will be) or a very big cache. Even if you are not changing the row address, having holes in the pipeline will cause poor performance.

Now, probably an Altera guru can explain this a bit more accurately, but this is how I would explain Dirk's findings: The 12 clock cycles for a read comes from the fact that the data cache doesn't request/store words of data from the SDRAM in advance. The cache simply waits for the CPU to request a word, encounters a miss, passes it to the SDRAM controller, gets it back and passes it to the CPU.

So we have probably 3 clock cycles until the request hits the Avalon bus, then we have 3 cycles till it hits the SDRAM chip, 3-4 clocks into the SDRAM (read+CAS delay), +1 clock back to the Avalon bus, +1 clock at least to the CPU, +1 clock next cached instruction, VOILA.

Add 3-4 more clocks if you need to change the SDRAM row, depending of clock speed.

while all this, the cpu simply waits, doing nothing else.

Now it would be nice if we can fill up the pipeline so we can eliminate the wasted clocks. This can't be done unless the cache controller burst-reads some words in advance (with the added penalty if you end up not using them in the future). It also means that if you do random reads a lot you will get far worse performance, as you will get a burst-read for each access, when you need in fact only one word here and there.

These issues are not Altera specific, they will happen in other systems using processors like ARM or x86 variations. The only thing different is that these will usually run the SDRAM at 133MHz or above, so the impact on performance is not so visible. With a Cyclone device, these speeds are hard to achieve.

One thing Altera can do is to add burst type access to the SDRAM controller, add burst read capability on the data cache and let the user enable/disable these in SOPC Builder. This way, one user can try both approaches and decide which one fits his/her application best.

Now, I have the feeling I'm forgetting something.... Oh yes, why are the writes on Dirk's example taking only one cycle each? http://forum.niosforum.com/work2/style_emoticons/<#EMO_DIR#>/rolleyes.gif

Most reasonable explanation is that's because of the write-back capability in the cache. The cache controller simply stores all the requests and then commits them to the SDRAM at once, DMA-style http://forum.niosforum.com/work2/style_emoticons/<#EMO_DIR#>/wink.gif What would be interesting to see here (and we can't see on the timing graphs) is how many cycles have passed between CPU initiating the first write and the time the SDRAM gets the actual write command at the pins.

Hopefully you will get more interesting details from Altera.

Regards,

C

Altera_Forum · ‎11-12-2004

Clancy,

Interesting theory but I've seen evidence to the contrary. Jesse seems to know the problem and the fact that the dma engine doesn't exhibit it pretty much proves the cpu core needn't either.

12 clocks on a 50MHz bus is 240ns. No competitive processor is going to take that long to do carefully hand coded back to back consecutive reads from PC100 SDRAM.

RAS + CAS for this part should be something like 30-50ns for initial access then 1 clock per additional accesses in the same column.

Like I've written before, I've seen the dma master read 480 32 bit words out of an onchip fifo and put them into SDRAM in about 485 clocks. This is writing, (and reading!) but no way is 12 clocks per read what we should expect.

An interesting trace would be dma SDRAM to onchip. This would show us what Dirk's traces should look like.

Ken

Altera_Forum · ‎11-12-2004

Clancy,

[list][*]I'm not doing random reads. My tests were performing a memcpy, these are consecutive reads. So no RAS cycle except if bank/row needs to be changed.

[*]I use stwio/ldwio instructions. These are cache-bypass instructions, so the data cache must not have a performance impact. There cannot be some cycle penalty "until a rquest hits the Avalon bus".

[*]There are almost no cycle penalties (I guess one or two) to get requests out to the SDRAM. You can see consecutive SDRAM writes, so all cycles necessary to get a request from NIOS to SDRAM are "hidden" in the time the SDRAM controller needs to select bank/row and initiate the write (the SDRAM controller has a pipleine that can store a few requests)

[*]If a request is passed to the SDRAM, it needs one cycle to crank out the response, at least if no bank/row needs to be changed. Just put the new address on the bus, and set the control signals (RAS, CAS, CS, WE) to READ.

[*]The next instruction (no matter whether it is cached or not) is already in the pipeline (which is stalled until the response arrives), so no cycle penalty to get the next instruction (at least if NIOS does not FLUSH it's pipeline, but IMO there should be no reason for it).

[/list]

Additionally, we are in contact with Altera support since more than 3 months about this issue (through our support here in Germany) and they still did not tell us an officical statement about the reasion / planned fix for this... http://forum.niosforum.com/work2/style_emoticons/<#EMO_DIR#>/sad.gif

Dirk

Altera_Forum · ‎11-12-2004

Hi Dirk,

I was trying to think why I didn't suspect the SDRAM/SDRAM controller and I remembered why.

When I went to a NiosI class the instructor said the new SDRAM controller provided one-clock per word performance. I asked him why then cache was important. (one clock is one clock right ?) He said that was a good question and took my name and# to get back with me. (still no call)

I also found the shiny brochure for one of my 3 devkit boards and it says this:

"Enhanced SDRAM Controller"

"The NIOS SDRAM controller has been enhanced to support pipelined data transactions; it provides single-cycle access to low cost single data rate (SDR) SDRAM devices"

I'm not as hardware centric as I would like to be, but these two statements tell me not to expect 12 cycle access. Perhaps it's a matter of semantics and the fault is my misunderstanding of the exact meaning of the terms?

Still the bottom line is how do we get decent SDRAM performance? New version of SDRAM controller? Secret .ptf settings? 3rd party controller?

Any ideas?

Thanks,

Ken

Altera_Forum · ‎11-12-2004

Hello Ken,

from my point of view there is nothing we can do. I think the SDRAM conroller is fine, it has a small pipeline to store requests. The main problem is the NIOS data master: as Jesse said, it's not latency aware. Look at the Avalon spec, this means NIOS cannot "enqueue" multiple read requests into the SDRAM controller pipeline, it has to wait until a request was processed. I still don't understand why this takes 12 cycles, maybe there is more overhead involved in the NIOS pipeline (flush??? hopefully not).

But I can imagine that adding latency awareness to NIOS is a very intrusive change to the processor design. This means the NIOS needs to be able to analyse dependencies between instructions ("this ldwio instruction does not depend on the previous ldwio, so it can safely be executed"). This also implies the NIOS pipline stage "memory" must be able to hold multiple queued requests and execute them if the SDRAM delivers data (the memory pipeline stage can be active while the rest is stalled). I can image that this is expensive in both logic elements (config option?) and design "intrusion" since Altera would have to partly redesign the pipeline.

This is my guess about why Altera is so "quiet" about this issue. But, these are just the things I can imagine about the reasons for the performance lack, it may be something else as well. I hope to get the confirmation from Altera about this some day.

It's reasonable to optimize the CPU to be small and work well when things are cached. But IMHO applications with custom components that share RAM with the CPU are not a corner case for this FPGA system, so this should be a config option. You can't get that capability equally elegant anywhere for this price and effort - the only processor I found that is able to share SDRAM out of the box is the IBM PowerPC with it's external bus master feature. But even the smallest PowerPC (133MHz) was to powerful and expensive for my application. And IBM targets >500MHz, not <100MHz in the future.

I have only one "bad hack" idea that could do something about it - use knowledge about the data cache for copying by using normal cached read instructions, but invalidating the cacheline(s) before reading may speed things up. The cache is AFAIK latency aware, so it can quickly retrieve the data from SDRAM, and NIOS can get it in full speed from cache. However, I won't try that in the near future, I am too busy developing my application.

Dirk

Altera_Forum · ‎11-12-2004

Hi guys,

Sorry I don't mean to ignore the conversation here or keep quiet about it -- we are rapidly approaching our next Nios/Quartus/SOPC Builder release and have the associated time crunch to deal with. I will try to post something more useful early next week.

There are several recently-introduced but not-yet-documented Avalon features I want to discuss.. this won't solve the immediate problem that Dirk presents (successive loads from SDRAM where the cache misses every time), but will be of assistance in complex (multi-master) systems where getting the best memory bandwidth is key. Additionally our aforementioned next release has several more features (and documentation !) that will speed things up further (sorry, latency awareness on the CPU data master isn't one of them...but as I say we will be giving this a serious look).

Altera_Forum · ‎11-17-2004

Hi Guys,

I think that I know what your problem is here. Do you have your code stored in the same SDRAM as the data. If so the SDRAM controller opens the bank and reads the data then opens another bank and reads the next bit of code. You can fix this in several ways:

1. Put your code somewhere else.

2. If you have an instruction cache and your code is in a loop this should be ok the nth time through the loop (where n != 1)

3. The SDRAM can have mutilple banks open, if the data and code are in different banks you still should get fast performance. Unfortunately the SDRAM controller from altera does not support this and will always close the bank rather than leaving it open when the new address is in another bank. The SDRAM controller needs to be quite a bit more complex to take care of this. We wrote one but not for avalon. You could write your own, it took us about 2 months to do this. I cant distribute as it is the property of my old company.

I could be wrong about this being the cause of your 12 cycles but to open a bank and do a read is about 5 cycles the next read should be 1 cycle. Changing banks (if bank is open) I think is 2 cycles. ie 3 cycle saving per read 12 cycles reduced to 6 (3 for the data read and 3 for the next instruction read.)

Good Luck.

Altera_Forum · ‎11-17-2004

If I read Jesse right, the problem is in Nios' Data Master. It is not "Latency Aware" meaning it does not monitor 'readdatavalid'.

Soooo, with only static timing at it's disposal, the Data Master must use the same absolute worst case timing for each and every read.

Privately I've been shown that read performance is still a dismal 5-6 clocks from initial access to 'readdatavalid' even when accessing the same row back to back.

I'd really like someone to explain the "single cycle access" statements that are given in Nios classes and docs. Better yet, make the statements come true http://forum.niosforum.com/work2/style_emoticons/<#EMO_DIR#>/smile.gif

Ken

Altera_Forum · ‎11-17-2004

Hello to Ken and the other guys discussing this topic.

Here is the link to the image that Ken mentioned: http://www.entner-electronics.com/images/n...explanation.jpg (http://www.entner-electronics.com/images/nios2sdram_with_explanation.jpg)

As you can see, there is also a delay when writing to the SDRAM: Altera's SDRAM-controller has 2 write-buffers. Therefore the first 2 writes operate at full speed, that is 2 cycles per write. Then wait-states are inserted until one of the two write-buffers becomes free for the third write, etc. If you would have 8 back-to-back rights instead of 4, you could also see this on the SDRAM-signals.

When reading, things become worse: Here the latency from the SDRAM and from the SDRAM-controller take full effect. Also the nios2-core itself requires several cycles, therefore even with internal SRAM you have about 4 cycles per read (I looked at it, but do not remember the exact number, maybe it was 3, more likely 5 or 6...).

SDRAM-controllers are a topic I could discuss hours about, so I try to make it short (many things were already mentioned before):

- The Altera controller does always keep only one bank open (you can see in the diagramm the writes are in bank 1, the reads in bank 0, they get precharged anyway. This is very conservative, at least he could have activated bank 1 before precharging bank 0. On the other side: What are this 3 cycles helping when he needs about 50 for reading 4 words...).

- The controller has 2 write buffers, which will help a lot in many applications.

- The 11 or 12 cycles per read are with the programm running from ANOTHER memory or cache.

- I have not checked it, but I suppose that reading the program memory is much more effective (and more important in most cases).

- The make the data-master latency-aware would be tough: He would need to guess what data will be read next by the programm and preload it into a buffer / small cache.

Do not forget that Altera can not only have performance in mind, but also LC-count. A design that is very fast but requires e.g. 6.000 LCs would not help much either. We are taking about a Nios II with about 2.000 LCs (core + sdram), not about a Athlon 64 with I don't know how many million gates. Somewhere there will be performance bottlenecks.

What can we do?

- Increase the clock-rate

- Use DMA

- Solve the specific problem with own logic (nice, we have a FPGA...)

I will most likely design a SDRAM- and a DDR-II-controller with a interface for very fast video-transfers (or other streaming things) and a Nios-II-avalon-interface within the next few months. But I do not think that I will address this specific issue as I will use the "fast-video-interface" for tasks that require maximum performance. If there is an interest in, I could also offer it as an IP for Nios II (but not for free ;-).

Regards

Thomas

www.entner-electronics.com (http://www.entner-electronics.com)

Altera_Forum · ‎11-30-2004

Hello Jesse,

is there any progress regarding this topic?

Dirk

Altera_Forum · ‎11-30-2004

I am sorry, no, I have not forgotten about this. I have one other piece of "homework" that I have to write up at the moment (for my real job http://forum.niosforum.com/work2/style_emoticons/<#EMO_DIR#>/smile.gif and then I can get to work on a screed here.

Topics to be covered are: why the data master is not latency aware, how to improve SDRAM performance in the case of multiple masters accessing it simultaneously, using the DMA controller in an efficient manner to (hopefully) alleviate part of the original poster's pain, and as much of a sneak preview as I can give without getting in trouble about new features that are coming out in the next quartus/sopc/nios release which we are finishing up now and will be available in the coming weeks.

Altera_Forum · ‎12-28-2004

This thread seems to continue here: ddr vs. sdr ram... (http://www.niosforum.com/forum/index.php?act=st&f=2&t=797&st=0)