Nios® V/II Embedded Design Suite (EDS)
Support for Embedded Development Tools, Processors (SoCs and Nios® V/II processor), Embedded Development Suites (EDSs), Boot and Configuration, Operating Systems, C and C++

DDR vs. SDR RAM...

Altera_Forum
Honored Contributor II
2,350 Views

Anyone know if DDR Ram will fare better with the NiosII data master than SDRAM? 

 

After Dirk figured out it takes a whopping 12 clocks @50MHz (~240ns) per SDRAM read when not using the dma, we realized we had to respin our board. Jesse explained that the problem is that the Nios' data master is not latency aware and so must use worst case timing.  

 

I'm wondering if DDR would fare better. I'm not familiar with it at all. 

 

Anybody know? 

 

I'm looking for a bulk memory that is also high performance with the NiosII. 

 

Thanks, 

Ken
0 Kudos
27 Replies
Altera_Forum
Honored Contributor II
111 Views

Hello, 

 

maybe a configuration option could be added that reduces fmax but increases memory throughput? Just from reading the above explanations, maybe a few optimizations are be possible: 

 

1) 1 tick for cache miss - if I use ldwio and friends, I know in advance this is a cache miss 

2) Combine "prepare read / read signals asserted" to one tick 

3) Squeeze one tick out of SDRAM controller ("SDRAM-controller needs 3 clocks to assert CAS after chip-selected internally") 

4) Let the SDRAM controller read a few bytes in advance if it's "read job queue" is empty ("speculative prereading"). This would at least accelerate memcpy and restoring context from stack (but of course not random reads).  

 

I&#39;d guess the clock tick overhead in 2) and 3) are just to achieve a higher fmax, but without having the NIOS source code and taking a closer look at it this is of course some wild guessing... But if you can get three ticks out of there, you get close to that "X-brand" processor http://forum.niosforum.com/work2/style_emoticons/<#EMO_DIR#>/wink.gif Depending on how much you need to reduce fmax for that (e.g. I only need 60-70MHz) this would be a helpful optimization. 

 

Are you internally discussing to perform a few optimizations on this issue? 

 

Dirk
0 Kudos
Altera_Forum
Honored Contributor II
111 Views

Hi Thomas, 

 

Are you saying Jesse&#39;s time table doesn&#39;t apply to onchip RAM? He didn&#39;t qualify it as such. If it does apply then onchip reads should be 7 clocks. (I would expect the LDxIO instructions to do it in 6 or even 5 if they can also skip the align) So dma&#39;ing to onchip seems of little help unless that onchip ram is the cache itself. 

 

I&#39;ve thought/wondered about a custom instruction interface to bypass the Avalon bus. I&#39;m not sure how or if it would be integrated into the memory map, but even if it was a rogue interface it would be well worth it to gain 1 clock access at least to onchip ram. Maybe a smart guy like you could do this http://forum.niosforum.com/work2/style_emoticons/<#EMO_DIR#>/smile.gif  

 

Ken
0 Kudos
Altera_Forum
Honored Contributor II
111 Views

Hi Ken, 

 

you are right, if you are using the /f-core, also internal RAM-accesses that miss the cache are slow. Like Jesse pointed out, using the /s-core would remove some clocks of delay as there is no cache, therefore there is no need for a check. Still you cannot achieve 1 or 2 cycles, even with internal RAM. 

 

Using custom instructions would be the quickest, but also a more complicated, way. To achieve 1 cycle accesses will be tricky, but possible for streaming data. For random accesses you need a second cycle, because the internal SRAM is always registered (at least at Cyclone). 

 

If you would like me to design it for you, I would be happy to get a mail. 

 

Regards, 

 

Thomas
0 Kudos
Altera_Forum
Honored Contributor II
111 Views

The Nios II/f is optimized to run at 145 MHz in fast Stratix devices and even higher in Stratix II. 

It achieves this performance by giving preference to cache hits over cache misses (a common CPU design technique). 

This also takes advantage of the relatively fast speed of RAMs on FPGAs. 

However, as you have discovered, when you miss in the D-cache or bypass it, it takes a significant number of cycles 

to the load/store instruction to execute. This is required to maintain the 145MHz design goal because 

of the relatively slow speed of logic, muxing, and wires in an FPGA. 

 

Unfortunately, if you don&#39;t need 145 MHz, you still have all the extra cycles of latency. 

I&#39;d love to do a version of Nios II that is optimized for latency instead of Fmax. 

It is just a matter of development priorities. 

 

One thing that might help some customers is the new multiple clock 

domain support in the Quartus 4.2 version of SOPC Builder. 

You can now build a system where you can have the CPU run at a high frequency 

and have other components run at a low frequency. 

Of course, for good performance, you&#39;ll need to have your memory controller run at 

the same frequency of the CPU (because it adds several cycles of latency to cross clock domains). 

 

We have ideas for ways to reduce the latency of accesses to on-chip memories 

that I&#39;d like to see in a future product release. We&#39;ll let you know if it happens.
0 Kudos
Altera_Forum
Honored Contributor II
111 Views

Hi James, 

 

I would vote for absolute performance over fmax. I&#39;m trying to think of an embedded app that would revisit (get a data cache hit) old data. Whether you&#39;re processing audio, video, network packets, telemetry data, or some other data, why would you need to process it twice? (I&#39;m sure there are some examples, I just can&#39;t think of any) 

 

In the meantime is there any relief? Anyway to dma into the cache? How about read ahead caching? It&#39;s very typical to run through a work packet of data in order. 

 

Can we at least get rid of the cache miss penalty for the LD*IO instructions? 

 

Thanks, 

Ken
0 Kudos
Altera_Forum
Honored Contributor II
111 Views

Ken, I can&#39;t divulge future product plans but let me just say that relief is in sight ...

0 Kudos
Altera_Forum
Honored Contributor II
111 Views

Hi James, 

 

We&#39;d be happy to beta test any ideas you implement. We&#39;ve had some success using a custom instruction to access memory. (In our first case it is reading from an external fifo, but we&#39;ll be adding more CI interfaces as we move fwd) 

 

We were dma&#39;ing the fifo into SDRAM at an amazing 1 clock per word, but getting the words out of sdram to operate on them killed us. Now we can CI in 2 clocks into the NiosII to operate on the data - mucha bedda. 

 

We&#39;re busy bringing up our new custom Stratix/NiosII board, but soon we&#39;ll be back to maximizing firmware execution speed on the NiosII.  

 

If anybody else is interested in this CI approach we&#39;ll be sharing some of our results and who helped us attain them soon. (if you can&#39;t already guess http://forum.niosforum.com/work2/style_emoticons/<#EMO_DIR#>/smile.gif  

 

Ken
0 Kudos
Reply