Nios® V/II Embedded Design Suite (EDS)
Support for Embedded Development Tools, Processors (SoCs and Nios® V/II processor), Embedded Development Suites (EDSs), Boot and Configuration, Operating Systems, C and C++
12589 Discussions

Nios II DDR SDRAM read latency

Altera_Forum
Honored Contributor II
1,852 Views

Hi,  

I’m trying to optimize RAM access from Nios II.  

 

when reading 10’000 successive words from ram using iord in a for-loop, it takes 25 clock cycles per read (without loop overhead). To me, this looks way too much; I would expect something around 10cc for random read access! 

 

I looked at many different posts and various documentation, but I can’t figure out if this is the best I can reach or if I’m doing something wrong. 

 

please give me some feedback, what minimal read latency is achievable, reading ddr sdram from nios ii.  

(Note: In practice I read single words, so DMA is no option.) 

 

 

Some additional infos: 

  • I’m using a Cyclone III device with Nios II/f with 4kByte Instruction-Cache.  

  • The external RAM is a Micron DDR SDRAM (MT46V32M16BN-6IT:F).  

  • I use the Altera DDR SDRAM High Performance II Controller with Altmemphy.  

  • Memory clock is set to 90MHz and Nios - as well as all the other components in the sopc design - use the altmemddr_sysclk.  

  • I use Quartus 13.1.4 with Qsys (latest version to support Cyclone III).  

  • Compile optimization in Nios EDS is set to maximum (Level3).  

 

HPC II settings (High Performance Controller):  

  • tCAS = 2cc  

  • tRAS = 42ns (4cc)  

  • tRCD = 18ns (2cc)  

  • tPR = 18ns (2cc) 

     

 

 

Best Regards 

Simon
0 Kudos
9 Replies
Altera_Forum
Honored Contributor II
690 Views

Hi Simon, 

 

I would suggest using memcpy instead of IORD. You could read data in blocks reducing considerably the latency involved. I understand that you are using single words but it would be more efficient to load a block of data from DDR, store it in an array and use each word from this block. 

 

Best regards, 

Thiago
0 Kudos
Altera_Forum
Honored Contributor II
691 Views

Hi Thiago, 

 

Of course reading a large block of data to a local array and then go over it is more efficient - assuming you are willing to provide enough onchip memory. 

But this is not the point, it's more a general question: Is the memory controller really so terribly slow, or am I missing something. 

 

Best Regards 

Simon
0 Kudos
Altera_Forum
Honored Contributor II
691 Views

 

--- Quote Start ---  

Is the memory controller really so terribly slow 

--- Quote End ---  

In short - yes - when you're doing single, random accesses. 

 

However, that is not how DDR memory is intended to be used. To get greater, far greater, access bandwidth, you must access multiple consecutive values with a single command. As Thiago suggests, accessing blocks of memory will improve your access bandwidth vastly. So, to use DDR efficiently you need to cache chunks of memory from it and then use the cache. Latency to your cache'd data will be a couple of clock cycles at worst. 

 

If you genuinely need single, random access to external memory, then you need to consider SRAM or SSRAM. With 29 clock cycles per access with DDR, I suspect you'll find either of these will be just as quick. 

 

Regards, 

Alex
0 Kudos
Altera_Forum
Honored Contributor II
691 Views

I found an altera user guide (http://ridl.cfd.rit.edu/products/manuals/altera/user%20guides%20and%20appnotes/external%20memory/emi_ddr_ug.pdf) for the setup I use. There is a chapter "Latency" that states that the HPC II with Cyclone III has a total read latency of 19 clock cycles. 

 

So I know I'm not too far off B) 

 

As you already pointed out, moving data to cache before handling is of course much more efficient. But I will have to see what I can do in this direction... 

 

Thanks 

Simon
0 Kudos
Altera_Forum
Honored Contributor II
691 Views

How many clocks does your loop take without the IORD() ? 

How many does it take if you do the IORD() twice in each loop (with nothing in between)? 

Or try using signaltap to trace the actual cycle. 

 

I've a note that says most reads from SDRAM took 16 clocks, but some took 12. 

Writes were more interesting: single writes took 2 clocks, but the third took much longer. I assume a two stage 'pipeline' of some sort. 

DDR will only be worse than SDRAM.
0 Kudos
Altera_Forum
Honored Contributor II
691 Views

Hi dsl, 

 

Sorry for the delay and thanks for the good idea :) 

 

- For IORD it takes 29cc + 25cc for every additional IORD I add in the loop  

- For IOWR it takes 6cc + 2cc for every additional IOWR I add in the loop (constant increment for 3,4 or 5 additional IOWRs) 

 

- With an empty loop it takes 1cc or 10'000 iterations => meaning the hole loop gets optimized away and the overhead for the (custom made) timestamp is 1cc. 

 

So I guess I have to subtract 4cc for loop overhead in my original post :cool:
0 Kudos
Altera_Forum
Honored Contributor II
691 Views

If you are trying to time things, it is always worth looking at the generated code - just to check the compiler has generated the instruction sequence you think you are measuring. 

I've measured the number of clocks for the nios cpu, the only undocumented stall I've seen is for a read of tightly coupled memory immediately following a write to the same memory block (the same probably applies for the data cache). 

 

I suspect your writes are repeatedly updating the same bytes in a 'pending line' register. 

If you do random writes the 3rd and later probably have longer delays. 

 

I think that what you are seeing is a side effect of the access times of DRAM memory not really having improved very much (if at all). What has improved is the bandwidth for sequential access bursts. 

Work out how many clocks a full DDR burst read takes - you might find it is much nearer to the clock count you are seeing.
0 Kudos
Altera_Forum
Honored Contributor II
691 Views

Hey dsl, 

 

 

--- Quote Start ---  

I suspect your writes are repeatedly updating the same bytes in a 'pending line' register. 

If you do random writes the 3rd and later probably have longer delays. 

--- Quote End ---  

 

 

The code is as follows: 

 

// Write to RAM time1 = GET_TIMESTAMP; for (i = 0; i < 10000*5; i+=5) { IOWR(ALTMEMDDR_BASE, i, i); IOWR(ALTMEMDDR_BASE, i+1, i); IOWR(ALTMEMDDR_BASE, i+2, i); IOWR(ALTMEMDDR_BASE, i+3, i); IOWR(ALTMEMDDR_BASE, i+4, i); } time2 = GET_TIMESTAMP; printf("10'000 IOWR = %i\n", time2 - time1); // Read from RAM time1 = GET_TIMESTAMP; for (i = 0; i < 10000*5; i+=5) { IORD(ALTMEMDDR_BASE, i); IORD(ALTMEMDDR_BASE, i+1); IORD(ALTMEMDDR_BASE, i+2); IORD(ALTMEMDDR_BASE, i+3); IORD(ALTMEMDDR_BASE, i+4); } time2 = GET_TIMESTAMP; printf("10'000 IORD = %i\n", time2 - time1); 

 

Of course this is far from random. With true random access where columns and rows will change between each access, both read and write timing will increas significantly. 

I just thought it is not fair to count the 4cc of the foor loop as IORD or IOWR time. 

 

My goal is not to track down each and every clock cycle. I just wanted to get the overall picture. 

 

In my real (unoptimized) project the situation is much worse. The following line takes about 300cc to execute: 

 

memcpy(my_int, (alt_u8*)unaligned_address, 4); 

 

This is owed to many factors: 

- My data is not word-aligned in RAM, therefore memcopy does multiple individual reads. 

- My Nios and RAM have different clock frequency, introducing slow clock-crossing bridges. 

- My instruction code, as well as stack and heap are stored in RAM as well (same bank, different columns), so there are several reads just to get the instructions... 

 

This is the reason I started looking at "what is the lowest latency for reading from SDRAM" at all :-P
0 Kudos
Altera_Forum
Honored Contributor II
691 Views

The general answer is the align your data :-) 

The code will be cached, I presume you have a data cache for most data accesses. 

memcpy() will be optimised for long aligned copies. 

If you do need to read misalgned items from uncached memory you want to do two 32bit aligned reads and then some shifts and masks.
0 Kudos
Reply