Nios® V/II Embedded Design Suite (EDS)
Support for Embedded Development Tools, Processors (SoCs and Nios® V/II processor), Embedded Development Suites (EDSs), Boot and Configuration, Operating Systems, C and C++
Announcements
FPGA community forums and blogs on community.intel.com are migrating to the new Altera Community and are read-only. For urgent support needs during this transition, please visit the FPGA Design Resources page or contact an Altera Authorized Distributor.

NIOS with DDR as Main Memory

Altera_Forum
Honored Contributor II
2,957 Views

Hello everyone, 

 

I am attempting to use the DDR memory as the main memory for the NIOS core. As such, I decided to enable burst transfers for the processor. I have a couple of questions regarding the burst transfers: 

 

1. Are there any constructs to ensure that an arbitrary piece of data (for example a variable of array type) will be transfered in burst mode? 

- This question relates to blocks of data that I expect to write to the DDR memory. Currently I am using a for loop, but I am not sure if the compiler is smart enough to burst the data in to increase throughput. Perhaps there is a construct that will give the compiler a hint? 

 

2. What if I decide to instead use a standard RAM block for the CPU's main memory, but connect a DDR SDRAM as another data master. Does the compiler know how to optimize the data transfers, especially if I am writing arrays of data into memory, which would naturally work well for bursting? 

 

Any insight would be very useful and will enable me to increase my performance. The previous alternative was to disable burst transfers to the DDR which resulted in millions of cycles just to write 4096 bytes of data to the DDR.  

 

Thanks
0 Kudos
7 Replies
Altera_Forum
Honored Contributor II
1,939 Views

The compiler won't do anything special, all the writes from the NiosII cpu instruction unit are single words. 

It would be normal to use the data cache for external memory (DDR or SDRAM), cache line writes are likely to be burst transfers. 

 

I also believe the Avalon slave interface to DDR/SDRAM will merge writes to adjacent locations into a burst transfer to the memory itself. Certainly the first 2 writes are 'posted' (ie the data and address are latched and the Avalon bus cycle completes before the memory cycle starts).
0 Kudos
Altera_Forum
Honored Contributor II
1,939 Views

Also, have you compiled everything (including the BSP and libc bits) with -O2 or -O3, without those there will be a lot of cycles to the stack (which will almost certainly also be in your DDR memory). 

Even with a data cache the stack will probably be displaced from the cache by your sequential memory accesses.
0 Kudos
Altera_Forum
Honored Contributor II
1,939 Views

 

--- Quote Start ---  

Also, have you compiled everything (including the BSP and libc bits) with -O2 or -O3, without those there will be a lot of cycles to the stack (which will almost certainly also be in your DDR memory). 

Even with a data cache the stack will probably be displaced from the cache by your sequential memory accesses. 

--- Quote End ---  

 

 

 

I have not tried to modify the compiler flags yet. Would it be better to connect the DDR as just another data master/slave to the CPU while using internal memory for the instructions?
0 Kudos
Altera_Forum
Honored Contributor II
1,939 Views

If your application is small enough (probably means you aren't using any altera library functions!) it will run faster from internal memory (use tightly coupled instruction memory) - much like having an instruction cache that is pre-filled with the code. 

Similarly use tightly coupled data memory for 'normal' data. 

However you may well need to write your own linker script, and decide on how the code/data will be downloaded to get it into the correct areas. 

You also need to link the .rodata into the data, not the code (don't use gcc4 - altera moved switch statement jump tables into .code from .rodata). 

 

On our card we use the PCIe slave to load everything, holding the nios in soft reset until the code is present. Writing code to read the elf program headers is fairly simple, or the code/data can be extracted and inserted into the data area of the host program using several objcopy commands (it is even possible to link the host program with the symbol table of the nios one - allowing direct access to data items with a suitable host driver!).
0 Kudos
Altera_Forum
Honored Contributor II
1,939 Views

Very interesting! 

 

To give you an idea of the application, I want to create very 3 large arrays of data. Each array needs to hold about 2,000,000 64 bit floating point numbers. Naturally, I am not able to use internal memory for this, even with the Stratix IV GX 230, there simply is not enough space.  

 

As a result, I turned to the DDR as an output/storage device. I figured the easiest way would be to just connect the DDR to the processor and use it as the main memory. With this setup, I then create the gigantic arrays and use for loops to fill them sequentiall. However, it seems like the arrays are not being filled as burst transfers. So I though, maybe the compiler was not "optimizing" the transfers with respect to the memory and the burst nature of it. 

 

Then, I tried another implementation that used an internal RAM, as well as the DDR, each connected to the CPU's data master. The internam RAM was connected to the CPU's reset and interupt vectors, as configured in the CPU's parameters in SOPC wizard. 

 

When filling the arrays is there a way to direct the compiler/source code to burst the writes, or should this happen automatically? When I placed performance counters around the transfers, it seems to take more than 25,000,000 cycles to fill the arrays, even when I set the array size to 32768.
0 Kudos
Altera_Forum
Honored Contributor II
1,939 Views

I would look at the generated code, the instruction set is fairly simple to understand. 

 

To get any performance you need to ensure everything in compiler with -O2 or -O3 - these will also make the generated code easier to understand. 

Use 'gcc -S -fverbose-asm -O3 -o foo.S foo.c'. 

 

If you are processing the data sequentially then a small data cache (with 32 byte lines) should improve things. 

 

If you are indexing the same offsets in each array, check the cache associativity (I think it doesn't have any!) - so you may want to ensure the 

three arrays are offset from each other so that the same index uses different cache lines. 

 

My system has 16Mb of SDRAM for buffers, but since these are accessed randomly (one byte from each buffer) I don't use the data cache at all.
0 Kudos
Altera_Forum
Honored Contributor II
1,939 Views

 

--- Quote Start ---  

Very interesting! 

When filling the arrays is there a way to direct the compiler/source code to burst the writes, or should this happen automatically? When I placed performance counters around the transfers, it seems to take more than 25,000,000 cycles to fill the arrays, even when I set the array size to 32768. 

--- Quote End ---  

 

 

How do you generate a pattern that's going to be filled into DRAM? Does generation uses floating-point arithmetic or integer/floating point conversion? 

You should realize that on Nios2 both operations are very slow so if you do one of those then the time difference between burst access and single-word access to DRAM is probably lost in noise. 

 

Now,assuming you don't do something slow in generator, memory fill via cached Nios2 is still unlikely to achieve good efficiency because of write-back write-allocate architecture of the cache which is very badly suited to large memory fill. However with 100MHz+ CPU clock and 200-300MHz DRAM clock (400-600MT/s data rate) you should be able to fill a single 16-bit DDR SDRAM chip at 100-150 MB/s, i.e at approximately 10% of peak memory throughput. To do any better you can try one of the following: 

1. Program+data in internal RAMs. Uncached (__builtin_stwio or upper 2GB) access to DRAM in manually unrolled loop + reliance on merge access feature of HPC2 DDR controller. This solution requires minimal hardware expertise but, IMHO, is not sufficiently robust. 

 

2. Program+data in internal RAMs. All or part of the internal RAM is dual-ported with one memory port connected to CPU via tightly-coupled data port. Other memory port connected to DMA engine. You prepare your fill pattern chunk by chunk in internal dual-ported memory and then DMA it into DRAM. For maximum performance use double-buffer - DMA from the first while filling second, then switch. This solution is most robust but also take more development work and consumes more FPGA resources. 

 

Hope that helps, 

Michael
0 Kudos
Reply