Nios® V/II Embedded Design Suite (EDS)
Support for Embedded Development Tools, Processors (SoCs and Nios® V/II processor), Embedded Development Suites (EDSs), Boot and Configuration, Operating Systems, C and C++

sdram transfer

Altera_Forum
Honored Contributor II
1,677 Views

We have tested block transfer in sdram with memcpy() fonction on développement kit stratix 1S10 NIOS II with uclinux v1.1. 

The frequency CPU is 50Mhz. 

 

We obtained with nois ii /s (no data cache): 

8bit transfer : 1.7Mbytes/s 

16bit transfer : 3.4Mbytes/s 

32 bit transfet : 6.9Mbytes/s 

 

nios ii /f (2Kbytes data cache) : 

 

Blocksize<1Kbyte (between CPU and cache) 

8 bits transfer : 6.6Mbytes/s 

16bits transfer : 13.2Mbytes/s 

32 bits transfet : 26Mbytes/s 

 

Blocksize>1Kbyte (between CPU and SDRAM throw cache) 

8 bits transfer : 0.9Mbytes/s 

16bits transfer : 1.9Mbytes/s 

32 bits transfer : 5Mbytes/s 

 

With NIOS II/F, when block size is greater than cache size, the performances are lower than 

a configuration with NOIS II /S. It seems that cache penelizes the byte rate. 

Do you have NIOS testbench wich confirm my measures ? 

For a 32 bit transfer with no cache, we obtain 6.9Mb/s. The byte rate is very low. 

Is it normal ?? 

 

Thanks in advance, 

 

Fred
0 Kudos
7 Replies
Altera_Forum
Honored Contributor II
384 Views

Fred, 

 

Thanks for the valuable info and please keep it coming! 

 

The only way I found to copy/move memory around quickly was using the dma controller. memcpy and pointer arithmetic are slow. The dma controller uses the exact timing set in SOPC builder with no delay inbetween writes. Probably not your point here, but just in case. 

 

If you want to post your test code, I&#39;d be happy to run it on my NiosI and NiosII sdram systems. (Cyclone based, 75MHz) 

 

Ken
0 Kudos
Altera_Forum
Honored Contributor II
384 Views

Another thing you can do is, if you can identify a place where the data cache hurts you like this, you can use the "bit 31" trick to copy from/to noncacheable memory. This may help your II/f benchmarks converge with your II/s core benchmarks. 

 

The "bit 31" trick is covered on page 7-7 of the Nios II Software Developer&#39;s Handbook. In short, only bits 30-0 of an address are actually driven onto the address bus. Bit 31 controls whether it goes through the data cache or not; if set, the data cache is bypassed. So all you have to do is pass (address | 0x80000000) to memcpy for each buffer you want uncached. Try your benchmarks with the cache bypassed for reads only, writes only, or both, and see what works best.
0 Kudos
Altera_Forum
Honored Contributor II
384 Views

Be warned, though, that this meaning for bit 31 is optional and may be turned off. The only way to avoid the cache that is guaranteed to work for all Nios II cores is to use ldio/stio instructions instead of ld/st. The io suffix treats the read/write like a device access, so it avoids the cache as repeated reads/writes to device registers are often important. Perhaps there is a specialised memcpy that uses these instructions?

0 Kudos
Altera_Forum
Honored Contributor II
384 Views

Hi Fred, 

 

Here are some preliminary results.  

I modified your source to use nr_timer_milliseconds() instead of gettimeofday(), but that is all. I additionally added dmaMemcpy() and dmaMemcpy4() to test dma transfers. 

 

Here is the summary. 

The system is NiosI with single Micron 4Mx32 sdram @75MHz 2K I-cache, 0K D-cache. 

 

memcpy() 10.4-10.7MB/s 

dma1() 8.6-9.1MB/s 

dma4() 30.3-35.5MB/s 

 

Kind of dissapointing considering I&#39;ve read we should be able to get roughly one transfer per clock in burst. I wonder if we&#39;re missing some setting? 

 

I&#39;ll try to get the Nios2 results today as well. 

 

Ken
0 Kudos
Altera_Forum
Honored Contributor II
384 Views

thanks for your help and your results 

 

Ken,  

I agree with you. With a 75 Mhz 32b SDRAM you must obtain 300MB/s in burst (read or write). 

But when you chain single read and write transaction in different sdram page, you lost 

clock cycle. To increase performance, chain busrt read and burst write is better. 

I don&#39;t know if DMA can work in this mode. I don&#39;t think, because 

the DMA fifo depth is small.
0 Kudos
Altera_Forum
Honored Contributor II
384 Views

Fred, 

 

That&#39;s true, but a 10X performance hit is pretty severe. (300MB/s -> 30MB/s) 

 

I just wonder if there is anything that can be done to improve this? I&#39;m going through my code and carefully moving key variables into registers and onchip ram. It&#39;s making huge improvements, but I wasn&#39;t planning on going to this level to get the performance I need. Next thing I&#39;ll be hand coding assembly and creating custom instructions. I guess its nice to have the options. 

 

I&#39;m going to toss in the towel on running on Nios2. It runs but gives bogus results. Maybe I&#39;ll get another wild hair and try some more. 

 

Can you share how you were doing 32bit xfers? Are you simply relying on optimization inside of memcpy?  

 

Thanks, 

Ken
0 Kudos
Altera_Forum
Honored Contributor II
384 Views

Ken, 

(I am working with Fred, but on the software side ...) 

 

For 32bit transfers, we just use a hand-made loop with 32bit pointers (see below). 

With 8bit pointers (=8bit xfer), we get the same throughput as with memcpy(), so it seems there is no optimisation in memcpy(). 

 

Bruno. 

 

--------------- 

 

static void 

myMemcpyLong(const char *src, char *dest, int taille) 

/* This code will fail when buffers or size are not properly aligned, 

* so it is not a plug-to-plug replacement for memcpy(). 

* (anyway &#39;src&#39; and &#39;dest&#39; parameters are swapped ...) */ 

register const long *mySrc = src; 

register long *myDest = dest; 

 

while (taille > 0) { 

*myDest++ = *mySrc++; 

taille -= sizeof(*mySrc); 

} /* myMemcpyLong */
0 Kudos
Reply