I need some help of creating shared ddr sdram memory

Altera_Forum · ‎01-24-2011

Hi, everyone:),

My system is like this : Using NEEK, I created 2 processors both running in the ddr sdram. Processor1's reset vector offset is 0x100 and exception vector offset is 0x120. Processor2 's reset vector is 0x1000000 and exception vector offset is 0x1000020. As you can see, I didn't use the first 0x100 memory address space because I want to use it as the shared memory.

Then in my software of processor2, I created a float pointer pointing to the address DDR_SDRAM_BASE defined in the system.h and write some value to it. Processor1 will read and print the value pointed by address DDR_SDRAM_BASE in its program.

I think by doing this, processor2's change of the sdram should be reflected by processor1. Howerer, processor1 always print out 0.000 no matter what processor2 writes.

I feel confused :confused:

Can anyone help?

Altera_Forum · ‎01-24-2011

If your CPUs have a data cache it is possible that they won't detect changes in the main RAM, when the value to read is already crached.

Either change your pointers to non cached, using the alt_remap_uncached() function, or use the IORD/IOWR macros to acces the shared memory. Those macros always bypass the cache.

Altera_Forum · ‎01-24-2011

You'd be better of allocating a M9K block and dual porting it to both cpu as 'tightly coupled data memory' and using that as the shared space (just make sure you don't do concurrent writes to the same address).

In either case you'll need to use something like Dekker's algorythm for mutual exclusion.

Altera_Forum · ‎01-24-2011

1. Thank you for 'Daixiwen' 's answer. Right. If I remove the data cache, it works well.

2.For 'dsl',I am always feel confused about the difference between tightly coupled memory and the normal on-chip ram. Tighly coupled memory need to be connected to the tight coupled master port of the cpu while the normal on-chip ram can be connected directly to normal master port.Right ? Can I use the normal on-chip ram as the shared memory in my design?

Thanks

Altera_Forum · ‎01-24-2011

Tightly coupled memory is a regular on-chip memory configured to be 32-bit wide, read latency of 1, and connected to only a single tightly coupled master. This gives a deterministic access time of 0 cycles for writes and 1 cycle for reads.

Other memories in your system might be multi-mastered or accessed by the cache both of which will cause non-deterministic access times. The nice thing about using tightly coupled memory for shared memory is that a) it's the fastest memory you can use b) accesses to it bypass the cache. So you could dual port an on-chip memory, connect a tightly coupled data master from each CPU to it, and use some sort of algorithm/hardware to prevent data hazards. You could use the mutex hardware component for example to make sure both CPUs don't attempt to access the same memory.

I on the other hand prefer to use FIFOs to move data between processor cores since the data hazards are not a problem that way. This method doesn't scale when you have a lot of processors that need to communicate with each other since you have to use a lot of FIFOs to do that. I recommend you get your software up and running first and if you need more speed then determine if other shared data mechanisms may be needed.

Altera_Forum · ‎01-24-2011

Hi, BadOmen,

Exactly. Originally I used the mailbox core to do the on-chip shared memory communication. I think the mailbox is kind of like the FIFO as you said. Howerver , it is very slow. The speed is about one tens of the Ethernet communication!( I am not sure whether I have done something wrong). So now I am thinking using the shared memory and hardware mutex core.

On-chip memory is a limited resources. I am thinking whether I can use the ddrsdram or the sram to build the shared memory. So you mean if I use the regular on-chip memory connecting to the regullar master port of the processor, it will not bypass the cache.Rihgt?

Altera_Forum · ‎01-25-2011

That's correct, the mailbox component is similar to a FIFO only parts of it are implemented in software which is why it's slow when sending small messages back and forth. The good part is that it's scalable to any memory in your system. Using the FIFO approach I either drain the FIFOs until they are empty or map the FIFO 'used' signal so that I know ahead of time how much data I can safely read out of it if I don't want to end up blocking the processor when the FIFO is empty. Sometimes I make this interrupt based as well.

The on-chip memory if it's connected directly to the (regular) Nios II data master when the data cache is enabled will be cacheable. Think of the tightly coupled master and caches being in parallel inside the processor, the decode to determine if the access is tightly coupled, cacheable, or uncacheable happens concurrently. Just remember that the Nios II JTAG debugger can't download instructions into tightly coupled memory unless there is a data master connection to the tightly coupled memory (regular or tightly coupled).

If you decide to make the on-chip RAM cacheable you can carve out some of that memory and remap it to be uncacheable so that you don't need to use IOWR/IORD to access the shared data. If you do this use the HAL calls that remap the pointer and don't explicitly flip the MSB of the pointer manually since there is some cache considerations that the HAL takes care of.

Altera_Forum · ‎01-25-2011

Dear BadOmen,

Thanks for your explanation.

I think it makes no sense to have a cacheable shared memory between processors,right?

So if I understand correctly, I only have the following 3 choices:

1. Remove the data cache;

2.Use tight coupled memory;

3. Use regular on chip memory or other memory(e.g., SDRAM,SRAM) and use nono-cacheable instructions(like IOWR/IORD or remap the memory as uncacheable as you said)

Am I right? Then I think I will see how much resource do I have and choose between choice 2 and choice 3 :)

Altera_Forum · ‎01-25-2011

Alse remember that alt_remap_uncached() doesn't flush the cache for the addressed area, nor can it ensure (and it doesn't error) that the area contains entire cache lines.

Both these can lead to unexpected read/write transfers from the cache to the memory.

Altera_Forum · ‎01-25-2011

If you want the fastest memory transfer method, use an on-chip memory with a dual port and connect each port to a tight coupled memory port on each of the CPUs.

If you need more memory than what you have available in the FPGA, use external memory. If the external memory is an SSRAM, you can use uncached pointers or uncached macros. If the external memory is as SDRAM you will have a huge speed loss on consecutive accesses without a cache, so I would recommend to use the data cache and the flush functions. But in that case you must be sure that your memory blocks are aligned on cache lines, because as dsl says, nasty things happen when different blocks share lines and that you do flushes.

Altera_Forum · ‎01-25-2011

Thanks for your comments.

Altera_Forum · ‎01-25-2011

I did some measurements for random uncached SDRAM accesses.

IIRC the first 2 writes happen without significant delay. I suspect that the first one is actioned asynchronously, and the second is put into an SDRAM 'line' buffer (32 bytes ?) in case the next transfer is to the same SDRAM line.

It is also likely that SDRAM reads always read a memory line, and Avalon reads that match the buffered data (eg sequential access) are completed without doing an actual memory transfer.

However cache transfers should be faster since (I think) they get pipelined.

Unfortunately the nios cpu is missing the instruction to create a valid cache line without doing the memory read (useful when you know you are going to modify the entire line). I think there are some other missing cache functions that rather hurt attempts to run unix os.

Altera_Forum · ‎01-25-2011

Thank you for everyone's information.

So I think tight coupled memory would be a good choice .However, I wonder why I should use the dual-port mode?

If I have 2 processors , maybe configuring as dual-port mode is better because I can connect them to each processor.

However, if I have more than 2 processors and I want to have a common shared memory, I think dual-port mode is not necessary at all and I should connect all of the processors to the common slave of the memory and use mutex to coordinate the access.Right?

:D;)

Altera_Forum · ‎01-25-2011

Oh, I have found one slave port of tightly coupled memory can only be connected to one master port. So that means tightly coupled memory can only be shared by at most 2 processors in the dual-port mode, I think

Altera_Forum · ‎01-25-2011

That is correct, the reason for this is that tightly coupled memory cannot stall the processor pipeline, so the slave port cannot be shared as arbitration could cause waitstates.

From my own testing I find that worst case random access patterns to SDRAM will drop the performance down to 33% efficiency. Best case is around 97% but you would need something like a DMA to hit that. A cache I suspect should achieve around 90% as most SDRAM controllers have built in management for the rows/columns/banks (or whatever SDRAM people call those terms). Due to the mapping of the cache lines in the address space the first access will typically have some overhead but the next seven accesses (assuming 32B/L) should enter the controller efficiently assuming the arbitration share is set to 8 or greater so that other masters don't get in and start thrashing the memory.

Altera_Forum · ‎01-27-2011

Hi,

I have implemented the tightly coupled memory as the shared memory.

However ,I have found it is really difficult to implement the communication software program of the shared memory.

Originally I used the mailboxes core and the API mailbox_pend() can automatically block when there is no message and unblock when there is a message , which is very easy to be implemented in a process.

However , since now I use hardware mutex for the shared memory, there is no such API. The problem is that this is a 'mutex' not a 'samephore' which means it cann't be released by other processors! Since there is no samephore , I think I can not implment the communication in a while() loop as the classical producer&consumer problem. If Processor1 has put some data into the shared memory, Processor2 can't know when the data is ready.Even Processor2 knows, Processor1 will not know when Processor2 has finished reading the data.

Am I right? I am not a CS major so there might be some misunderstanding:(

Altera_Forum · ‎01-27-2011

Ummmmm I forget.... I haven't done this stuff since school but you can take a look at this to see if it gives you any ideas: http://www.altera.com/support/examples/nios2/exm-multi-nios2-hardware.html?gsa_pos=4&wt.oss_r=1&wt.oss=multi That design shares data between two CPUs using a Mutex and on-chip RAM

If I remember correctly the mailbox was essentially two mutexes, a 'binding' to some physical memory in your system, and some software protocol implemented in the driver for the component. Maybe looking at the driver will give you some ideas of what to do while you eliminate anything you don't need that might slow it down.

Altera_Forum · ‎01-27-2011

Hi, BadOmen,

Thank you for your reply. I am very familiar with the example you provided. Though it created a hardware mutex, it doesn't use it.All the communicaiton happens using the mailbox.

I will check the mailbox driver anyway:) It seems that there is no easy way to use mutex as convinent as mailbox. However, mutex should be faster:0

Altera_Forum · ‎01-27-2011

Maybe if you used more than one mutex that would help.

Altera_Forum · ‎01-27-2011

Yes. If I can use multiple semaphores,then it will work well because that's a generic producer/consumer problem.

However, different from semaphores, Mutex can only be released by the one who lockes it . So mutex can be used to protect the shared memroy ,howerver it can not be used to coordinate different task of different processors, in my opinion:(

Altera_Forum · ‎01-28-2011

You'll need to use Dekker's algorythm on uncached memory to do any form of inter-cpu synchronisation.

Generating a 'spin lock' is easiest, and, in fact, all the other synchronisation schemes are based on them.