NIOSII addressing for 32 bit wide SDRAM

Altera_Forum · ‎08-27-2010

Hello, first time poster here.

As the title suggests, I have an FPGA accessing a 32 bit wide DDR2 device

(actually 2x 16 bit devices with appropriately shared signals).

Using SOPC builder to generate a simple system with mainly NIOSII/f and one

HPCII DDR2 controller.

The controller is running full rate which means the "local width" is 64 bits.

The DDR2 size is 256MB so the span is 0x10000000 (bytes) and the base

address is 0x20000000 in SOPC.

I am trying to confirm how to access all of the memory.

Should I use IOWR/IORD or IOWR_32DIRECT/IORD_32DIRECT?

If I use IOWR_32DIRECT/IORD_32DIRECT then should I read/write as:

IOWR_32DIRECT(BASE, 0, val);

IOWR_32DIRECT(BASE, 4, val);

IOWR_32DIRECT(BASE, 8, val);

...

IOWR_32DIRECT(BASE, 0xFFFFFFC, val);

else should the address increments be multiples of 8. Else what?

Thank-you.

Altera_Forum · ‎08-27-2010

May seem like a silly question but I ask because when using inc of 8 and writing

data=addr to the given addr then values are all unique and read value = written value.

But with 8 then the memory span seems incorrect and I can't go beyond 0x10000000

else read all F's. It also does not seem logical to use 8 as I believe we are using byte

addressing and a 32DIRECT access would hit 4 of these byte addresses at a time.

However, when using 4x addressing there is 2x duplication between subsequent

addresses when reading back.

There could be timing issues relating to this but I wanted to make sure that I am

at least doing the right thing in the NIOS code first.

Thank-you.

Altera_Forum · ‎08-27-2010

The IOWR/IORD macros assume word addressing (32-bit). I recommend being explicit and just use the width specific macros: IORD_<8/16/32>DIRECT and IOWR_<8/16/32>DIRECT. This way you can use them to access narrow data and won't have some confusing mix of the two macro types in your code.

So by using the _32DIRECT macros you are correctly incrementing the address by 4 bytes at a time.

To debug the issue you are seeing I recommend simulating the design. Write a few values out to the memory and then read them back and watch the transactions on the fabric to see if you are getting what you expect returned in hardware.

Altera_Forum · ‎08-27-2010

Thanks for the reply, BadOmen.

As mentioned:

>>Using SOPC builder to generate a simple system with mainly NIOSII/f and one

>>HPCII DDR2 controller.

Would you recommend trying to close timing on this project before anything else though?

If so, if I am going to run this thing at 200MHz with a -6 ArriaIIGX then perhaps

I need to alter the HDL indirectly through SOPC changes(either parametric for the given

current components or architecturally by for example adding a pipelining bridge) as

"Report Timing" in Timequest shows a number of violations(neg slack) even for clk

to clk transfers(ie. launch and latch clock are the same). Hints here?

SignalTap of use here also versus setting up sims?

Your thoughts?

Regards.

Altera_Forum · ‎08-27-2010

Yes you should close timing before expecting your software to work. Same goes for using Signaltap II. I also would prototype at a much lower speed to start, make sure the memory accesses are functional, and then crank the clock up.

If you simulate this system you won't need to meeting timing since you'll just be verifying it at a functional timing level. After the functionality looks right in simulation then you could move on closing timing.

clk to clk transfers are going to be the majority of your design (register to register transfers will normally be on the same domain). If the TNS (total negative slack) is close to 0 then you could probably just do some Quartus II setting tweaks to meet timing. If the TNS is high then you might need architectural changes to achieve 200MHz.

Altera_Forum · ‎08-27-2010

Thanks again, BadOmen.

On the topic of timing closure: I have noticed even for

clk to clk transfers that some designers use constraints such as set_max_delay to fix the

offending constraints. Is this good practice or are we just asking for trouble and should

this only be done as a last resort?

Or should, as you say, procedures such as HDL modifications, architectural changes,

or Quartus settings tweaks(for small TNS) be explored first?

>>>you could probably just do some Quartus II setting tweaks to meet timing.

Perhaps using the Quartus tool "advisors" for this or are they not so good?

Perhaps if the violation count is low then one might opt for constraining the few violating

paths but in my case there are too many to count so probably I need to either lower

the clock speed or make some architectural changes and not even think about trying to

constrain with say set_max_delay, eh? Or perhaps entertain a faster part (ie. -4)?

Thoughts?

Regards.

Altera_Forum · ‎08-28-2010

It is difficult to say what would be a good or bad constraint without knowing the design. I would take a look at the Timequest documentation to learn the best practices. Also for the SDRAM you have added it's automatically generated constraints to your Quartus project correct?

Altera_Forum · ‎08-28-2010

Thanks,

>>I would take a look at the Timequest documentation to learn the best practices.

I have read through mountains of material and took the approx one hour online

training and saw little practical value there. I found that the online training lacked

examples of the nature "here is a violation and here is how we fix it and here is why

we do it this way." Same thing with the TimeQuest docs. They seem to focus on

"here are the tools and good luck." Hence the questions. :)

>>Also for the SDRAM you have added it's automatically generated constraints to your >>Quartus project correct?

That is correct. Actually, when I contacted support about failing constraints they

made a couple of changes to the proj settings and gave me an additional.sdc file

to further constrain the DDR which had a bunch of constraints like:

set_max_delay -to [get_keepers *stage_counter\[*] 10.459

They have not solved my problems so I turn to the forum.

Regards.

Altera_Forum · ‎08-28-2010

FYI: here is the output from the test(below).

I now have the NIOS processor running at 50MHz and all Critical Warnings removed

such that I have timing closure.

FYI: output results did not change from 200MHz case.

Thoughts on the pattern error?

Some kinda configuration adjustment for the NIOS or DDR2 HPCII in SOPC?

Thank-you.

OUTPUT

--------

ddr_test_fulldata: 1: wrote 00000001, test_addr 00000000, read_val 00000004, device 0

ddr_test_fulldata: 2: wrote 00000002, test_addr 00000004, read_val 00000004, device 0

ddr_test_fulldata: 3: wrote 00000004, test_addr 00000008, read_val 00000008, device 0

ddr_test_fulldata: 4: wrote 00000008, test_addr 0000000c, read_val 00000008, device 0

ddr_test_fulldata: 5: wrote 00000010, test_addr 00000010, read_val 00000040, device 0

ddr_test_fulldata: 6: wrote 00000020, test_addr 00000014, read_val 00000040, device 0

ddr_test_fulldata: 7: wrote 00000040, test_addr 00000018, read_val 00000080, device 0

ddr_test_fulldata: 8: wrote 00000080, test_addr 0000001c, read_val 00000080, device 0

ddr_test_fulldata: 9: wrote 00000100, test_addr 00000020, read_val 00000400, device 0

ddr_test_fulldata: 10: wrote 00000200, test_addr 00000024, read_val 00000400, device 0

ddr_test_fulldata: 11: wrote 00000400, test_addr 00000028, read_val 00000800, device 0

ddr_test_fulldata: 12: wrote 00000800, test_addr 0000002c, read_val 00000800, device 0

ddr_test_fulldata: 13: wrote 00001000, test_addr 00000030, read_val 00004000, device 0

ddr_test_fulldata: 14: wrote 00002000, test_addr 00000034, read_val 00004000, device 0

ddr_test_fulldata: 15: wrote 00004000, test_addr 00000038, read_val 00008000, device 0

ddr_test_fulldata: 16: wrote 00008000, test_addr 0000003c, read_val 00008000, device 0

ddr_test_fulldata: 17: wrote 00010000, test_addr 00000040, read_val 00040000, device 0

ddr_test_fulldata: 18: wrote 00020000, test_addr 00000044, read_val 00040000, device 0

ddr_test_fulldata: 19: wrote 00040000, test_addr 00000048, read_val 00080000, device 0

ddr_test_fulldata: 20: wrote 00080000, test_addr 0000004c, read_val 00080000, device 0

ddr_test_fulldata: 21: wrote 00100000, test_addr 00000050, read_val 00400000, device 0

ddr_test_fulldata: 22: wrote 00200000, test_addr 00000054, read_val 00400000, device 0

ddr_test_fulldata: 23: wrote 00400000, test_addr 00000058, read_val 00800000, device 0

ddr_test_fulldata: 24: wrote 00800000, test_addr 0000005c, read_val 00800000, device 0

ddr_test_fulldata: 25: wrote 01000000, test_addr 00000060, read_val 04000000, device 0

ddr_test_fulldata: 26: wrote 02000000, test_addr 00000064, read_val 04000000, device 0

ddr_test_fulldata: 27: wrote 04000000, test_addr 00000068, read_val 08000000, device 0

ddr_test_fulldata: 28: wrote 08000000, test_addr 0000006c, read_val 08000000, device 0

ddr_test_fulldata: 29: wrote 10000000, test_addr 00000070, read_val 40000000, device 0

ddr_test_fulldata: 30: wrote 20000000, test_addr 00000074, read_val 40000000, device 0

ddr_test_fulldata: 31: wrote 40000000, test_addr 00000078, read_val 80000000, device 0

ddr_test_fulldata: 32: wrote 80000000, test_addr 0000007c, read_val 80000000, device 0

Altera_Forum · ‎08-29-2010

So, I did a bunch of testing and experimenting and I need help

to clarify things that may put this thing to rest.

Firstly, I have only NIOS, DDR, On-chip mem, and JTAG in the system

and I have reduced the clocks for all possible to the 50MHz sysclk.

With this, all critical warnings have been removed as previously stated.

The result with 4x address incrementing is like this:

1: wrote 00000001, test_addr 00000000, read_val 00000004, device 0

2: wrote 00000002, test_addr 00000004, read_val 00000004, device 0

3: wrote 00000004, test_addr 00000008, read_val 00000008, device 0

4: wrote 00000008, test_addr 0000000c, read_val 00000008, device 0

5: wrote 00000010, test_addr 00000010, read_val 00000040, device 0

6: wrote 00000020, test_addr 00000014, read_val 00000040, device 0

7: wrote 00000040, test_addr 00000018, read_val 00000080, device 0

8: wrote 00000080, test_addr 0000001c, read_val 00000080, device 0

... etc, etc

and with 8x address incrementing is like this:

1: wrote 00000001, test_addr 00000000, read_val 00000001, device 0

2: wrote 00000002, test_addr 00000008, read_val 00000002, device 0

3: wrote 00000004, test_addr 00000010, read_val 00000004, device 0

4: wrote 00000008, test_addr 00000018, read_val 00000008, device 0

5: wrote 00000010, test_addr 00000020, read_val 00000010, device 0

6: wrote 00000020, test_addr 00000028, read_val 00000020, device 0

7: wrote 00000040, test_addr 00000030, read_val 00000040, device 0

8: wrote 00000080, test_addr 00000038, read_val 00000080, device 0

etc, etc, ...

8x incrementing looks to provide correct results but that means that

since we only get 4 bytes per read from NIOS that we are missing

the other 4 bytes assuming the addresses provided to the _32DIRECT

RD/WR macros are byte addresses.

I confirmed that with 8x addr incrementing that I cannot write past the

span of 0x10000000(256MB which is size of device).

I used SignalTap to look at the *local* signals and addresses and this

seems the root of the problem.

When using 4x incrementing then the local_address increases as

0,0,1,1,2,2,3,3, etc. for every local_read_req pulse.

When using 8x incrementing then the local_address increases as

0,1,2,3, etc.

So, here, 8x looks correct but then since I can only access up

to the span 0x10000000 (256MB if byte addr) before NIOS returns

0xFFFFFFFF as data and the devices are 256MB then half the addresses are not

available apparently.

Note for the reads the data for one of the 8x reads on this local_rdata

bus (64 bits wide) is:

0x00000002_00000002

Note for the reads the data for one of the 4x reads on this local_rdata

bus (64 bits wide) is:

0x00000004_00000004

How do I access ____ALL____ 256MB of this device.

The emi_ddr_ug.pdf by Altera shows local_address

on page 121 which talks about "LSB of column address on memory side

is ignored" which may be of some relevance.

Does this mean we ____WILL____ only see half of our memory???

Thank-you.

Altera_Forum · ‎08-30-2010

Has anyone had this same experience as in my post# 10?

Thank-you.

Altera_Forum · ‎08-30-2010

There appears to be a somewhat similar thread at:

http://www.alterauserforums.org/forum/showthread.php?t=24132

with title: "DDR SDRA IP local address" where BadOmen helped out

another poor soul.

Difference here is that I am using a fully enclosed SOPC system where

all code is generated by the tools which I shouldn't have to touch.

Help. :)

Altera_Forum · ‎08-30-2010

Can you copy your test code into this post just to make sure it's not a software issue. Also when you say that you dropped the clock speed down to 50MHz, you left the SDRAM operating at a faster clock speed correct? DDR-SDRAM has a minimum clock frequency of 77MHz if I remember correctly, DDR2-SDRAM minimum frequency is 125MHz I think.

Also I still highly recommend simulating this design, it'll show you what the accesses look like on the fabric and will not be affected by timing issues. The Nios II software will get pulled into the simulation so there isn't much you have to do besides run a macro or two in Modelsim before starting the simulation.

Altera_Forum · ‎08-30-2010

Thank-you for the reply.

Will attach the code. Code will eventually be for two 32 bit devices

so there are a few small hacks related to this but otherwise fairly clean code.

Code is a c file but changed to .txt for upload.

>>>>Also when you say that you dropped the clock speed down to 50MHz, you left the

>>>>SDRAM operating at a faster clock speed correct? DDR-SDRAM has a minimum clock

>>>>frequency of 77MHz if I remember correctly, DDR2-SDRAM minimum frequency is

>>>>125MHz I think.

DDR2 is still running at 200MHz.

I will take a look simulating this.

Regards.

Altera_Forum · ‎08-30-2010

Your 'test_addr' goes 1, 2, 4, 8, 16, 32, 64, 128 ,etc... This doesn't seem to match the output that you showed above where the address just increments four or eight bytes at a time. I think you meant to write 1, 2, 4, 8, 16, etc... as the data but only move four/eight bytes at a time.

This is the line(s) I'm talking about:

test_addr = 8 * (1 << i);

Altera_Forum · ‎08-31-2010

For the output above where the address just increments four or eight bytes at a time:

this is output from the function ddr_test_fulldata() so test_addr = 8*i

(ie. 0x0,0x8,0x10,0x18, etc) which matches the 8x address incrementing above

for ddr_test_fulldata().

For test_addr = 8 * (1 << i); we have 1 shifting left i times and then multiply by 8

(ie. 0x8, 0x10,0x20,0x40, etc).

>>>Your 'test_addr' goes 1, 2, 4, 8, 16, 32, 64, 128 ,etc...

I am not sure where this is coming from. :(

The output was:

1: wrote 00000001, test_addr 00000000, read_val 00000001, device 0

2: wrote 00000002, test_addr 00000008, read_val 00000002, device 0

3: wrote 00000004, test_addr 00000010, read_val 00000004, device 0

4: wrote 00000008, test_addr 00000018, read_val 00000008, device 0

5: wrote 00000010, test_addr 00000020, read_val 00000010, device 0

yadda, yadda.

Perhaps my awkward reporting order/style of: wrote_val:test_addr:read_val

is confusing. ie. test_addr is not the first column(it is the middle column).

Sorry if I have not been clear.

Altera_Forum · ‎08-31-2010

Thinking out loud here but assuming there is an issue with the system SOPC

is generating then I wonder if increasing the column address width by 1 would help.

Then access as 8x and my span would be twice so I could see all the memory.

Downside would be that more SOPC span is lost but that would be ok.

??

Altera_Forum · ‎09-02-2010

Ok, I started an SR with Altera and they said:

>>When NIOS II (32 bit Master) is trying to access your memory controller (64 bit slave), >>Master Byte Address will need to be at incremental of 8.

>>Could you please refer to Avalon Interface specification document for better

>>explaination. http://www.altera.com/literature/manual/mnl_avalon_spec.pdf

>>Table 3-3

>>Such incremental step is required for Avalon bus interfacing between NIOS-II and your

>>memory controller local bus, and it is not related with memory burst length setting

>>(x4 or x8) in memory controller.

So, this confirms my suspicion and experiments that the address needs to increment by 8.

But note you still only get 4 bytes of data per address increment of 8.

In the same SR update I had asked two other questions which did not get answers

(common with Altera SR feedback depending on the support individual):

1) ie. how do I access ____ALL____ 256MB of this device.

2) Does this mean we ____WILL____ only see half of our memory???

So, the span generated is 0x10000000 for 256MB DDR and I thus cannot access past

this span and thus with address increments of 8 I will only see half the memory

and would need a span twice this size to see it all. No? Else how do I see all the mem.

Thoughts?

Regards.

Altera_Forum · ‎09-10-2010

Still no resolution to this :(

Here is some more data...

(sorry about the screwed up formatting in the table here)

For writes(B = byte, L = local):

ACCESS BADDR BDATA LADDR BE WDATA

1 0 1 0 0F 0000000100000001h

2 4 2 0 F0 0000000200000002h

3 8 4 1 0F 0000000400000004h

4 c 8 1 F0 0000000800000008h

5 10 10 2 0F 0000001000000010h

6 14 20 2 F0 0000002000000020h

...

Behavior for reads is same but at the time of local_rdata_valid

the data on local_rdata is:

ACCESS local_rdata

1 0000000400000004h

2 0000000400000004h

3 0000000800000008h

4 0000000800000008h

5 0000004000000040h

6 0000004000000040h

...

For all these:

local_size = 1

Clearly not a fabric issue.

Issue must reside in controller(and/or the way it is configured)

or in my hardware. From here on I will only test with x4 addressing.

Currently, I am doing all the writes and then all the reads but if I do the writes and reads back to back then it works ... temporarily.

ie. if I go write,write,write,... then read,read,read,... then no good

if I go write,read,write,read, ... then ok (until I read 'em all after this)

So, it would seem that subsequent writes are stepping on the previous writes.

When I do x8 accesses BE is always 0x0F and when I do x4 accesses then BE

alternates between 0x0F and 0xF0 as expected and I think it is when I mix these two types of accesses that things go awry.

If I do x8 accesses starting at addr 0(ie. 0,8,0x10,0x18, etc) then _nothing_ gets

stomped on.

If I do x8 accesses starting at addr 4(ie. 0,4,0xC,0x14, etc) then _nothing_ gets

stomped on.

However, it is when I do writes where some use BE 0x0F and others use BE 0xF0 then that is when things go awry.

I will clarify for you how the hardward is hooked up:

o I have two 16 bit DDR2 devices.

o The same address bus goes to both DDR A0-A12

o The same BA goes to both DDR BA0-BA2

o The same CK/CK# go to both DDR CK/CK#

o The same CS,WE,RAS,CAS go to both

o DQS0,1 go to device 0 LDQS,UDQS and DQS2,3 go to device 1 LDQS,UDQS

o ODT is also shared

o DQ0-15 go to device 0, DQ16-31 go to device 1.

o LDM,UDM for both devices is grounded through a resistor.

Of course, the controller is configured as a 32 bit single device

with 1 CS, 8 DQ per DQS, drive DM pins=NO, 1 output clk pair.

Thanks.

Altera_Forum · ‎09-11-2010

Devices I am using are MT47H64M16

Configuration: 8 Meg x 16bit x 8 banks (ie. 1Gb, 128MB, or 64M x 16bit locations)

Row Addr: A[12:0] (8K)

Bank Addr: BA[2:0] (8)

Col Addr: A[9:0] (1K)

Total of row/bank/col = 8Kx8x1K = 64M x 16bit locations(according to the datasheet).

Now, given the way I have connected the two devices we still have the

same number of locations but now the density doubles since the

number of DQ's has doubled.

In order to access every location I need access to all address locations

(row,bank,col) but the datasheet for the controller says for its example:

>>>>>local_address[23:11] = row address[12:0]

>>>>>local_address[10:9] = bank address [1:0]

>>>>>local_address [8:0] = column address[9:1]

>>>>>The least significant bit (LSB) of the column address (multiples of four) on

>>>>>the memory side is ignored, because the local data width is twice that of the

>>>>>memory data bus width.

Is it not true that I need to be able to toggle the LSB of the column address

to access all the memory? To me it would seem absolutely necessary.

With consideration to my previous post:

I am trying to understand how the given accesses show up on the mem side.

The datasheet is does not really cover my case clearly with the aid of

timing diagrams.

Here was the data in question:

For writes(B = byte, L = local):

ACCESS BADDR BDATA LADDR BE WDATA

1 0 1 0 0F 0000000100000001h

2 4 2 0 F0 0000000200000002h

3 8 4 1 0F 0000000400000004h

4 c 8 1 F0 0000000800000008h

5 10 10 2 0F 0000001000000010h

6 14 20 2 F0 0000002000000020h

...

I would imagine that all of these would be single 32 bit accesses

to the memory(which is 32 bits wide) and thus from one cycle

to another the only things that should change would be the mem_DQ's and the

mem_addr. I would expect all other signals to behave the same

(until bank end is reached) so on the mem interface things should not

look different whether BE is 0x0F or 0xF0 but something is definitely

not functioning correctly here.

It is my contention that there are bugs in the controller for how

varying BE transfers are handled.

Here is some more data:

1: wrote 00000001, test_addr 00000000, read_val fffffffe, device 1

2: wrote 00000002, test_addr 00000008, read_val fffffffd, device 1

3: wrote 00000004, test_addr 00000010, read_val fffffffb, device 1

4: wrote 00000008, test_addr 00000018, read_val fffffff7, device 1

5: wrote fffffffe, test_addr 00000004, read_val fffffffe, device 1

6: wrote fffffffd, test_addr 0000000c, read_val fffffffd, device 1

7: wrote fffffffb, test_addr 00000014, read_val fffffffb, device 1

8: wrote fffffff7, test_addr 0000001c, read_val fffffff7, device 1

For this test I wanted to see what would happen to four BE=0x0F (x8) writes

if I followed them with four BE=0xF0 (x8 but with offset 4 addrs).

Here I do the 8 writes and then read them back at the end in the order

they were written.

Sure enough the data from 1-4 get stomped on by the writes 5-8.

If I read 1-4 back before the 5-8 writes then all is good.

It is almost as if the local_be(byte enables) are being completely ignored

by the controller.

For all accesses, local_size is 1 and local_burstbegin toggles for one cycle

with the local_write_req.

I welcome any feedback.

Thank-you.

Altera_Forum · ‎09-11-2010

Hmmm, I wonder if accesses were made to be 64 bit on the local bus which means

that it would use local_be = 0xFF. I wonder if that would work.

Perhaps using one of the DMA controllers.

What's the easiest way to test this theory?