DMA on a DE0

Altera_Forum · ‎05-24-2011

Hi all,

Can someone verify for me that they have DMA to SDRAM working on a DE0 board ? I've been trying to get it to work for a couple of days, but no joy. Here's how it's set up in SOPC builder:

http://0x0000ff.com/imgs/fpga/dma.png

... and all I'm doing is instantiating the SOPC system, then creating a BSP from it with the included 'memtest.c' example. What I see when typing in values that won't conflict with the code (since the test is destructive) is that everything passes, but the DMA hangs...

http://0x0000ff.com/imgs/fpga/dma-out.png

'ramClock' is delayed c.f. 'cpuClock' (clock phase shift on 'c1' is set to -3ns). I'm not sure if it matters, but the 'compensated for' clock is set to 'c0' (which is the cpuClock output).

I am getting some timing violations on the 'altera_reserved_tck' clock since switching to the 'standard' Nios2 CPU, but the other clocks in the system seem to be within range, and the altera_reserved_tck' clock seems to be just for JTAG anyway...

http://0x0000ff.com/imgs/fpga/dma-clocks.png

Assuming the clock isn't the cause, is there anything else I ought to be doing ? Do you need any other support modules to get DMA working ?

Assuming the clock is the cause, is there a way to isolate the 'altera_reserved_tck' clock to just the JTAG circuit ?

This is all using Quartus 10.1, if it matters. That's the revision I've had most success with, to date.

Any help gratefully appreciated :)

Simon

Altera_Forum · ‎05-24-2011

While your code is running the stack and heap could be in that range you tested so I wouldn't recommend testing this way. There is a smaller version of the memtest software that would fit into the on-chip memory if you boost the size a bit. The readme text for that software template should tell you how big the code footprint needs to be (it's either 4 or 8 kB.... I forget).

Altera_Forum · ‎05-24-2011

Yes, I saw the 'small' variant, but it doesn't do any DMA testing. I was trying to keep to the supplied code (to reduce the number of things I could have done wrong :) )

The code I'm using does do a non-DMA test, which works fine. It's only when it starts the DMA test that the board hangs. I was also using offsets from top/bottom of RAM to allow the stack/heap to have space to work in. I did try using a small section in the middle of RAM as well (0xA00000 -> 0xB00000), and had the same problem.

The goal is to have DMA sending video from SDRAM to a VGA core, and I want to try and get it working with a presumably "known good" solution before adding in my VGA core...

I did look at the university program stuff, because it sets up a PLL for the SDRAM and the CPU, but it's all wrapped up inside a custom component, so I don't think I can get at how they've configured it, to check if my -3ns is the correct phase offset. Perhaps I'll just try various offsets and see if it helps...

Cheers

Simon

Altera_Forum · ‎05-24-2011

Ah you are correct, this whole time I thought the small memtest used the DMA. I think if you took the memtest code, and removed the flash testing stuff and switched the software over to using small printfs you could probably get it to fit in an 8k on-chip memory.

If your VGA controller has a streaming input then maybe something like this would work well for you: http://www.alterawiki.com/wiki/modular_sgdma_video_frame_buffer

Instead of phase shifting the SDRAM clock I would recommend writing .sdc constraints for the SDRAM instead. The fitter will move the logic of the SDRAM controller around to meet the offchip timing. This will require you to read the SDRAM device data sheet to find out it's timing so that you can key them into the custom timing constraints. If you are new to Timequest this might take a while to learn and maybe what I refer to as the "lick your finger and hold it up to the wind" method (phase shifted clock) would be the quickest solution for you.

Altera_Forum · ‎05-24-2011

--- Quote Start ---

Ah you are correct, this whole time I thought the small memtest used the DMA. I think if you took the memtest code, and removed the flash testing stuff and switched the software over to using small printfs you could probably get it to fit in an 8k on-chip memory.

--- Quote End ---

I may give that a go, but I'm pretty sure I'm not over-writing anything at the moment - I'm printing out the addresses I'm using and...

the MemDMATest is the last thing to allocate any RAM (for the DMA buffers) and they're well below the area I'm testing.
the stack ought to be nowhere near where I'm testing either. The Memory extends from 0x800000 to 0xFFFFFF, and I'm only testing from 0xA00000 to 0xB00000 at the moment.

One of the runs looks like:

--- Quote Start ---

Testing RAM from 0xA00000 to 0xB00000

-Data bus test passed

-Address bus test passed

-Byte and half-word access test passed

-Testing each bit in memory device. . . passed

-Testing memory using DMA.

channels ... tx handle=0x816e28, rx handle=0x816e44

DMA write buffer:0x8081a2a0, read buffer:0x8081b2a8

[tx: 0xa00000](tx:rc=0)(rx:rc=0)

--- Quote End ---

You can see that the read/write buffers are way outside the testing area, and I can't believe the stack has extended down from 0xFFFFFF to impact on 0xB00000. I'm not currently trying to exhaustively test DMA to/from the SDRAM, just get it working at all...

Still, if there's a shadow of doubt, it may be worth trying :)

--- Quote Start ---

If your VGA controller has a streaming input then maybe something like this would work well for you: http://www.alterawiki.com/wiki/modular_sgdma_video_frame_buffer

--- Quote End ---

Yes, I saw that one :) I was going to try and get it working with "simple" DMA before going to the scatter/gather option... It would be nice to start it going and forget about it from then on, though. I also wasn't sure whether a DDR SDRAM would be required, or whether it's just what the design was demonstrated on. The DE0 only has an SDR SDRAM :(

--- Quote Start ---

Instead of phase shifting the SDRAM clock I would recommend writing .sdc constraints for the SDRAM instead. The fitter will move the logic of the SDRAM controller around to meet the offchip timing. This will require you to read the SDRAM device data sheet to find out it's timing so that you can key them into the custom timing constraints. If you are new to Timequest this might take a while to learn and maybe what I refer to as the "lick your finger and hold it up to the wind" method (phase shifted clock) would be the quickest solution for you.

--- Quote End ---

New to Timequest, and to just about everything else here... I was actually pretty pleased with myself for getting the PLL clocks worked out in Timequest [grin].

However, I've just tried setting the phase-shift to -2.5ns and then -3.5ns, without any success on either setting, and it reads just fine at either setting when not using DMA. I guess DMA puts the SDRAM under a lot more strain than just single reads/writes...

Are there any "gotchas" that newbies like myself might be tripped up by, that seasoned designers do as a matter of course ? Or is it really just down to the SDRAM being *that* picky about working ?

Cheers

Simon

Altera_Forum · ‎05-24-2011

Hmm - ok, looking around the forum, I found: http://www.alteraforum.com/forum/showthread.php?t=1269&page=2 which has some fairly clear docs on how to constrain the SDRAM using Timequest, so if I did:

--- Quote Start ---

create_clock -period 20.000 -name ext_clk [get_ports {CLOCK_50}]

derive_pll_clocks

derive_clock_uncertainty

set sdram_clk gpu|the_altpll_0|sd1|pll7|clk[0]

create_generated_clock -name sdram_clk_pin

-source $sdram_clk -offset 0.5 [get_ports {sdram_clk}]

--- Quote End ---

... using a 0.5ns offset for PCB routing. Reading the datasheet (the DRAM on a DE0 is an A3V64S40ETP chip, with the speed-class -6, so it can run at 166MHz), and using CAS latency=3, it seems

Max Clk-to-valid-output-delay = 5.4ns
Min Output data-hold time = 2.5ns
Min setup = 1.5ns
Min hold = 1 ns

... so if I further add:

--- Quote Start ---

set_input_delay -clock sdram_clk_pin -max [expr 5.4 + 0.6]

[get_ports {DRAM_CAS_N DRAM_RAS_N DRAM_CS_N DRAM_WE_N DRAM_ADDR[*]}]

set_input_delay -clock sdram_clk_pin -min [expr 2.5 + 0.4]

[get_ports {DRAM_CAS_N DRAM_RAS_N DRAM_CS_N DRAM_WE_N DRAM_ADDR[*]}]

set_output_delay -clock sdram_clk_pin -max [expr 1.5 + 0.6]

[get_ports {DRAM_CAS_N DRAM_RAS_N DRAM_CS_N DRAM_WE_N DRAM_ADDR[*]}]

set_output_delay -clock sdram_clk_pin -min [expr -1.0 + 0.4)]

[get_ports {DRAM_CAS_N DRAM_RAS_N DRAM_CS_N DRAM_WE_N DRAM_ADDR[*]}]

--- Quote End ---

... it ought to self-constrain without me having to use a second clock with an offset ?

Questions:

I don't really see why DRAM_BA and DRAM_{L,U}DQM aren't part of the ports listed. I'm following the recommendations in the document, but would love to know :)
I'm guessing that the DRAM_DQ data lines aren't specified because they'll be latched at some time during the waveform that the other signals comprise, so they're less important to get synchronised.
DRAM_CKE isn't mentioned anywhere either. Ought I be putting this signal into the list of ports above ?

Cheers

Simon

Altera_Forum · ‎05-25-2011

Typically I use the same clock for the controller as what I send off chip since the constraints will drive the fitter to move things around for you. If timing is still and issue mixing timing constraints with a phase shifted clock off-chip can sometimes solve that problem. In that case you are trading off either read or write timing for the other which is also what that -3.5ns phase shift that people blindly add to their designs does as well (but you have no idea if it really works without constraints).

I haven't read that document before so I can't really comment but I would think you need to set the same constraints across all the I/O for the interface. Here is another doc that I highly recommend: http://www.alterawiki.com/wiki/timequest_user_guide

Also I should mention if your SDRAM interface does not meet timing this will not affect the DMA (beside the fact that it might transfer corrupt data). So the DMA getting stuck would be caused by other things like it accessing space, or maybe the software is getting clobbered, etc...

Altera_Forum · ‎05-25-2011

Well, since this is such a simple design, I decided to try out Quartus2 v11, and see if that helped matters, especially since you say the DMA ought not hang...

Long story short: it works in v11. I can't see anything that I'm doing differently (apart from minor things - there's a clock-bridge component rather than directly exporting the pll's scram clock). In any event, the last two times I've tried it, the SDRAM passes its DMA test as well as its CPU test...

http://www.0x0000ff.com/imgs/fpga/dma-success.png

Note that I'm testing the same memory-area as before. This is working at 100MHz as well, not just at the 50MHz I was trying on Quartus 10.1

Oh well, should have tried the later version earlier :( Still, glad it's working now :)

Simon

Altera_Forum · ‎05-25-2011

Perhaps the DMA or bridge suffered from the same bug as the SGDMA used to as well and it was fixed in 11.0 (it's been a while so I forget). This bug would prevent the master from issuing a request until waitrequest is low so as you can imagine this could cause the master to get stuck. Masters shouldn't wait for waitrequest to de-assert so this was addressed in a few cores.

To make sure the memory timing has some slack (if you decide not to use Timequest constraints) I would try a few other clock phases to make sure you are not on the edge of not meeting timing.

Goodluck with the rest of your project.

Altera_Forum · ‎05-27-2011

Ok, same topic but a different subject :)

Can anyone verify for me that the below code is what you'd expect to use to DMA a frame of video to the FIFO connected at the input of the VGA controller... I'm trying to narrow down whether I have a software problem or a hardware problem...

The symptoms are that my chip-select line on the VGA controller never seems to go high (it's a 'chipselect' type, not a 'chipselect_n' type). I connected it (via a counter) to the LEDG array-of-leds on the board, and there wasn't any activity. Since I'm not apparently getting a chip-select, I'm assuming that nothing is being sent to the VGA controller, hence the question about the code :)


static	alt_dma_txchan 	_txchan;
static unsigned char *	_vram = (unsigned char *)0xE00000;
static void nextFrameRequired(void* handle)
	{
	static int count = 0;
	int rc;
	if (count %60 == 0)
	    printf("x");
	count ++;
	// Re-schedule a DMA transfer, and call this callback again
	if ((rc = alt_dma_txchan_send(_txchan, _vram, 640*480, nextFrameRequired, NULL)) < 0)
	    {
	    printf ("Failed to repost transmit request, reason = %i\n", rc);
	    }
	}
int main(void)
    {
    int rc;					// Result code
    int failed = 0;
    printf("\n\nStarting up the VGA controller\n");
    // Start a transfer to the VGA controller
    if ((_txchan = alt_dma_txchan_open(DMA_NAME)) == NULL)
	{
	printf ("Failed to open DMA channel to VGA controller\n");
	failed = 1;
	}
    
    // We want to write to the VGA controller FIFO at VGA_BASE so
    // set the DMA transactions to be WRITE_ONLY
    if ((rc = alt_dma_txchan_ioctl(_txchan, ALT_DMA_TX_ONLY_ON, (void *)VGA_BASE)) < 0)
	{
	printf ("Failed to configure DMA, reason = %i\n", rc);
	failed = 1;
	}
    // We want bytes to be written to the fifo
    if ((rc = alt_dma_txchan_ioctl(_txchan,ALT_DMA_SET_MODE_8,0)) < 0)
	{
	printf("Failed to configure byte-writes\n");
	failed = 1;
	}
    // Schedule a DMA transfer, and enable a callback that will
    // reschedule a transfer when the current one is done. We transfer
    // from the top of memory
    if ((rc = alt_dma_txchan_send(_txchan, _vram, 640*480, nextFrameRequired, NULL)) < 0)
	{
	printf ("Failed to post transmit request, reason = %i\n", rc);
	failed= 1;
	}
    printf("DMA scheduler activated. Failed = %d, Looping\n", failed);
    // Loop forever
    while (1)
	{
	int i;
	int pixel = 0;
	unsigned char *vram = _vram;
	for (i=0; i<640*480; i++)
	    {
	    *vram ++  = pixel ++;
	    if (pixel == 21)
		pixel = 0;
	    }
	}
    }

The RAM location (0xE00000) passes the DMA test from the Altera-supplied memtest.c template, so I ought to be able to read from there. If anyone is interested enough to look, the complete files are at

http://www.0x0000ff.com/src/fpga/q11.zip (http://www.0x0000ff.com/src/fpga/q11.zip) (zipped, 24MB )
http://www.0x0000ff.com/src/fpga/q11.tar.bz2 (http://www.0x0000ff.com/src/fpga/q11.tar.bz2) (bz2-compressed tar file, 22MB )

Thanks in advance for any help :)

Simon

Altera_Forum · ‎05-31-2011

Make sure you have lots of buffering in the video output otherwise the time between the frames might be so long that you end up with a buffer underflow. I would simulate your design to see whether or not the DMA is behaving as expected. Also your code fragment doesn't have any cache bypassing when it's writing to the frame buffer so if you have the data cache enabled I would address that too. The code as far as I can tell sets the DMA up correctly, but I haven't used that DMA for around 5 years so I could be overlooking something.

To do frame buffering this is way easier.... http://www.alterawiki.com/wiki/modular_sgdma_video_frame_buffer

Altera_Forum · ‎06-01-2011

Okay okay, I give up [grin]. The modular SGDMA it is - I can take a hint, honest guv, once I've been beaten about the head with it a few times :)

So, I tried porting the NEEK-based SGDMA - not the frame-buffer one, since I don't want to use the video-processing pipeline (the licence is too expensive for my bank account!), just the normal DMA one.

I couldn't get it to work last night (to be fair, I didn't have too much time to try), I wrote a top-level verilog module to instantiate the system (I've never used schematic entry) and linked it up with the SDRAM controller rather than the DDR SDRAM controller. It generated the system, loaded onto the DE0, and the software program compiled but failed to verify when using 'nios2-download -g'.

So, there's one (or more, I guess :)) of three things wrong that I can think of:

- The code is being overwritten somehow. It's being loaded in at 0x0 (where the SDRAM starts) and perhaps something else likes that memory-location. I might try relocating SDRAM to a higher location.

- My SDRAM clock is not synchronising correctly. I tried to understand the constraint-based sdc file included, but it seems to be tied up with the DDR controller. I don't have one of those :(. I did try changing the SDRAM clock offset to -2.5, -3, -3.5 ns without any success - and it's worked at all those settings for me before.

- My reset logic is screwed up somehow, and the CPU is never coming out of reset.

My first choice was to use the on-chip RAM, but I can only make 32k of RAM, and the test program doesn't seem to fit into that (I get link-time errors), so the code has to be in SDRAM. I did change the RAM_BASE and RAM_SPAN so that the tested-area of RAM starts at 1MB and extends for 6MB. It's *possible* that isn't leaving enough space for the code/stack but I doubt it.

I'll have another look at it tonight. Maybe I just ought to go buy a NEEK...

Thanks for all the advice so far, by the way ... It may be frustrating at times, but I am (slowly) learning stuff, and that's why I started doing this :)

Simon.

Altera_Forum · ‎06-01-2011

If you get a message about "m_state" ........ and a bunch of text that is not really all that useful then I would start suspecting that reset could be an issue. If the download just says that the validation failed then it could be any number of things including a memory problem.

The SDR SDRAM controller doesn't make use of the first 0x20 words in the memory for calibration so you don't have to worry about that. I would recommend using a non-volatile memory for the reset vector since that's what the processor starts fetching instructions when coming out of reset.

Your code probably wouldn't be located anywhere but the first ~80kB of the memory however the heap is probably located in the memory you are testing so I wouldn't recommend doing that. You could hack the memtest software to allocate the memory instead of having it ask you over the terminal. This way it'll be safe for the test to clobber it since it was either allocated at compile time or at run time.

Altera_Forum · ‎06-02-2011

Okay, it seems there was a fourth option for why it wasn't working... blind idiocy... In my defence, it was 1am when I was playing with this ... but I'd been running the 'nios2-configure-sof' command and it was picking up the 'xxx.sof' file, not the 'xxx_time_limited.sof' file...

When I removed the xxx.sof file (with some prejudice ...) it all seems to work well - at least if I run at <= 100 MHz. I can't get the SDRAM to go higher than that, even if Quartus says fMax is 144.11MHz, even if I perturb the sdram-clock delay around a median of -3ns.

Anyway, now I'm getting:

--- Quote Start ---

Test complete with a throughput of 39MB/s.

Test complete with a throughput of 38MB/s.

Test complete with a throughput of 39MB/s.

Test complete with a throughput of 38MB/s.

Test complete with a throughput of 39MB/s.

Test complete with a throughput of 38MB/s.

Test complete with a throughput of 39MB/s.

--- Quote End ---

... which is roughly 20% of the theoretical bandwidth of the SDRAM. I've noticed you saying that you've seen almost 100% utilization previously, so I presume there are things I can do to tweak that to make it better. Presumably the fifo interface to the real VGA controller (rather than this test) would help matters as well.

So, now I have a working modular SGDMA system on the DE0, I can start to try getting the VGA core to interface to it :) The one in the Neek is way more general than I need - the DE0 only has 4 bits each for R,G,B so I'll be adopting a standard colour definition of 16 bits, (4 each for R,G,B, 2 for alpha, 2 for depth) which precludes the need for re-sampling, colour conversion, etc.

Once that is working, I can start to implement the Blitter/GPU side of things, and it starts to get really interesting :)

Thanks for all the help :) Oh, and it may not be a great implementation (the .sdc file is basic ...) but if you want the DE0 project to go alongside the Neek one, you're welcome to it :)

Simon

Altera_Forum · ‎06-02-2011

The reason why the performance is so bad is that the read and write masters are ping ponging for access to the SDRAM. If you increase the arbitration share of each master that should reduce this effect. If you were reading from the memory and writing the data somewhere else this problem would also go away. In the end your VGA output will end up being the bottleneck so as long as you have enough memory bandwidth to keep the video pipeline feed you should be fine.

The only way to get the SDRAM operating higher than 100MHz reliably would be to constrain the interface since Quartus II has no clue what kind of off-chip timing relationships are needed (that's what the constraints are for)