Re: Tightly Coupled OnChip DMA

Altera_Forum · ‎02-11-2014

Hi,

I've got some very strange 'performance behavior' with a tightly coupled on chip memory used as data storage. In order to speed up my design, I've created a tightly coupled on chip memory where I calculate coefficient values while a DMA controller copies new data into another portion.

At my first attempt I've used the DMA synchronized by means that I wait for the dma to finish copying and then doing my calculations.

In my second attempt I copy while I do my calculations, but this gives me zero performance gain. How can that be? It almost seems as if the NIOS stalls while copying takes place.

Can anyone give me a good explanation on what's going on?

Thanks

Altera_Forum · ‎02-11-2014

Is the TCM single or dual port?

If it's a single port ram, then it will definitly stall while the copy takes place.

If its a true dual port, you should be able to make it work, just make sure the NIOS is attached to one port and the DMA controller is attached to the other.

Pete

Altera_Forum · ‎02-12-2014

Dual Port, TCM only allows one memory master connected to the TCM :).

The DMA Controller is the mSGDMA from the Altera Wiki. Seems all very strange.

Altera_Forum · ‎02-12-2014

Maybe the copy time isn't actually significant.

You could try requesting the copy twice and see how much that slows it down.

Altera_Forum · ‎02-12-2014

Hi,

already tried that too. I try to share some code snippet, where I see some of those strange behaviors. In the code below the 8x memcpy_dma is significantly faster than wrapping it all in one single dma call. And by significantly faster I mean, 1400ms per execution cycle of my algoritm to 1600ms.


memcpy_read_dma_async(c_off(0,0,synth->width), &(synth->pixel),synth->width*sizeof(alt_16), DESCRIPTOR_CONTROL_EARLY_DONE_ENABLE_MASK);
memcpy_read_dma_async(c_off(1,0,synth->width), &(synth->pixel),synth->width*sizeof(alt_16), DESCRIPTOR_CONTROL_EARLY_DONE_ENABLE_MASK);
memcpy_read_dma_async(c_off(2,0,synth->width), &(synth->pixel),synth->width*sizeof(alt_16), DESCRIPTOR_CONTROL_EARLY_DONE_ENABLE_MASK);
memcpy_read_dma_async(c_off(3,0,synth->width), &(synth->pixel),synth->width*sizeof(alt_16), DESCRIPTOR_CONTROL_EARLY_DONE_ENABLE_MASK);
memcpy_read_dma_async(c_off(4,0,synth->width), &(synth->pixel),synth->width*sizeof(alt_16), DESCRIPTOR_CONTROL_EARLY_DONE_ENABLE_MASK);
memcpy_read_dma_async(c_off(5,0,synth->width), &(synth->pixel),synth->width*sizeof(alt_16), DESCRIPTOR_CONTROL_EARLY_DONE_ENABLE_MASK);
memcpy_read_dma_async(c_off(6,0,synth->width), &(synth->pixel),synth->width*sizeof(alt_16), DESCRIPTOR_CONTROL_EARLY_DONE_ENABLE_MASK);
memcpy_read_dma(c_off(7,0,synth->width), &(synth->pixel),synth->width*sizeof(alt_16));
	for(int n = 0; n < synth->width; n++)
	{
		// lift 2's
		/*c_off(0,n,synth->width) -= HWLIFT2(*c_off(1,n,synth->width), *c_off(1,n,synth->width));
		*c_off(2,n,synth->width) -= HWLIFT2(*c_off(1,n,synth->width), *c_off(3,n,synth->width));
		*c_off(4,n,synth->width) -= HWLIFT2(*c_off(3,n,synth->width), *c_off(5,n,synth->width));*/
		*c_off(0,n,synth->width) -= ((2 + *c_off(1,n,synth->width) + *c_off(1,n,synth->width)) >> 2);
		*c_off(2,n,synth->width) -= ((2 + *c_off(1,n,synth->width) + *c_off(3,n,synth->width)) >> 2);
		*c_off(4,n,synth->width) -= ((2 + *c_off(3,n,synth->width) + *c_off(5,n,synth->width)) >> 2);
		// lift 3's
		*c_off(1,n,synth->width) += ((8 - *c_off(0,n,synth->width) + 9*(*c_off(0,n,synth->width))
				+ 9*(*c_off(2,n,synth->width)) - (*c_off(4,n,synth->width))) >> 4);
	}
	for (int y = 0; y < synth->height/2; y++)
	{
		if(y < (synth->height/2-4))
		{
			memcpy_read_dma(c_off((8 + 2*y)%MEM_LIFTCACHE_DEPTH,0,synth->width), &(synth->pixel),synth->width*sizeof(alt_16));
			memcpy_read_dma(c_off((9 + 2*y)%MEM_LIFTCACHE_DEPTH,0,synth->width), &(synth->pixel),synth->width*sizeof(alt_16));
		}
// and so on ...

my memcpy_read_dma is just a simple wrapper around the dma descriptor writing


static void sgdma_read_complete_isr(void * context)
{
	read_isr_fired++;
	clear_irq(MSGDMA_DISPATCHER_READ_CSR_BASE);
}
void memcpy_read_dma_async(void* dest, void* src, alt_u32 size, unsigned long control_bits)
{
	sgdma_standard_descriptor descriptor;
    while((RD_CSR_STATUS(MSGDMA_DISPATCHER_READ_CSR_BASE) & CSR_DESCRIPTOR_BUFFER_FULL_MASK) != 0);  // spin until there is room for another descriptor to be written to the SGDMA
    construct_standard_mm_to_mm_descriptor (&descriptor, (alt_u32 *)src, (alt_u32 *)dest, size, control_bits);
    write_standard_descriptor (MSGDMA_DISPATCHER_READ_CSR_BASE, MSGDMA_DISPATCHER_READ_DESCRIPTOR_SLAVE_BASE, &descriptor);
    return;
}
void memcpy_read_dma_compl()
{
	while(read_isr_fired == 0);
	read_isr_fired = 0;				// reset spin lock
}
void memcpy_read_dma(void* dest, void* src, alt_u32 size)
{
	memcpy_read_dma_async(dest, src, size, DESCRIPTOR_CONTROL_TRANSFER_COMPLETE_IRQ_MASK);
	memcpy_read_dma_compl();
}

Altera_Forum · ‎02-13-2014

Any chance that the code fetches are being deferred by the dma?

Personally I wouldn't take an interrupt at the end of the dma either.

The cost of the ISR is probably more than the transfer.

You just need to poll the 'dma busy' register.

But I've really not looked at what the altera HAL code does. I'm not impressed by the HAL code I've looked at.

Altera_Forum · ‎02-14-2014

I have a similar situation with simple DMA and internal FPGA memory ... is that the same as TCM ?

Can you explain the mechanism that stalls the NIOS processor while the DMA is in operation ? I would have thought that the Avalon MM fabric would take

care of that arbitration and both would make forward progress together .

Thanks, Bob

Altera_Forum · ‎02-17-2014

IIRC the arbiter is likely to have given both master the same priority, and also to avoid switching between equal priority masters unless they release their 'request'. I think you can change the priority by right clicking on the intersection.

The dma controller could easily be doing full bandwidth transfers on both the source and destination avalon slaves.

So if the cpu needs to access one of those slaves it could easily get locked out.

Altera_Forum · ‎02-18-2014

Got back on working on that issue yesterday, and it seems that by moving heap & stack and also the code to respective TCM coupled onchip memories resolve the contention on the bus.

One small question related to that whole issue: My algorithm heavily relies on memory accesses and I measured with a performance counter that my algorithm takes ~ 30 000 000 clock cycles. Doing a naive calculation this should roughly translate to 300ms, but measuring with a interval timer I see ~1200 ms now. Does anyone have good guess why the two methods differ that much? Or even better, has anyone a good advice on how to increase the utilization of the NIOS?

Thanks for all the help so far.

Altera_Forum · ‎02-19-2014

You will need to understand the object code generated by the compiler.

Compiling with (IIRC) -S --verbose-asm will give an annotated assembly output.

You might either find that you've miscounted the number of instructions, or that there are some pipeline stalls because values read from memory can't be used for the next two clocks (ie there is a stall if either of the next two instructions use the read value).

A quick look at your code makes me think that you need to copy some values to locals. I suspect a lot of them are being reread from memory multiple times in the loop.

Altera_Forum · ‎02-19-2014

--- Quote Start ---

You will need to understand the object code generated by the compiler.

Compiling with (IIRC) -S --verbose-asm will give an annotated assembly output.

You might either find that you've miscounted the number of instructions, or that there are some pipeline stalls because values read from memory can't be used for the next two clocks (ie there is a stall if either of the next two instructions use the read value).

A quick look at your code makes me think that you need to copy some values to locals. I suspect a lot of them are being reread from memory multiple times in the loop.

--- Quote End ---

I agree with DSL that the actual code generated needs to be analyzed to give a fair chance of correlating. I have been involve with benchmark tuning and that method will yield results ... analyzing what the code generator produced . I had access to a "scroll pipe" that was a trace of the pipeline execution of each instruction but we don't have that here.

Did you get over the contention to getting to the internal memory ? I have not tried dual porting but another approach may be to compute your coeffecients with NIOS and having two independent coeffiecient buffers for DMA to work on in a ping-pong fashion. This would require three internal memories.. the main one for NIOS and a dedicated ping and dedicated pong memory to fully decouple the DMA from the NIOS.

To view contention without side-effects, SignalTrace can be used else bring the AVALON DMA and NIOS data master reada and write signals out to probe then with a scope or Logic Analyzer.

Best Regards, Bob.

Altera_Forum · ‎02-20-2014

The code I run on the nios is carefully compiled without any actual function calls (they are all inlined) in order to give the compiler more registers.

All global data is accessed using 16bit offsets from the global pointer - this also significantly reduces pressure on registers.

You do get better code for global arrays if you use %gp as a register variable pointing to the array (and for global structs if you have built gcc with my patches).

I've also disabled the dynamic branch prediction logic to get guaranteed branch timings.

With code and data in tightly coupled memory the measured timing then match the calculated ones.

I only found one undocumented pipeline stall - there is a 1 cycle stall for a read following a write to the same tightly coupled data memory.