Re: Data transfer from hardware Modules to Nios

Altera_Forum · ‎08-13-2016

Hello,

As I have described in a previous post http://www.alteraforum.com/forum/showthread.php?t=52808&p=217155&highlight=#post217155 I want to implement a DAQ system using NIOS and W5300. The communication problem between NIOS and W5300 has been solved and I am able now to transfer data through TCP/IP. Now I am dealing with a different problem that has to do with transfer data between hardware module (implemented in verilog) and NIOS.

To give you a short description my system consists of 36 hardware modules each of them transferring 64 bits of data every 462us. The goal is to grab this data and send them through Ethernet to a pc, so my required data rate is about 5 Mbits/s. Ethernet communication itself doesn't seem to be a problem cause can achieve much higher speeds. The problem is how fast the cpu grabs the data for the modules. In the beginning I thought using just PIOs wouldn't be a problem. So NIOS collects the data from the modules and then transfer them again to W5300 fifo, but using counters I found that my maximum data rate is only about 4Mbit/s. Is there any alternative way to go? I thought using DMA but it sounds bit complicated for me at the moment and I don't know if at the end would be enough.

Any suggestions??

Altera_Forum · ‎08-14-2016

4Mbit/s sounds kind of low, even for a simplistic implementation. Did you already isolate where the bottleneck is? i.e. how fast can you run the W5300 in isolation, how fast can you grab your data in isolation.

Assuming your data acquisition is the bottleneck: how does your data acquisition portion work? Is the NIOS writing to the PIO to force the trigger, and then following that up with a read for the data? Anything where you are performing multiple PIO interactions to execute a single logical operation is an area where it should be very straightforward to improve your performance just by creating your own Avalon-MM Slave component.

Altera_Forum · ‎08-14-2016

I haven't performed any isolation measurement yet but I am pretty sure the problem is on data grabbing. The W5300 data transfer is in range of 100 Mbit/s so its more than enough for my application. The code itself is quite straightforward and it looks like that

--- Quote Start ---

while (1) {

data_buf2[0]=IORD (0x214f0,0);

data_buf2[1]=IORD (0x214e0,0);

data_buf2[2]=IORD (0x214d0,0);

data_buf2[3]=IORD (0x214c0,0);

data_buf2[4]=IORD (0x214b0,0);

data_buf2[5]=IORD (0x214a0,0);

data_buf2[6]=IORD (0x21490,0);

data_buf2[7]=IORD (0x21480,0);

data_buf2[8]=IORD(0x21470,0);

data_buf2[9]=IORD(0x21460,0);

data_buf2[10]=IORD(0x21450,0);

data_buf2[11]=IORD(0x21440,0);

data_buf2[12]=IORD(0x21430,0);

data_buf2[13]=IORD(0x21420,0);

data_buf2[14]=IORD(0x21410,0);

data_buf2[15]=IORD(0x21400,0);

test_tcps(0, 5000, data_buf2, 0);

}

--- Quote End ---

At the moment I am using just 16 PIO so 256 bits, the input is just a counter running at 18Khz. Ive seen some improvement using NIOSII/f and now my data rate is 4,6 Mbit/s.

The modules are running independent as I wrote before so the cpu has only to read a register value connected to the PIO, that's all. I don't mind receiving the same data twice, but this requires reading faster than intended.

Altera_Forum · ‎08-14-2016

W5300 may be capable of 100Mbit/s, but is your system (NIOS+RAM+software+W5300) capable of that performance level? From your code snippet, I'm guessing the answer is no: running your IORD()'s should be a handful of clocks each (and I'm assuming they are in the same [fast] clock domain as the NIOS), so I am guessing all your time is being consumed within test_tcps().

Because you only need a modest improvement in performance over what you've got working already, I would suggest briefly reviewing https://www.altera.com/en_us/pdfs/literature/an/an391.pdf and picking one method you are comfortable with to find the biggest bottleneck in your test_tcps(), then just make whatever easy change to improve that bottleneck.

To answer your original post, DMA really wouldn't buy you much in the above code snippet, as I think all your time is spent inside test_tcps() which DMA wouldn't help one bit.

Altera_Forum · ‎08-14-2016

Thank you very much, actually you are right. I used timestamp timer and the total time for PIO reads was about 4us. So the problem is withe the cpu-w5300 communication. I used the drivers provided by Wiznet and the code isn't so complicated. On the other hand I am using generic tristate controller in QSYS to connect W5300 to Nios, so there is maybe a problem withe the configuration of tristate controller. Attached is the timing I used (I am running with a 140 Mhz clk). Do you have any experience with this?

https://www.alteraforum.com/forum/attachment.php?attachmentid=12555

Altera_Forum · ‎08-15-2016

The forum shrunk your picture, so it's not very readable. I did notice that your specifying your timing in terms of "cycles", meaning your 140MHz clock, and that your "Read wait time" was double digit [14? 24?]. The W5300 datasheet says tRD = 65ns so 9.1 cycles [round up to 10]. And turnaround time is even larger? (shouldn't be?)

This is just one example, but I personally find it more intuitive to use "nanoseconds" and then just plugin the values from the datasheet table just as you read them, and let Qsys handle the rounding.

After you've done all that, will you see a dramatic increase? It's hard to say without knowing what percentage of your time is actually spent reading, writing to the external device.

Other sources of bottleneck could be basic things like running with no cache from a slow memory, for example. It could also be something like you're taking source code for another environment, and then missed some porting detail like how time is managed if there is any sort of internal delay in your ported code.

Anyway, I would drill down one more level into the W5300 code and profile the time used by the register read/write primitives to identify if your problem is there or not. If a big percentage of your time is with that I/O, yes focus on the tri-state bridge configuration.

Altera_Forum · ‎08-15-2016

The code is running entirely on on-chip ram both data and instructions, so I assume is the best in terms of speed. I am not using any caches, actually whenever I use them the code doesn't run which is kind of strange!!

There are no delays on wiznet code just raw read and rights to register.

I follow your advice changing the times to ns but the result was the same. According to turn around time it should be the time changing from write to read measured from the last command so it has to be according to datasheet higher than tcs+tcsn i suppose. For your information the times I am using now are:

Read wait time: 100ns

Write wait time: 100ns

Setup time: 0 ns

Data hold time: 10ns

Max pending: 1

Turnaround time: 130ns

Read latency: 55 ns

I have also include the code I am using at the moment for Nios.

Altera_Forum · ‎08-15-2016

Can you get timestamp data for test_tcps() as a whole, and then the "wiz_write_buf()" which is where the data is actually copied.

Altera_Forum · ‎08-15-2016

test_tcps() :51us

wiz_write_buf(): 35us

I am running now with 125Mhz clock and the new timings are

Read wait time: 9cycles

Write wait time: 7cycles

Setup time: 2cycles

Data hold time: 2cycles

Max pending: 2cycles

Turnaround time: 2cycles

Read latency: 2cycles

Altera_Forum · ‎08-15-2016

Couple of things:

send() and wiz_write_buf() appear to work off of lengths in units of 16-bit words, not bytes.

Your test_tcps() is saying send 32 of them but I think you only need 16?

Then, your wiz_write_buf() taking 35us means >1us per each of your (32) writes that you are doing.

I would quickly try some things like at least turning on the compiler optimization and maybe inlining the wiz_write_buf() loop and function calls, to try to reduce the loop overhead. Even with a conservative budget of 250ns per write taken by the tri-state timing, you're losing 3x that (another 750ns) on loop overhead. NIOS isn't a great performer, but I think you should be able to manage better than that.

Altera_Forum · ‎08-15-2016

That is true I need only 16 but the index idx on the loop increase each time by 2 so the loop runs 16 times actually. If I change the send to 32 I am not sending the correct data. As I said before most of the code is at was fro wiznet driver.

Can you be more specific on what to do to optimize the compiler and inlining the loop and the function calls?

Altera_Forum · ‎08-15-2016

In your project settings, change the optimization level from "None" to -O3. Your software might break, or it might be fine.

Your code boils down to:


void IINCHIP_WRITE(uint32 addr, uint16 data) {
	(*((vuint16*) addr)) = data;
}
uint32 wiz_write_buf(SOCKET s, uint8* buf, uint32 len) {
	uint32 idx = 0;
	// M_08082008
	IINCHIP_CRITICAL_SECTION_ENTER();
	for (idx = 0; idx < len; idx += 2)
		IINCHIP_WRITE(Sn_TX_FIFOR(s), *((uint16*) (buf + idx)));
	// M_08082008
	IINCHIP_CRITICAL_SECTION_EXIT();
}

The compiler can do some improvements by itself when you have enabled the optimizer, but as an example, changing to a macro and using the 'register' keyword probably will bring some improvement:


#define IINCHIP_WRITE(addr, data)   ((*((vuint16*) addr)) = data)
uint32 wiz_write_buf(SOCKET s, uint8* buf, uint32 len) {
	register uint32 idx = 0;
        register uint32 fifo_addr = Sn_TX_FIFOR(s);
	// M_08082008
	IINCHIP_CRITICAL_SECTION_ENTER();
	for (idx = 0; idx < len; idx += 2)
		IINCHIP_WRITE(fifo_addr, *((uint16*) (buf + idx)));
	// M_08082008
	IINCHIP_CRITICAL_SECTION_EXIT();
}

The simple above edits eliminated 2 x 16 = 32 function call overheads from your loop. The math in the loop also isn't great, but deal with the function calls first.

Altera_Forum · ‎08-15-2016

I've made code optimization but this didn't bring any improvement. The changes in the code above also didn't work in contrary bring more trouble cause by changing the code I stared to lose some packets (I've tested the changes separately....).

Altera_Forum · ‎08-15-2016

--- Quote Start ---

I've made code optimization but this didn't bring any improvement. The changes in the code above also didn't work in contrary bring more trouble cause by changing the code I stared to lose some packets (I've tested the changes separately....).

--- Quote End ---

"losing packets" could simply be a side effect of your code not waiting for the transmit to complete before issuing the next request.

After you've sped things up as fast as you can, just add a simple wait loop to delay between packet transmits and see if that affects at all your lost packets.

Altera_Forum · ‎08-15-2016

No its just that the code runs slower and it doesn't return fast enough to catch the next data. I've added a delay and the behavior is the same.

Altera_Forum · ‎08-15-2016

--- Quote Start ---

No its just that the code runs slower and it doesn't return fast enough to catch the next data. I've added a delay and the behavior is the same.

--- Quote End ---

Sorry, I don't understand the problem: each of those code improvements should have resulted in faster execution time for wiz_write_buf().

If they somehow made it operate slower, I guess make changes one at a time and figure out why it got slower?