Tightly-coupled interface to custom components

Altera_Forum · ‎06-14-2010

All experienced SOPC/Nios2 developers probably aware of the fact that Nios2 CPU access to Avalon-MM components is pretty slow.

Specifically, Nios2/f core with non-burst data master generates Avalon-MM read transactions (0 ws, pipeline latency= 1 clock) at max. rate of 1 transaction per 4 clocks with 2 more clocks of latency for instructions depending on load result. The same core generates Avalon-MM write transactions (also 0 ws) at max. rate of 1 transaction per 2 clocks.

Nios2/f core with burst data master is slower yet - 6+2 clocks for read and 4 clocks for write.

I, personally am a big fun of programmable I/O - for things that don't have to run at absolute maximum speed it is much simpler to program than DMA, less prone to strange errors (to name just one, how about cache lines tearing?), more friendly to multitasking environments and, last but not least, costs me little or nothing in terms of FPGA resources. Although, in theory, for small packets PIO, esp in write direction, could end up faster than DMA.

The problem is - on Nios2 PIO is so slow that the slowness seriously reduce its applicability.

The most depressing about it is the fact that there I see no good technical reasons for Nios2 PIO to be slow. Actually, Nios2/f core has near-perfect tool for fast PIO in form of tightly-coupled data port (TCM-DP). The limitations of TCM-DP protocol, specifically, point-to-point master-slave connection, zero wait states and one-clock pipeline latency look perfectly reasonable. There is only one, problem, but the lethal one - the damn SOPC builder refuses to connect TCM-DP to anything except Altera's own on-chip memory components.

Writing this post more out of despair than with the hope to get a real help, but still... I'd guess, my question is: "Does somebody know how to trick SOPC builder into accepting custom Avalon-MM component on TCM-DP port without going too ugly"?

Or, more generally, I want to hear what Honorable Altera Gurus think about ways of improving NIOS2/f PIO performance, especially in a more problematic case of burst data master.

For example, I personally can think about "creative" use of custom instructions. But multicycle custom instructions themselves are only marginally faster than Avalon-MM access and hardware-wise they are not as cheap TCM-DP. Besides, using custom instructions for I/O doesn't agree with my senses of aesthetics.

Thank you for patience,

Michael

Altera_Forum · ‎06-15-2010

I measured 2 clocks for both read and write on a /f without a data cache.

I think the slave is generating a wait state - which is very difficult to avoid on the read cycle!

My thoughts in this area are that the nios need not stall on MM transfers.

For writes simply using a 'posted write' would be enough to allow most writes to complete in a single cycle (a second transfer would have to stall).

For reads it ought, somehow, be possible to cause a 'D' stage stall when the required value is needed instead of an 'A' stage stall.

Both these would need to be options - since they will increase the processor size.

It is worth noting that the 'late result' is actually a 'normal result' - and happens when the resultant value has to go via the register file.

I think non 'late result' instructions use special logic to forward the output of the ALU back to its inputs.

Altera_Forum · ‎06-16-2010

--- Quote Start ---

I measured 2 clocks for both read and write on a /f without a data cache.

--- Quote End ---

On /f without data cache I measured 3 clocks per operation for both read and write. So read is one clock faster than dcache-no-burst but write is one clock slower.

Anyway, when you have the bulk of the data in SDRAM or in any other external memory working without data cache doesn't sound like a feasable option.

--- Quote Start ---

I think the slave is generating a wait state - which is very difficult to avoid on the read cycle!

--- Quote End ---

I don't understand this comment.

The slave is my own, I know for sure that it doesn't generate wait states (i.e. every clock it can accept a new address), just one clock of pipeline latency. So all wait states come from either SOPC (unlikely) or from Nios itself.

--- Quote Start ---

My thoughts in this area are that the nios need not stall on MM transfers.

For writes simply using a 'posted write' would be enough to allow most writes to complete in a single cycle (a second transfer would have to stall).

For reads it ought, somehow, be possible to cause a 'D' stage stall when the required value is needed instead of an 'A' stage stall.

Both these would need to be options - since they will increase the processor size.

--- Quote End ---

Interesting thoughts, but IMHO purely theoretical.

Unfortunately, there is very little chance that Altera is going to significantly redesign /f core.

On the other hand, allowing connection of custom components to tightly-coupled data port would take just a very small change in SOPC builder and achieve the same or better improvements for significant class of custom components.

--- Quote Start ---

It is worth noting that the 'late result' is actually a 'normal result' - and happens when the resultant value has to go via the register file.

I think non 'late result' instructions use special logic to forward the output of the ALU back to its inputs.

--- Quote End ---

Of course, they had to build forwarding logic for single-cycle instructions. However, I don't think that 'late result'='normal result without forwarding'. Even without pipeline bypass results of 'combinatorial' instructions would be available one clock earlier than 'late result'

Altera_Forum · ‎06-17-2010

--- Quote Start ---

Originally Posted by dsl http://www.alteraforum.com/forum/../proweb/buttons/viewpost.gif (http://www.alteraforum.com/forum/showthread.php?p=93142#post93142)

i think the slave is generating a wait state - which is very difficult to avoid on the read cycle!

Originally Posted by mshatz

I don't understand this comment.

The slave is my own, I know for sure that it doesn't generate wait states (i.e. every clock it can accept a new address), just one clock of pipeline latency. So all wait states come from either SOPC (unlikely) or from Nios itself.

--- Quote End ---

I was thinking of non-pipelined single transfers. The pipeline latency becomes a wait state on the nios (probably unless you are doing cache line reads!).

--- Quote Start ---

Originally Posted by dsl

it is worth noting that the 'late result' is actually a 'normal result' - and happens when the resultant value has to go via the register file.

i think non 'late result' instructions use special logic to forward the output of the alu back to its inputs.

Originally Posted by mshatz

Of course, they had to build forwarding logic for single-cycle instructions. However, I don't think that 'late result'='normal result without forwarding'. Even without pipeline bypass results of 'combinatorial' instructions would be available one clock earlier than 'late result'

--- Quote End ---

I was thinking that that the instruction pipeline does 'read reqister file' - 'execute' - 'write register file' over 3 clocks for each instruction, and that a 'read' would only give the new data the clock after a 'write' of the same register.

Thinking further about the M9K rules, for dual access using the same clock it is possible for concurrent read/write to return the new data on read - which might allow for only a single clock delay.

Maybe the extra clock comes from the delays added by the decision code.

Altera_Forum · ‎06-21-2010

When you use a data cache line size of 4B/line or no cache at all the Nios II core does not have a readdatavalid signal hooked up to the fabric. Without this signal Nios II is not able to perform pipelined reads and as a result you should expect read turn around times to affect the performance. Also if any master that supports pipelined reads (has a readdatavalid signal) has reads that are outstanding it will not be able to begin reading from another slave port until all the previous reads returns. So for example with the Nios II 'f' core you can say read 32 back to back bytes from SDRAM and then read from the PIO core immediately after. The reads from the SDRAM must return before the read from the PIO core is allowed to complete (waitrequest will be asserted by the fabric).