Cache Behavior on NIOS-II

Altera_Forum · ‎07-13-2012

Hi everybody,

I'm working on a project that needs me to fully understand how many cycles does d-cache/ i-cache needs to fetch the whole cache line.

Assuming that the cache line size is 32bytes, my questions are as follow:

1. does the CPU needs to bring the whole cache line before start executing the first word that arrived? (no pipe-lining)

2. 32bytes means 8 words, does the CPU needs to stall 8 cycles to bring that cache line?

Altera_Forum · ‎07-13-2012

The instruction cache uses critical word first line filling. So if an instruction at address 20 was to be fetched and is not already cached then you can expect the instruction master to perform the reads at the following addresses:

(first) (last)

20, 24, 28, 0, 4, 8, 12, 16

These are pipelined reads so as long as the slave/fabric doesn't stall you can expect the read data to return back-to-back after 'x' number of cycles.

Altera_Forum · ‎07-13-2012

--- Quote Start ---

These are pipelined reads so as long as the slave/fabric doesn't stall you can expect the read data to return back-to-back after 'x' number of cycles.

--- Quote End ---

Thank you BadOmen for your reply.

Yes of course the data will start arriving after 'x' cycles and then continue to arrive at the rate of one word per cycle.

My question is will the CPU be able to continue processing the first arrived word of data or it has to stall until the whole line be in the cache.

just to let you know I'm talking about the d-cache.

Altera_Forum · ‎07-13-2012

Oh sorry, I didn't notice the part about the D$.

So once the word that caused the miss to occur is loaded into the cache the processor will proceed. That's the purpose of a critical first cache, so that the line doesn't need to fill before the instructions can be used.

The data cache does not have the critical first feature so no matter where on the line the miss occurs, your code will stall under the data cache line fills. This is why there are options for the data cache line size (1, 2, 8 words) because depending on your access pattern an eight word line might not be ideal.

The best case times would be as follows (assuming no system stalls):

With the I$ the wait time is = memory latency + 1

With the D$ the wait time is = memory latency + 9

If you simulate a design you should be able to capture these numbers graphically.

Altera_Forum · ‎07-15-2012

Thank you for your explanation. However, I still have some questions.

When I did some experiments on the assembly level, I got some strange numbers.

When I read from an address that is already in the d-cache it takes one clock cycle. On the other hand, when I read from an address that is not in the d-cache it takes much more than one extra cycle to access the on-chip memory on avalon-bus. Even when I instruct the Qsys not to add any register in between to improve fmax, the result still a does not make any sense.

So my question will be how many cycles the CPU takes to read from on-chip memory that is connected through Avalon-bus.

1. using normal asm(ldw)

2. using IO asm(ldwio)

Thank you again for your support.

Altera_Forum · ‎07-16-2012

I would suspect ldw with a 4 byte/line data cache on a miss will take 3 cycles (assuming no pipelining and a read latency of 1) . If you had a 32 byte/line data cache it could be much higher since it doesn't implement critical line first loading.

I would expect ldwio to take 2 cycles assuming there is no pipelining in the fabric and the on-chip RAM is setup for 1 cycle of read latency.

So in general having a data cache when you place data into on-chip memory doesn't make a lot of sense. The data cache is implemented with on-chip memory itself so you are using on-chip memory.... to cache on-chip memory which is a bit redundant. Instead you can use the on-chip memory as a tightly coupled memory or remove the data cache assuming on-chip memory is the only memory connected to the data master.

Altera_Forum · ‎07-16-2012

Dual port the on-chip memory as tightly coupled code/data memory and to the avalon bus for other masters.

Remember not to try to give the cpu's data port access to tightly coupled data memory - that will give errors since the two blocks end up at the same address.

If you are reading from external memory (eg SDRAM) it will take considerably longer.

Altera_Forum · ‎07-16-2012

--- Quote Start ---

Remember not to try to give the cpu's data port access to tightly coupled data memory - that will give errors since the two blocks end up at the same address.

.

--- Quote End ---

I did not get you. can you please explain what you meant by ending up having two blocks with overlapped addresses?

Altera_Forum · ‎07-16-2012

DSL means don't hook up the tightly coupled data master and the regular data master up to each port of a dual-port on-chip memory. If you did that you would have two data paths into the same memory which will cause problems at the linking stage.

Assigning the ports to different addresses would remove this but then it opens up a new case of worms since you'll have aliased memory if you did that. There is no reason to have both types of data masters connected to a single memory so just don't do it.

Typically when people dual port a tightly coupled memory it is to put instructions into it when you connect the 2nd port to a tightly coupled instruction master. The other common usage is to share data without having to worry about cache coherency (other processors or DMAs in your system can access the TCM without you needing to worry about the L1 data cache coherency since TCM accesses always bypass the cache).