I thought I understood the Nios instruction timings, and exactly when pipeline stalls occur. However I've found some discrepencies between calculated and measured execution times.This is a /f processor without the dynamic branch predictor and with all code and (almost) all data cycles going to tightly coupled memory. Apart from delays due to contention on the few Avalon data transfers, the execution time ought to be determinable. (I've measured the non-conteded avalon cycles.) One place I've found an unexpected stall appears to be in the difference between:
ldhu rx, 0(ry) add ra, rb, rc stb rd, 0(ra) bne re, zero, labeland
add ra, rb, rc stb rd, 0(ra) ldhu rx, 0(ry) bne re, zero, labelalthough there are no 'late result' stalls, if the execution brances to label (forwards so predicted not taken) then the second version has an additional stall - on top of the 4 cycles lost because the branch is mispredicted. I'm comparing the execution time of the above with the code path that takes a branch just before, and merges just after - the difference between the two should be 1 clock (one is 7, the other 8). Anyone any thoughts on this? Is there a lurking extra stall cycle when a memory load (etc) preceeds a mispredicted branch? I'm also not sure I have enough mispredicted branches in my slow code path to account for the overall additional delays. I might try to get a signaltrap trace of the code addresses.
I've located the undocumented stall.There is a 1 clock delay if a read follows a write to the same tightly coupled data block. The same delay may apply to data cache accesses. The reason is that the write cannot be requested unless the address matches and the instruction is a write. The read can (and is) issued to every tightly coupled memory block every cycle, and the value only used if the instruction was a read on that block. This means that the writes actually happen one clock later than reads.