Nios® V/II Embedded Design Suite (EDS)
Support for Embedded Development Tools, Processors (SoCs and Nios® V/II processor), Embedded Development Suites (EDSs), Boot and Configuration, Operating Systems, C and C++
The Intel sign-in experience has changed to support enhanced security controls. If you sign in, click here for more information.
12514 Discussions

tightly-coupled memory performance !!

Honored Contributor II

I wanted to compare the performance between cache and tightly-coupled memory. So I did the following experiment: 


call the function() 




call the same function() 



I noticed that the second run was slightly faster than the first run. 


All code + data are placed in the on-chip tightly-coupled memory, even the stack. 


Can anyone comment on this behavior ?
0 Kudos
9 Replies
Honored Contributor II

I would look at the assembled code (objdump file) to see what the compiler is doing. I'm guessing all the register preserving operations for the first call are not being duplicated for the second call and as a result the second call is faster. This would have nothing to do with tightly coupled memory, it's just a code optimization.

Honored Contributor II

The calling function is very simple. And it is the called function that has the reading loop. 

The called function should be the same for each call, right?
Honored Contributor II

The called function assembly code 


05009254 <corner_turn_main>: 

5009254: 2015883a mov r10,r4 

5009258: 0013883a mov r9,zero 

500925c: 00001506 br 50092b4 <corner_turn_main+0x60> 

5009260: 40800017 ldw r2,0(r8) 

5009264: 3885883a add r2,r7,r2 

5009268: 11000017 ldw r4,0(r2) 

500926c: 01400784 movi r5,30 

5009270: 3145383a mul r2,r6,r5 

5009274: 1245883a add r2,r2,r9 

5009278: 1085883a add r2,r2,r2 

500927c: 1085883a add r2,r2,r2 

5009280: 00c14474 movhi r3,1297 

5009284: 18e7c604 addi r3,r3,-24808 

5009288: 10c5883a add r2,r2,r3 

500928c: 11000015 stw r4,0(r2) 

5009290: 00c00044 movi r3,1 

5009294: 30cd883a add r6,r6,r3 

5009298: 00800104 movi r2,4 

500929c: 388f883a add r7,r7,r2 

50092a0: 317fef1e bne r6,r5,5009260 <corner_turn_main+0xc> 

50092a4: 48d3883a add r9,r9,r3 

50092a8: 5095883a add r10,r10,r2 

50092ac: 00800a04 movi r2,40 

50092b0: 48800426 beq r9,r2,50092c4 <corner_turn_main+0x70> 

50092b4: 5011883a mov r8,r10 

50092b8: 000d883a mov r6,zero 

50092bc: 000f883a mov r7,zero 

50092c0: 003fe706 br 5009260 <corner_turn_main+0xc> 

50092c4: f800283a ret 






The calling function assembly code: 

5008cc8: 04010034 movhi r16,1024 

5008ccc: 84062804 addi r16,r16,6304 

5008cd0: 84800037 ldwio r18,0(r16) 

5008cd4: a009883a mov r4,r20 

5008cd8: 50092540 call 5009254 <corner_turn_main> 

5008cdc: 84400037 ldwio r17,0(r16) 

5008ce0: 8ca3c83a sub r17,r17,r18 

5008ce4: 04c000b4 movhi r19,2 

5008ce8: 9cc7af04 addi r19,r19,7868 

5008cec: 9809883a mov r4,r19 

5008cf0: 880b883a mov r5,r17 

5008cf4: 000feb00 call feb0 <printf> 

5008cf8: 84800037 ldwio r18,0(r16) 

5008cfc: a009883a mov r4,r20 

5008d00: 50092540 call 5009254 <corner_turn_main> 

5008d04: 84000037 ldwio r16,0(r16) 

5008d08: 84a1c83a sub r16,r16,r18 

5008d0c: 9809883a mov r4,r19 

5008d10: 800b883a mov r5,r16 

5008d14: 000feb00 call feb0 <printf>
Honored Contributor II

Actually I just mean look at the "calling function assembly code" to see if some of the instructions before the 'call' are omitted for the second one. 


Assuming all the instructions and data in the called function are all located in the tightly coupled memory, and if the code executes the same instructions between calls then it should have the same execution time. But that doesn't mean the stack operations leading into the call to the function will be the same between two calls. So you should look at the instructions before the 'call' to make sure you are simply not seeing additional work being done for the first call that is omitted for the second one.
Honored Contributor II

Thank you BadOmen, 

I understand what you are saying. However, I still find it strange that I find this difference. The compiler did not do anything for the second call, and the the difference is increasing as the loop iteration number increases!!!!!:confused: 


I need somebody help me understand this issue.
Honored Contributor II

I'd guess at the dynamic branch predictor changing the latency of some branches. 


There is a hidden (on sopc at least) menu with some extra nios cpu options. One of which is to remove the dynamic branch prediction logic and just use the static prediction 'assume backwards taken' and 'forwards not taken'. 


Actually a shame there isn't the option 'assume all not taken'. 

I needed to minimise the worst case path so had to persuade gcc to generate forward branches to backwards jumps in quite a few places (add an asm volatile () that only contains comments).
Honored Contributor II

Thank you dsl, that makes sense. However, is there any trick to force the NIOS-II/f to use the static branch predictor.

Honored Contributor II

The SOPC builder has a hidden menu that allows some additional configuration of the NiosII cpu. 

If you ask your FAE they might tell you how to find it. 

I'm not sure its presence is supposed to be public knowledge!
Honored Contributor II

Who are the FAE, I'm a university student.