- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I wanted to compare the performance between cache and tightly-coupled memory. So I did the following experiment:
tic call the function() toc tic call the same function() toc I noticed that the second run was slightly faster than the first run. All code + data are placed in the on-chip tightly-coupled memory, even the stack. Can anyone comment on this behavior ?Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I would look at the assembled code (objdump file) to see what the compiler is doing. I'm guessing all the register preserving operations for the first call are not being duplicated for the second call and as a result the second call is faster. This would have nothing to do with tightly coupled memory, it's just a code optimization.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The calling function is very simple. And it is the called function that has the reading loop.
The called function should be the same for each call, right?- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The called function assembly code
05009254 <corner_turn_main>: 5009254: 2015883a mov r10,r4 5009258: 0013883a mov r9,zero 500925c: 00001506 br 50092b4 <corner_turn_main+0x60> 5009260: 40800017 ldw r2,0(r8) 5009264: 3885883a add r2,r7,r2 5009268: 11000017 ldw r4,0(r2) 500926c: 01400784 movi r5,30 5009270: 3145383a mul r2,r6,r5 5009274: 1245883a add r2,r2,r9 5009278: 1085883a add r2,r2,r2 500927c: 1085883a add r2,r2,r2 5009280: 00c14474 movhi r3,1297 5009284: 18e7c604 addi r3,r3,-24808 5009288: 10c5883a add r2,r2,r3 500928c: 11000015 stw r4,0(r2) 5009290: 00c00044 movi r3,1 5009294: 30cd883a add r6,r6,r3 5009298: 00800104 movi r2,4 500929c: 388f883a add r7,r7,r2 50092a0: 317fef1e bne r6,r5,5009260 <corner_turn_main+0xc> 50092a4: 48d3883a add r9,r9,r3 50092a8: 5095883a add r10,r10,r2 50092ac: 00800a04 movi r2,40 50092b0: 48800426 beq r9,r2,50092c4 <corner_turn_main+0x70> 50092b4: 5011883a mov r8,r10 50092b8: 000d883a mov r6,zero 50092bc: 000f883a mov r7,zero 50092c0: 003fe706 br 5009260 <corner_turn_main+0xc> 50092c4: f800283a ret The calling function assembly code: 5008cc8: 04010034 movhi r16,1024 5008ccc: 84062804 addi r16,r16,6304 5008cd0: 84800037 ldwio r18,0(r16) 5008cd4: a009883a mov r4,r20 5008cd8: 50092540 call 5009254 <corner_turn_main> 5008cdc: 84400037 ldwio r17,0(r16) 5008ce0: 8ca3c83a sub r17,r17,r18 5008ce4: 04c000b4 movhi r19,2 5008ce8: 9cc7af04 addi r19,r19,7868 5008cec: 9809883a mov r4,r19 5008cf0: 880b883a mov r5,r17 5008cf4: 000feb00 call feb0 <printf> 5008cf8: 84800037 ldwio r18,0(r16) 5008cfc: a009883a mov r4,r20 5008d00: 50092540 call 5009254 <corner_turn_main> 5008d04: 84000037 ldwio r16,0(r16) 5008d08: 84a1c83a sub r16,r16,r18 5008d0c: 9809883a mov r4,r19 5008d10: 800b883a mov r5,r16 5008d14: 000feb00 call feb0 <printf>- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Actually I just mean look at the "calling function assembly code" to see if some of the instructions before the 'call' are omitted for the second one.
Assuming all the instructions and data in the called function are all located in the tightly coupled memory, and if the code executes the same instructions between calls then it should have the same execution time. But that doesn't mean the stack operations leading into the call to the function will be the same between two calls. So you should look at the instructions before the 'call' to make sure you are simply not seeing additional work being done for the first call that is omitted for the second one.- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you BadOmen,
I understand what you are saying. However, I still find it strange that I find this difference. The compiler did not do anything for the second call, and the the difference is increasing as the loop iteration number increases!!!!!:confused: I need somebody help me understand this issue.- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'd guess at the dynamic branch predictor changing the latency of some branches.
There is a hidden (on sopc at least) menu with some extra nios cpu options. One of which is to remove the dynamic branch prediction logic and just use the static prediction 'assume backwards taken' and 'forwards not taken'. Actually a shame there isn't the option 'assume all not taken'. I needed to minimise the worst case path so had to persuade gcc to generate forward branches to backwards jumps in quite a few places (add an asm volatile () that only contains comments).- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you dsl, that makes sense. However, is there any trick to force the NIOS-II/f to use the static branch predictor.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The SOPC builder has a hidden menu that allows some additional configuration of the NiosII cpu.
If you ask your FAE they might tell you how to find it. I'm not sure its presence is supposed to be public knowledge!- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Who are the FAE, I'm a university student.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page