Nios® V/II Embedded Design Suite (EDS)
Support for Embedded Development Tools, Processors (SoCs and Nios® V/II processor), Embedded Development Suites (EDSs), Boot and Configuration, Operating Systems, C and C++

Performance Counter offset

Altera_Forum
Honored Contributor II
1,805 Views

Hi, i have a question about the PERFORMANCE COUNTER component. 

 

I have the following code to measure the overhead from starting/stoping the counter: 

PERF_BEGIN (PERFORMANCE_COUNTER_BASE, 1); PERF_END (PERFORMANCE_COUNTER_BASE, 1); 

From the two lines above I always get a value of 3 clocks. So far so good. 

 

But when I measure the following code ... 

 

for (int x = 0; x < 10000; x++) {     PERF_BEGIN (PERFORMANCE_COUNTER_BASE, 1);     a++;     PERF_END (PERFORMANCE_COUNTER_BASE, 1); } 

 

... I get a result of 10012 clocks http://forum.niosforum.com/work2/style_emoticons/<#EMO_DIR#>/unsure.gif  

 

Is the offset value of 3 clocks constant or not? It seems not to be because then I would expect a value of ~ 40000 clocks ( 10000 * (1 clock + 3 clock offset) ) 

 

What is the error in reasoning ?
0 Kudos
10 Replies
Altera_Forum
Honored Contributor II
539 Views

I&#39;m only guessing here, but maybe you have an instruction and/or data cache, in which case the loop may be optimising your code whereas the single instance is not able to.

0 Kudos
Altera_Forum
Honored Contributor II
539 Views

Hi, I observed that when I run my program on the first time compare with when I run the next time, the performance counter give different result. What could be the difference?

0 Kudos
Altera_Forum
Honored Contributor II
539 Views

It depends. What are you doing inbetween the begin and end? Cache could definitely cause this to happen.

0 Kudos
Altera_Forum
Honored Contributor II
539 Views

I have a long program function in between the beigin and end. I also tried a smaller function which give 821 clks on the performance counter and it is always the same result. however when I put a bigger function, the result is different. I even have no data cache in SPOC CPU. where could be the errors?

0 Kudos
Altera_Forum
Honored Contributor II
539 Views

No i-cache too? Instruction cache could cause deltas as well.

0 Kudos
Altera_Forum
Honored Contributor II
539 Views

I'm disable the Instruction cache in SOPC CPU and it give me errors. It is set to 2Kbyte of cache. So I enable back in SPOC and tried to add in alt_icache_flush_all() before the performance counter and it still give the different clk count. In my SPOC there is a shared memory between CPU and FPGA. I created a Dual memory port RAM using custom instruction and link to CPU. It has a fix latency. Could this create the different in clk count?

0 Kudos
Altera_Forum
Honored Contributor II
539 Views

 

--- Quote Start ---  

Could this create the different in clk count? 

--- Quote End ---  

 

 

Yeah, that's a possibility. Cache and inconsistent system latencies are the typical causes of deltas like this. 

 

--slacker
0 Kudos
Altera_Forum
Honored Contributor II
539 Views

Possibly the compiler moved the PERF_BEGIN/END outside the loop. 

You need to look at the generated code.
0 Kudos
Altera_Forum
Honored Contributor II
539 Views

Times like these typically what I do is simulate the system and watch the instructions (program counter) being executed and compare to the objdump file. This should give a lot of insight as to whether cache misses are occurring and other performance penalties. 

 

Also I would highly recommend starting and stopping the performance counter *infrequently*. The performance counter is accurate but if you constantly start and stop it even the smallest amount of overhead is going to add up and when compared to a register increment instruction this overhead will be very significant. Typically people start the counter before entering a loop and stop it after the loop completes. If you know how many iterations the loop took then you can get a ballpark estimate of how long each iteration took. 

 

... edit just realized I said the same thing DSL said ..... so +1 :)
0 Kudos
Altera_Forum
Honored Contributor II
539 Views

For performance measurements I got our HW guys to put a 32bit up-counter clocked by sys_clk onto an avalon bus and read that. (reading the counter with a custom instruction would save some clocks!) 

 

I still needed to add asm("":::"memory") lines to stop gcc caching memory values in registers (it acts as a memory barrier to the compiler) and then check the generated code (eg with objdump) to ensure the correct code was being counted. 

Looking at the asm will also allow you to significantly improve the performance by writing C that gcc can compile to better code!
0 Kudos
Reply