Programmable Devices
CPLDs, FPGAs, SoC FPGAs, Configuration, and Transceivers
20778 Discussions

Why is the speed of operations not constant? Is the cache the reason?

Altera_Forum
Honored Contributor II
991 Views

Hi  

 

First of all I'm using Quartus 8.1 and the NiosII IDE 8.1.  

 

The Nios II system I set up, processes data from an ADC with the help of a FFT module I added to the processor system. The output of the FFT module is read by a DMA controller, which copies the FFT results to an internal buffer (512*32bit). This FFT output data consists of 16bit real +16bit imaginary. The goal is to calculate the magnitude of 4 bundles of data by using the real and imaginary bins.  

 

How do I do this?  

1. input the first bundle of data to the FFT module 

2. readout the FFT module with the DMA 

3. calculate magnitude of all bins 

4. repeat these steps for the other 3 bundles of data 

 

The entire source code runs from an external SRAM which requires 30ns per access (3 cycles in my system, running with 100MHz) 

 

Here the source code of the magnitude calculation of one bundle of data: 

 

for(SampleCounter_u16 = 0; SampleCounter_u16 < 512; SampleCounter_u16++) 

/* extract 16bit real and imaginary from 32bit value */ 

Real_s16 = (int16_t)(samples_u32sa[SampleCounter_u16] & 0xFFFF); 

Imag_s16 = (int16_t)((samples_u32sa[SampleCounter_u16] >> 16)&0xFFFF); 

/* typecast with custom instruction */ 

Real_f32 = CI_INTTOFLOAT(Real_s16); 

Imag_f32 = CI_INTTOFLOAT(Imag_s16); 

/* calculate magnitude and scale */ 

Result1_f32a[SampleCounter_u16] = CI_FPSQRT(Real_f32*Real_f32 +  

Imag_f32*Imag_f32)/STEP_SIZE_1MILIG; 

 

The problem is that the execution of this source code requires different time durations. I perform this loop 4 times to calculate the magnitude of all 4 data bundles and the overall time of each loop is different (measured with the timestamp timer).  

 

Next step was to measure the speed of each line of code.  

E.g. the code "Real_f32 = CI_INTTOFLOAT(Real_s16);" should have a constant execution speed, but it hasn't. The execution speed of the operation during the one of the four loops is higher than during the other 3 three loops.  

e.g. during the first, second and fourth loop the operation takes 20ns every 512 times the operation is executed. During the third loop it takes 100ns during each of the 512 repeatitions.  

 

This is very strange. First I through the timer could be the reason, but than I changed from the NiosII-f to the NiosII-e processor and the phenomenon disappeared. With the NiosII-s and -f this strange behaviour occurs. 

 

I assume the caching of the NiosII-s and -f is the reason for this, but I'm not sure. Has anybody an idea about this?
0 Kudos
5 Replies
Altera_Forum
Honored Contributor II
251 Views

Nios-S and -F have a cache the -e not so this could be one source  

 

do you see any differences betwen -s and -f ?  

-f has data cache as well but -s only i-cache 

 

do you have a timer that produces interrupts that need to be handled ?
0 Kudos
Altera_Forum
Honored Contributor II
251 Views

The only difference between -s and -f I noticed is the absolut time required for an operation, but this is normal this the -f has a higher performance. The timer does not produce any interrupts. I use the Altera Interval Timer IP-core as Timestamp Timer, which is running in an endless loop. I take snapshots to determine the length of an operation e.g. like this 

 

time1 = timestamp(); 

operation; 

time2 = timestamp(); 

duration = time2 -time1;
0 Kudos
Altera_Forum
Honored Contributor II
251 Views

If your system runs at 100MHz, then you will have a clock cycle of 10ns. 

you say that the 1. & 2. and 4th take 20ns = 2 clocks for all 512 iterations but the 3. takes 100ns = 10 cycles for all 512 loop steps. 

if some interrupt or refresh cycle or whatsoever occcurs then you would measure this additional time within all 4 loops and not only in loop 3. 

this is indeed strange 

 

have you setup signaltap and monitored what is giong on on each of the 4 loop cycles ? 

especially those custum instructions. 

 

another question  

you wrote 

Real_f32 = CI_INTTOFLOAT(Real_s16); 

Imag_f32 = CI_INTTOFLOAT(Imag_s16); 

Result1_f32a[SampleCounter_u16] = CI_FPSQRT(Real_f32*Real_f32 +  

Imag_f32*Imag_f32)/STEP_SIZE_1MILIG; 

 

so you have 2 float multiplikation and one float addition 

why don't you write 

 

Result1_f32a[SampleCounter_u16] = CI_FPSQRT( CI_INTTOFLOAT( Real_s16 * Real_s16 + Imag_s16 * Imag_s16 ) )/STEP_SIZE_1MILIG; 

 

now you have 4 integer multiplikation and one integer addition, this should be a bit faster as also only one CI_INTTOFLOAT operation is needed.
0 Kudos
Altera_Forum
Honored Contributor II
251 Views

Hi, 

 

I verified the custom instructions separately and they have a constant duration. Your tip of changing the float to integers is very good and increases the performance. Nice! 

 

I tried to monitor the SRAM access with the SignalTap to, but I do not have enough internal memory available in my design to monitor enough data, because the absolute times are much higher than I mentioned. 20ns and 100ns were just examples. Maybe the integer multiplication and addition reduces times so that I can monitor more valuable data.
0 Kudos
Altera_Forum
Honored Contributor II
251 Views

okay thats what was confusing me a bit about the time you mentioned. 

 

you could increase the loop a bit more if you can move the division inside the square root funtion the gain the integer division ... 

 

can you monitor the sram access via signaltap to see if this takes sometimes a lot longer ? 

or monitor different sections to see where these changes come from ?
0 Kudos
Reply