- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi
First of all I'm using Quartus 8.1 and the NiosII IDE 8.1. The Nios II system I set up, processes data from an ADC with the help of a FFT module I added to the processor system. The output of the FFT module is read by a DMA controller, which copies the FFT results to an internal buffer (512*32bit). This FFT output data consists of 16bit real +16bit imaginary. The goal is to calculate the magnitude of 4 bundles of data by using the real and imaginary bins. How do I do this? 1. input the first bundle of data to the FFT module 2. readout the FFT module with the DMA 3. calculate magnitude of all bins 4. repeat these steps for the other 3 bundles of data The entire source code runs from an external SRAM which requires 30ns per access (3 cycles in my system, running with 100MHz) Here the source code of the magnitude calculation of one bundle of data: for(SampleCounter_u16 = 0; SampleCounter_u16 < 512; SampleCounter_u16++) { /* extract 16bit real and imaginary from 32bit value */ Real_s16 = (int16_t)(samples_u32sa[SampleCounter_u16] & 0xFFFF); Imag_s16 = (int16_t)((samples_u32sa[SampleCounter_u16] >> 16)&0xFFFF);/* typecast with custom instruction */ Real_f32 = CI_INTTOFLOAT(Real_s16); Imag_f32 = CI_INTTOFLOAT(Imag_s16); /* calculate magnitude and scale */ Result1_f32a[SampleCounter_u16] = CI_FPSQRT(Real_f32*Real_f32 + Imag_f32*Imag_f32)/STEP_SIZE_1MILIG; } The problem is that the execution of this source code requires different time durations. I perform this loop 4 times to calculate the magnitude of all 4 data bundles and the overall time of each loop is different (measured with the timestamp timer). Next step was to measure the speed of each line of code. E.g. the code "Real_f32 = CI_INTTOFLOAT(Real_s16);" should have a constant execution speed, but it hasn't. The execution speed of the operation during the one of the four loops is higher than during the other 3 three loops. e.g. during the first, second and fourth loop the operation takes 20ns every 512 times the operation is executed. During the third loop it takes 100ns during each of the 512 repeatitions. This is very strange. First I through the timer could be the reason, but than I changed from the NiosII-f to the NiosII-e processor and the phenomenon disappeared. With the NiosII-s and -f this strange behaviour occurs. I assume the caching of the NiosII-s and -f is the reason for this, but I'm not sure. Has anybody an idea about this?
Link Copied
5 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Nios-S and -F have a cache the -e not so this could be one source
do you see any differences betwen -s and -f ? -f has data cache as well but -s only i-cache do you have a timer that produces interrupts that need to be handled ?- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The only difference between -s and -f I noticed is the absolut time required for an operation, but this is normal this the -f has a higher performance. The timer does not produce any interrupts. I use the Altera Interval Timer IP-core as Timestamp Timer, which is running in an endless loop. I take snapshots to determine the length of an operation e.g. like this
time1 = timestamp(); operation; time2 = timestamp(); duration = time2 -time1;- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If your system runs at 100MHz, then you will have a clock cycle of 10ns.
you say that the 1. & 2. and 4th take 20ns = 2 clocks for all 512 iterations but the 3. takes 100ns = 10 cycles for all 512 loop steps. if some interrupt or refresh cycle or whatsoever occcurs then you would measure this additional time within all 4 loops and not only in loop 3. this is indeed strange have you setup signaltap and monitored what is giong on on each of the 4 loop cycles ? especially those custum instructions. another question you wrote Real_f32 = CI_INTTOFLOAT(Real_s16); Imag_f32 = CI_INTTOFLOAT(Imag_s16); Result1_f32a[SampleCounter_u16] = CI_FPSQRT(Real_f32*Real_f32 + Imag_f32*Imag_f32)/STEP_SIZE_1MILIG; so you have 2 float multiplikation and one float addition why don't you write Result1_f32a[SampleCounter_u16] = CI_FPSQRT( CI_INTTOFLOAT( Real_s16 * Real_s16 + Imag_s16 * Imag_s16 ) )/STEP_SIZE_1MILIG; now you have 4 integer multiplikation and one integer addition, this should be a bit faster as also only one CI_INTTOFLOAT operation is needed.- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I verified the custom instructions separately and they have a constant duration. Your tip of changing the float to integers is very good and increases the performance. Nice! I tried to monitor the SRAM access with the SignalTap to, but I do not have enough internal memory available in my design to monitor enough data, because the absolute times are much higher than I mentioned. 20ns and 100ns were just examples. Maybe the integer multiplication and addition reduces times so that I can monitor more valuable data.- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
okay thats what was confusing me a bit about the time you mentioned.
you could increase the loop a bit more if you can move the division inside the square root funtion the gain the integer division ... can you monitor the sram access via signaltap to see if this takes sometimes a lot longer ? or monitor different sections to see where these changes come from ?
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page