Nios® V/II Embedded Design Suite (EDS)
Support for Embedded Development Tools, Processors (SoCs and Nios® V/II processor), Embedded Development Suites (EDSs), Boot and Configuration, Operating Systems, C and C++

Floating point math

Altera_Forum
Honored Contributor II
1,519 Views

Hello,this would be my first real job with fpga, and because of that, i need some help. 

Originally i was using STM32F4 to make single point DFT from 128 point array, but even at 250MHz it is too slow. 

C code looks like this (first creating coefficients in memory, so i can use very fast FMAC operation from memory) 

float sum_array(int* A) { 

int k = 0; 

float cos_fast[128]={0}; 

float sin_fast[128]={0}; 

while(k<128) 

cos_fast[k]=cosf((6.28318531*k*6)/128); 

sin_fast[k] =sinf((6.28318531*k*6)/128); 

k++; 

float real = 0; 

float imag = 0; 

for (k=0;k<128; k++) 

real+=cos_fast[k]*A[k]; 

imag+=sin_fast[k]*A[k]; 

return real,imag; 

 

idea is to use like counter, but it couns by giving f32 number from memory or other place, than multiply that with 16b value from adc ( again, diferent types of data, should i just convert to number to f32 by adding zeros to sign and exponent, and use u16 bits to mantissa? ) 

After that i just sum each number to older number, and create output to my trusty STM32F4 with answer )
0 Kudos
6 Replies
Altera_Forum
Honored Contributor II
401 Views

The processor manual for that chip advertises 3 cycles for floating point multiply-accumulate, which it looks like your inner kernel requires two of them and at 250MHz is considerable power. 

 

If you are actually realizing performance significantly worse than that, then your question becomes an STM32 tools / optimization discussion not appropriate for this forum although many people here may be knowledgeable about it. 

 

If you are certain you have already maxed out the chip and definitely require a co-processor in an FPGA, then you will need to say some more about your proposed system architecture and requirements.
0 Kudos
Altera_Forum
Honored Contributor II
401 Views

 

--- Quote Start ---  

The processor manual for that chip advertises 3 cycles for floating point multiply-accumulate, which it looks like your inner kernel requires two of them and at 250MHz is considerable power. 

 

If you are actually realizing performance significantly worse than that, then your question becomes an STM32 tools / optimization discussion not appropriate for this forum although many people here may be knowledgeable about it. 

 

If you are certain you have already maxed out the chip and definitely require a co-processor in an FPGA, then you will need to say some more about your proposed system architecture and requirements. 

--- Quote End ---  

 

Memory to memory FMAC is more than good for my task, problem with timings, i need to check flags, clk state and so on, this is slow, and kills performance. 

Idea is simple. 

Processor will generate start pulse (50ns long), 

fpga will start detector with same pulse, wait 7 delay cycles (pipelined adc), and after that will multiply adc value to corresponding floating point constant, and addup together.taht would generate Real and imag part of that exact frequency. 

so c code would look like this 

while(i<128) 

while(GPIOC->IDR < 32766);//is fifo empty? 

CLK_LOW; 

CLK_HIGH; 

k=GPIOB->IDR;//adc value from fifo 

real+=k*cosinusas

imag-=k*sinusas

i++; 

 

after that, i would print real and imag to 2x32b ports, and do complex math with STM32F4, or maybe it is possible to get phase value from this ? 

c code for fast atan2 is in here: 

dspguru.com/dsp/tricks/fixed-point-atan2-with-self-normalization 

if that would be possible, i could implement compleate PIC controller inside cyclone FPGA
0 Kudos
Altera_Forum
Honored Contributor II
401 Views

Sure, you can do all that in an FPGA if you like. 

 

You're FPGA design would consist of three major parts: 

1. the ADC/detector timing generation and sample acquisition 

2. the floating point arithmetic 

3. the external processor interface 

 

Are you looking for pointers to "getting started" tutorials etc. or did you already have some specific questions? 

 

Since the title of the thread you created is "floating point math", here is a pointer to the Altera-supplied floating point arithmetic modules, which are free to use: 

http://www.altera.com/literature/ug/ug_altfp_mfug.pdf 

 

The core of your algorithm would likely consist of simply chaining together the correct series of blocks. i.e. your STM32F4 MAC instruction would require the ALTFP_MULT followed by ALTFP_ADD_SUB with necessary glue logic and registers in between.
0 Kudos
Altera_Forum
Honored Contributor II
401 Views

Precessed question: 

How to make "Counter", that counts constants ? ( it must be done in verliog ? , if yes, can come one at least guide me how to do that ? ) 

 

I could use LPM constant, and bus multiplexer, and write all constant by hand, but it should be easy way to do that 

 

input CLK (rising edge), reset ,output f32 (constant from my file or elsewhere) 

 

All multiplying and accumulating i could do in schematic programing ( since i don't know verliog) 

output should look like this: 

RESET// 

1.00000000 //First rising clk edge 

0.95694035 //Second rising clk edge 

0.83146960 //Third rising clk edge 

0.63439327 //Fourth rising clk edge 

0.38268343 // and so on
0 Kudos
Altera_Forum
Honored Contributor II
401 Views

One solution to your problem is to use a lookup table, just as you have done in your C code. 

 

In your C code, your arrays "cos_fast" and "sin_fast" could be realized with FPGA on-chip memory (ALTSYNCRAM as a ROM). 

The contents of that memory would be initialized with the contents of a data file which you would populate prior to compiling your FPGA. 

 

At run-time, your loop iterator "k" would be tied to the address input port of the given on-chip memory, while the data output ports of that memory would be tied to the input of your ALTFP_MULT multiplier.
0 Kudos
Altera_Forum
Honored Contributor II
401 Views

ok, i found all parts for making this, question is, how to deal with all latency ? is where simple way to do that, or i need to boost clock from pll, so if i run data at 30MHz, and i have 5 cycle delay (like multiplier have 5 clock cycle delay), i need 5x clock (150MHz) and some latch ?

0 Kudos
Reply