- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,this would be my first real job with fpga, and because of that, i need some help.
Originally i was using STM32F4 to make single point DFT from 128 point array, but even at 250MHz it is too slow. C code looks like this (first creating coefficients in memory, so i can use very fast FMAC operation from memory) float sum_array(int* A) { int k = 0; float cos_fast[128]={0}; float sin_fast[128]={0}; while(k<128) { cos_fast[k]=cosf((6.28318531*k*6)/128); sin_fast[k] =sinf((6.28318531*k*6)/128); k++; } float real = 0; float imag = 0; for (k=0;k<128; k++) { real+=cos_fast[k]*A[k]; imag+=sin_fast[k]*A[k]; } return real,imag; } idea is to use like counter, but it couns by giving f32 number from memory or other place, than multiply that with 16b value from adc ( again, diferent types of data, should i just convert to number to f32 by adding zeros to sign and exponent, and use u16 bits to mantissa? ) After that i just sum each number to older number, and create output to my trusty STM32F4 with answer )Link Copied
6 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The processor manual for that chip advertises 3 cycles for floating point multiply-accumulate, which it looks like your inner kernel requires two of them and at 250MHz is considerable power.
If you are actually realizing performance significantly worse than that, then your question becomes an STM32 tools / optimization discussion not appropriate for this forum although many people here may be knowledgeable about it. If you are certain you have already maxed out the chip and definitely require a co-processor in an FPGA, then you will need to say some more about your proposed system architecture and requirements.- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
--- Quote Start --- The processor manual for that chip advertises 3 cycles for floating point multiply-accumulate, which it looks like your inner kernel requires two of them and at 250MHz is considerable power. If you are actually realizing performance significantly worse than that, then your question becomes an STM32 tools / optimization discussion not appropriate for this forum although many people here may be knowledgeable about it. If you are certain you have already maxed out the chip and definitely require a co-processor in an FPGA, then you will need to say some more about your proposed system architecture and requirements. --- Quote End --- Memory to memory FMAC is more than good for my task, problem with timings, i need to check flags, clk state and so on, this is slow, and kills performance. Idea is simple. Processor will generate start pulse (50ns long), fpga will start detector with same pulse, wait 7 delay cycles (pipelined adc), and after that will multiply adc value to corresponding floating point constant, and addup together.taht would generate Real and imag part of that exact frequency. so c code would look like this while(i<128) { while(GPIOC->IDR < 32766);//is fifo empty? CLK_LOW; CLK_HIGH; k=GPIOB->IDR;//adc value from fifo real+=k*cosinusas;
imag-=k*sinusas; i++; } after that, i would print real and imag to 2x32b ports, and do complex math with STM32F4, or maybe it is possible to get phase value from this ? c code for fast atan2 is in here: dspguru.com/dsp/tricks/fixed-point-atan2-with-self-normalization if that would be possible, i could implement compleate PIC controller inside cyclone FPGA
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sure, you can do all that in an FPGA if you like.
You're FPGA design would consist of three major parts: 1. the ADC/detector timing generation and sample acquisition 2. the floating point arithmetic 3. the external processor interface Are you looking for pointers to "getting started" tutorials etc. or did you already have some specific questions? Since the title of the thread you created is "floating point math", here is a pointer to the Altera-supplied floating point arithmetic modules, which are free to use: http://www.altera.com/literature/ug/ug_altfp_mfug.pdf The core of your algorithm would likely consist of simply chaining together the correct series of blocks. i.e. your STM32F4 MAC instruction would require the ALTFP_MULT followed by ALTFP_ADD_SUB with necessary glue logic and registers in between.- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Precessed question:
How to make "Counter", that counts constants ? ( it must be done in verliog ? , if yes, can come one at least guide me how to do that ? ) I could use LPM constant, and bus multiplexer, and write all constant by hand, but it should be easy way to do that input CLK (rising edge), reset ,output f32 (constant from my file or elsewhere) All multiplying and accumulating i could do in schematic programing ( since i don't know verliog) output should look like this: RESET// 1.00000000 //First rising clk edge 0.95694035 //Second rising clk edge 0.83146960 //Third rising clk edge 0.63439327 //Fourth rising clk edge 0.38268343 // and so on- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
One solution to your problem is to use a lookup table, just as you have done in your C code.
In your C code, your arrays "cos_fast" and "sin_fast" could be realized with FPGA on-chip memory (ALTSYNCRAM as a ROM). The contents of that memory would be initialized with the contents of a data file which you would populate prior to compiling your FPGA. At run-time, your loop iterator "k" would be tied to the address input port of the given on-chip memory, while the data output ports of that memory would be tied to the input of your ALTFP_MULT multiplier.- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
ok, i found all parts for making this, question is, how to deal with all latency ? is where simple way to do that, or i need to boost clock from pll, so if i run data at 30MHz, and i have 5 cycle delay (like multiplier have 5 clock cycle delay), i need 5x clock (150MHz) and some latch ?
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page