FPGA Intellectual Property
PCI Express*, Networking and Connectivity, Memory Interfaces, DSP IP, and Video IP
6424 Discussions

UF-FFT, FFT IP Core, interpreting performance and resource utilization

Honored Contributor II



I have an application where I need to perform 1024 point FFTs on multiple channels simultaneously. I'm gauging the capabilities of the UG-FFT IP Core while also narrowing down which Altera FPGA I will use for the task.  


This will be the first time I'm using an FPGA to do FFTs so I'd like to make sure I'm interpreting the information correctly from the user guide here: https://www.altera.com/en_us/pdfs/literature/ug/ug_fft.pdf 


The first question is maybe a silly one but it's not clear to me. In the table below, (table excerpt from user guide), what is the fmax column referring to? 




Is this the maximum clock rate at which I can clock the FFT Core for the particular device? Or is it the maximum FFTs-per-second I can compute using the particular device at it's maximum clock rate? Or something else? 


Second question has to do with interpreting the logic resource requirements. So for example, let's say I need to compute ten (10) FFTs in parallel as quickly as possible and I want to do it with an Arria V device and using the 4-engine buffered burst option from the table above. Am I correct in assuming the total resources would be each column multiplied by 10? So for my example of 10 parallel FFTs with an Arria V, would my total ALM requirement be 44,850, (10 * 4,485)? And so on for the other resources? 


Thanks in advance!
0 Kudos
1 Reply
Honored Contributor II



For your first question: This is the maximum clock rate they were able to achieve in the particular family. But remember they were only attempting to fit a single FFT in the design with minimal other hardware, so that may be optimistic. usually your clock rate is driven by your data sample rate, but that may or may not be the case. 


If you are close to those numbers as far as your requirements, expect to spend lots of time in timing closure. But you should be able to do 160-200 MHz with no problem depending on the family. 


If you need better than that you should look at the Stratix V or 10 family. 


As far as size, yes your reasoning is correct, but you need to add what ever additional resources are required to mux/demux the data between the 10 cores as well. 


This again are estimates from a single run. The size of the design in your logic may vary significantly depending on your timing constraints and your area/speed constraints in the synthesis tool. 


0 Kudos