- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Has any done any comparison of the different parallelization techniques for the 1 D FFTs?
I'd like to know which method in general is faster? internal threading or user threading?
Currently my application is single-threaded but improving the FFT time would warrant the added complexity of handling our own threads for this app.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
First of all, please look at the list of threaded functions into the version of MKL which you are using.You can find this list into MKL User's Guide ( see chapter 6 - "Using Intel MKL Parallelism").
There are some restrictions for this functionality, e.g for the latest MKL 10.2 Update 4 :
and etc..
--Gennady
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am using MKL 10.2.3; my specific use case is a 1 Dimensional complex to complex transform.
At the beginning of Chapter 6, the user guide says "FFT" is threaded. It does not mention any of the restrictions listed in your reply.
In previous versions the memory array given to the FFT needed to be a factor of 128 for best performance. Is this still the case?
Does running the transform using out-of-place versus in-place memory make a difference?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
You know, Gennady added that fragment with limitations from MKL 10.2.4 User's Gude.
As to memory alignments: 16, or 128 or even page-alignment should provide better performance because using vectorizing code in DFT-kernels and compact page migrations (I mean DTLB misses).
About comparison out-of-place versus in-place 1D
for small sizes the difference isnonsignificant (see below Gfs for 1thread):
Forward_DFT_C, x210, 8.432,1th,1D,in-place Forward_DFT_C, x210, 8.502,1th,1D,out-of-place
Forward_DFT_C, x504, 9.896,1th,1D,in-place Forward_DFT_C, x504, 9.885,1th,1D,out-of-place
but for large sizes (more that cache size) it will be significant difference
Forward_DFT_C, x3211264, 4.196,1th,1D,in-place Forward_DFT_C, x3211264, 4.312,1th,1D,out-of-place
Forward_DFT_C, x6250000, 3.763,1th,1D,in-place Forward_DFT_C, x6250000, 3.837,1th,1D,out-of-place
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page