<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic An array of FFTSpec[i] and in Intel® Integrated Performance Primitives</title>
    <link>https://community.intel.com/t5/Intel-Integrated-Performance/Parallel-Intel-IPP-FFT-function/m-p/975897#M20971</link>
    <description>An array of &lt;STRONG&gt;FFTSpec&lt;/STRONG&gt;&lt;N&gt; and &lt;STRONG&gt;Buffer&lt;/STRONG&gt;&lt;N&gt; needs to be used and a size of the array should be equal to the number of chunks:
...
parallel_for( 0, chunks, [=](size_t i )
{
&lt;STRONG&gt;ippsFFTFwd_CToC_32fc&lt;/STRONG&gt;( ..., &amp;amp;FFTSpec&lt;I&gt;, &amp;amp;Buffer&lt;I&gt; );
}
...
Also, please take a look at an article:

software.intel.com/en-us/articles/threading-and-intel-integrated-performance-primitives

for more information. A complete list of threaded IPP functions should be in the IPP docs folder.&lt;/I&gt;&lt;/I&gt;&lt;/N&gt;&lt;/N&gt;</description>
    <pubDate>Tue, 19 Mar 2013 13:00:00 GMT</pubDate>
    <dc:creator>SergeyKostrov</dc:creator>
    <dc:date>2013-03-19T13:00:00Z</dc:date>
    <item>
      <title>Parallel Intel IPP FFT function</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/Parallel-Intel-IPP-FFT-function/m-p/975896#M20970</link>
      <description>&lt;P&gt;Hi experts:&lt;/P&gt;
&lt;P&gt;&amp;nbsp; &amp;nbsp; I want to do multiple FFT and I want to do them in parallel. So my code is similar as the following:&lt;/P&gt;
&lt;P&gt;ippsFFTGetSize_C_32fc(....)&lt;/P&gt;
&lt;P&gt;ippsFFTInit_C_32fc(...FFTSpec,&amp;nbsp;Buffer)&lt;/P&gt;
&lt;P&gt;parallel_for(0, chunks, [=](size_t i){&lt;/P&gt;
&lt;P&gt;&amp;nbsp; &amp;nbsp; ippsFFTFwd_CToC_32fc(...FFTSpec, Buffer);&lt;/P&gt;
&lt;P&gt;}&lt;/P&gt;
&lt;P&gt;&amp;nbsp; &amp;nbsp; But I found that the results are not correct. I suspect that the FFTspec and Buffer record the status when do fft operation, so there is conflict when I do parallel FFTs.&lt;/P&gt;
&lt;P&gt;&amp;nbsp; &amp;nbsp; Could you please let me know the real reason?&lt;/P&gt;
&lt;P&gt;&amp;nbsp; &amp;nbsp; And is there any way I can parallel multiple FFTs? (I do not want to put ippsFFTInit_C_32fc in the loop as it is time-consuming)&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 19 Mar 2013 10:19:09 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/Parallel-Intel-IPP-FFT-function/m-p/975896#M20970</guid>
      <dc:creator>caosun</dc:creator>
      <dc:date>2013-03-19T10:19:09Z</dc:date>
    </item>
    <item>
      <title>An array of FFTSpec[i] and</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/Parallel-Intel-IPP-FFT-function/m-p/975897#M20971</link>
      <description>An array of &lt;STRONG&gt;FFTSpec&lt;/STRONG&gt;&lt;N&gt; and &lt;STRONG&gt;Buffer&lt;/STRONG&gt;&lt;N&gt; needs to be used and a size of the array should be equal to the number of chunks:
...
parallel_for( 0, chunks, [=](size_t i )
{
&lt;STRONG&gt;ippsFFTFwd_CToC_32fc&lt;/STRONG&gt;( ..., &amp;amp;FFTSpec&lt;I&gt;, &amp;amp;Buffer&lt;I&gt; );
}
...
Also, please take a look at an article:

software.intel.com/en-us/articles/threading-and-intel-integrated-performance-primitives

for more information. A complete list of threaded IPP functions should be in the IPP docs folder.&lt;/I&gt;&lt;/I&gt;&lt;/N&gt;&lt;/N&gt;</description>
      <pubDate>Tue, 19 Mar 2013 13:00:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/Parallel-Intel-IPP-FFT-function/m-p/975897#M20971</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2013-03-19T13:00:00Z</dc:date>
    </item>
    <item>
      <title>in that case for elimination</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/Parallel-Intel-IPP-FFT-function/m-p/975898#M20972</link>
      <description>&lt;P&gt;in that case for elimination threads oversubscription, you can call ippSetNumThreads(1).&lt;/P&gt;</description>
      <pubDate>Tue, 19 Mar 2013 13:56:27 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/Parallel-Intel-IPP-FFT-function/m-p/975898#M20972</guid>
      <dc:creator>Gennady_F_Intel</dc:creator>
      <dc:date>2013-03-19T13:56:27Z</dc:date>
    </item>
    <item>
      <title>Please also take into account</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/Parallel-Intel-IPP-FFT-function/m-p/975899#M20973</link>
      <description>Please also take into account that &lt;STRONG&gt;ippsFFTFwd_CToC_32fc&lt;/STRONG&gt; is threaded ( I just did a verification in v7.1 ) and All threading will be removed in the future versions of IPP. At the moment if your data set is large and internal IPP threading is working then your own TBB based threading could create more problems and could degrade performance. I would call it as a "double-threaded" processing and I think you really need to do performance evaluation. Do you have any numbers as an example?

Gennady's suggestion should force single threaded processing by the IPP function and in that case your code looks good.</description>
      <pubDate>Tue, 19 Mar 2013 14:50:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/Parallel-Intel-IPP-FFT-function/m-p/975899#M20973</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2013-03-19T14:50:00Z</dc:date>
    </item>
    <item>
      <title>Hi,</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/Parallel-Intel-IPP-FFT-function/m-p/975900#M20974</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;
&lt;P&gt;I have already disabled the internal IPP OpenMP threading.&lt;/P&gt;
&lt;P&gt;It does not make sense to me, if I use&amp;nbsp;&lt;STRONG&gt;FFTSpec&lt;/STRONG&gt;&lt;N&gt; and&amp;nbsp;&lt;STRONG&gt;Buffer&lt;/STRONG&gt;&lt;N&gt; where size of the array N should be equal to the number of chunks. The reasons are:&lt;/N&gt;&lt;/N&gt;&lt;/P&gt;
&lt;P&gt;1. The program might not know how large the size is until it do parallel&amp;nbsp;&lt;STRONG&gt;ippsFFTFwd_CToC_32fc&lt;/STRONG&gt;, that is, the size is dynamic and could not be known beforehand.&lt;/P&gt;
&lt;P&gt;2. The number of threads created is limited by the number of cores, and for each thread it do mulitple &lt;STRONG&gt;ippsFFTFwd_CToC_32fc&lt;/STRONG&gt;&amp;nbsp;in serial. So it indicates that we only need to create the maximal size of array equal to number of cores. And each thread has its own &lt;STRONG&gt;FFTSpec&lt;/STRONG&gt;&amp;nbsp;and&amp;nbsp;&lt;STRONG&gt;Buffer&lt;/STRONG&gt;.&amp;nbsp;But how could I control that?&lt;/P&gt;
&lt;P&gt;You comments are highly appreciated.&lt;/P&gt;</description>
      <pubDate>Mon, 08 Apr 2013 01:15:32 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/Parallel-Intel-IPP-FFT-function/m-p/975900#M20974</guid>
      <dc:creator>caosun</dc:creator>
      <dc:date>2013-04-08T01:15:32Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;...It does not make sense</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/Parallel-Intel-IPP-FFT-function/m-p/975901#M20975</link>
      <description>&amp;gt;&amp;gt;...It does not make sense to me, if I use &lt;STRONG&gt;FFTSpec&lt;N&gt;&lt;/N&gt;&lt;/STRONG&gt; and &lt;STRONG&gt;Buffer&lt;N&gt;&lt;/N&gt;&lt;/STRONG&gt; where size of the array N should be equal to
&amp;gt;&amp;gt;the number of chunks...

This is by design of the function and if it is used in a multi-threaded environment different threads can Not share these parameters. A similar problem with application of IPP functions with TBB was solved by another IDZ user in 2012.</description>
      <pubDate>Tue, 09 Apr 2013 12:59:38 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/Parallel-Intel-IPP-FFT-function/m-p/975901#M20975</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2013-04-09T12:59:38Z</dc:date>
    </item>
    <item>
      <title>If all FFTs have the same</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/Parallel-Intel-IPP-FFT-function/m-p/975902#M20976</link>
      <description>&lt;P&gt;If all FFTs have the same order - it is enough to have 1 FFTSpec - in IPP terminology (described in the manual) Spec is always const, while State (for example FIRs, IIRs) stores temporal function state in order to provide stream processing. So for correct threading you should create one common&amp;nbsp;FFTSpec and a number of unique buffers - one for each thread. Buffers are used for temporal store after each butterfly, while Spec contains only pre-calculated twiddle factors and bit-reverse table.&lt;/P&gt;
&lt;P&gt;regards, Igor&lt;/P&gt;</description>
      <pubDate>Tue, 09 Apr 2013 20:04:33 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/Parallel-Intel-IPP-FFT-function/m-p/975902#M20976</guid>
      <dc:creator>Igor_A_Intel</dc:creator>
      <dc:date>2013-04-09T20:04:33Z</dc:date>
    </item>
    <item>
      <title>    Thank you for your</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/Parallel-Intel-IPP-FFT-function/m-p/975903#M20977</link>
      <description>&lt;P&gt;&amp;nbsp; &amp;nbsp; Thank you for your information.&lt;/P&gt;
&lt;P&gt;&amp;nbsp; &amp;nbsp; Igor, for one FFT order, we just need to create 1 FFTSpec, that's good.&amp;nbsp;And I still need to create a number of buffers - one for each thread. But the problem is that I do not want to create number of &lt;STRONG&gt;chunks&lt;/STRONG&gt; of buffers which I do not know beforehand. If I just need to create number of chunks = number of CPU cores = number of threads, that's will be great. Do you know how do that with Intel TBB tools?&lt;/P&gt;</description>
      <pubDate>Wed, 10 Apr 2013 01:09:14 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/Parallel-Intel-IPP-FFT-function/m-p/975903#M20977</guid>
      <dc:creator>caosun</dc:creator>
      <dc:date>2013-04-10T01:09:14Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;...If I just need to create</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/Parallel-Intel-IPP-FFT-function/m-p/975904#M20978</link>
      <description>&amp;gt;&amp;gt;...If I just need to create number of chunks = number of CPU cores = number of threads, that's will be great...

This is actually what guides for multi-threading will recommend and ( ideally ) you shouldn't exceed number of logical cores. Oversubscription also could be used but some performance impact is expected.

Also, take a look at &lt;STRONG&gt;Intel C++ Compiler User and Reference Guides&lt;/STRONG&gt;:
...
&lt;STRONG&gt;Cache Blocking&lt;/STRONG&gt;

Cache blocking involves structuring data blocks so that they conveniently fit into a portion of
the L1 or L2 cache. By controlling data cache locality, an application can minimize performance
delays due to memory bus access. The application controls the behavior by dividing a large
array into smaller blocks of memory so a thread can make repeated accesses to the data while
the data is still in cache.
For example, image processing and video applications are well suited to cache blocking
techniques because an image can be processed on smaller portions of the total image or video
frame. Compilers often use the same technique, by grouping related blocks of instructions close
together so they execute from the L2 cache.
The effectiveness of the cache blocking technique depends on data block size, processor cache
size, and the number of times the data is reused. Cache sizes vary based on processor. An
application can detect the data cache size using the CPUID instruction and dynamically adjust
cache blocking tile sizes to maximize performance. As a general rule, cache block sizes should
target approximately one-half to three-quarters the size of the physical cache. For systems
that are Hyper-Threading Technology (HT Technology) enabled target one-quarter to one-half
the physical cache size. (See Designing for Hyper-Threading Technology for more other design
considerations.)
...</description>
      <pubDate>Thu, 11 Apr 2013 04:01:02 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/Parallel-Intel-IPP-FFT-function/m-p/975904#M20978</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2013-04-11T04:01:02Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;...how do that with Intel</title>
      <link>https://community.intel.com/t5/Intel-Integrated-Performance/Parallel-Intel-IPP-FFT-function/m-p/975905#M20979</link>
      <description>&amp;gt;&amp;gt;...how do that with Intel TBB tools?

Do you mean TBB classes? If Yes, take a look at TBB examples for details and &lt;STRONG&gt;simple_partitioner&lt;/STRONG&gt; class could be used in your case.</description>
      <pubDate>Thu, 11 Apr 2013 04:06:01 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Integrated-Performance/Parallel-Intel-IPP-FFT-function/m-p/975905#M20979</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2013-04-11T04:06:01Z</dc:date>
    </item>
  </channel>
</rss>

