<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: HelloMKLwithDPCPP - how to make sure FFTs run in parallel? in Intel® oneAPI Math Kernel Library</title>
    <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/HelloMKLwithDPCPP-how-to-make-sure-FFTs-run-in-parallel/m-p/1353331#M32648</link>
    <description>&lt;P&gt;Frans,&lt;/P&gt;
&lt;P&gt;&amp;gt;&amp;gt; Why is the dynamically linked executable 8 times slower than the statically linked one (both using mkl_intel_thread)?&lt;/P&gt;
&lt;P&gt;The performance must be the same whether static or dynamic linking mode has been used.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;We have to double-check this case.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;What is the problem size?&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Did you run the code on Core(TM) i9-9880 CPU type?&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Did you measure the compute_foraward/backward part only?&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Fri, 21 Jan 2022 04:51:28 GMT</pubDate>
    <dc:creator>Gennady_F_Intel</dc:creator>
    <dc:date>2022-01-21T04:51:28Z</dc:date>
    <item>
      <title>HelloMKLwithDPCPP - how to make sure FFTs run in parallel?</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/HelloMKLwithDPCPP-how-to-make-sure-FFTs-run-in-parallel/m-p/1346508#M32499</link>
      <description>&lt;P&gt;Hi there,&lt;/P&gt;
&lt;P&gt;I have a pretty basic/conceptual question with respect to the usage of oneMKL in combination with DPC++ and Visual Studio 2017.&lt;/P&gt;
&lt;P&gt;I'm running the example code (see below) on the 16 cores of the Intel Core i9-9880H CPU&amp;nbsp;@ 2.30 GHz of my Dell Precision 7540 and hoped that oneMKL would &lt;U&gt;use as many cores are possible to calculate 10 complex-valued FFTs each (12.5 Mio points)&lt;/U&gt;. At this moment it looks like I'm only using a single core given the fact that it takes 10 times longer to calculate these 10 FFTs than a single FFT.&lt;/P&gt;
&lt;P&gt;Here's the output of the program:&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Frans1_0-1640349129485.png" style="width: 400px;"&gt;&lt;img src="https://community.intel.com/t5/image/serverpage/image-id/25069i02779869765A3CE8/image-size/medium/is-moderation-mode/true?v=v2&amp;amp;px=400&amp;amp;whitelist-exif-data=Orientation%2CResolution%2COriginalDefaultFinalSize%2CCopyright" role="button" title="Frans1_0-1640349129485.png" alt="Frans1_0-1640349129485.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;Here are my settings in Visual Studio 2017 (enabling the use of oneTBB doesn't help)&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Frans1_1-1640349203663.png" style="width: 400px;"&gt;&lt;img src="https://community.intel.com/t5/image/serverpage/image-id/25070i09DA7587107C1DA6/image-size/medium/is-moderation-mode/true?v=v2&amp;amp;px=400&amp;amp;whitelist-exif-data=Orientation%2CResolution%2COriginalDefaultFinalSize%2CCopyright" role="button" title="Frans1_1-1640349203663.png" alt="Frans1_1-1640349203663.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;Here's the output when running the VTune Profiler&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Frans1_2-1640349291206.png" style="width: 400px;"&gt;&lt;img src="https://community.intel.com/t5/image/serverpage/image-id/25071iFA5E6E2D24623CDE/image-size/medium/is-moderation-mode/true?v=v2&amp;amp;px=400&amp;amp;whitelist-exif-data=Orientation%2CResolution%2COriginalDefaultFinalSize%2CCopyright" role="button" title="Frans1_2-1640349291206.png" alt="Frans1_2-1640349291206.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;Apparently I'm overlooking &lt;U&gt;something essential&lt;/U&gt; to make sure the program takes full advantage of the 16 cores in my CPU. Please let me know what and where I can find more &lt;U&gt;conceptual information&lt;/U&gt; about fine-tuning things with respect to optimal parallel usage of the CPU/GPU device (mainly from oneMKL, possibly in combination with oneTBB) using Visual Studio 2017.&lt;/P&gt;
&lt;P&gt;Thanks and regards,&lt;/P&gt;
&lt;P&gt;Frans&lt;/P&gt;
&lt;P&gt;------------------------------------&lt;/P&gt;
&lt;P&gt;&lt;FONT size="2"&gt;#include &amp;lt;mkl.h&amp;gt;&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT size="2"&gt;#include &amp;lt;CL/sycl.hpp&amp;gt;&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT size="2"&gt;#include &amp;lt;iostream&amp;gt;&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT size="2"&gt;#include &amp;lt;string&amp;gt;&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT size="2"&gt;#include &amp;lt;oneapi/mkl/dfti.hpp&amp;gt;&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT size="2"&gt;#include &amp;lt;oneapi/mkl/rng.hpp&amp;gt;&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT size="2"&gt;#include &amp;lt;complex&amp;gt;&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT size="2"&gt;#include &amp;lt;chrono&amp;gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT size="2"&gt;int main(int argc, char** argv)&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT size="2"&gt;{&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT size="2"&gt;try&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT size="2"&gt;{&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT size="2"&gt;// Probably not 100% idiot-proof ... using 25 Mio points on CPU by default and 4 parallel complex-valued FFTs&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT size="2"&gt;unsigned int nrOfPoints = (argc &amp;lt; 2) ? 25000000U : std::stoi(argv[1]);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT size="2"&gt;std::string selector = (argc &amp;lt; 3) ? "cpu" : argv[2];&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT size="2"&gt;unsigned int nrOfParallelFFTs = (argc &amp;lt; 4) ? 4U : std::stoi(argv[3]);&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT size="2"&gt;sycl::queue Q;&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT size="2"&gt;if (selector == "cpu")&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT size="2"&gt;Q = sycl::queue(sycl::cpu_selector{});&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT size="2"&gt;else if (selector == "gpu")&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT size="2"&gt;Q = sycl::queue(sycl::gpu_selector{});&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT size="2"&gt;else if (selector == "host")&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT size="2"&gt;Q = sycl::queue(sycl::host_selector{});&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT size="2"&gt;else&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT size="2"&gt;{&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT size="2"&gt;std::cout &amp;lt;&amp;lt; "Please use: " &amp;lt;&amp;lt; argv[0] &amp;lt;&amp;lt; " &amp;lt;nrOfPoints:25Mio&amp;gt; &amp;lt;selector cpu|gpu|host&amp;gt; &amp;lt;nrOfParallelFFTs:4&amp;gt;" &amp;lt;&amp;lt; std::endl;&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT size="2"&gt;return EXIT_FAILURE;&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT size="2"&gt;}&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT size="2"&gt;auto sycl_device = Q.get_device();&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT size="2"&gt;auto sycl_context = Q.get_context();&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT size="2"&gt;// Get more specific info about the device&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT size="2"&gt;std::cout &amp;lt;&amp;lt; "Running on: " &amp;lt;&amp;lt; sycl_device.get_info&amp;lt;sycl::info::device::name&amp;gt;() &amp;lt;&amp;lt; std::endl;&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT size="2"&gt;std::cout &amp;lt;&amp;lt; " Available number of parallel compute units (a.k.a. cores): " &amp;lt;&amp;lt; sycl_device.get_info&amp;lt;sycl::info::device::max_compute_units&amp;gt;() &amp;lt;&amp;lt; std::endl;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT size="2"&gt;// Use fixed seed in combination with random data&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT size="2"&gt;std::uint32_t seed = 0;&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT size="2"&gt;oneapi::mkl::rng::mcg31m1 pseudoRndGen(Q, seed); // Initialize the pseudo-random generator.&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT size="2"&gt;// Uniform distribution only supports floats and doubles (e.g. not std::complex&amp;lt;float&amp;gt;)&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT size="2"&gt;// Use reinterpret_cast&amp;lt;float*&amp;gt; to fill an array of complex&amp;lt;float&amp;gt; values&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT size="2"&gt;oneapi::mkl::rng::uniform&amp;lt;float, oneapi::mkl::rng::uniform_method::standard&amp;gt; uniformDistribution(-1, 1);&lt;/FONT&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;FONT size="2"&gt;// Complex-valued IQ data.&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT size="2"&gt;auto iqData = sycl::malloc_shared&amp;lt;std::complex&amp;lt;float&amp;gt;&amp;gt;(nrOfPoints, sycl_device, sycl_context);&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT size="2"&gt;// Interpret as float values such that we can use the random generator&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT size="2"&gt;oneapi::mkl::rng::generate(uniformDistribution, pseudoRndGen, 2 * nrOfPoints, reinterpret_cast&amp;lt;float*&amp;gt;(iqData)).wait();&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT size="2"&gt;// Keeping track of 1st value to compensate for current scaling impact&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT size="2"&gt;// when combining forward and backward FFT.&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT size="2"&gt;auto iqData1 = iqData[0];&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT size="2"&gt;std::cout &amp;lt;&amp;lt; "iqData before in-place FFT: " &amp;lt;&amp;lt; iqData[0] &amp;lt;&amp;lt; " .. " &amp;lt;&amp;lt; iqData[1] &amp;lt;&amp;lt; " .. " &amp;lt;&amp;lt; std::endl;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT size="2"&gt;oneapi::mkl::dft::descriptor&amp;lt;oneapi::mkl::dft::precision::SINGLE, oneapi::mkl::dft::domain::COMPLEX&amp;gt; fftDescriptorIQ(nrOfPoints);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT size="2"&gt;// Don't forget to commit the FFT descriptor to the queue.&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT size="2"&gt;fftDescriptorIQ.commit(Q);&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT size="2"&gt;auto startTime = std::chrono::system_clock::now();&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT size="2"&gt;oneapi::mkl::dft::compute_forward&amp;lt; oneapi::mkl::dft::descriptor&amp;lt;oneapi::mkl::dft::precision::SINGLE, oneapi::mkl::dft::domain::COMPLEX&amp;gt;, std::complex&amp;lt;float&amp;gt;&amp;gt;(fftDescriptorIQ, iqData).wait();&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT size="2"&gt;auto stopTime = std::chrono::system_clock::now();&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT size="2"&gt;std::cout &amp;lt;&amp;lt; "iqData after in-place forward FFT: " &amp;lt;&amp;lt; iqData[0] &amp;lt;&amp;lt; " .. " &amp;lt;&amp;lt; iqData[1] &amp;lt;&amp;lt; " .. " &amp;lt;&amp;lt; std::endl;&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT size="2"&gt;std::cout &amp;lt;&amp;lt; "Elapsed time (ms) for " &amp;lt;&amp;lt; nrOfPoints &amp;lt;&amp;lt; " points (complex, in-place): " &amp;lt;&amp;lt; std::chrono::duration_cast&amp;lt;std::chrono::milliseconds&amp;gt;(stopTime - startTime).count() &amp;lt;&amp;lt; std::endl;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT size="2"&gt;startTime = std::chrono::system_clock::now();&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT size="2"&gt;oneapi::mkl::dft::compute_backward&amp;lt; oneapi::mkl::dft::descriptor&amp;lt;oneapi::mkl::dft::precision::SINGLE, oneapi::mkl::dft::domain::COMPLEX&amp;gt;, std::complex&amp;lt;float&amp;gt;&amp;gt;(fftDescriptorIQ, iqData).wait();&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT size="2"&gt;stopTime = std::chrono::system_clock::now();&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT size="2"&gt;std::cout &amp;lt;&amp;lt; "iqData after in-place backward FFT: " &amp;lt;&amp;lt; iqData[0] &amp;lt;&amp;lt; " .. " &amp;lt;&amp;lt; iqData[1] &amp;lt;&amp;lt; " .. " &amp;lt;&amp;lt; std::endl;&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT size="2"&gt;std::cout &amp;lt;&amp;lt; "Elapsed time (ms) for " &amp;lt;&amp;lt; nrOfPoints &amp;lt;&amp;lt; " points (complex, in-place): " &amp;lt;&amp;lt; std::chrono::duration_cast&amp;lt;std::chrono::milliseconds&amp;gt;(stopTime - startTime).count() &amp;lt;&amp;lt; std::endl;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT size="2"&gt;// PROPER SOLUTION: make sure we perform proper scaling in forward and backward FFT&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT size="2"&gt;auto scaleFactor = abs(iqData[0]) / abs(iqData1);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT size="2"&gt;std::cout &amp;lt;&amp;lt; "iqData after in-place backward FFT and scaling: " &amp;lt;&amp;lt; iqData[0] / scaleFactor &amp;lt;&amp;lt; " .. " &amp;lt;&amp;lt; iqData[1] / scaleFactor &amp;lt;&amp;lt; " .. " &amp;lt;&amp;lt; std::endl;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT size="2"&gt;sycl::free(iqData, sycl_context);&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT size="2"&gt;// +++ PARALLEL FFTS +++&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT size="2"&gt;// How to allocated an array of USM? Safe to use new?&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT size="2"&gt;std::complex&amp;lt;float&amp;gt; **iqDataMC = new std::complex&amp;lt;float&amp;gt; *[nrOfParallelFFTs];&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT size="2"&gt;sycl::event *eventsMC = new sycl::event[nrOfParallelFFTs];&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT size="2"&gt;std::cout &amp;lt;&amp;lt; "Creating " &amp;lt;&amp;lt; nrOfParallelFFTs &amp;lt;&amp;lt; " sets of " &amp;lt;&amp;lt; nrOfPoints &amp;lt;&amp;lt; " random IQ data." &amp;lt;&amp;lt; std::endl;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT size="2"&gt;// Can we do this too in parallel?&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT size="2"&gt;int i;&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT size="2"&gt;for (i = 0; i &amp;lt; nrOfParallelFFTs; i++)&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT size="2"&gt;{&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT size="2"&gt;iqDataMC[i] = sycl::malloc_shared&amp;lt;std::complex&amp;lt;float&amp;gt;&amp;gt;(nrOfPoints, sycl_device, sycl_context);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT size="2"&gt;oneapi::mkl::rng::generate(uniformDistribution, pseudoRndGen, 2 * nrOfPoints, reinterpret_cast&amp;lt;float*&amp;gt;(iqDataMC[i])).wait();&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT size="2"&gt;}&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT size="2"&gt;std::cout &amp;lt;&amp;lt; "Done creating " &amp;lt;&amp;lt; nrOfParallelFFTs &amp;lt;&amp;lt; " sets of " &amp;lt;&amp;lt; nrOfPoints &amp;lt;&amp;lt; " random complex IQ data." &amp;lt;&amp;lt; std::endl;&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT size="2"&gt;std::cout &amp;lt;&amp;lt; "Performing " &amp;lt;&amp;lt; nrOfParallelFFTs &amp;lt;&amp;lt; " forward complex single-precision FFTs." &amp;lt;&amp;lt; std::endl;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT size="2"&gt;startTime = std::chrono::system_clock::now();&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT size="2"&gt;// Need for a parallel_for loop ?&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT size="2"&gt;// Based on timing we clearly don't have any speed improvement ... time multiplied by # FFTS ...&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT size="2"&gt;// Do we need to commit different FFT descriptors to different queues?&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT size="2"&gt;// Given the fact that a queue is linked to a device, one would expect this device to properly deal&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT size="2"&gt;// with parallelism ...&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT size="2"&gt;for (i = 0; i &amp;lt; nrOfParallelFFTs; i++)&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT size="2"&gt;eventsMC[i] = oneapi::mkl::dft::compute_forward&amp;lt;oneapi::mkl::dft::descriptor&amp;lt;oneapi::mkl::dft::precision::SINGLE, oneapi::mkl::dft::domain::COMPLEX&amp;gt;, std::complex&amp;lt;float&amp;gt;&amp;gt;(fftDescriptorIQ, iqDataMC[i]);&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT size="2"&gt;// Wait for all events to indicate the calculation is done ...&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT size="2"&gt;for (i = 0; i &amp;lt; nrOfParallelFFTs; i++)&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT size="2"&gt;eventsMC[i].wait();&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT size="2"&gt;stopTime = std::chrono::system_clock::now();&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT size="2"&gt;std::cout &amp;lt;&amp;lt; "Done performing " &amp;lt;&amp;lt; nrOfParallelFFTs &amp;lt;&amp;lt; " forward complex single-precision FFTs." &amp;lt;&amp;lt; std::endl;&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT size="2"&gt;std::cout &amp;lt;&amp;lt; "Elapsed time (ms) for " &amp;lt;&amp;lt; nrOfParallelFFTs &amp;lt;&amp;lt; " forward complex single-precision FFTs: " &amp;lt;&amp;lt; std::chrono::duration_cast&amp;lt;std::chrono::milliseconds&amp;gt;(stopTime - startTime).count() &amp;lt;&amp;lt; std::endl;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT size="2"&gt;for (i = 0; i &amp;lt; nrOfParallelFFTs; i++)&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT size="2"&gt;sycl::free(iqDataMC[i], sycl_context);&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT size="2"&gt;return EXIT_SUCCESS;&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT size="2"&gt;}&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT size="2"&gt;catch (sycl::exception&amp;amp; e)&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT size="2"&gt;{&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT size="2"&gt;std::cout &amp;lt;&amp;lt; "SYCL exception: " &amp;lt;&amp;lt; e.what() &amp;lt;&amp;lt; std::endl;&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT size="2"&gt;}&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT size="2"&gt;}&lt;/FONT&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 24 Dec 2021 12:43:13 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/HelloMKLwithDPCPP-how-to-make-sure-FFTs-run-in-parallel/m-p/1346508#M32499</guid>
      <dc:creator>Frans1</dc:creator>
      <dc:date>2021-12-24T12:43:13Z</dc:date>
    </item>
    <item>
      <title>Re: HelloMKLwithDPCPP - how to make sure FFTs run in parallel?</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/HelloMKLwithDPCPP-how-to-make-sure-FFTs-run-in-parallel/m-p/1346782#M32504</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Thanks for reaching out to us.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Could you please try running your code on Intel oneAPI command prompt by following the below steps?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;gt; set MKL_DOMAIN_NUM_THREADS="MKL_DOMAIN_ALL=1, MKL_DOMAIN_FFT=4"&lt;/P&gt;
&lt;P&gt;&amp;gt; set KMP_AFFINITY=granularity=fine,compact,1,0&lt;/P&gt;
&lt;P&gt;&lt;LI-WRAPPER&gt;&lt;/LI-WRAPPER&gt;&lt;/P&gt;
&lt;P&gt;(Please refer to the following link for more details regarding improving performance with the threading&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;A href="https://www.intel.com/content/www/us/en/develop/documentation/onemkl-windows-developer-guide/top/managing-performance-and-memory/improving-performance-with-threading/using-additional-threading-control/mkl-domain-num-threads.html" target="_blank" rel="noopener"&gt;https://www.intel.com/content/www/us/en/develop/documentation/onemkl-windows-developer-guide/top/managing-performance-and-memory/improving-performance-with-threading/using-additional-threading-control/mkl-domain-num-threads.html&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;&lt;A href="https://www.intel.com/content/www/us/en/develop/documentation/onemkl-windows-developer-guide/top/managing-performance-and-memory/improving-performance-with-threading/using-intel-hyper-threading-technology.html" target="_blank" rel="noopener"&gt;https://www.intel.com/content/www/us/en/develop/documentation/onemkl-windows-developer-guide/top/managing-performance-and-memory/improving-performance-with-threading/using-intel-hyper-threading-technology.html&lt;/A&gt;)&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;gt; dpcpp&amp;nbsp;-DMKL_ILP64&amp;nbsp;-I"%MKLROOT%\include" file.cpp -c /EHsc&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;gt; dpcpp file.obj -fsycl-device-code-split=per_kernel mkl_sycl.lib&amp;nbsp;mkl_intel_ilp64.lib mkl_intel_thread.lib mkl_core.lib libiomp5md.lib sycl.lib OpenCL.lib&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;gt; file.exe 12500000 cpu 10&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;When we tried the above steps from our end, we could see that it is taking less time when compared to yours. Please refer to the screenshot for the output.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="VidyalathaB_Intel_0-1640605803749.png" style="width: 400px;"&gt;&lt;img src="https://community.intel.com/t5/image/serverpage/image-id/25093iC17022FDD55FA1CD/image-size/medium?v=v2&amp;amp;px=400&amp;amp;whitelist-exif-data=Orientation%2CResolution%2COriginalDefaultFinalSize%2CCopyright" role="button" title="VidyalathaB_Intel_0-1640605803749.png" alt="VidyalathaB_Intel_0-1640605803749.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;gt;&amp;gt;&lt;I&gt;At this moment it looks like I'm only using a single core given the fact that .....&lt;/I&gt;&lt;/P&gt;
&lt;P&gt;We are working on this issue, we will get back to you soon.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Regards,&lt;/P&gt;
&lt;P&gt;Vidya.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 27 Dec 2021 16:46:35 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/HelloMKLwithDPCPP-how-to-make-sure-FFTs-run-in-parallel/m-p/1346782#M32504</guid>
      <dc:creator>VidyalathaB_Intel</dc:creator>
      <dc:date>2021-12-27T16:46:35Z</dc:date>
    </item>
    <item>
      <title>Re: HelloMKLwithDPCPP - how to make sure FFTs run in parallel?</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/HelloMKLwithDPCPP-how-to-make-sure-FFTs-run-in-parallel/m-p/1346833#M32505</link>
      <description>&lt;P&gt;Thanks Vidya.&lt;/P&gt;
&lt;P&gt;Your compile/link settings speed up the actual execution as you can see below.&lt;/P&gt;
&lt;P&gt;Can you please explain why these settings don't show up when using&amp;nbsp;&lt;A href="https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl-link-line-advisor.html" target="_blank" rel="noopener"&gt;https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl-link-line-advisor.html&lt;/A&gt;&amp;nbsp;?&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Frans1_0-1640682903076.png" style="width: 400px;"&gt;&lt;img src="https://community.intel.com/t5/image/serverpage/image-id/25136i937AED0898C230A3/image-size/medium/is-moderation-mode/true?v=v2&amp;amp;px=400&amp;amp;whitelist-exif-data=Orientation%2CResolution%2COriginalDefaultFinalSize%2CCopyright" role="button" title="Frans1_0-1640682903076.png" alt="Frans1_0-1640682903076.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;Calculating 10 FFTs still takes roughly 10 times calculating a single FFT. Hence it looks like I'm still only using a single core.&lt;/P&gt;
&lt;P&gt;Also, I question the relevance of&lt;BR /&gt;&amp;gt; set MKL_DOMAIN_NUM_THREADS="MKL_DOMAIN_ALL=1, MKL_DOMAIN_FFT=4"&lt;BR /&gt;because it is linked to OpenMP, while the documentation suggests OpenMP is only used in case of the C/Fortran API.&lt;/P&gt;
&lt;P&gt;As such I repeated your proposal &lt;U&gt;without setting these variables and I obtain the same performance&lt;/U&gt;.&lt;/P&gt;
&lt;P&gt;Something else that bothers me is that the main speed improvement is due to the &lt;U&gt;static linking instead of dynamic linking&lt;/U&gt;.&lt;BR /&gt;If I link dynamically using&lt;BR /&gt;&amp;gt;&amp;nbsp;dpcpp HelloMKLwithDPCPP.obj -fsycl-device-code-split=per_kernel mkl_sycl_dll.lib mkl_intel_ilp64_dll.lib mkl_intel_thread_dll.lib mkl_core_dll.lib libiomp5md.lib sycl.lib OpenCL.lib&lt;BR /&gt;the speed is roughly back to what it was (about 8 times slower).&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Frans1_1-1640685129781.png" style="width: 400px;"&gt;&lt;img src="https://community.intel.com/t5/image/serverpage/image-id/25139i1EA15E9A18DA25C1/image-size/medium/is-moderation-mode/true?v=v2&amp;amp;px=400&amp;amp;whitelist-exif-data=Orientation%2CResolution%2COriginalDefaultFinalSize%2CCopyright" role="button" title="Frans1_1-1640685129781.png" alt="Frans1_1-1640685129781.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;Also &lt;U&gt;replacing the&amp;nbsp;mkl_intel_thread.lib by mkl_tbb_thread.lib&lt;/U&gt; (static linking) results in a similar negative speed impact (factor 8).&lt;/P&gt;
&lt;P&gt;Can you please explain the observed speed impact (dynamic versus static and mkl_tbb_thread versus mkl_intel_thread) ?&lt;/P&gt;
&lt;P&gt;Looking forward to your feedback.&lt;/P&gt;
&lt;P&gt;Thanks and regards,&lt;/P&gt;
&lt;P&gt;Frans&lt;/P&gt;</description>
      <pubDate>Tue, 28 Dec 2021 09:59:47 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/HelloMKLwithDPCPP-how-to-make-sure-FFTs-run-in-parallel/m-p/1346833#M32505</guid>
      <dc:creator>Frans1</dc:creator>
      <dc:date>2021-12-28T09:59:47Z</dc:date>
    </item>
    <item>
      <title>Re:HelloMKLwithDPCPP - how to make sure FFTs run in parallel?</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/HelloMKLwithDPCPP-how-to-make-sure-FFTs-run-in-parallel/m-p/1347708#M32520</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;&lt;I&gt;&amp;gt;&amp;gt;without setting these variables and I obtain the same performance.&lt;/I&gt;&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;Could you please share the output by setting MKL_VERBOSE=1 (set MKL_VERBOSE=1) before running your code (without setting the environment variables which was mentioned above)?&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;&lt;I&gt;&amp;gt;&amp;gt;Can you please explain why these settings don't show up when using&lt;/I&gt;&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;We can get these settings if we select programming language as dpc++ instead of c++ language and we can get the option for linking against with TBB.&lt;/P&gt;&lt;P&gt;When linked it against mkl_intel_thread we can get speed as it is faster when compared with TBB threading.&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;Regards,&lt;/P&gt;&lt;P&gt;Vidya.&lt;/P&gt;&lt;BR /&gt;</description>
      <pubDate>Fri, 31 Dec 2021 06:40:35 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/HelloMKLwithDPCPP-how-to-make-sure-FFTs-run-in-parallel/m-p/1347708#M32520</guid>
      <dc:creator>VidyalathaB_Intel</dc:creator>
      <dc:date>2021-12-31T06:40:35Z</dc:date>
    </item>
    <item>
      <title>Re: Re:HelloMKLwithDPCPP - how to make sure FFTs run in parallel?</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/HelloMKLwithDPCPP-how-to-make-sure-FFTs-run-in-parallel/m-p/1347743#M32521</link>
      <description>&lt;P&gt;Hey Vidya,&lt;/P&gt;
&lt;P&gt;here's the verbose output when &lt;U&gt;not&lt;/U&gt; setting the variables (before and after set MKL_VERBOSE=1)&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Frans1_1-1640947665259.png" style="width: 400px;"&gt;&lt;img src="https://community.intel.com/t5/image/serverpage/image-id/25202iFF3D2297DC87CD4E/image-size/medium/is-moderation-mode/true?v=v2&amp;amp;px=400&amp;amp;whitelist-exif-data=Orientation%2CResolution%2COriginalDefaultFinalSize%2CCopyright" role="button" title="Frans1_1-1640947665259.png" alt="Frans1_1-1640947665259.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;Here's the verbose output when setting the variables (before and after set MKL_VERBOSE=1)&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Frans1_2-1640947762396.png" style="width: 400px;"&gt;&lt;img src="https://community.intel.com/t5/image/serverpage/image-id/25203i6FC58EFFBA11A76E/image-size/medium/is-moderation-mode/true?v=v2&amp;amp;px=400&amp;amp;whitelist-exif-data=Orientation%2CResolution%2COriginalDefaultFinalSize%2CCopyright" role="button" title="Frans1_2-1640947762396.png" alt="Frans1_2-1640947762396.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;Here's what I get when I use the linker advisor tool&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Frans1_3-1640948285908.png" style="width: 400px;"&gt;&lt;img src="https://community.intel.com/t5/image/serverpage/image-id/25204i32DE405A2CAE1BEE/image-size/medium/is-moderation-mode/true?v=v2&amp;amp;px=400&amp;amp;whitelist-exif-data=Orientation%2CResolution%2COriginalDefaultFinalSize%2CCopyright" role="button" title="Frans1_3-1640948285908.png" alt="Frans1_3-1640948285908.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;I also ran vtune (MKL_VERBOSE disabled and without setting the variables)&lt;BR /&gt;&amp;gt; vtune -collect threading HelloMKLwithDPCPP 12500000 cpu 10&lt;/P&gt;
&lt;P&gt;Here's part of the report&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Frans1_4-1640950859597.png" style="width: 400px;"&gt;&lt;img src="https://community.intel.com/t5/image/serverpage/image-id/25205iC48107C7020CC1F9/image-size/medium/is-moderation-mode/true?v=v2&amp;amp;px=400&amp;amp;whitelist-exif-data=Orientation%2CResolution%2COriginalDefaultFinalSize%2CCopyright" role="button" title="Frans1_4-1640950859597.png" alt="Frans1_4-1640950859597.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;Here's another portion of the output which I don't understand at this moment&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Frans1_5-1640951465732.png" style="width: 400px;"&gt;&lt;img src="https://community.intel.com/t5/image/serverpage/image-id/25206iDDC1FC7E2A03C2C3/image-size/medium/is-moderation-mode/true?v=v2&amp;amp;px=400&amp;amp;whitelist-exif-data=Orientation%2CResolution%2COriginalDefaultFinalSize%2CCopyright" role="button" title="Frans1_5-1640951465732.png" alt="Frans1_5-1640951465732.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;U&gt;Open questions:&lt;/U&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;With respect to the link advisor tool: I don't see&amp;nbsp;libiomp5md.lib showing up while it shows up at the top function with spin and overhead time. Why?&lt;/LI&gt;
&lt;LI&gt;Why doesn't the link advisor tool indicate that I should use&amp;nbsp;&lt;SPAN&gt;mkl_intel_thread which is about 8 times faster?&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;Where can I find information explaining me that I&amp;nbsp;should use&amp;nbsp;&lt;SPAN&gt;mkl_intel_thread instead of TBB? Speed is key to our application.&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;SPAN&gt;Why is the dynamically linked executable 8 times slower than the statically linked one (both using mkl_intel_thread)?&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;SPAN&gt;Based on the MKL verbose output, it looks like I'm using 8 threads. Correct?&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;SPAN&gt;Where can I find information to interpret the MKL verbose output?&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;SPAN&gt;Running VTune using the oneAPI Tools Command Window gives a different result and reports thread oversubscription. How can I use the MKL verbose output and the VTune report to speed up the application? What are the key take-aways from both reports?&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;Thanks and best wishes for 2022,&lt;/P&gt;
&lt;P&gt;Frans&lt;/P&gt;</description>
      <pubDate>Fri, 31 Dec 2021 11:53:50 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/HelloMKLwithDPCPP-how-to-make-sure-FFTs-run-in-parallel/m-p/1347743#M32521</guid>
      <dc:creator>Frans1</dc:creator>
      <dc:date>2021-12-31T11:53:50Z</dc:date>
    </item>
    <item>
      <title>Re: HelloMKLwithDPCPP - how to make sure FFTs run in parallel?</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/HelloMKLwithDPCPP-how-to-make-sure-FFTs-run-in-parallel/m-p/1348043#M32525</link>
      <description>&lt;P&gt;Hi Frans,&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;I&gt;&amp;gt;&amp;gt;With respect to the link advisor tool: I don't see libiomp5md.lib showing up while it shows up at the top function with spin and overhead time. Why?&lt;/I&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;The possible reason might be is, mkl_intel_thread option is included in your linking command, hence the analysis shows up with libiomp5md.lib in the output from vtune.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;If you link it against using TBB(as how the link line suggests you) you will get output similar to the below screenshot (no occurrence of libiomp5md.lib)&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="VidyalathaB_Intel_2-1641206380816.png" style="width: 400px;"&gt;&lt;img src="https://community.intel.com/t5/image/serverpage/image-id/25255iA62E7BA1E85335B0/image-size/medium?v=v2&amp;amp;px=400&amp;amp;whitelist-exif-data=Orientation%2CResolution%2COriginalDefaultFinalSize%2CCopyright" role="button" title="VidyalathaB_Intel_2-1641206380816.png" alt="VidyalathaB_Intel_2-1641206380816.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;I&gt;&amp;gt;&amp;gt;Based on the MKL verbose output, it looks like I'm using 8 threads. Correct?&lt;/I&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Yes, you are correct. It is utilizing 8 threads.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;I&gt;&amp;gt;&amp;gt;Where can I find information to interpret the MKL verbose output?&lt;/I&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Please refer to the below link which gives the information contained in a call description line for Verbose which might help you in interpreting the output of MKL_VERBOSE&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;A href="https://www.intel.com/content/www/us/en/develop/documentation/onemkl-windows-developer-guide/top/managing-output/using-onemkl-verbose-mode/call-description-line.html" target="_blank" rel="noopener"&gt;https://www.intel.com/content/www/us/en/develop/documentation/onemkl-windows-developer-guide/top/managing-output/using-onemkl-verbose-mode/call-description-line.html&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;We are working on your issue internally and will address your remaining questions shortly.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Regards,&lt;/P&gt;
&lt;P&gt;Vidya.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 03 Jan 2022 10:40:15 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/HelloMKLwithDPCPP-how-to-make-sure-FFTs-run-in-parallel/m-p/1348043#M32525</guid>
      <dc:creator>VidyalathaB_Intel</dc:creator>
      <dc:date>2022-01-03T10:40:15Z</dc:date>
    </item>
    <item>
      <title>Re: HelloMKLwithDPCPP - how to make sure FFTs run in parallel?</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/HelloMKLwithDPCPP-how-to-make-sure-FFTs-run-in-parallel/m-p/1353331#M32648</link>
      <description>&lt;P&gt;Frans,&lt;/P&gt;
&lt;P&gt;&amp;gt;&amp;gt; Why is the dynamically linked executable 8 times slower than the statically linked one (both using mkl_intel_thread)?&lt;/P&gt;
&lt;P&gt;The performance must be the same whether static or dynamic linking mode has been used.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;We have to double-check this case.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;What is the problem size?&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Did you run the code on Core(TM) i9-9880 CPU type?&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Did you measure the compute_foraward/backward part only?&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 21 Jan 2022 04:51:28 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/HelloMKLwithDPCPP-how-to-make-sure-FFTs-run-in-parallel/m-p/1353331#M32648</guid>
      <dc:creator>Gennady_F_Intel</dc:creator>
      <dc:date>2022-01-21T04:51:28Z</dc:date>
    </item>
    <item>
      <title>Re: HelloMKLwithDPCPP - how to make sure FFTs run in parallel?</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/HelloMKLwithDPCPP-how-to-make-sure-FFTs-run-in-parallel/m-p/1354049#M32657</link>
      <description>&lt;P&gt;&lt;SPAN&gt;&amp;gt;&amp;gt; Why is the dynamically linked executable 8 times slower than the statically linked one (both using mkl_intel_thread)?&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;We checked the problem on our end and see a similar performance. &lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;Here are the verbose outputs: MKL 2021u4, Linux OS*;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;Running on: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz&lt;BR /&gt;Available number of parallel compute units (a.k.a. cores): 160&lt;BR /&gt;MKL_VERBOSE &lt;STRONG&gt;oneMKL 2021.0 Update 4&lt;/STRONG&gt; Product build 20210904 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) with support of Intel(R) Deep Learning Boost (Intel(R) DL Boost), EVEX-encoded AES and Carry-Less Multiplication Quadword instructions, Lnx 2.30GHz tbb_thread&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;static:&lt;/STRONG&gt; (MKL_VERBOSE=1 ./_stat.x 12500000 cpu 10)&lt;STRONG&gt;:&lt;/STRONG&gt; &lt;BR /&gt;MKL_VERBOSE FFT(scfi12500000,tLim:80,desc:0x6a9fb00) 114.27ms CNR:OFF Dyn:1 FastMM:1&lt;BR /&gt;MKL_VERBOSE FFT(scfi12500000,tLim:80,desc:0x6a9fb00) 122.21ms CNR:OFF Dyn:1 FastMM:1&lt;BR /&gt;MKL_VERBOSE FFT(scfi12500000,tLim:80,desc:0x6a9fb00) 113.64ms CNR:OFF Dyn:1 FastMM:1&lt;BR /&gt;MKL_VERBOSE FFT(scfi12500000,tLim:80,desc:0x6a9fb00) 125.89ms CNR:OFF Dyn:1 FastMM:1&lt;BR /&gt;MKL_VERBOSE FFT(scfi12500000,tLim:80,desc:0x6a9fb00) 115.52ms CNR:OFF Dyn:1 FastMM:1&lt;BR /&gt;MKL_VERBOSE FFT(scfi12500000,tLim:80,desc:0x6a9fb00) 116.16ms CNR:OFF Dyn:1 FastMM:1&lt;BR /&gt;MKL_VERBOSE FFT(scfi12500000,tLim:80,desc:0x6a9fb00) 104.32ms CNR:OFF Dyn:1 FastMM:1&lt;BR /&gt;MKL_VERBOSE FFT(scfi12500000,tLim:80,desc:0x6a9fb00) 103.92ms CNR:OFF Dyn:1 FastMM:1&lt;BR /&gt;MKL_VERBOSE FFT(scfi12500000,tLim:80,desc:0x6a9fb00) 104.61ms CNR:OFF Dyn:1 FastMM:1&lt;BR /&gt;MKL_VERBOSE FFT(scfi12500000,tLim:80,desc:0x6a9fb00) 106.50ms CNR:OFF Dyn:1 FastMM:1&lt;BR /&gt;Done performing 10 forward complex single-precision FFTs.&lt;BR /&gt;Elapsed time (ms) for 10 forward complex single-precision &lt;STRONG&gt;FFTs: 1130&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;dynamic&lt;/STRONG&gt;: (MKL_VERBOSE=1 ./_dyn.x 12500000 cpu 10 )&lt;BR /&gt;Performing 10 forward complex single-precision FFTs.&lt;BR /&gt;MKL_VERBOSE FFT(scfi12500000,tLim:80,desc:0x2d7bbc0) 105.54ms CNR:OFF Dyn:1 FastMM:1&lt;BR /&gt;MKL_VERBOSE FFT(scfi12500000,tLim:80,desc:0x2d7bbc0) 104.11ms CNR:OFF Dyn:1 FastMM:1&lt;BR /&gt;MKL_VERBOSE FFT(scfi12500000,tLim:80,desc:0x2d7bbc0) 106.27ms CNR:OFF Dyn:1 FastMM:1&lt;BR /&gt;MKL_VERBOSE FFT(scfi12500000,tLim:80,desc:0x2d7bbc0) 106.64ms CNR:OFF Dyn:1 FastMM:1&lt;BR /&gt;MKL_VERBOSE FFT(scfi12500000,tLim:80,desc:0x2d7bbc0) 105.82ms CNR:OFF Dyn:1 FastMM:1&lt;BR /&gt;MKL_VERBOSE FFT(scfi12500000,tLim:80,desc:0x2d7bbc0) 104.91ms CNR:OFF Dyn:1 FastMM:1&lt;BR /&gt;MKL_VERBOSE FFT(scfi12500000,tLim:80,desc:0x2d7bbc0) 96.06ms CNR:OFF Dyn:1 FastMM:1&lt;BR /&gt;MKL_VERBOSE FFT(scfi12500000,tLim:80,desc:0x2d7bbc0) 104.94ms CNR:OFF Dyn:1 FastMM:1&lt;BR /&gt;MKL_VERBOSE FFT(scfi12500000,tLim:80,desc:0x2d7bbc0) 107.61ms CNR:OFF Dyn:1 FastMM:1&lt;BR /&gt;MKL_VERBOSE FFT(scfi12500000,tLim:80,desc:0x2d7bbc0) 106.05ms CNR:OFF Dyn:1 FastMM:1&lt;BR /&gt;Elapsed time (ms) for 10 forward complex single-precision &lt;STRONG&gt;FFTs: 1051&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 24 Jan 2022 11:57:13 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/HelloMKLwithDPCPP-how-to-make-sure-FFTs-run-in-parallel/m-p/1354049#M32657</guid>
      <dc:creator>Gennady_F_Intel</dc:creator>
      <dc:date>2022-01-24T11:57:13Z</dc:date>
    </item>
    <item>
      <title>Re:HelloMKLwithDPCPP - how to make sure FFTs run in parallel?</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/HelloMKLwithDPCPP-how-to-make-sure-FFTs-run-in-parallel/m-p/1354720#M32664</link>
      <description>&lt;P&gt;The issue has not been reproduced and it is closing. We will no longer respond to this thread.&amp;nbsp;If you require additional assistance from Intel, please start a new thread.&amp;nbsp;Any further interaction in this thread will be considered community only.&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;BR /&gt;</description>
      <pubDate>Wed, 26 Jan 2022 08:16:29 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/HelloMKLwithDPCPP-how-to-make-sure-FFTs-run-in-parallel/m-p/1354720#M32664</guid>
      <dc:creator>Gennady_F_Intel</dc:creator>
      <dc:date>2022-01-26T08:16:29Z</dc:date>
    </item>
    <item>
      <title>Re: Re:HelloMKLwithDPCPP - how to make sure FFTs run in parallel?</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/HelloMKLwithDPCPP-how-to-make-sure-FFTs-run-in-parallel/m-p/1356043#M32682</link>
      <description>&lt;P&gt;This is an "easy" way to close an issue ... I've been out of the loop for 2 weeks due to an emergency situation at home (in fact my father-in-law passed away on January 20th with funeral @ January 29th, so forgive me to only look into this now).&lt;BR /&gt;You tried this on a &lt;U&gt;different OS&lt;/U&gt; and a &lt;U&gt;different CPU&lt;/U&gt; and concluded it should also work in my situation, which is clearly not the case.&lt;/P&gt;
&lt;P&gt;I will re-post this based on our priority support.&lt;/P&gt;
&lt;P&gt;Regards,&lt;BR /&gt;Frans&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 31 Jan 2022 14:20:07 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/HelloMKLwithDPCPP-how-to-make-sure-FFTs-run-in-parallel/m-p/1356043#M32682</guid>
      <dc:creator>Frans1</dc:creator>
      <dc:date>2022-01-31T14:20:07Z</dc:date>
    </item>
  </channel>
</rss>

