HelloMKLwithDPCPP - how to make sure FFTs run in parallel?

Frans1 · ‎12-24-2021

Hi there,

I have a pretty basic/conceptual question with respect to the usage of oneMKL in combination with DPC++ and Visual Studio 2017.

I'm running the example code (see below) on the 16 cores of the Intel Core i9-9880H CPU @ 2.30 GHz of my Dell Precision 7540 and hoped that oneMKL would use as many cores are possible to calculate 10 complex-valued FFTs each (12.5 Mio points). At this moment it looks like I'm only using a single core given the fact that it takes 10 times longer to calculate these 10 FFTs than a single FFT.

Here's the output of the program:

Here are my settings in Visual Studio 2017 (enabling the use of oneTBB doesn't help)

Here's the output when running the VTune Profiler

Apparently I'm overlooking something essential to make sure the program takes full advantage of the 16 cores in my CPU. Please let me know what and where I can find more conceptual information about fine-tuning things with respect to optimal parallel usage of the CPU/GPU device (mainly from oneMKL, possibly in combination with oneTBB) using Visual Studio 2017.

Thanks and regards,

Frans

------------------------------------

#include <mkl.h>
#include <CL/sycl.hpp>
#include <iostream>
#include <string>
#include <oneapi/mkl/dfti.hpp>
#include <oneapi/mkl/rng.hpp>
#include <complex>
#include <chrono>

int main(int argc, char** argv)
{
try
{
// Probably not 100% idiot-proof ... using 25 Mio points on CPU by default and 4 parallel complex-valued FFTs
unsigned int nrOfPoints = (argc < 2) ? 25000000U : std::stoi(argv[1]);
std::string selector = (argc < 3) ? "cpu" : argv[2];
unsigned int nrOfParallelFFTs = (argc < 4) ? 4U : std::stoi(argv[3]);

sycl::queue Q;
if (selector == "cpu")
Q = sycl::queue(sycl::cpu_selector{});
else if (selector == "gpu")
Q = sycl::queue(sycl::gpu_selector{});
else if (selector == "host")
Q = sycl::queue(sycl::host_selector{});
else
{
std::cout << "Please use: " << argv[0] << " <nrOfPoints:25Mio> <selector cpu|gpu|host> <nrOfParallelFFTs:4>" << std::endl;
return EXIT_FAILURE;
}

auto sycl_device = Q.get_device();
auto sycl_context = Q.get_context();

// Get more specific info about the device
std::cout << "Running on: " << sycl_device.get_info<sycl::info::device::name>() << std::endl;
std::cout << " Available number of parallel compute units (a.k.a. cores): " << sycl_device.get_info<sycl::info::device::max_compute_units>() << std::endl;

// Use fixed seed in combination with random data
std::uint32_t seed = 0;
oneapi::mkl::rng::mcg31m1 pseudoRndGen(Q, seed); // Initialize the pseudo-random generator.
// Uniform distribution only supports floats and doubles (e.g. not std::complex<float>)
// Use reinterpret_cast<float*> to fill an array of complex<float> values
oneapi::mkl::rng::uniform<float, oneapi::mkl::rng::uniform_method::standard> uniformDistribution(-1, 1);

// Complex-valued IQ data.
auto iqData = sycl::malloc_shared<std::complex<float>>(nrOfPoints, sycl_device, sycl_context);

// Interpret as float values such that we can use the random generator
oneapi::mkl::rng::generate(uniformDistribution, pseudoRndGen, 2 * nrOfPoints, reinterpret_cast<float*>(iqData)).wait();

// Keeping track of 1st value to compensate for current scaling impact
// when combining forward and backward FFT.
auto iqData1 = iqData[0];

std::cout << "iqData before in-place FFT: " << iqData[0] << " .. " << iqData[1] << " .. " << std::endl;

oneapi::mkl::dft::descriptor<oneapi::mkl::dft::precision::SINGLE, oneapi::mkl::dft::domain::COMPLEX> fftDescriptorIQ(nrOfPoints);
// Don't forget to commit the FFT descriptor to the queue.
fftDescriptorIQ.commit(Q);

auto startTime = std::chrono::system_clock::now();
oneapi::mkl::dft::compute_forward< oneapi::mkl::dft::descriptor<oneapi::mkl::dft::precision::SINGLE, oneapi::mkl::dft::domain::COMPLEX>, std::complex<float>>(fftDescriptorIQ, iqData).wait();
auto stopTime = std::chrono::system_clock::now();

std::cout << "iqData after in-place forward FFT: " << iqData[0] << " .. " << iqData[1] << " .. " << std::endl;
std::cout << "Elapsed time (ms) for " << nrOfPoints << " points (complex, in-place): " << std::chrono::duration_cast<std::chrono::milliseconds>(stopTime - startTime).count() << std::endl;

startTime = std::chrono::system_clock::now();
oneapi::mkl::dft::compute_backward< oneapi::mkl::dft::descriptor<oneapi::mkl::dft::precision::SINGLE, oneapi::mkl::dft::domain::COMPLEX>, std::complex<float>>(fftDescriptorIQ, iqData).wait();
stopTime = std::chrono::system_clock::now();

std::cout << "iqData after in-place backward FFT: " << iqData[0] << " .. " << iqData[1] << " .. " << std::endl;
std::cout << "Elapsed time (ms) for " << nrOfPoints << " points (complex, in-place): " << std::chrono::duration_cast<std::chrono::milliseconds>(stopTime - startTime).count() << std::endl;

// PROPER SOLUTION: make sure we perform proper scaling in forward and backward FFT
auto scaleFactor = abs(iqData[0]) / abs(iqData1);
std::cout << "iqData after in-place backward FFT and scaling: " << iqData[0] / scaleFactor << " .. " << iqData[1] / scaleFactor << " .. " << std::endl;

sycl::free(iqData, sycl_context);

// +++ PARALLEL FFTS +++
// How to allocated an array of USM? Safe to use new?
std::complex<float> **iqDataMC = new std::complex<float> *[nrOfParallelFFTs];
sycl::event *eventsMC = new sycl::event[nrOfParallelFFTs];

std::cout << "Creating " << nrOfParallelFFTs << " sets of " << nrOfPoints << " random IQ data." << std::endl;

// Can we do this too in parallel?
int i;
for (i = 0; i < nrOfParallelFFTs; i++)
{
iqDataMC[i] = sycl::malloc_shared<std::complex<float>>(nrOfPoints, sycl_device, sycl_context);
oneapi::mkl::rng::generate(uniformDistribution, pseudoRndGen, 2 * nrOfPoints, reinterpret_cast<float*>(iqDataMC[i])).wait();
}

std::cout << "Done creating " << nrOfParallelFFTs << " sets of " << nrOfPoints << " random complex IQ data." << std::endl;
std::cout << "Performing " << nrOfParallelFFTs << " forward complex single-precision FFTs." << std::endl;

startTime = std::chrono::system_clock::now();

// Need for a parallel_for loop ?
// Based on timing we clearly don't have any speed improvement ... time multiplied by # FFTS ...
// Do we need to commit different FFT descriptors to different queues?
// Given the fact that a queue is linked to a device, one would expect this device to properly deal
// with parallelism ...
for (i = 0; i < nrOfParallelFFTs; i++)
eventsMC[i] = oneapi::mkl::dft::compute_forward<oneapi::mkl::dft::descriptor<oneapi::mkl::dft::precision::SINGLE, oneapi::mkl::dft::domain::COMPLEX>, std::complex<float>>(fftDescriptorIQ, iqDataMC[i]);

// Wait for all events to indicate the calculation is done ...
for (i = 0; i < nrOfParallelFFTs; i++)
eventsMC[i].wait();

stopTime = std::chrono::system_clock::now();

std::cout << "Done performing " << nrOfParallelFFTs << " forward complex single-precision FFTs." << std::endl;
std::cout << "Elapsed time (ms) for " << nrOfParallelFFTs << " forward complex single-precision FFTs: " << std::chrono::duration_cast<std::chrono::milliseconds>(stopTime - startTime).count() << std::endl;

for (i = 0; i < nrOfParallelFFTs; i++)
sycl::free(iqDataMC[i], sycl_context);

return EXIT_SUCCESS;
}
catch (sycl::exception& e)
{
std::cout << "SYCL exception: " << e.what() << std::endl;
}
}

VidyalathaB_Intel · ‎12-27-2021

Hi,

Thanks for reaching out to us.

Could you please try running your code on Intel oneAPI command prompt by following the below steps?

> set MKL_DOMAIN_NUM_THREADS="MKL_DOMAIN_ALL=1, MKL_DOMAIN_FFT=4"

> set KMP_AFFINITY=granularity=fine,compact,1,0

(Please refer to the following link for more details regarding improving performance with the threading

https://www.intel.com/content/www/us/en/develop/documentation/onemkl-windows-developer-guide/top/managing-performance-and-memory/improving-performance-with-threading/using-additional-threading-control/mkl-domain-num-threads.html

https://www.intel.com/content/www/us/en/develop/documentation/onemkl-windows-developer-guide/top/managing-performance-and-memory/improving-performance-with-threading/using-intel-hyper-threading-technology.html)

> dpcpp -DMKL_ILP64 -I"%MKLROOT%\include" file.cpp -c /EHsc

> dpcpp file.obj -fsycl-device-code-split=per_kernel mkl_sycl.lib mkl_intel_ilp64.lib mkl_intel_thread.lib mkl_core.lib libiomp5md.lib sycl.lib OpenCL.lib

> file.exe 12500000 cpu 10

When we tried the above steps from our end, we could see that it is taking less time when compared to yours. Please refer to the screenshot for the output.

>>At this moment it looks like I'm only using a single core given the fact that .....

We are working on this issue, we will get back to you soon.

Regards,

Vidya.

Frans1 · ‎12-27-2021

Thanks Vidya.

Your compile/link settings speed up the actual execution as you can see below.

Can you please explain why these settings don't show up when using https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl-link-line-advisor.html ?

Calculating 10 FFTs still takes roughly 10 times calculating a single FFT. Hence it looks like I'm still only using a single core.

Also, I question the relevance of
> set MKL_DOMAIN_NUM_THREADS="MKL_DOMAIN_ALL=1, MKL_DOMAIN_FFT=4"
because it is linked to OpenMP, while the documentation suggests OpenMP is only used in case of the C/Fortran API.

As such I repeated your proposal without setting these variables and I obtain the same performance.

Something else that bothers me is that the main speed improvement is due to the static linking instead of dynamic linking.
If I link dynamically using
> dpcpp HelloMKLwithDPCPP.obj -fsycl-device-code-split=per_kernel mkl_sycl_dll.lib mkl_intel_ilp64_dll.lib mkl_intel_thread_dll.lib mkl_core_dll.lib libiomp5md.lib sycl.lib OpenCL.lib
the speed is roughly back to what it was (about 8 times slower).

Also replacing the mkl_intel_thread.lib by mkl_tbb_thread.lib (static linking) results in a similar negative speed impact (factor 8).

Can you please explain the observed speed impact (dynamic versus static and mkl_tbb_thread versus mkl_intel_thread) ?

Looking forward to your feedback.

Thanks and regards,

Frans

VidyalathaB_Intel · ‎12-30-2021

Hi,

>>without setting these variables and I obtain the same performance.

Could you please share the output by setting MKL_VERBOSE=1 (set MKL_VERBOSE=1) before running your code (without setting the environment variables which was mentioned above)?

>>Can you please explain why these settings don't show up when using

We can get these settings if we select programming language as dpc++ instead of c++ language and we can get the option for linking against with TBB.

When linked it against mkl_intel_thread we can get speed as it is faster when compared with TBB threading.

Regards,

Vidya.

Frans1 · ‎12-31-2021

Hey Vidya,

here's the verbose output when not setting the variables (before and after set MKL_VERBOSE=1)

Here's the verbose output when setting the variables (before and after set MKL_VERBOSE=1)

Here's what I get when I use the linker advisor tool

I also ran vtune (MKL_VERBOSE disabled and without setting the variables)
> vtune -collect threading HelloMKLwithDPCPP 12500000 cpu 10

Here's part of the report

Here's another portion of the output which I don't understand at this moment

Open questions:

With respect to the link advisor tool: I don't see libiomp5md.lib showing up while it shows up at the top function with spin and overhead time. Why?
Why doesn't the link advisor tool indicate that I should use mkl_intel_thread which is about 8 times faster?
Where can I find information explaining me that I should use mkl_intel_thread instead of TBB? Speed is key to our application.
Why is the dynamically linked executable 8 times slower than the statically linked one (both using mkl_intel_thread)?
Based on the MKL verbose output, it looks like I'm using 8 threads. Correct?
Where can I find information to interpret the MKL verbose output?
Running VTune using the oneAPI Tools Command Window gives a different result and reports thread oversubscription. How can I use the MKL verbose output and the VTune report to speed up the application? What are the key take-aways from both reports?

Thanks and best wishes for 2022,

Frans

VidyalathaB_Intel · ‎01-03-2022

Hi Frans,

>>With respect to the link advisor tool: I don't see libiomp5md.lib showing up while it shows up at the top function with spin and overhead time. Why?

The possible reason might be is, mkl_intel_thread option is included in your linking command, hence the analysis shows up with libiomp5md.lib in the output from vtune.

If you link it against using TBB(as how the link line suggests you) you will get output similar to the below screenshot (no occurrence of libiomp5md.lib)

>>Based on the MKL verbose output, it looks like I'm using 8 threads. Correct?

Yes, you are correct. It is utilizing 8 threads.

>>Where can I find information to interpret the MKL verbose output?

Please refer to the below link which gives the information contained in a call description line for Verbose which might help you in interpreting the output of MKL_VERBOSE

https://www.intel.com/content/www/us/en/develop/documentation/onemkl-windows-developer-guide/top/managing-output/using-onemkl-verbose-mode/call-description-line.html

We are working on your issue internally and will address your remaining questions shortly.

Regards,

Vidya.

Gennady_F_Intel · ‎01-20-2022

Frans,

>> Why is the dynamically linked executable 8 times slower than the statically linked one (both using mkl_intel_thread)?

The performance must be the same whether static or dynamic linking mode has been used.

We have to double-check this case.

What is the problem size?

Did you run the code on Core(TM) i9-9880 CPU type?

Did you measure the compute_foraward/backward part only?

Gennady_F_Intel · ‎01-24-2022

>> Why is the dynamically linked executable 8 times slower than the statically linked one (both using mkl_intel_thread)?

We checked the problem on our end and see a similar performance.

Here are the verbose outputs: MKL 2021u4, Linux OS*;

Running on: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz
Available number of parallel compute units (a.k.a. cores): 160
MKL_VERBOSE oneMKL 2021.0 Update 4 Product build 20210904 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) with support of Intel(R) Deep Learning Boost (Intel(R) DL Boost), EVEX-encoded AES and Carry-Less Multiplication Quadword instructions, Lnx 2.30GHz tbb_thread

static: (MKL_VERBOSE=1 ./_stat.x 12500000 cpu 10):
MKL_VERBOSE FFT(scfi12500000,tLim:80,desc:0x6a9fb00) 114.27ms CNR:OFF Dyn:1 FastMM:1
MKL_VERBOSE FFT(scfi12500000,tLim:80,desc:0x6a9fb00) 122.21ms CNR:OFF Dyn:1 FastMM:1
MKL_VERBOSE FFT(scfi12500000,tLim:80,desc:0x6a9fb00) 113.64ms CNR:OFF Dyn:1 FastMM:1
MKL_VERBOSE FFT(scfi12500000,tLim:80,desc:0x6a9fb00) 125.89ms CNR:OFF Dyn:1 FastMM:1
MKL_VERBOSE FFT(scfi12500000,tLim:80,desc:0x6a9fb00) 115.52ms CNR:OFF Dyn:1 FastMM:1
MKL_VERBOSE FFT(scfi12500000,tLim:80,desc:0x6a9fb00) 116.16ms CNR:OFF Dyn:1 FastMM:1
MKL_VERBOSE FFT(scfi12500000,tLim:80,desc:0x6a9fb00) 104.32ms CNR:OFF Dyn:1 FastMM:1
MKL_VERBOSE FFT(scfi12500000,tLim:80,desc:0x6a9fb00) 103.92ms CNR:OFF Dyn:1 FastMM:1
MKL_VERBOSE FFT(scfi12500000,tLim:80,desc:0x6a9fb00) 104.61ms CNR:OFF Dyn:1 FastMM:1
MKL_VERBOSE FFT(scfi12500000,tLim:80,desc:0x6a9fb00) 106.50ms CNR:OFF Dyn:1 FastMM:1
Done performing 10 forward complex single-precision FFTs.
Elapsed time (ms) for 10 forward complex single-precision FFTs: 1130

dynamic: (MKL_VERBOSE=1 ./_dyn.x 12500000 cpu 10 )
Performing 10 forward complex single-precision FFTs.
MKL_VERBOSE FFT(scfi12500000,tLim:80,desc:0x2d7bbc0) 105.54ms CNR:OFF Dyn:1 FastMM:1
MKL_VERBOSE FFT(scfi12500000,tLim:80,desc:0x2d7bbc0) 104.11ms CNR:OFF Dyn:1 FastMM:1
MKL_VERBOSE FFT(scfi12500000,tLim:80,desc:0x2d7bbc0) 106.27ms CNR:OFF Dyn:1 FastMM:1
MKL_VERBOSE FFT(scfi12500000,tLim:80,desc:0x2d7bbc0) 106.64ms CNR:OFF Dyn:1 FastMM:1
MKL_VERBOSE FFT(scfi12500000,tLim:80,desc:0x2d7bbc0) 105.82ms CNR:OFF Dyn:1 FastMM:1
MKL_VERBOSE FFT(scfi12500000,tLim:80,desc:0x2d7bbc0) 104.91ms CNR:OFF Dyn:1 FastMM:1
MKL_VERBOSE FFT(scfi12500000,tLim:80,desc:0x2d7bbc0) 96.06ms CNR:OFF Dyn:1 FastMM:1
MKL_VERBOSE FFT(scfi12500000,tLim:80,desc:0x2d7bbc0) 104.94ms CNR:OFF Dyn:1 FastMM:1
MKL_VERBOSE FFT(scfi12500000,tLim:80,desc:0x2d7bbc0) 107.61ms CNR:OFF Dyn:1 FastMM:1
MKL_VERBOSE FFT(scfi12500000,tLim:80,desc:0x2d7bbc0) 106.05ms CNR:OFF Dyn:1 FastMM:1
Elapsed time (ms) for 10 forward complex single-precision FFTs: 1051

Gennady_F_Intel · ‎01-26-2022

The issue has not been reproduced and it is closing. We will no longer respond to this thread. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only.

Frans1 · ‎01-31-2022

This is an "easy" way to close an issue ... I've been out of the loop for 2 weeks due to an emergency situation at home (in fact my father-in-law passed away on January 20th with funeral @ January 29th, so forgive me to only look into this now).
You tried this on a different OS and a different CPU and concluded it should also work in my situation, which is clearly not the case.

I will re-post this based on our priority support.

Regards,
Frans