HelloMKLwithDPCPP - different behaviour selecting CPU and GPU - feature or bug ?

Frans1 · ‎12-22-2021

Hi there,

it looks like I did hit a rather interesting feature/bug using oneMKL and DPC++.

I created a small test program to get an initial feel for oneMKL in combination with DPC++ (opposed to using ISO C++ which IMHO imposes the usage of the C API).

The program allows to calculate a single-precision FFT of a random sequence of real values of a specified length while choosing between the GPU (Intel UHD Graphics 630) and the CPU (Intel Core i9-9880H CPU @ 2.30 GHz) of my Dell Precision 7540.

I'm also roughly monitoring how long it takes to calculate the FFT, which at this moment is out-of-place to keep track of both time and frequency data.

When specifying 1k points, I get identical results for GPU and CPU. The FFT output data nicely shows up in the freqData variable while the timeData is maintained as (random) input data.

As soon as I use 10k points (or 25 Mio points as required in an upcoming application), everything works as expected using the GPU. However when using the CPU, I obtain identical results but the FFT output data shows up in the timeData (overwriting the FFT input data), implying the FFT suddenly acts as if in-place instead of out-of-place.

Here's the output where the HelloMKLwithDPCPP accepts 2 arguments, i.e. number of points and gpu|cpu.

I'm using oneAPI Base Toolkit, v.2022 (downloaded last Friday, Dec 17th) in combination with Visual Studio 2017.

I tried to attach the simple source code as file, but for one or another weird reason I get the following error (?!)

As such I had to copy-paste the source code below.

Can you please confirm this is a bug?

Also, is my statement that sticking to ISO C++ imposes the use of the oneMKL C API correct?

Thanks and regards,

Frans

-----------------------------------------

#include <mkl.h>
#include <CL/sycl.hpp>
#include <iostream>
#include <string>
#include <oneapi/mkl/dfti.hpp>
#include <oneapi/mkl/rng.hpp>
#include <complex>
#include <chrono>

using namespace oneapi::mkl::dft;

int main(int argc, char** argv)
{
try
{
// Probably not 100% idiot-proof ... using 25 Mio points on CPU by default
unsigned int nrOfPoints = (argc < 2) ? 25000000U : std::stoi(argv[1]);
std::string selector = (argc < 3) ? "cpu" : argv[2];

sycl::queue Q;
if (selector == "cpu")
Q = sycl::queue(sycl::cpu_selector{});
else if (selector == "gpu")
Q = sycl::queue(sycl::gpu_selector{});
else
{
std::cout << "Please use: " << argv[0] << " <nrOfPoints (default 25Mio)> <selector cpu|gpu>" << std::endl;
return EXIT_FAILURE;
}

std::cout << "Running on: " << Q.get_device().get_info<sycl::info::device::name>() << "\n";

auto sycl_device = Q.get_device();
auto sycl_context = Q.get_context();

// For the time being not yet trying complex-valued IQ data due to missing random generation of complex values.
auto timeData = sycl::malloc_shared<float>(nrOfPoints, sycl_device, sycl_context);
// Initially not in-place ... later in-place in an attempt to speed up things
auto freqData = sycl::malloc_shared<float>(nrOfPoints, sycl_device, sycl_context);

// Use fixed seed in combination with random data
std::uint32_t seed = 0;
oneapi::mkl::rng::mcg31m1 pseudoRndGen(Q, seed); // Initialize the pseudo-random generator.
// Uniform distribution only supports floats and doubles (e.g. not std::complex<float>)
oneapi::mkl::rng::uniform<float, oneapi::mkl::rng::uniform_method::standard> uniformDistribution(-1, 1);

oneapi::mkl::rng::generate(uniformDistribution, pseudoRndGen, nrOfPoints, timeData).wait();

oneapi::mkl::dft::descriptor<oneapi::mkl::dft::precision::SINGLE, oneapi::mkl::dft::domain::REAL> fftDescriptor(nrOfPoints);
// Don't forget to commit the FFT descriptor to the queue.
fftDescriptor.commit(Q);

// Calculate the forward FFT and wait until done before printing the first and last value.
// Apparently no support to have N floats as input and N/2 + 1 complex<float>s as output.
// Not sure how the complex FFT values are stored ... expect real|imag|real|imag|...

auto startTime = std::chrono::system_clock::now();
oneapi::mkl::dft::compute_forward<oneapi::mkl::dft::descriptor<oneapi::mkl::dft::precision::SINGLE, oneapi::mkl::dft::domain::REAL>, float, float>(fftDescriptor, timeData, freqData).wait();
auto stopTime = std::chrono::system_clock::now();

// +++ BUG ALERT +++
// When using the CPU the freq data end up in the time data (?!) starting from 10k points while not the case for GPU.
// +++ BUG ALERT +++

std::cout << "time data: " << timeData[0] << " .. " << timeData[1] << " .. " << timeData[2] << " .. " << timeData[3] << " .. " << std::endl;
std::cout << "freq data: " << freqData[0] << " .. " << freqData[1] << "j .. " << freqData[2] << " .. " << freqData[3] << "j .. " << std::endl;
std::cout << "Elapsed time (ms) for " << nrOfPoints << " points: " << std::chrono::duration_cast<std::chrono::milliseconds>(stopTime - startTime).count() << std::endl;

return EXIT_SUCCESS;
}
catch (sycl::exception& e)
{
std::cout << "SYCL exception: " << e.what() << std::endl;
}
}

VarshaS_Intel · ‎12-23-2021

Hi,

Thanks for reaching out to us.

We are able to reproduce your issue. We are working on it internally and will get back to you soon.

Thanks & Regards,

Varsha

Gennady_F_Intel · ‎01-20-2022

It is some kind of our specific implementation but not a bug. In this case, when the user wants to obtain the output results into freqData array, he has to explicitly see DFTI_NOT_INPLACE mode.

e.x – it could be like as follows: desc.set_value(oneapi::mkl::dft::config_param::PLACEMENT, DFTI_NOT_INPLACE); )

The thread is closing and we will no longer respond to this thread. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only.