Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.
6647 Discussions

HelloMKLwithDPCPP - different behaviour selecting CPU and GPU - feature or bug ?

Frans1
Beginner
368 Views

Hi there,

it looks like I did hit a rather interesting feature/bug using oneMKL and DPC++.

I created a small test program to get an initial feel for oneMKL in combination with DPC++ (opposed to using ISO C++ which IMHO imposes the usage of the C API).

The program allows to calculate a single-precision FFT of a random sequence of real values of a specified length while choosing between the GPU (Intel UHD Graphics 630) and the CPU (Intel Core i9-9880H CPU @ 2.30 GHz) of my Dell Precision 7540.

I'm also roughly monitoring how long it takes to calculate the FFT, which at this moment is out-of-place to keep track of both time and frequency data.

When specifying 1k points, I get identical results for GPU and CPU. The FFT output data nicely shows up in the freqData variable while the timeData is maintained as (random) input data.

As soon as I use 10k points (or 25 Mio points as required in an upcoming application), everything works as expected using the GPU. However when using the CPU, I obtain identical results but the FFT output data shows up in the timeData (overwriting the FFT input data), implying the FFT suddenly acts as if in-place instead of out-of-place.

Here's the output where the HelloMKLwithDPCPP accepts 2 arguments, i.e. number of points and gpu|cpu.

Frans1_0-1640193652976.png

I'm using oneAPI Base Toolkit, v.2022 (downloaded last Friday, Dec 17th) in combination with Visual Studio 2017.

I tried to attach the simple source code as file, but for one or another weird reason I get the following error (?!)

Frans1_0-1640194344450.png

As such I had to copy-paste the source code below.

Can you please confirm this is a bug?

Also, is my statement that sticking to ISO C++ imposes the use of the oneMKL C API correct?

Thanks and regards,

Frans

-----------------------------------------

#include <mkl.h>
#include <CL/sycl.hpp>
#include <iostream>
#include <string>
#include <oneapi/mkl/dfti.hpp>
#include <oneapi/mkl/rng.hpp>
#include <complex>
#include <chrono>

using namespace oneapi::mkl::dft;

int main(int argc, char** argv)
{
try
{
// Probably not 100% idiot-proof ... using 25 Mio points on CPU by default
unsigned int nrOfPoints = (argc < 2) ? 25000000U : std::stoi(argv[1]);
std::string selector = (argc < 3) ? "cpu" : argv[2];

sycl::queue Q;
if (selector == "cpu")
Q = sycl::queue(sycl::cpu_selector{});
else if (selector == "gpu")
Q = sycl::queue(sycl::gpu_selector{});
else
{
std::cout << "Please use: " << argv[0] << " <nrOfPoints (default 25Mio)> <selector cpu|gpu>" << std::endl;
return EXIT_FAILURE;
}

std::cout << "Running on: " << Q.get_device().get_info<sycl::info::device::name>() << "\n";

auto sycl_device = Q.get_device();
auto sycl_context = Q.get_context();

// For the time being not yet trying complex-valued IQ data due to missing random generation of complex values.
auto timeData = sycl::malloc_shared<float>(nrOfPoints, sycl_device, sycl_context);
// Initially not in-place ... later in-place in an attempt to speed up things
auto freqData = sycl::malloc_shared<float>(nrOfPoints, sycl_device, sycl_context);

// Use fixed seed in combination with random data
std::uint32_t seed = 0;
oneapi::mkl::rng::mcg31m1 pseudoRndGen(Q, seed); // Initialize the pseudo-random generator.
// Uniform distribution only supports floats and doubles (e.g. not std::complex<float>)
oneapi::mkl::rng::uniform<float, oneapi::mkl::rng::uniform_method::standard> uniformDistribution(-1, 1);

oneapi::mkl::rng::generate(uniformDistribution, pseudoRndGen, nrOfPoints, timeData).wait();

oneapi::mkl::dft::descriptor<oneapi::mkl::dft::precision::SINGLE, oneapi::mkl::dft::domain::REAL> fftDescriptor(nrOfPoints);
// Don't forget to commit the FFT descriptor to the queue.
fftDescriptor.commit(Q);

// Calculate the forward FFT and wait until done before printing the first and last value.
// Apparently no support to have N floats as input and N/2 + 1 complex<float>s as output.
// Not sure how the complex FFT values are stored ... expect real|imag|real|imag|...

auto startTime = std::chrono::system_clock::now();
oneapi::mkl::dft::compute_forward<oneapi::mkl::dft::descriptor<oneapi::mkl::dft::precision::SINGLE, oneapi::mkl::dft::domain::REAL>, float, float>(fftDescriptor, timeData, freqData).wait();
auto stopTime = std::chrono::system_clock::now();

// +++ BUG ALERT +++
// When using the CPU the freq data end up in the time data (?!) starting from 10k points while not the case for GPU.
// +++ BUG ALERT +++

std::cout << "time data: " << timeData[0] << " .. " << timeData[1] << " .. " << timeData[2] << " .. " << timeData[3] << " .. " << std::endl;
std::cout << "freq data: " << freqData[0] << " .. " << freqData[1] << "j .. " << freqData[2] << " .. " << freqData[3] << "j .. " << std::endl;
std::cout << "Elapsed time (ms) for " << nrOfPoints << " points: " << std::chrono::duration_cast<std::chrono::milliseconds>(stopTime - startTime).count() << std::endl;

return EXIT_SUCCESS;
}
catch (sycl::exception& e)
{
std::cout << "SYCL exception: " << e.what() << std::endl;
}
}

0 Kudos
2 Replies
VarshaS_Intel
Moderator
329 Views

Hi,


Thanks for reaching out to us.


We are able to reproduce your issue. We are working on it internally and will get back to you soon.


Thanks & Regards,

Varsha


Gennady_F_Intel
Moderator
188 Views

It is some kind of our specific implementation but not a bug. In this case, when the user wants to obtain the output results into freqData array, he has to explicitly see DFTI_NOT_INPLACE mode.

e.x – it could be like as follows:  desc.set_value(oneapi::mkl::dft::config_param::PLACEMENT, DFTI_NOT_INPLACE); )

 

The thread is closing and we will no longer respond to this thread. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only.

 


Reply