Intel® oneAPI DPC++/C++ Compiler
Talk to fellow users of Intel® oneAPI DPC++/C++ Compiler and companion tools like Intel® oneAPI DPC++ Library, Intel® DPC++ Compatibility Tool, and Intel® Distribution for GDB*
Announcements
FPGA community forums and blogs have moved to the Altera Community. Existing Intel Community members can sign in with their current credentials.
814 Discussions

Parallel result is different with serial result

Siqiao_Fu
Beginner
1,592 Views
#include <CL/sycl.hpp>
#include <iostream>
#include <ctime>

using namespace std;
using namespace cl::sycl;

constexpr long num_steps = 10000000;

double without_oneapi() {
    double step = 1.0 / (double)num_steps;
    double x = 0.0;
    double sum = 0.0;

    clock_t start = clock();
    for (int i = 0; i < num_steps; i++) {
        x = (i + 0.5) * step;
        sum += 4.0 / (1.0 + x * x);
    }
    clock_t end = clock();
    std::cout << "Without oneapi cost " << (double)(end - start) / CLOCKS_PER_SEC << " second" << std::endl;

    double res = step * sum;
    return res;
}

double with_oneapi_buffer() {
    queue q;
    //std::cout << "Device: " << q.get_device().get_info<info::device::name>() << std::endl;

    double step = 1.0 / (double)num_steps;
    double data[2] = {step, 0.0};

    buffer buf(data, range<1>(2));

    clock_t start = clock();
    q.submit([&](handler& h) {
        accessor a(buf, h);
        h.parallel_for(range<1>(num_steps), [=](auto i) {
            double temp = ((double)i + 0.5) * a[0];
            a[1] += 4.0 / (1.0 + temp * temp);
            });
        }).wait();
    clock_t end = clock();
    std::cout << "With oneapi buffer cost " << (double)(end - start) / CLOCKS_PER_SEC << " second" << std::endl;

    double res = step * data[1];
    return res;
}

double with_oneapi_usm() {
    queue q;
    //std::cout << "Device: " << q.get_device().get_info<info::device::name>() << std::endl;

    double step = 1.0 / (double)num_steps;
    double* data = malloc_shared<double>(2, q);
    data[0] = step;
    data[1] = 0.0;

    clock_t start = clock();
    q.parallel_for(range<1>(num_steps), [=](id<1> i){
        double temp = ((double)i + 0.5) * data[0];
        data[1] += 4.0 / (1.0 + temp * temp);
    }).wait();
    clock_t end = clock();
    std::cout << "With oneapi usm cost " << (double)(end - start) / CLOCKS_PER_SEC << " second" << std::endl;

    double res = step * data[1];
    free(data, q);
    return res;
}


int main() {

    clock_t start, end;
    
    double PI1 = without_oneapi();
    double PI2 = with_oneapi_buffer();
    double PI3 = with_oneapi_usm();
    
    std::cout << "Without oneapi result:" << PI1 << std::endl;
    std::cout << "With oneapi buffer result:" << PI2 << std::endl;
    std::cout << "With oneapi usm result:" << PI3 << std::endl;

  return 0;
}

I tried this simple test and the result is weird. When num_steps is larger than 10000. Both of the results of using usm and using buffer becomes smaller than they should be which is PI. 

 

Why does this happened? Did I use oneAPI in the right way? 

0 Kudos
1 Solution
RahulV_intel
Moderator
1,562 Views

Hi,


The source code provided by you is not thread-safe.


Specifically, this line of code:

a[1] += 4.0 / (1.0 + temp * temp);


The DPC++ runtime launches as many threads as specified in the parallel_for region. In your case, the runtime launches 10000000 threads (as specified by the num_steps variable). Since all these threads are writing to the same memory location (a[1]), the results can go wrong.


To avoid this, you could create an array to store all the partial results and then perform a reduction sum operation on the array to get the final result.


Alternatively, you might want to check out the Montecarlo PI approximation sample in the link below:

https://github.com/oneapi-src/oneAPI-samples/tree/master/Libraries/oneMKL/monte_carlo_pi



Regards,

Rahul


View solution in original post

3 Replies
RahulV_intel
Moderator
1,563 Views

Hi,


The source code provided by you is not thread-safe.


Specifically, this line of code:

a[1] += 4.0 / (1.0 + temp * temp);


The DPC++ runtime launches as many threads as specified in the parallel_for region. In your case, the runtime launches 10000000 threads (as specified by the num_steps variable). Since all these threads are writing to the same memory location (a[1]), the results can go wrong.


To avoid this, you could create an array to store all the partial results and then perform a reduction sum operation on the array to get the final result.


Alternatively, you might want to check out the Montecarlo PI approximation sample in the link below:

https://github.com/oneapi-src/oneAPI-samples/tree/master/Libraries/oneMKL/monte_carlo_pi



Regards,

Rahul


Siqiao_Fu
Beginner
1,556 Views

Really helpful! Thank you Rahul!

0 Kudos
RahulV_intel
Moderator
1,535 Views

Hi,


Thanks for the confirmation. Intel will no longer monitor this thread. Further discussions on this thread will be considered community only.


Regards,

Rahul


0 Kudos
Reply