Intel® oneAPI Data Parallel C++
Support for Intel® oneAPI DPC++ Compiler, Intel® oneAPI DPC++ Library, Intel® DPC++ Compatibility Tool, and GDB*
Announcements
This community is designed for sharing of public information. Please do not share Intel or third-party confidential information here.
348 Discussions

Parallel result is different with serial result

Siqiao_Fu
Beginner
327 Views
#include <CL/sycl.hpp>
#include <iostream>
#include <ctime>

using namespace std;
using namespace cl::sycl;

constexpr long num_steps = 10000000;

double without_oneapi() {
    double step = 1.0 / (double)num_steps;
    double x = 0.0;
    double sum = 0.0;

    clock_t start = clock();
    for (int i = 0; i < num_steps; i++) {
        x = (i + 0.5) * step;
        sum += 4.0 / (1.0 + x * x);
    }
    clock_t end = clock();
    std::cout << "Without oneapi cost " << (double)(end - start) / CLOCKS_PER_SEC << " second" << std::endl;

    double res = step * sum;
    return res;
}

double with_oneapi_buffer() {
    queue q;
    //std::cout << "Device: " << q.get_device().get_info<info::device::name>() << std::endl;

    double step = 1.0 / (double)num_steps;
    double data[2] = {step, 0.0};

    buffer buf(data, range<1>(2));

    clock_t start = clock();
    q.submit([&](handler& h) {
        accessor a(buf, h);
        h.parallel_for(range<1>(num_steps), [=](auto i) {
            double temp = ((double)i + 0.5) * a[0];
            a[1] += 4.0 / (1.0 + temp * temp);
            });
        }).wait();
    clock_t end = clock();
    std::cout << "With oneapi buffer cost " << (double)(end - start) / CLOCKS_PER_SEC << " second" << std::endl;

    double res = step * data[1];
    return res;
}

double with_oneapi_usm() {
    queue q;
    //std::cout << "Device: " << q.get_device().get_info<info::device::name>() << std::endl;

    double step = 1.0 / (double)num_steps;
    double* data = malloc_shared<double>(2, q);
    data[0] = step;
    data[1] = 0.0;

    clock_t start = clock();
    q.parallel_for(range<1>(num_steps), [=](id<1> i){
        double temp = ((double)i + 0.5) * data[0];
        data[1] += 4.0 / (1.0 + temp * temp);
    }).wait();
    clock_t end = clock();
    std::cout << "With oneapi usm cost " << (double)(end - start) / CLOCKS_PER_SEC << " second" << std::endl;

    double res = step * data[1];
    free(data, q);
    return res;
}


int main() {

    clock_t start, end;
    
    double PI1 = without_oneapi();
    double PI2 = with_oneapi_buffer();
    double PI3 = with_oneapi_usm();
    
    std::cout << "Without oneapi result:" << PI1 << std::endl;
    std::cout << "With oneapi buffer result:" << PI2 << std::endl;
    std::cout << "With oneapi usm result:" << PI3 << std::endl;

  return 0;
}

I tried this simple test and the result is weird. When num_steps is larger than 10000. Both of the results of using usm and using buffer becomes smaller than they should be which is PI. 

 

Why does this happened? Did I use oneAPI in the right way? 

0 Kudos
1 Solution
RahulV_intel
Moderator
297 Views

Hi,


The source code provided by you is not thread-safe.


Specifically, this line of code:

a[1] += 4.0 / (1.0 + temp * temp);


The DPC++ runtime launches as many threads as specified in the parallel_for region. In your case, the runtime launches 10000000 threads (as specified by the num_steps variable). Since all these threads are writing to the same memory location (a[1]), the results can go wrong.


To avoid this, you could create an array to store all the partial results and then perform a reduction sum operation on the array to get the final result.


Alternatively, you might want to check out the Montecarlo PI approximation sample in the link below:

https://github.com/oneapi-src/oneAPI-samples/tree/master/Libraries/oneMKL/monte_carlo_pi



Regards,

Rahul


View solution in original post

3 Replies
RahulV_intel
Moderator
298 Views

Hi,


The source code provided by you is not thread-safe.


Specifically, this line of code:

a[1] += 4.0 / (1.0 + temp * temp);


The DPC++ runtime launches as many threads as specified in the parallel_for region. In your case, the runtime launches 10000000 threads (as specified by the num_steps variable). Since all these threads are writing to the same memory location (a[1]), the results can go wrong.


To avoid this, you could create an array to store all the partial results and then perform a reduction sum operation on the array to get the final result.


Alternatively, you might want to check out the Montecarlo PI approximation sample in the link below:

https://github.com/oneapi-src/oneAPI-samples/tree/master/Libraries/oneMKL/monte_carlo_pi



Regards,

Rahul


Siqiao_Fu
Beginner
291 Views

Really helpful! Thank you Rahul!

RahulV_intel
Moderator
270 Views

Hi,


Thanks for the confirmation. Intel will no longer monitor this thread. Further discussions on this thread will be considered community only.


Regards,

Rahul


Reply