Parallel for is very slow compared to iterative solution

sidrakiyani — Tue, 24 May 2022 17:42:06 GMT

I am trying to accelerate an algorithm using DPC++. What happens is that the normal calculation takes 1.5 times faster than kernel parallel execution. The following code is for both calculations.

the num_items currently equals 16,000. I tried small values like 500 but the same thing, the CPU is way faster the kernel.

I am using visual studio 2022 that runs oneAPI dpc++ compiler, and trying to make emulation on an FPGA, but I don't know how to find the details of the FPGA emulator like what frequency it is running on. The full code is: https://ideone.com/iEHQHa b

    // This is the normal iterative code.
    std::vector<double> distance_calculation(std::vector<std::vector<double>>& dataset, 
    std::vector<double>& curr_test) {
    auto start = std::chrono::high_resolution_clock::now();
    std::vector<double>res;
    for (int i = 0; i < dataset.size(); ++i) {
        double dis = 0;
        for (int j = 0; j < dataset[i].size(); ++j) {
            dis += (curr_test[j] - dataset[i][j]) * (curr_test[j] - dataset[i][j]);
        }
        res.push_back(dis);
    }
    auto finish = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double> elapsed = finish - start;
    std::cout << "Elapsed time: " << elapsed.count() << " s\n";
    return res;
}

    // This is FPGA emulation code
    std::vector<double> distance_calculation_FPGA(queue& q, const  
    std::vector<std::vector<double>>& dataset, const std::vector<double>& curr_test) {
    std::vector<double>linear_dataset;
    for (int i = 0; i < dataset.size(); ++i) {
        for (int j = 0; j < dataset[i].size(); ++j) {
            linear_dataset.push_back(dataset[i][j]);
        }
    }
    range<1> num_items{dataset.size()};
    std::vector<double>res;
    //std::cout << "im in" << std::endl;

    res.resize(dataset.size());
    buffer dataset_buf(linear_dataset);
    buffer curr_test_buf(curr_test);
    buffer res_buf(res.data(), num_items);
    {
        auto start = std::chrono::high_resolution_clock::now();
        q.submit([&](handler& h) {
            accessor a(dataset_buf, h, read_only);
            accessor b(curr_test_buf, h, read_only);

            accessor dif(res_buf, h, read_write, no_init);
            h.parallel_for(range<1>(num_items), [=](id<1> i) {
                //  dif[i] = a[i].size() * 1.0;// a[i];
                for (int j = 0; j < 5; ++j) {
                    dif[i] += (b[j] - a[i * 5 + j]) * (b[j] - a[i * 5 + j]);

                }
                });
            });
            q.wait();
            auto finish = std::chrono::high_resolution_clock::now();
            std::chrono::duration<double> elapsed = finish - start;
            std::cout << "Elapsed time: " << elapsed.count() << " s\n";

    }
    /*
        for (int i = 0; i < dataset.size(); ++i) {
            double dis = 0;
            for (int j = 0; j < dataset[i].size(); ++j) {
                dis += (curr_test[j] - dataset[i][j]) * (curr_test[j] - dataset[i][j]);
            }
            res.push_back(dis);
        }
        */
    return res;
}

Re: Parallel for is very slow compared to iterative solution

Steve_Lionel — Tue, 24 May 2022 17:55:25 GMT

This should be moved to Intel® oneAPI Data Parallel C++ - Intel Communities

Re: Parallel for is very slow compared to iterative solution

Barbara_P_Intel — Tue, 24 May 2022 18:08:02 GMT

Definitely on the wrong forum. Moving to DPC++ Forum.

topic Re: Parallel for is very slow compared to iterative solution in Intel® oneAPI DPC++/C++ Compiler

Parallel for is very slow compared to iterative solution

Re: Parallel for is very slow compared to iterative solution

Re: Parallel for is very slow compared to iterative solution