Parallel Version of Code not as efficient as Serial Version of Code

Nikhil_T · ‎02-02-2021

Hi there!

I wrote a code for the restriction operator used in multigrid algorithms. Code is given below:

#include <iostream>
#include <CL/sycl.hpp>
#include <vector>

using namespace sycl;

std::vector <float> Restriction2D(std::vector <float>& vec_h) {
int vec_h_dim = int(std::sqrt(vec_h.size()));
int vec_2h_dim = int((vec_h_dim - 1) / 2);
std::vector<float> vec_2h(vec_2h_dim * vec_2h_dim, 0);
for (int i_2h = 1; i_2h <= vec_2h_dim; i_2h++) {

for (int j_2h = 1; j_2h <= vec_2h_dim; j_2h++) {

vec_2h[(i_2h - 1) * vec_2h_dim + j_2h - 1] = (1 / 16) * (vec_h[(2 * i_2h - 1 - 1) * vec_h_dim + 2 * j_2h - 1 - 1] + vec_h[(2 * i_2h - 1 - 1) * vec_h_dim + 2 * j_2h]
+ vec_h[2 * i_2h * vec_h_dim + 2 * j_2h - 1 - 1] + vec_h[2 * i_2h * vec_h_dim + 2 * j_2h] + 2 * (vec_h[(2 * i_2h - 1) * vec_h_dim + 2 * j_2h - 1 - 1] +
vec_h[(2 * i_2h - 1) * vec_h_dim + 2 * j_2h] + vec_h[(2 * i_2h - 1 - 1) * vec_h_dim + 2 * j_2h - 1] + vec_h[2 * i_2h * vec_h_dim + 2 * j_2h - 1]) +
4 * vec_h[(2 * i_2h - 1) * vec_h_dim + 2 * j_2h - 1]);
}
}
return vec_2h;
}

std::vector <float> Restriction2D_parallel(std::vector <float>& vec_h) {
int vec_h_dim = int(std::sqrt(vec_h.size()));
int vec_2h_dim = int((vec_h_dim - 1) / 2);
std::vector<float> vec_2h(vec_2h_dim * vec_2h_dim, 0);
cl::sycl::queue q;
{
buffer <float, 2> vec_2h_buf(vec_2h.data(), range<2>{vec_2h_dim, vec_2h_dim});
buffer <float, 2> vec_h_buf(vec_h.data(), range<2>{vec_h_dim, vec_h_dim});

//float* host_vector_2h = malloc_host<float>(vec_2h_dim, q);
q.submit([&](handler& h) {
accessor vec_2h_acc{ vec_2h_buf , h };
accessor vec_h_acc{ vec_h_buf , h };
program p(q.get_context());
p.build_with_kernel_type<class Restriction>();

h.parallel_for<class Restriction>(p.get_kernel<class Restriction>(),range<2>{ vec_2h_dim, vec_2h_dim}, [=](id<2>idx) {
int i_2h = idx[0]; //0 to vec_2h_dim -1
int j_2h = idx[1]; //0 to vec_2h_dim -1
vec_2h_acc[i_2h][j_2h] = (1 / 16) * (vec_h_acc[2 * i_2h - 1][2 * j_2h - 1] + vec_h_acc[2 * i_2h - 1][2 * j_2h + 1]
+ vec_h_acc[2 * i_2h + 1][2 * j_2h - 1] + vec_h_acc[2 * i_2h + 1][2 * j_2h + 1] + 2 * (vec_h_acc[2 * i_2h][2 * j_2h - 1] +
vec_h_acc[2 * i_2h][2 * j_2h + 1] + vec_h_acc[2 * i_2h - 1][2 * j_2h] + vec_h_acc[2 * i_2h + 1][2 * j_2h]) +
4 * vec_h_acc[2 * i_2h][2 * j_2h]);
});
});
//q.wait();
}
return vec_2h;
}

int main() {
std::size_t size = 11108889;
std::vector<float> test_vec(size, 0.0);
for (int i = 0; i < test_vec.size(); i++) {
test_vec[i] = i / 4.0;
}
std::vector<float>test_vec_restricted = Restriction2D_parallel(test_vec);
//std::cout << test_vec_restricted.size();
return 0;

}

While running the Restriction2D and Restriction2D_parallel, the serial version of the code seems to perform better than the parallel version. I have also attached the results of HPC Vtune analysis for both.

Can someone explain it to me why is this happening? What knowledge am I lacking here?

AbhishekD_Intel · ‎02-05-2021

Hi Nikhil,

Thanks for reaching out to us.

From your code, we can see that you are trying to perform simple operations on vectors and finally adding them to get the desired result, and as it will take a constant time to access the elements and do some simple operations you can very well relate your code as a vector add. And as the complexity of the loops is also not exceeding O(size) your code will take less than a second to complete sequentially.

So this small workload is not quite ideal to compare for sequential and parallel executions. This is the reason for the difference in performance between parallel and sequential executions.

To get good performance stats between sequential and parallel execution you can try increasing your workload and can make your code more compute intensive.

Hope the provided details will help you to get more clarity on your issue.

Warm Regards,

Abhishek

AbhishekD_Intel · ‎02-14-2021

Hi Nikhil,

Please give us an update on the provided details.

Warm Regards,

Abhishek

AbhishekD_Intel · ‎02-21-2021

Hi Nikhil,

We haven't heard back from you for a long time, so we are assuming that the provided solution had helped you in solving your issue. So we are no longer monitoring this thread.

Please post a new thread if you have any other issues.

Warm Regards,

Abhishek