Solved: how to change h.parallel_for(range(M, P), [=](auto index) to single_task function

Wei-Chih · ‎08-02-2022

Hi support team

I modified the code from opeapi samples mul, and I wanna use it for FPGA hardware. I am not sure how to modified these three parallel_for function to single_task. May you give me some suggestions?

below is the code:

#if FPGA_EMULATOR
// DPC++ extension: FPGA emulator selector on systems without FPGA card.
ext::intel::fpga_emulator_selector d_selector;
#elif FPGA
// DPC++ extension: FPGA selector on systems with FPGA card.
ext::intel::fpga_selector d_selector;
#else
// The default device selector will select the most performant device.
default_selector d_selector;
#endif
try {
queue q(d_selector, dpc_common::exception_handler);

cout << "Device: " << q.get_device().get_info<info::device::name>() << "\n";

// Create 2D buffers for matrices, buffer c is bound with host memory c_back

buffer<float, 2> a_buf(range(M, N));
buffer<float, 2> b_buf(range(N, P));
buffer c_buf(reinterpret_cast<float*>(c_back), range(M, P));

cout << "Problem size: c(" << M << "," << P << ") = a(" << M << "," << N
<< ") * b(" << N << "," << P << ")\n";

// Using three command groups to illustrate execution order. The use of
// first two command groups for initializing matrices is not the most
// efficient way. It just demonstrates the implicit multiple command group
// execution ordering.

// Submit command group to queue to initialize matrix a

//start the clock
// dpc_common::TimeInterval kernel_runtime;
dpc_common::TimeInterval kernel_e_a_runtime;

auto e_a = q.submit([&](auto& h) {
// Get write only access to the buffer on a device.
accessor a(a_buf, h, write_only);

// Execute kernel.
h.parallel_for(range(M, N), [=](auto index) {
// Each element of matrix a is 1.
a[index] = 1.0f;
});
});

double elapsed_e_a_time = kernel_e_a_runtime.Elapsed();

dpc_common::TimeInterval kernel_e_b_runtime;
// Submit command group to queue to initialize matrix b
auto e_b = q.submit([&](auto& h) {
// Get write only access to the buffer on a device
accessor b(b_buf, h, write_only);

// Execute kernel.
h.parallel_for(range(N, P), [=](auto index) {
// Each column of b is the sequence 1,2,...,N
b[index] = index[0] + 1.0f;
});
});
double elapsed_e_b_time = kernel_e_b_runtime.Elapsed();

dpc_common::TimeInterval kernel_e_c_runtime;
// Submit command group to queue to multiply matrices: c = a * b
auto e_c = q.submit([&](auto& h) {
// Read from a and b, write to c
accessor a(a_buf, h, read_only);
accessor b(b_buf, h, read_only);
accessor c(c_buf, h, write_only);

int width_a = a_buf.get_range()[1];

// Execute kernel.
h.parallel_for(range(M, P), [=](auto index) {
// h.single_task<c_calc>([=]() [[intel::kernel_args_restrict]] {
// for (int i = 0; i < M; i++) {
//#pragma unroll 1
// for (int j = 0; j < P; j++) {
// Get global position in Y direction.
int row = index[0];
// int row = j;
// Get global position in X direction.
int col = index[1];
// int col = i;

float sum = 0.0f;

// Compute the result of one element of c
//#pragma unroll 1
for (int i = 0; i < width_a; i++) {
sum += a[row][i] * b[i][col];
}

c[index] = sum;
//c[i][j] = sum;
// }
// }
});
});

aikeu · ‎08-15-2022

Hi Wei-Chih,

Sorry for late reply.

I may not understand the written code that you are trying to work with.

Can you provide more decription on the operation that you are trying to work on?

I think you can split your task using a normal for loop as compared to using parallel_for:

Brief example as below:

// Computes the product of two square matrices.

void matrix_multiply(double** m1, double** m2, double** result, size_t size)

{

for (size_t i = 0; i < size; i++)

{

for (size_t j = 0; j < size; j++)

{

double temp = 0;

for (int k = 0; k < size; k++)

{

temp += m1[i][k] * m2[k][j];

}

result[i][j] = temp;

}

// Computes the product of two square matrices in parallel.

void parallel_matrix_multiply(double** m1, double** m2, double** result, size_t size)

{

parallel_for (size_t(0), size, [&](size_t i)

{

for (size_t j = 0; j < size; j++)

{

double temp = 0;

for (int k = 0; k < size; k++)

{

temp += m1[i][k] * m2[k][j];

}

result[i][j] = temp;

}

});

}

There is this document that might help to consider how your code will be written:

https://www.colfax-intl.com/downloads/oneAPI_module04_DPCplusplusFundamentals2of2.pdf

Thanks.

Regards,

Aik Eu

View solution in original post

aikeu · ‎08-03-2022

Hi Wei-Chih,

Can refer to the below link for your reference:

https://www.intel.com/content/www/us/en/develop/documentation/oneapi-fpga-optimization-guide/top/optimize-your-design/throughput-1/single-work-item-kernels/single-work-item-kernel-design-guidelines.html

Thanks.

Regards,

Aik Eu

aikeu · ‎08-07-2022

Hi Wei-Chih,

May I know does the link from previous comment help in answering your question?

Thanks.

Regards,

Aik Eu

Wei-Chih · ‎08-08-2022

Hi Aikeu

Follow the link tutorial to modify it, I will get wrong result. May you show me that you how to set the for loop in single_task function in this case?( h.parallel_for(range(M, P), [=](auto index) { )

aikeu · ‎08-15-2022

Hi Wei-Chih,

Sorry for late reply.

I may not understand the written code that you are trying to work with.

Can you provide more decription on the operation that you are trying to work on?

I think you can split your task using a normal for loop as compared to using parallel_for:

Brief example as below:

// Computes the product of two square matrices.

void matrix_multiply(double** m1, double** m2, double** result, size_t size)

{

for (size_t i = 0; i < size; i++)

{

for (size_t j = 0; j < size; j++)

{

double temp = 0;

for (int k = 0; k < size; k++)

{

temp += m1[i][k] * m2[k][j];

}

result[i][j] = temp;

}

// Computes the product of two square matrices in parallel.

void parallel_matrix_multiply(double** m1, double** m2, double** result, size_t size)

{

parallel_for (size_t(0), size, [&](size_t i)

{

for (size_t j = 0; j < size; j++)

{

double temp = 0;

for (int k = 0; k < size; k++)

{

temp += m1[i][k] * m2[k][j];

}

result[i][j] = temp;

}

});

}

There is this document that might help to consider how your code will be written:

https://www.colfax-intl.com/downloads/oneAPI_module04_DPCplusplusFundamentals2of2.pdf

Thanks.

Regards,

Aik Eu

aikeu · ‎08-18-2022

Hi Wei-Chih,

May I know is there any follow up from the previous comment?

Thanks.

Regards,

Aik Eu

aikeu · ‎08-22-2022

Hi Wei-Chih,

I will close this thread if no further question.

Thanks.

Regards,

Aik Eu

aikeu · ‎08-24-2022

Hi Wei-Chih,

We do not receive any response from you to the previous question/reply/answer that I have provided. This thread will be transitioned to community support. If you have a new question, feel free to open a new thread to get the support from Intel experts. Otherwise, the community users will continue to help you on this thread. Thank you.

Thanks.

Regards,

Aik Eu