- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi support team
I modified the code from opeapi samples mul, and I wanna use it for FPGA hardware. I am not sure how to modified these three parallel_for function to single_task. May you give me some suggestions?
below is the code:
#if FPGA_EMULATOR
// DPC++ extension: FPGA emulator selector on systems without FPGA card.
ext::intel::fpga_emulator_selector d_selector;
#elif FPGA
// DPC++ extension: FPGA selector on systems with FPGA card.
ext::intel::fpga_selector d_selector;
#else
// The default device selector will select the most performant device.
default_selector d_selector;
#endif
try {
queue q(d_selector, dpc_common::exception_handler);
cout << "Device: " << q.get_device().get_info<info::device::name>() << "\n";
// Create 2D buffers for matrices, buffer c is bound with host memory c_back
buffer<float, 2> a_buf(range(M, N));
buffer<float, 2> b_buf(range(N, P));
buffer c_buf(reinterpret_cast<float*>(c_back), range(M, P));
cout << "Problem size: c(" << M << "," << P << ") = a(" << M << "," << N
<< ") * b(" << N << "," << P << ")\n";
// Using three command groups to illustrate execution order. The use of
// first two command groups for initializing matrices is not the most
// efficient way. It just demonstrates the implicit multiple command group
// execution ordering.
// Submit command group to queue to initialize matrix a
//start the clock
// dpc_common::TimeInterval kernel_runtime;
dpc_common::TimeInterval kernel_e_a_runtime;
auto e_a = q.submit([&](auto& h) {
// Get write only access to the buffer on a device.
accessor a(a_buf, h, write_only);
// Execute kernel.
h.parallel_for(range(M, N), [=](auto index) {
// Each element of matrix a is 1.
a[index] = 1.0f;
});
});
double elapsed_e_a_time = kernel_e_a_runtime.Elapsed();
dpc_common::TimeInterval kernel_e_b_runtime;
// Submit command group to queue to initialize matrix b
auto e_b = q.submit([&](auto& h) {
// Get write only access to the buffer on a device
accessor b(b_buf, h, write_only);
// Execute kernel.
h.parallel_for(range(N, P), [=](auto index) {
// Each column of b is the sequence 1,2,...,N
b[index] = index[0] + 1.0f;
});
});
double elapsed_e_b_time = kernel_e_b_runtime.Elapsed();
dpc_common::TimeInterval kernel_e_c_runtime;
// Submit command group to queue to multiply matrices: c = a * b
auto e_c = q.submit([&](auto& h) {
// Read from a and b, write to c
accessor a(a_buf, h, read_only);
accessor b(b_buf, h, read_only);
accessor c(c_buf, h, write_only);
int width_a = a_buf.get_range()[1];
// Execute kernel.
h.parallel_for(range(M, P), [=](auto index) {
// h.single_task<c_calc>([=]() [[intel::kernel_args_restrict]] {
// for (int i = 0; i < M; i++) {
//#pragma unroll 1
// for (int j = 0; j < P; j++) {
// Get global position in Y direction.
int row = index[0];
// int row = j;
// Get global position in X direction.
int col = index[1];
// int col = i;
float sum = 0.0f;
// Compute the result of one element of c
//#pragma unroll 1
for (int i = 0; i < width_a; i++) {
sum += a[row][i] * b[i][col];
}
c[index] = sum;
//c[i][j] = sum;
// }
// }
});
});
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Wei-Chih,
Sorry for late reply.
I may not understand the written code that you are trying to work with.
Can you provide more decription on the operation that you are trying to work on?
I think you can split your task using a normal for loop as compared to using parallel_for:
Brief example as below:
// Computes the product of two square matrices.
void matrix_multiply(double** m1, double** m2, double** result, size_t size)
{
for (size_t i = 0; i < size; i++)
{
for (size_t j = 0; j < size; j++)
{
double temp = 0;
for (int k = 0; k < size; k++)
{
temp += m1[i][k] * m2[k][j];
}
result[i][j] = temp;
}
}
}
// Computes the product of two square matrices in parallel.
void parallel_matrix_multiply(double** m1, double** m2, double** result, size_t size)
{
parallel_for (size_t(0), size, [&](size_t i)
{
for (size_t j = 0; j < size; j++)
{
double temp = 0;
for (int k = 0; k < size; k++)
{
temp += m1[i][k] * m2[k][j];
}
result[i][j] = temp;
}
});
}
There is this document that might help to consider how your code will be written:
https://www.colfax-intl.com/downloads/oneAPI_module04_DPCplusplusFundamentals2of2.pdf
Thanks.
Regards,
Aik Eu
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Wei-Chih,
May I know does the link from previous comment help in answering your question?
Thanks.
Regards,
Aik Eu
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Aikeu
Follow the link tutorial to modify it, I will get wrong result. May you show me that you how to set the for loop in single_task function in this case?( h.parallel_for(range(M, P), [=](auto index) { )
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Wei-Chih,
Sorry for late reply.
I may not understand the written code that you are trying to work with.
Can you provide more decription on the operation that you are trying to work on?
I think you can split your task using a normal for loop as compared to using parallel_for:
Brief example as below:
// Computes the product of two square matrices.
void matrix_multiply(double** m1, double** m2, double** result, size_t size)
{
for (size_t i = 0; i < size; i++)
{
for (size_t j = 0; j < size; j++)
{
double temp = 0;
for (int k = 0; k < size; k++)
{
temp += m1[i][k] * m2[k][j];
}
result[i][j] = temp;
}
}
}
// Computes the product of two square matrices in parallel.
void parallel_matrix_multiply(double** m1, double** m2, double** result, size_t size)
{
parallel_for (size_t(0), size, [&](size_t i)
{
for (size_t j = 0; j < size; j++)
{
double temp = 0;
for (int k = 0; k < size; k++)
{
temp += m1[i][k] * m2[k][j];
}
result[i][j] = temp;
}
});
}
There is this document that might help to consider how your code will be written:
https://www.colfax-intl.com/downloads/oneAPI_module04_DPCplusplusFundamentals2of2.pdf
Thanks.
Regards,
Aik Eu
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Wei-Chih,
We do not receive any response from you to the previous question/reply/answer that I have provided. This thread will be transitioned to community support. If you have a new question, feel free to open a new thread to get the support from Intel experts. Otherwise, the community users will continue to help you on this thread. Thank you.
Thanks.
Regards,
Aik Eu
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page