- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi support team

I modified the code from opeapi samples mul, and I wanna use it for FPGA hardware. I am not sure how to modified these three parallel_for function to single_task. May you give me some suggestions?

below is the code:

#if FPGA_EMULATOR

// DPC++ extension: FPGA emulator selector on systems without FPGA card.

ext::intel::fpga_emulator_selector d_selector;

#elif FPGA

// DPC++ extension: FPGA selector on systems with FPGA card.

ext::intel::fpga_selector d_selector;

#else

// The default device selector will select the most performant device.

default_selector d_selector;

#endif

try {

queue q(d_selector, dpc_common::exception_handler);

cout << "Device: " << q.get_device().get_info<info::device::name>() << "\n";

// Create 2D buffers for matrices, buffer c is bound with host memory c_back

buffer<float, 2> a_buf(range(M, N));

buffer<float, 2> b_buf(range(N, P));

buffer c_buf(reinterpret_cast<float*>(c_back), range(M, P));

cout << "Problem size: c(" << M << "," << P << ") = a(" << M << "," << N

<< ") * b(" << N << "," << P << ")\n";

// Using three command groups to illustrate execution order. The use of

// first two command groups for initializing matrices is not the most

// efficient way. It just demonstrates the implicit multiple command group

// execution ordering.

// Submit command group to queue to initialize matrix a

//start the clock

// dpc_common::TimeInterval kernel_runtime;

dpc_common::TimeInterval kernel_e_a_runtime;

auto e_a = q.submit([&](auto& h) {

// Get write only access to the buffer on a device.

accessor a(a_buf, h, write_only);

// Execute kernel.

h.parallel_for(range(M, N), [=](auto index) {

// Each element of matrix a is 1.

a[index] = 1.0f;

});

});

double elapsed_e_a_time = kernel_e_a_runtime.Elapsed();

dpc_common::TimeInterval kernel_e_b_runtime;

// Submit command group to queue to initialize matrix b

auto e_b = q.submit([&](auto& h) {

// Get write only access to the buffer on a device

accessor b(b_buf, h, write_only);

// Execute kernel.

h.parallel_for(range(N, P), [=](auto index) {

// Each column of b is the sequence 1,2,...,N

b[index] = index[0] + 1.0f;

});

});

double elapsed_e_b_time = kernel_e_b_runtime.Elapsed();

dpc_common::TimeInterval kernel_e_c_runtime;

// Submit command group to queue to multiply matrices: c = a * b

auto e_c = q.submit([&](auto& h) {

// Read from a and b, write to c

accessor a(a_buf, h, read_only);

accessor b(b_buf, h, read_only);

accessor c(c_buf, h, write_only);

int width_a = a_buf.get_range()[1];

// Execute kernel.

h.parallel_for(range(M, P), [=](auto index) {

// h.single_task<c_calc>([=]() [[intel::kernel_args_restrict]] {

// for (int i = 0; i < M; i++) {

//#pragma unroll 1

// for (int j = 0; j < P; j++) {

// Get global position in Y direction.

int row = index[0];

// int row = j;

// Get global position in X direction.

int col = index[1];

// int col = i;

float sum = 0.0f;

// Compute the result of one element of c

//#pragma unroll 1

for (int i = 0; i < width_a; i++) {

sum += a[row][i] * b[i][col];

}

c[index] = sum;

//c[i][j] = sum;

// }

// }

});

});

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi Wei-Chih,

Sorry for late reply.

I may not understand the written code that you are trying to work with.

Can you provide more decription on the operation that you are trying to work on?

I think you can split your task using a normal for loop as compared to using parallel_for:

Brief example as below:

// Computes the product of two square matrices.

void matrix_multiply(double** m1, double** m2, double** result, size_t size)

{

for (size_t i = 0; i < size; i++)

{

for (size_t j = 0; j < size; j++)

{

double temp = 0;

for (int k = 0; k < size; k++)

{

temp += m1[i][k] * m2[k][j];

}

result[i][j] = temp;

}

}

}

// Computes the product of two square matrices in parallel.

void parallel_matrix_multiply(double** m1, double** m2, double** result, size_t size)

{

parallel_for (size_t(0), size, [&](size_t i)

{

for (size_t j = 0; j < size; j++)

{

double temp = 0;

for (int k = 0; k < size; k++)

{

temp += m1[i][k] * m2[k][j];

}

result[i][j] = temp;

}

});

}

There is this document that might help to consider how your code will be written:

https://www.colfax-intl.com/downloads/oneAPI_module04_DPCplusplusFundamentals2of2.pdf

Thanks.

Regards,

Aik Eu

Link Copied

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi Wei-Chih,

May I know does the link from previous comment help in answering your question?

Thanks.

Regards,

Aik Eu

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi Aikeu

Follow the link tutorial to modify it, I will get wrong result. May you show me that you how to set the for loop in single_task function in this case?( h.parallel_for(range(M, P), [=](auto index) { )

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi Wei-Chih,

Sorry for late reply.

I may not understand the written code that you are trying to work with.

Can you provide more decription on the operation that you are trying to work on?

I think you can split your task using a normal for loop as compared to using parallel_for:

Brief example as below:

// Computes the product of two square matrices.

void matrix_multiply(double** m1, double** m2, double** result, size_t size)

{

for (size_t i = 0; i < size; i++)

{

for (size_t j = 0; j < size; j++)

{

double temp = 0;

for (int k = 0; k < size; k++)

{

temp += m1[i][k] * m2[k][j];

}

result[i][j] = temp;

}

}

}

// Computes the product of two square matrices in parallel.

void parallel_matrix_multiply(double** m1, double** m2, double** result, size_t size)

{

parallel_for (size_t(0), size, [&](size_t i)

{

for (size_t j = 0; j < size; j++)

{

double temp = 0;

for (int k = 0; k < size; k++)

{

temp += m1[i][k] * m2[k][j];

}

result[i][j] = temp;

}

});

}

There is this document that might help to consider how your code will be written:

https://www.colfax-intl.com/downloads/oneAPI_module04_DPCplusplusFundamentals2of2.pdf

Thanks.

Regards,

Aik Eu

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi Wei-Chih,

We do not receive any response from you to the previous question/reply/answer that I have provided. This thread will be transitioned to community support. If you have a new question, feel free to open a new thread to get the support from Intel experts. Otherwise, the community users will continue to help you on this thread. Thank you.

Thanks.

Regards,

Aik Eu

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page