- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi support team
I am modifying the code for running FPGA hardware on oneapi devcloud. The original sample code is for GPU/CPU which uses parallel_for lambda function for kernel, but for FPGA optimization reason, i think it should be modified to single_task lambda function for kernel. However, I have no idea how to pass nd_item<1> to h.single_task lambda function.
I need to use nd_item<1> class, but it seems h.single_task cannot pass parameter. So how can i modify it? i need to get the values of get_local_id(0), get_group(0) and get_local_range(0)
fragments of my code:
h.parallel_for<class bude_kernel>(nd_range<1>(global, wgSize), [=](nd_item<1> item) {
const size_t lid = item.get_local_id(0);
const size_t gid = item.get_group(0);
const size_t lrange = item.get_local_range(0);
float etot[NUM_TD_PER_THREAD];
cl::sycl::float3 lpos[NUM_TD_PER_THREAD];
cl::sycl::float4 transform[NUM_TD_PER_THREAD][3];
size_t ix = gid * lrange * NUM_TD_PER_THREAD + lid;
ix = ix < nposes ? ix : nposes - NUM_TD_PER_THREAD;
.
.
.
.
.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thanks for reaching out to us.
We are working on your issue. We will get back to you soon.
Thanks & Regards,
Noorjahan.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
You do not need to pass nd_item<1> to your kernel for the single work-item kernel, i.e., h.single_task lambda function. In this programming model you only have one single work-item within one work-group running on FPGA thus, you don't have to get the get_local_id(0), get_group(0), and get_local_range(0) because these values are equal to 0. Instead, you need to use "for" loops (or any other loop) to process your data in your lambda function. You will have something like that:
h.single_task<class bude_kernel>([=]()[[intel::kernel_args_restrict]]{
for(size_t id = 0; id < global; id++){ // I'm assuming that "global" is the size of the problem.
ix = // It is up to you to compute this value from your "id."
.
.
.
}
You have only one "for" loop because nd_item was one. If nd_item was N, then you will have N nested loops. Although, nested loops can be fused into one loop also.
I hope this clarifies things for you!
Regards,
Daouda
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Daouda
thanks your reply, I will try it based on your suggestion, if i have further questions i will ask you.
thanks a lot
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @Wei-Chih,
Greetings, just checking in to see if there is any further doubts in regards to this matter.
Hope we have clarify your doubts.
Best Wishes
BB
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi support team / Daouda
Following your suggestion to modify the single_task function for the kernel, I can get the result I expect now. However, if I set the for loop condition too large (size_t id = 0; id < global(I will use 65536, or 65536/4, 65536/8...); id++), FPGA hardware(arria10) will run very long time and even fail in some large condition case. Do you have any suggestion about how to optimize it? Currently, I only add unroll(with factor 1) on the loop. May you suggest me other optimization methods can use in my case? I need to decrease the kernel execution time when using FPGA hardwares.
below is my modified code(i use id<4 now)
h.single_task<class bude_kernel>([=]() [[intel::kernel_args_restrict]] {
#pragma unroll 1
for (size_t id = 0; id < 4; id++) {
const size_t lid = id;
const size_t gid = id;
const size_t lrange = 1;
float etot[NUM_TD_PER_THREAD];
cl::sycl::float3 lpos[NUM_TD_PER_THREAD];
cl::sycl::float4 transform[NUM_TD_PER_THREAD][3];
size_t ix = gid * lrange * NUM_TD_PER_THREAD + lid;
ix = ix < nposes ? ix : nposes - NUM_TD_PER_THREAD;
#pragma unroll 1
for (int i = lid; i < ntypes; i += lrange) local_forcefield[i] = forcefield[i];
#pragma unroll 1
for (size_t i = 0; i < NUM_TD_PER_THREAD; i++) {
size_t index = ix + i * lrange;
.
.
.
.
.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Wei-Chih,
You may need to use the oneAPI optimization guide to optimize the execution time of your kernel. You can find the guide here:
Use the report file from the dpc++ compiler and ensure that your loops' initiation interval is not high (the ideal value is 1). Also, the report indicates the main bottleneck of your kernel so that you can optimize them.
Then You may increase your unrolling factor to express more parallelism instead of using just one as unrolling factor. Note that the loop trip count should be divisible by the unrolling factor for good performance, otherwise the II may be greater than 1. For example, you can fully unroll your "range" and "NUM_TD_PER_THREAD" loops if the trip count is not too large and partially unroll the "global" loop for high exit values. I don't know what the execution time of your kernel was, but it shouldn't be too high even for global = 65536.
When you say "... even fail in some large condition case.", do you mean that you get bad results or the execution does not complete ?
Best regards,
Daouda
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Daouda
I tried to increase the unroll factor(4,8,16...) but it will cause RAM utilization problem as below picture. Do you have any recommendation of it? Another question is that if i don't set any unroll factor ( #pragma unroll ) would the compile process set it automatically?
aoc: Warning RAM Utilization is at 128%
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Wei-Chih,
Can you tell me on which loop you are applying the unrolling ?
You should also be careful with the memory access pattern of your application. Use the optimization guide to optimize your memory access if needed and decrease the BRAM usage. There are multiple LSU implementations depending on your memory access pattern.
If you "#pragma unroll" without setting any unroll factor, your loop will be fully unrolled. For partial unrolling, you need to specify the unroll factor.
Best regards,
Daouda
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Daouda
below is my current single task function, may you tell me if i have any issue on adding unroll optimization?
ok, I will study the optimization guide. Thanks a lot.
h.single_task<class bude_kernel>([=]() [[intel::kernel_args_restrict]] {
#pragma unroll 2
[[intelfpga::initiation_interval(1)]]
for (size_t id = 0; id < DEFAULT_WGSIZE; id++) {
const size_t lid = id;
const size_t gid = id;
const size_t lrange = 1;
float etot[NUM_TD_PER_THREAD];
cl::sycl::float3 lpos[NUM_TD_PER_THREAD];
cl::sycl::float4 transform[NUM_TD_PER_THREAD][3];
size_t ix = gid * lrange * NUM_TD_PER_THREAD + lid;
ix = ix < nposes ? ix : nposes - NUM_TD_PER_THREAD;
#pragma unroll 2
[[intelfpga::initiation_interval(1)]]
for (int i = lid; i < ntypes; i += lrange) local_forcefield[i] = forcefield[i];
#pragma unroll 2
[[intelfpga::initiation_interval(1)]]
for (size_t i = 0; i < NUM_TD_PER_THREAD; i++) {
size_t index = ix + i * lrange;
const float sx = cl::sycl::sin(transforms_0[index]);
const float cx = cl::sycl::cos(transforms_0[index]);
const float sy = cl::sycl::sin(transforms_1[index]);
const float cy = cl::sycl::cos(transforms_1[index]);
const float sz = cl::sycl::sin(transforms_2[index]);
const float cz = cl::sycl::cos(transforms_2[index]);
transform[i][0].x() = cy * cz;
transform[i][0].y() = sx * sy * cz - cx * sz;
transform[i][0].z() = cx * sy * cz + sx * sz;
transform[i][0].w() = transforms_3[index];
transform[i][1].x() = cy * sz;
transform[i][1].y() = sx * sy * sz + cx * cz;
transform[i][1].z() = cx * sy * sz - sx * cz;
transform[i][1].w() = transforms_4[index];
transform[i][2].x() = -sy;
transform[i][2].y() = sx * cy;
transform[i][2].z() = cx * cy;
transform[i][2].w() = transforms_5[index];
etot[i] = ZERO;
}
// item.barrier(access::fence_space::local_space);
// Loop over ligand atoms
size_t il = 0;
do {
// Load ligand atom data
const Atom l_atom = ligand_molecule[il];
const FFParams l_params = local_forcefield[l_atom.type];
const bool lhphb_ltz = l_params.hphb < ZERO;
const bool lhphb_gtz = l_params.hphb > ZERO;
const cl::sycl::float4 linitpos(l_atom.x, l_atom.y, l_atom.z, ONE);
#pragma unroll 2
[[intelfpga::initiation_interval(1)]]
for (size_t i = 0; i < NUM_TD_PER_THREAD; i++) {
lpos[i].x() = transform[i][0].w() +
linitpos.x() * transform[i][0].x() +
linitpos.y() * transform[i][0].y() +
linitpos.z() * transform[i][0].z();
lpos[i].y() = transform[i][1].w() +
linitpos.x() * transform[i][1].x() +
linitpos.y() * transform[i][1].y() +
linitpos.z() * transform[i][1].z();
lpos[i].z() = transform[i][2].w() +
linitpos.x() * transform[i][2].x() +
linitpos.y() * transform[i][2].y() +
linitpos.z() * transform[i][2].z();
}
size_t ip = 0;
do {
const Atom p_atom = protein_molecule[ip];
const FFParams p_params = local_forcefield[p_atom.type];
const float radij = p_params.radius + l_params.radius;
const float r_radij = 1.f / (radij);
const float elcdst = (p_params.hbtype == HBTYPE_F && l_params.hbtype == HBTYPE_F) ? FOUR : TWO;
const float elcdst1 = (p_params.hbtype == HBTYPE_F && l_params.hbtype == HBTYPE_F) ? QUARTER : HALF;
const bool type_E = ((p_params.hbtype == HBTYPE_E || l_params.hbtype == HBTYPE_E));
const bool phphb_ltz = p_params.hphb < ZERO;
const bool phphb_gtz = p_params.hphb > ZERO;
const bool phphb_nz = p_params.hphb != ZERO;
const float p_hphb = p_params.hphb * (phphb_ltz && lhphb_gtz ? -ONE : ONE);
const float l_hphb = l_params.hphb * (phphb_gtz && lhphb_ltz ? -ONE : ONE);
const float distdslv = (phphb_ltz ? (lhphb_ltz ? NPNPDIST : NPPDIST) : (lhphb_ltz ? NPPDIST : -FloatMax));
const float r_distdslv = 1.f / (distdslv);
const float chrg_init = l_params.elsc * p_params.elsc;
const float dslv_init = p_hphb + l_hphb;
#pragma unroll 2
[[intelfpga::initiation_interval(1)]]
for (size_t i = 0; i < NUM_TD_PER_THREAD; i++) {
const float x = lpos[i].x() - p_atom.x;
const float y = lpos[i].y() - p_atom.y;
const float z = lpos[i].z() - p_atom.z;
const float distij = cl::sycl::sqrt(x * x + y * y + z * z);
const float distbb = distij - radij;
const bool zone1 = (distbb < ZERO);
etot[i] += (ONE - (distij * r_radij)) * (zone1 ? 2 * HARDNESS : ZERO);
float chrg_e = chrg_init * ((zone1 ? 1 : (ONE - distbb * elcdst1)) * (distbb < elcdst ? 1 : ZERO));
const float neg_chrg_e = -cl::sycl::fabs(chrg_e);
chrg_e = type_E ? neg_chrg_e : chrg_e;
etot[i] += chrg_e * CNSTNT;
const float coeff = (ONE - (distbb * r_distdslv));
float dslv_e = dslv_init * ((distbb < distdslv&& phphb_nz) ? 1 : ZERO);
dslv_e *= (zone1 ? 1 : coeff);
etot[i] += dslv_e;
}
} while (++ip < natpro); // loop over protein atoms
} while (++il < natlig); // loop over ligand atoms
const size_t td_base = gid * lrange * NUM_TD_PER_THREAD + lid;
if (td_base < nposes) {
#pragma unroll 2
[[intelfpga::initiation_interval(1)]]
for (size_t i = 0; i < NUM_TD_PER_THREAD; i++) {
etotals[td_base + i * lrange] = etot[i] * HALF;
}
}
// }
}
});
}
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @Wei-Chih,
Do keep us posted with any updates.
We are reviewing through the code provided and will get back to you.
Looking forward to hear from you.
Best Wishes
BB
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
thanks your help. I am waiting for your support reply
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @Wei-Chih,
Thank you for the patients, after some investigation on the mention code snippet, my guess is that below are some recommendation:
- Unroll an nested loops have some drawback instead of optimizing it as it will cause longer compile time, which I think its a major cause, hence would recommend to convert to single loop is possible. (https://www.intel.com/content/www/us/en/develop/documentation/oneapi-fpga-optimization-guide/top/optimize-your-design/throughput-1/single-work-item-kernels/single-work-item-kernel-design-guidelines.html)
- There are also some dependencies between the data (i.e. local_forcefield) on the for and do loop which seems will loop in a large number of iteration, which might also be another cause.
- Would recommend to simplify the loop as it seems complex with multiple nested with and tightly couple, hence unrolling the loop in this condition would not help much, but instead having drawback.
Hope that clarify.
Best Wishes
BB
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks I will try to modify the code.
Besides Unroll, what else could I do to optimize this code? any suggestion?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @Wei-Chih,
As every application are coded to execute differently, hence it would be best to go through the optimization report or profiler to which will gives more accurate insights on what to change and you can refer to the optimization guide.
With no further clarification on this thread, it will be transitioned to community support for further help on doubts in this thread and no longer monitor this thread.
Thank you for the questions and as always pleasure having you here.
Best Wishes
BB
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page