Re: How to obtain the valuse of get_local_id(0), get_group(0) and get_local_range(0) in single_task

Wei-Chih · ‎07-17-2022

Hi support team

I am modifying the code for running FPGA hardware on oneapi devcloud. The original sample code is for GPU/CPU which uses parallel_for lambda function for kernel, but for FPGA optimization reason, i think it should be modified to single_task lambda function for kernel. However, I have no idea how to pass nd_item<1> to h.single_task lambda function.

I need to use nd_item<1> class, but it seems h.single_task cannot pass parameter. So how can i modify it? i need to get the values of get_local_id(0), get_group(0) and get_local_range(0)

fragments of my code:

h.parallel_for<class bude_kernel>(nd_range<1>(global, wgSize), [=](nd_item<1> item) {

const size_t lid = item.get_local_id(0);
const size_t gid = item.get_group(0);
const size_t lrange = item.get_local_range(0);

float etot[NUM_TD_PER_THREAD];
cl::sycl::float3 lpos[NUM_TD_PER_THREAD];
cl::sycl::float4 transform[NUM_TD_PER_THREAD][3];

size_t ix = gid * lrange * NUM_TD_PER_THREAD + lid;
ix = ix < nposes ? ix : nposes - NUM_TD_PER_THREAD;

.

NoorjahanSk_Intel · ‎07-20-2022

Hi,

Thanks for reaching out to us.

We are working on your issue. We will get back to you soon.

Thanks & Regards,

Noorjahan.

DDIAKITE · ‎07-22-2022

Hi,

You do not need to pass nd_item<1> to your kernel for the single work-item kernel, i.e., h.single_task lambda function. In this programming model you only have one single work-item within one work-group running on FPGA thus, you don't have to get the get_local_id(0), get_group(0), and get_local_range(0) because these values are equal to 0. Instead, you need to use "for" loops (or any other loop) to process your data in your lambda function. You will have something like that:

h.single_task<class bude_kernel>([=]()[[intel::kernel_args_restrict]]{

for(size_t id = 0; id < global; id++){ // I'm assuming that "global" is the size of the problem.

ix = // It is up to you to compute this value from your "id."
.
.

.

}

You have only one "for" loop because nd_item was one. If nd_item was N, then you will have N nested loops. Although, nested loops can be fused into one loop also.

I hope this clarifies things for you!

Regards,
Daouda

Wei-Chih · ‎07-24-2022

Hi Daouda

thanks your reply, I will try it based on your suggestion, if i have further questions i will ask you.

thanks a lot

BoonBengT_Intel · ‎07-24-2022

Hi @Wei-Chih,

Greetings, just checking in to see if there is any further doubts in regards to this matter.

Hope we have clarify your doubts.

Best Wishes

BB

Wei-Chih · ‎07-25-2022

Hi support team / Daouda

Following your suggestion to modify the single_task function for the kernel, I can get the result I expect now. However, if I set the for loop condition too large (size_t id = 0; id < global(I will use 65536, or 65536/4, 65536/8...); id++), FPGA hardware(arria10) will run very long time and even fail in some large condition case. Do you have any suggestion about how to optimize it? Currently, I only add unroll(with factor 1) on the loop. May you suggest me other optimization methods can use in my case? I need to decrease the kernel execution time when using FPGA hardwares.

below is my modified code(i use id<4 now)

h.single_task<class bude_kernel>([=]() [[intel::kernel_args_restrict]] {

#pragma unroll 1
for (size_t id = 0; id < 4; id++) {

const size_t lid = id;
const size_t gid = id;
const size_t lrange = 1;

float etot[NUM_TD_PER_THREAD];
cl::sycl::float3 lpos[NUM_TD_PER_THREAD];
cl::sycl::float4 transform[NUM_TD_PER_THREAD][3];

size_t ix = gid * lrange * NUM_TD_PER_THREAD + lid;
ix = ix < nposes ? ix : nposes - NUM_TD_PER_THREAD;

#pragma unroll 1
for (int i = lid; i < ntypes; i += lrange) local_forcefield[i] = forcefield[i];
#pragma unroll 1
for (size_t i = 0; i < NUM_TD_PER_THREAD; i++) {
size_t index = ix + i * lrange;

.

DDIAKITE · ‎07-26-2022

Hi Wei-Chih,

You may need to use the oneAPI optimization guide to optimize the execution time of your kernel. You can find the guide here:

https://www.intel.com/content/www/us/en/develop/documentation/oneapi-fpga-optimization-guide/top.html

Use the report file from the dpc++ compiler and ensure that your loops' initiation interval is not high (the ideal value is 1). Also, the report indicates the main bottleneck of your kernel so that you can optimize them.

Then You may increase your unrolling factor to express more parallelism instead of using just one as unrolling factor. Note that the loop trip count should be divisible by the unrolling factor for good performance, otherwise the II may be greater than 1. For example, you can fully unroll your "range" and "NUM_TD_PER_THREAD" loops if the trip count is not too large and partially unroll the "global" loop for high exit values. I don't know what the execution time of your kernel was, but it shouldn't be too high even for global = 65536.

When you say "... even fail in some large condition case.", do you mean that you get bad results or the execution does not complete ?

Best regards,
Daouda

Wei-Chih · ‎07-28-2022

Hi Daouda

I tried to increase the unroll factor(4,8,16...) but it will cause RAM utilization problem as below picture. Do you have any recommendation of it? Another question is that if i don't set any unroll factor ( #pragma unroll ) would the compile process set it automatically?

aoc: Warning RAM Utilization is at 128%

DDIAKITE · ‎08-01-2022

Hi Wei-Chih,

Can you tell me on which loop you are applying the unrolling ?

You should also be careful with the memory access pattern of your application. Use the optimization guide to optimize your memory access if needed and decrease the BRAM usage. There are multiple LSU implementations depending on your memory access pattern.

If you "#pragma unroll" without setting any unroll factor, your loop will be fully unrolled. For partial unrolling, you need to specify the unroll factor.

Best regards,
Daouda

Wei-Chih · ‎08-02-2022

Hi Daouda

below is my current single task function, may you tell me if i have any issue on adding unroll optimization?

ok, I will study the optimization guide. Thanks a lot.

h.single_task<class bude_kernel>([=]() [[intel::kernel_args_restrict]] {
#pragma unroll 2
[[intelfpga::initiation_interval(1)]]
for (size_t id = 0; id < DEFAULT_WGSIZE; id++) {

const size_t lid = id;
const size_t gid = id;
const size_t lrange = 1;

float etot[NUM_TD_PER_THREAD];
cl::sycl::float3 lpos[NUM_TD_PER_THREAD];
cl::sycl::float4 transform[NUM_TD_PER_THREAD][3];

size_t ix = gid * lrange * NUM_TD_PER_THREAD + lid;
ix = ix < nposes ? ix : nposes - NUM_TD_PER_THREAD;
#pragma unroll 2
[[intelfpga::initiation_interval(1)]]
for (int i = lid; i < ntypes; i += lrange) local_forcefield[i] = forcefield[i];

#pragma unroll 2
[[intelfpga::initiation_interval(1)]]
for (size_t i = 0; i < NUM_TD_PER_THREAD; i++) {
size_t index = ix + i * lrange;

const float sx = cl::sycl::sin(transforms_0[index]);
const float cx = cl::sycl::cos(transforms_0[index]);
const float sy = cl::sycl::sin(transforms_1[index]);
const float cy = cl::sycl::cos(transforms_1[index]);
const float sz = cl::sycl::sin(transforms_2[index]);
const float cz = cl::sycl::cos(transforms_2[index]);

transform[i][0].x() = cy * cz;
transform[i][0].y() = sx * sy * cz - cx * sz;
transform[i][0].z() = cx * sy * cz + sx * sz;
transform[i][0].w() = transforms_3[index];
transform[i][1].x() = cy * sz;
transform[i][1].y() = sx * sy * sz + cx * cz;
transform[i][1].z() = cx * sy * sz - sx * cz;
transform[i][1].w() = transforms_4[index];
transform[i][2].x() = -sy;
transform[i][2].y() = sx * cy;
transform[i][2].z() = cx * cy;
transform[i][2].w() = transforms_5[index];

etot[i] = ZERO;
}

// item.barrier(access::fence_space::local_space);

// Loop over ligand atoms
size_t il = 0;
do {
// Load ligand atom data
const Atom l_atom = ligand_molecule[il];
const FFParams l_params = local_forcefield[l_atom.type];
const bool lhphb_ltz = l_params.hphb < ZERO;
const bool lhphb_gtz = l_params.hphb > ZERO;

const cl::sycl::float4 linitpos(l_atom.x, l_atom.y, l_atom.z, ONE);
#pragma unroll 2
[[intelfpga::initiation_interval(1)]]
for (size_t i = 0; i < NUM_TD_PER_THREAD; i++) {
lpos[i].x() = transform[i][0].w() +
linitpos.x() * transform[i][0].x() +
linitpos.y() * transform[i][0].y() +
linitpos.z() * transform[i][0].z();
lpos[i].y() = transform[i][1].w() +
linitpos.x() * transform[i][1].x() +
linitpos.y() * transform[i][1].y() +
linitpos.z() * transform[i][1].z();
lpos[i].z() = transform[i][2].w() +
linitpos.x() * transform[i][2].x() +
linitpos.y() * transform[i][2].y() +
linitpos.z() * transform[i][2].z();
}

size_t ip = 0;
do {
const Atom p_atom = protein_molecule[ip];
const FFParams p_params = local_forcefield[p_atom.type];

const float radij = p_params.radius + l_params.radius;
const float r_radij = 1.f / (radij);

const float elcdst = (p_params.hbtype == HBTYPE_F && l_params.hbtype == HBTYPE_F) ? FOUR : TWO;
const float elcdst1 = (p_params.hbtype == HBTYPE_F && l_params.hbtype == HBTYPE_F) ? QUARTER : HALF;
const bool type_E = ((p_params.hbtype == HBTYPE_E || l_params.hbtype == HBTYPE_E));

const bool phphb_ltz = p_params.hphb < ZERO;
const bool phphb_gtz = p_params.hphb > ZERO;
const bool phphb_nz = p_params.hphb != ZERO;
const float p_hphb = p_params.hphb * (phphb_ltz && lhphb_gtz ? -ONE : ONE);
const float l_hphb = l_params.hphb * (phphb_gtz && lhphb_ltz ? -ONE : ONE);
const float distdslv = (phphb_ltz ? (lhphb_ltz ? NPNPDIST : NPPDIST) : (lhphb_ltz ? NPPDIST : -FloatMax));
const float r_distdslv = 1.f / (distdslv);

const float chrg_init = l_params.elsc * p_params.elsc;
const float dslv_init = p_hphb + l_hphb;
#pragma unroll 2
[[intelfpga::initiation_interval(1)]]
for (size_t i = 0; i < NUM_TD_PER_THREAD; i++) {

const float x = lpos[i].x() - p_atom.x;
const float y = lpos[i].y() - p_atom.y;
const float z = lpos[i].z() - p_atom.z;

const float distij = cl::sycl::sqrt(x * x + y * y + z * z);

const float distbb = distij - radij;
const bool zone1 = (distbb < ZERO);

etot[i] += (ONE - (distij * r_radij)) * (zone1 ? 2 * HARDNESS : ZERO);

float chrg_e = chrg_init * ((zone1 ? 1 : (ONE - distbb * elcdst1)) * (distbb < elcdst ? 1 : ZERO));
const float neg_chrg_e = -cl::sycl::fabs(chrg_e);
chrg_e = type_E ? neg_chrg_e : chrg_e;
etot[i] += chrg_e * CNSTNT;

const float coeff = (ONE - (distbb * r_distdslv));
float dslv_e = dslv_init * ((distbb < distdslv&& phphb_nz) ? 1 : ZERO);
dslv_e *= (zone1 ? 1 : coeff);
etot[i] += dslv_e;
}
} while (++ip < natpro); // loop over protein atoms
} while (++il < natlig); // loop over ligand atoms

const size_t td_base = gid * lrange * NUM_TD_PER_THREAD + lid;

if (td_base < nposes) {
#pragma unroll 2
[[intelfpga::initiation_interval(1)]]
for (size_t i = 0; i < NUM_TD_PER_THREAD; i++) {
etotals[td_base + i * lrange] = etot[i] * HALF;
}
}

// }
}
});
}

BoonBengT_Intel · ‎08-07-2022

Hi @Wei-Chih,

Do keep us posted with any updates.

We are reviewing through the code provided and will get back to you.

Looking forward to hear from you.

Best Wishes

BB

Wei-Chih · ‎08-08-2022

thanks your help. I am waiting for your support reply

BoonBengT_Intel · ‎08-28-2022

Hi @Wei-Chih,

Thank you for the patients, after some investigation on the mention code snippet, my guess is that below are some recommendation:

- Unroll an nested loops have some drawback instead of optimizing it as it will cause longer compile time, which I think its a major cause, hence would recommend to convert to single loop is possible. (https://www.intel.com/content/www/us/en/develop/documentation/oneapi-fpga-optimization-guide/top/optimize-your-design/throughput-1/single-work-item-kernels/single-work-item-kernel-design-guidelines.html)

- There are also some dependencies between the data (i.e. local_forcefield) on the for and do loop which seems will loop in a large number of iteration, which might also be another cause.

- Would recommend to simplify the loop as it seems complex with multiple nested with and tightly couple, hence unrolling the loop in this condition would not help much, but instead having drawback.

Hope that clarify.

Best Wishes

BB

Wei-Chih · ‎08-29-2022

Thanks I will try to modify the code.

Besides Unroll, what else could I do to optimize this code? any suggestion?

BoonBengT_Intel · ‎09-06-2022

Hi @Wei-Chih,

As every application are coded to execute differently, hence it would be best to go through the optimization report or profiler to which will gives more accurate insights on what to change and you can refer to the optimization guide.

With no further clarification on this thread, it will be transitioned to community support for further help on doubts in this thread and no longer monitor this thread.

Thank you for the questions and as always pleasure having you here.

Best Wishes

BB