Intel® High Level Design
Support for Intel® High Level Synthesis Compiler, DSP Builder, OneAPI for Intel® FPGAs, Intel® FPGA SDK for OpenCL™
702 Discussions

Unable to coalesce kernel execution to memory reads in Arria 10 FPGA using opencl 17.1 bsp

MAstr
Novice
1,676 Views

Hi,

I have a benchmarking program based on the intel example for read and kernel coalesced execution: 

https://www.intel.com/content/www/us/en/programmable/documentation/mwh1391807516407.html#ran1551293458343

 

I know this example was added to the manual after version 17.1 but its core is just basic opencl event and kernel queue usage.

As seen in  @buffers_timeline.PNG the first execution does get coalesced (kernel execution and read buffer at the same time), but all the subsequent reads are not coalesced. The 3rd execution of the kernel is waiting for the second buffer somehow.

 

Any ideas?

 

I only need a way of getting data out. Is this a problem from version 17.1 ?  As you can see the data generation kernel "data_in" gets stalled (I supose because the "daq" kernel is slowing it down). 

@BW_occ_stall.PNG

 

 

So I coded the following example host program:

@hostcode.PNG

...

 

cl_event kernelEvent[100], readEvent[100];
const size_t global_work_size[1] = {1};

clEnqueueNDRangeKernel(kernelQ2[0], kernel[1], 1,NULL,global_work_size,NULL, 0, NULL, NULL);
const double start_time = getCurrentTimestamp();

//start daq loop. DAQ_ITERATIONS buffers in host to be filled by the device
printf("\n Starting DAQ %d iterations... \n", num_iterations);

clEnqueueNDRangeKernel(kernelQ[0], kernel[0], 1,NULL,global_work_size,NULL, 0, NULL, &kernelEvent[0]);
clEnqueueReadBuffer(readQ[0], d_output_buff[0], CL_FALSE, 0, d_buffer_num_elem*sizeof(cl_short8), h_output_buf[0], 1, &kernelEvent[0], &readEvent[0]);
clEnqueueNDRangeKernel(kernelQ[0], kernel[0], 1,NULL,global_work_size,NULL, 0, NULL, &kernelEvent[1]);
clEnqueueReadBuffer(readQ[0], d_output_buff[1], CL_FALSE, 0, d_buffer_num_elem*sizeof(cl_short8), h_output_buf[1], 1, &kernelEvent[1], &readEvent[1]);
// clFlush(readQ[0]);
clFlush(kernelQ[0]);
clFlush(readQ[0]);
for (int i=2; i<num_iterations; i++) {
printf("\nIteration %d, buffer %d: \n", i, i%2);
status = clSetKernelArg(kernel[0], i%2, sizeof(cl_mem), &d_output_buff[i]);
checkError(status, "Failed to set argument %d", i%2);
clEnqueueNDRangeKernel(kernelQ[0], kernel[0], 1,NULL,global_work_size,NULL, 1, &readEvent[i-2], &kernelEvent[i]);
// clFlush(kernelQ[0]);
clEnqueueReadBuffer(readQ[0], d_output_buff[i], CL_FALSE, 0, d_buffer_num_elem*sizeof(cl_short8), h_output_buf[i], 1, &kernelEvent[i], &readEvent[i]);
// clFlush(readQ[0]);
}

clFlush(kernelQ[0]);
clFlush(readQ[0]);

// Wait for all kernels to finish.
clWaitForEvents(1,&readEvent[num_iterations-1]);

const double end_time = getCurrentTimestamp();
const double total_time = end_time - start_time;

 

...

 

0 Kudos
4 Replies
HRZ
Valued Contributor III
1,620 Views

Doesn't your "checkError" function happen to have a clFlush, or clFinish, or clWaitForEvents, etc. that might be serializing some of the enqueues? Other than that, your host code looks fine and the kernel and read operations should get completely overlapped.

 

P.S. I remember there was this limitation in Intel's runtime and BSP that it wasn't possible to do simultaneous reads and writes through PCI-E, despite PCI-E being a full-duplex medium, which was fixed in some 18.x version of the compiler, and that limitation could cause unnecessary serialization of simultaneous PIC-E reads and writes. However, I don't think that applies to your case since you are trying to overlap kernel execution with PCI-E read and there is no PCI-E write involved, and that fix requires a compatible BSP anyway, which I don't think you have or else you wouldn't be using v17.1. of the compiler.

0 Kudos
BoonBengT_Intel
Moderator
1,582 Views

Hi @MAstr,

 

Thank you for posting in Intel community forum and hope all is well.
By any chances did you managed to recommendation mention by HRZ, please do let us know if there is any further clarification we can help with.

 

Best Wishes
BB

0 Kudos
MAstr
Novice
1,558 Views

Thanks for the suggestion. I am not able to use the hardware at the moment, will get back to this topic once I can test the program again. Thank you! @BoonBengT_Intel 

0 Kudos
BoonBengT_Intel
Moderator
1,510 Views

Hi @MAstr,

 

Good to know your doubts has been clarified, with no further clarification on this thread, it will be transitioned to community support for further help on doubts in this thread.
Thanks for your questions and as always pleasure having you here.

 

Best Wishes
BB

0 Kudos
Reply