OpenCL error on code compilation - Page 2

FCive · ‎11-14-2018

Dear all,

I would like to use OpenCL SDK for a Terasic DE5-Net to deploy my algorithm (I am using Quartus 18.1 and OpenCL SDK 18.1). I successfully run the examples provided by Intel FPGA for OpenCL and they work. Thus, I am trying to compile my code, which is an integer iFFT based on the Cooley-Tukey algorithm. The OpenCL emulator gives me correct results when I run the code on the x86 machine. But, when I try to generate the .aocx for the DE5-Net, the compiler returns with the following warnings and one error (after 5/6 hours of compilation):

..several warnings about auto unrolled loop..

Compiler Warning: removing out-of-bounds accesses to xtmp1.i.i.i

Compiler Warning: removing out-of-bounds accesses to xtmp2.i.i.i

Compiler Warning: removing out-of-bounds accesses to xtmp3.i.i.i

intelFPGA/18.1/hld/board/terasic/de5net/tests/fft1Dint/device/fft1dint.cl:1424: Compiler Warning: Aggressive compiler optimization: removing unnecessary storage to local memory

Error: exception "std::bad_alloc" in PostBuildSteps [ifft_512_function ifft_512_function ]Error: Verilog generator FAILED.

Could you please provide me more information about this error? How can I understand which error in my code is linked to this output (buffer allocation or something else)?

Thank you for your support.

Best Regards,

Federico

FCive · ‎12-20-2018

Ok. I can try to explain the issue to one of the Intel-affiliated moderators in the forums. Thank you for your suggestion.

About the processing time, I tried to read/write less bit per clock cycle. At the moment I read/write 1024 bits 8 times instead of 8192 bits in a single reading/writing. In this way, the kernel operating frequency is higher and it reaches 242 MHz, according to the profiler. Unfortunately, the kernel execution time remains the same.

About the PCIe bottleneck, do you mean that it is not worth to do the processing in hw for only 8192 bits because of the PCIe transfer bottleneck?

Moreover, I would like to be sure that I understood correctly the flow from host to FPGA and viceversa. As matter of example, I consider the kernel reading operation. The flow consists on:

PCIe writes the data to the DDR
DDR has to be read by the kernel.

Thus, the bottlenecks are DDR access and PCIe writing. Is it correct?

Thank you for your support.

HRZ · ‎12-20-2018

Depending on the size of your data and how much "data movement" you have in each step, your bottleneck could fall on PCI-E, DDR or computation of the FPGA itself. If you transfer one byte of data through PCI-E to DDR, read it once from DDR, and process it once on the FPGA, your bottleneck is going to be the PCI-E transfer since it has the least bandwidth. If you transfer once through PCI-E but read multiple times from DDR but process only once each time, then the bottleneck will fall on the DDR transfer. Finally, if you transfer once through PCI-E, read once from DDR but do a lot of compute on the FPGA for every byte you read, then you might finally be able to actually saturate the compute capabilities of the FPGA. In your case, since your data size is very small, unless you are processing the same data a couple million times, your bottleneck is going to be the PCI-E transfer.

FCive · ‎01-17-2019

Thank you for your explanation and happy new year!

I am trying to understand where is exactly the bottleneck, thus I enable the ACL_PROFILE_TIMER variable to see the memory transfer.

It seems that the access to the global memory does not reach 100% occupancy but only 4.2%. Moreover, in the kernel execution panel, there are empty spaces that represent the global memory access time, if I understood correctly from the OpenCL best practices guide. I have also tried with 128 iterations of 64 bits each of transfer from/to global memory, but I did not see improvement. Please find attached the screenshots from the profiler.

How can I improve the occupancy?

Thank you!

HRZ · ‎01-22-2019

The Best Practices Guide says:

"The Kernel Execution tab also displays information on memory transfers between the host and your devices"

I don't think "memory transfers" here refer to "global memory" since I highly doubt the profiler implements separate counters for global memory traffic. Furthermore, memory and compute operations generally overlap in a kernel and it would not be very easy to separate them with run-time counters. The guide does not address this topic very clearly, so I am not sure how the information you have obtained from the profiler can be interpreted.

Regarding the low occupancy, it could simply be caused by your code performing memory operations less frequently that compute. "Best Practices Guide, Section 4.3.4.2. Low Occupancy Percentage" contains the official guidelines to improve occupancy.