Community
cancel
Showing results for 
Search instead for 
Did you mean: 
FCive
New Contributor I
1,559 Views

OpenCL error on code compilation

Dear all,

 

I would like to use OpenCL SDK for a Terasic DE5-Net to deploy my algorithm (I am using Quartus 18.1 and OpenCL SDK 18.1). I successfully run the examples provided by Intel FPGA for OpenCL and they work. Thus, I am trying to compile my code, which is an integer iFFT based on the Cooley-Tukey algorithm. The OpenCL emulator gives me correct results when I run the code on the x86 machine. But, when I try to generate the .aocx for the DE5-Net, the compiler returns with the following warnings and one error (after 5/6 hours of compilation):

 

..several warnings about auto unrolled loop..

Compiler Warning: removing out-of-bounds accesses to xtmp1.i.i.i

Compiler Warning: removing out-of-bounds accesses to xtmp2.i.i.i

Compiler Warning: removing out-of-bounds accesses to xtmp3.i.i.i

intelFPGA/18.1/hld/board/terasic/de5net/tests/fft1Dint/device/fft1dint.cl:1424: Compiler Warning: Aggressive compiler optimization: removing unnecessary storage to local memory

Error: exception "std::bad_alloc" in PostBuildSteps [ifft_512_function ifft_512_function ]Error: Verilog generator FAILED.

 

Could you please provide me more information about this error? How can I understand which error in my code is linked to this output (buffer allocation or something else)?

 

Thank you for your support.

 

Best Regards,

 

Federico

0 Kudos
24 Replies
HRZ
Valued Contributor II
131 Views

Please attach the quartus_sh_compile.log file from your compilation folder. How much memory does your compilation machine have? Also note that Terasic's latest BSP for the DE5-Net board targets Quartus v18.0; it is not guaranteed to work with 18.1.

FCive
New Contributor I
131 Views

Thank you for your reply.

 

I can't attach the quartus_sh_compile.log since there is no such file in my folder. The compilation fails before generating the report.html. The command below for the intermediate compilation also fails with the same output. I noticed that this command takes around 3 hours to be completed and the code does not exceed 1500 lines with comment included ( the Intel guide says that this command takes some minutes to complete). I can attach the "fft1dint.log" file.

aoc -c fft1dint.cl -report

 

About the Terasic support, it also works with 18.1. There were some issues with the BSP, such as a kernel module bug that I fixed, but finally it worked. I tested the BSP with both examples provided by Intel (hello_world and fft1d_offchip) and another opencl algorithm that I had. All the tests were ok.

 

Best Regards,

 

Federico

 

FCive
New Contributor I
131 Views

Sorry, I forgot to write that my PC has 32 GB of RAM

HRZ
Valued Contributor II
131 Views

3 hours is indeed excessive. The first stage of compilation should only take a few minutes for typical kernels. Based on the line numbers in the log, you seem to have a relatively large kernel. Furthermore, the compiler is auto-unrolling a lot of loops, which might not necessarily be what you want do (especially from a resource usage point of view) while it is also removing some out-of-bound accesses to some of your buffers which shows you have logical issues in your code. I think your kernel is probably too large and complex for the compiler to handle and it is probably running into a memory leak somewhere and filling your memory and finally crashing when it runs out of memory.

 

My recommendation is to first make sure to modify your code to remove all the warnings in the log and then try to simplify your kernel. As it is, even if your kernel passes the first stage of the compilation, it will probably be too big to fit on the device.

FCive
New Contributor I
131 Views

Thank you HRZ.

I tried to reduce the number of the loops and also the number of variables. Finally, I was able to compile without the previous error. Unfortunately my design is still too big to fit into the FPGA. In particular, there are some variables that requires too much logic utilization. For example, I have the variable 'xtmp1', which is an array of 16 elements of 16 bits each, and in the report I can see the following comment:

 

Private Variable: - 'xtmp1' (fft1dint.cl:393):

  • Type: Register
    • 512 registers of width 16 and depth 1
    • 256 registers of width 16 and depth 7 (depth was increased by a factor of 7 due to a loop initiation interval of 7.)

 

Why the compiler uses such a big number of registers for 'xtmp1' variable? How can I reduce the resources for this variable?

 

Federico

 

HRZ
Valued Contributor II
131 Views

It is difficult to determine the reason for the high resource usage without seeing the source code; however, one thing I can say based on the snippet of the report you have posted is that due to the high initiation interval of your loop, all buffers inside of the loop need to be enlarged to accommodate for the high initiation interval, resulting in even higher resource usage. One thing you should consider doing is to optimize your loop and reduce the initiation interval. If you are comfortable with posting your kernel code, or a simplified version of it that exhibits the same issue, I might be able to give you a more concrete answer.

FCive
New Contributor I
131 Views

Thank you again for your support.

 

I do not have many loops with an high initiation interval but some variables that require a lot of resources are inside an high II. Thus, I am trying to reduce the II of loops and the ALUs and FFs utilization. No problem for me to post the code and I really appreciate your help. I also attached the report which can give you more details.

Moreover, I was able to reach only 50% of RAM but the ALUs and FFs were always well beyond the 100%. Could you please provide me some suggestions on how to reduce ALUs, FFs and also RAM? If possible, I would like to achieve the maximum performance in terms of execution time.

 

Federico

HRZ
Valued Contributor II
131 Views

It seems your code has been initially written to run on a standard CPU, hence not every construct used in the code is suitable for FPGA acceleration. There are lots of opportunities to improve your code:

 

  • Starting from the top function I can see that you are processing 1024 of data while only writing 512 points back. This will result in a significant waste of computing cycles and also FPGA area. You should modify the code to only compute what you are going to write back to external memory and later read in the host.
  • You are unrolling the read and write loops in the top function, which is the correct thing to do to achieve compile-time access coalescing. However, the unroll factor is far too large (512). Supporting such large accesses results in significant waste of FPGA resources, especially Block RAMs. The external memory bandwidth of the FPGA will be saturated with one 512-bit read and one 512-bit write per loop iteration (in case of two DDR memory banks and an II of one). This effectively translates to an unroll factor of 16 for the "int" datatype. What you should consider doing is to reconstruct your code so that you are reading, processing and writing back 16 points per loop iteration. Assuming that the FPGA is overutilized, you can then reduce the number of parallel points to fit the design.
  • There is excessive use of function calls in your code. Every function call will be implemented individually as a circuit on the FPGA, resulting in excessive use of FPGA resources. This is similar to the case of a fully-unrolled loop. Furthermore, such calls prevent the compiler from correcting reporting the area usage per kernel line in the HTML report (as is evident in your report where "No Source Line" is occupying half the area), which in turn makes performance debugging very difficult. You should avoid function calls as much as possible and try to use loops over the functions instead and partially or fully unroll the loops based on the available area.
  • The way the "ibfly4_16" is currently written is very inefficient on FPGAs (loop inside of a branch). Since the loops inside of both sides of the branch over "type" are the same, you should instead use one loop and move the branch inside of the loop. Furthermore, using the "out = (condition) ? in_1 : in_2;" construct rather than if/else could lead to area savings in some cases.
  • The main problem in your code seems to stem from the cpack_16_64 function which cannot achieve an II of one due to dependency on "x", resulting in the depth of all the buffers in the loop being increased by the II. Since the function is instantiated multiple times, it leads to huge area waste. I think the dependency exists since you are reading from the x[i+1] point and then overwriting it. If you can split "x" into two buffers and write to x1[i] and x2[i] instead, you might be able to avoid this problem. Of course this will require significant code rewriting which will likely propagate all the way to the top function.

 

There are probably other things that can be done to improve the code but I cannot find and list them all since the code is relatively large. You can try converting each function to a separate kernel manually and then compile them one by one and optimize each separately based on the information you get from the report and then put them back in the original kernel.

FCive
New Contributor I
131 Views

Thank you, HRZ. I really appreciate your help.

 

I tried to follow you suggestions and I made the kernel smaller (now it works with 256 samples instead of 512). Now It is much better but I have some problems with loops. Indeed, I still have memory dependency and I am not able to understand how I can fix this. Moreover, the variables involved in this loops require a lot of resources to be instantiated.

Please find the attached new report: could you please tell me how I can resolve these dependencies?

 

Federico

HRZ
Valued Contributor II
131 Views

Those dependencies look like false write-after-write dependencies. The compiler seems to be assuming that the store addresses might overlap and cause undefined behavior in the pipelined loop but since the loop bound is fixed and the addresses do not seem to overlap, it is probably safe to add #pragma ivdep to the loop to avoid the false dependency.

FCive
New Contributor I
131 Views

Thank you again for your reply.

 

Finally I was able to put the kernel into the HW. At the moment the resource utilization is around 50%. Your suggestions were really helpful.

Unfortunately the results of the opencl kernel in HW are different from the emulator. indeed, the kernel returns a buffer that has some correct elements and some other not. Do you know what could be the reasons for this behavior?

Looking the dynamic profiler, I checked that the transmission throughput from global memory to FPGA is 574 MB/s and from FPGA to global memory is 147 MB/s but I got around 2500 MB/s with the "aocl diagnose" command. Also the kernel clock frequency is not the optimal one since I got 155 MHz but I was able to reach more than 200 MHz with other kernels.

Could you please let me know what I can do to improve the performances in terms of execution time and global memory reading/writing troughput?

 

Thank you.

HRZ
Valued Contributor II
131 Views

Different output on FPGA compared to emulation can have two reasons:

 

  • A bug in the compiler that results in the generation of an incorrect hardware circuit (less likely)
  • Race condition in global memory accesses or incorrect usage of ivdep pragama (more likely)

 

It is possible that I missed some important detail in your kernel and my suggestion of adding ivdep to avoid the dependencies was incorrect. You can try removing them to see if you will get correct output (at the cost of lower performance).

 

I wouldn't rely too much on the numbers reported by the profiler; in my experience, these numbers are not very accurate. The peak external memory bandwidth of your board is 25.6 GB/s (23.8 GiB/s); however, you should not expect to get close to that number unless in extremely ideal situations. You can find the math behind calculation of the external memory bandwidth and my recommendations on how to improve external memory performance in this thread (check the reply before the last, usernames have been lost after migration from Altera's forum):

 

https://forums.intel.com/s/question/0D50P00003yyTK3SAM/global-memory-access-512-bit-width-constrain

 

Regarding operating frequency, it largely depends on loop-carried dependencies and area usage. OpenCL users have very little control over the kernel operating frequency and it is difficult to give recommendations as to how it can be improved. You can try changing the default target operating frequency from 240 to some higher number using the -fmax switch and force the compiler to insert more registers into the pipeline; this can potentially improve operating frequency. However, it might result in higher II for loops that are the fmax bottleneck. In that case you should focus on optimizing those loops to resolve whatever dependency that is causing the bottleneck.

FCive
New Contributor I
131 Views

Thank you for the information and calculation about accessing the global memory. In this way I can estimate the latency communication, considering the PCIe interface between host and device.

I will try to force the maximux frequency with -fmax option and I will see if the performance are better. I asked you how I can improve performances since my kernel lasts 600 us (according to the profiler) and I would like to reach hundreds of nanoseconds of processing, if possible.

 

About the different output, I followed your suggestion about the pragma ivdep but the area usage still was too much. So I tried to remove dependencies and I reached 50% of area only optimizing the loops, without using the ivdep pragma. The accesses to the global memory are only for reading and writing (plese see the code below).

 

Is there anyway a race condition in the global memory access? In case of bug compiler, could you please tell me how I can fix this?

reading: #pragma unroll for (ushort i = 0; i < 512; i++) data[i] = x[i];   writing: #pragma unroll for (ushort i = 0; i < 512; i++) y[i] = data[i];

Thank you.

 

HRZ
Valued Contributor II
131 Views

Do you still get incorrect output after removing the ivdep pragmas? Also, as I mentioned before, there is really no point in fully unrolling your memory reads and writes since the memory bandwidth will be saturated with an unroll factor of 16, and you will be just wasting FPGA area with such large unroll factors.

 

It is unlikely that your problem is caused by a bug in the compiler; however, if it is, there is nothing any of us can do about it other than reporting it to Intel and hoping that they would fix it in a later version. It might also be possible to avoid bugs in certain cases by changing the design strategy.

FCive
New Contributor I
131 Views

Yes. I still have incorrect output and I am not using the pragma ivdep. After a debug session, I have found out that the problem is in the following code.

for (uchar j=0; j<8; j++) { k = j*16; p = j*16+128; h = j*16+256; t = j*16+384; for (uchar i=0; i<16; i+=2) { result2[k+i] = (((result[p+i+1]*tw256[k+i+1] + result[p+i]*tw256[k+i] + result[h+i+1]*tw256[p+i+1] + result[h+i]*tw256[p+i] + result[t+i+1]*tw256[h+i+1] + result[t+i]*tw256[h+i]) >> 15) + result[k+i])>>1; result2[k+i+1] = (((result[p+i+1]*tw256[k+i] - result[p+i]*tw256[k+i+1] + result[h+i+1]*tw256[p+i] - result[h+i]*tw256[p+i+1] + result[t+i+1]*tw256[h+i] - result[t+i]*tw256[h+i+1]) >> 15) + result[k+i+1])>>1; result2[p+i] = (((result[t+i+1]*tw256[h+i] - result[t+i]*tw256[h+i+1] - result[h+i+1]*tw256[p+i+1] - result[h+i]*tw256[p+i] - result[p+i+1]*tw256[k+i] + result[p+i]*tw256[k+i+1]) >> 15) + result[k+i])>>1; result2[p+i+1] = (((result[p+i+1]*tw256[k+i+1] + result[p+i]*tw256[k+i] - result[h+i+1]*tw256[p+i] + result[h+i]*tw256[p+i+1] - result[t+i+1]*tw256[h+i+1] - result[t+i]*tw256[h+i]) >> 15) + result[k+i+1])>>1; result2[h+i] = (((result[h+i+1]*tw256[p+i+1] + result[h+i]*tw256[p+i] - result[t+i+1]*tw256[h+i+1] - result[t+i]*tw256[h+i] - result[p+i+1]*tw256[k+i+1] - result[p+i]*tw256[k+i]) >> 15) + result[k+i])>>1; result2[h+i+1] = (((result[h+i+1]*tw256[p+i] - result[h+i]*tw256[p+i+1] - result[t+i+1]*tw256[h+i] + result[t+i]*tw256[h+i+1] - result[p+i+1]*tw256[k+i] + result[p+i]*tw256[k+i+1]) >> 15) + result[k+i+1])>>1; result2[t+i] = (((result[p+i+1]*tw256[k+i] - result[p+i]*tw256[k+i+1] - result[h+i+1]*tw256[p+i+1] - result[h+i]*tw256[p+i] - result[t+i+1]*tw256[h+i] + result[t+i]*tw256[h+i+1]) >> 15) + result[k+i])>>1; result2[t+i+1] = (((result[t+i+1]*tw256[h+i+1] + result[t+i]*tw256[h+i] - result[h+i+1]*tw256[p+i] + result[h+i]*tw256[p+i+1] - result[p+i+1]*tw256[k+i+1] - result[p+i]*tw256[k+i]) >> 15) + result[k+i+1])>>1; } }

The buffer "result" is correct and it is calculated previously in the code. The buffer "tw256" is constant and read-only. The variables k, p, h, t are initialized. The buffer "result2" is incorrect. Both "result" and "result2" are declared as registers (512 registers of width 16 and depth 1). I also tried to divide the "result2" writings using several unrolled loops. The output has less errors but still wrong anyway.

Could you please let me know if something is wrong? I can share the complete code if it helps.

 

Thank you for your support.

HRZ
Valued Contributor II
131 Views

I don't really see a race condition in your current code snippet and cannot think of any reason why it would generate incorrect output. One test you can do is to extract that part of the code and convert it to a separate kernel and do the rest of the computation on the host and see if the output is still incorrect. If it is, then you can go ahead and add a "lightweight" printf to your code and try to see if you can figure out where things are going wrong on the hardware.

FCive
New Contributor I
131 Views

Finally I found the error. In my kernel code, I have 3 different stages to complete the FFT. The first and third ones are composed by an unrolled nested loop while the second one consists on a pipelined nested loop. Converting the pipelined nested loop to unrolled nested loop in the second stage resolved the problem. About this issue, I would like to ask you a question to confirm my idea: is it possible that the third stage with unrolled loop is run before that the data from second stage are ready? This can explain the issue that I had. If not, what could be the reason of this error?

 

I noticed that the execution time is around 700 us: what are possible good practices to accelerate the execution time? My goal is to reach 5/600 ns or 1 us as maximum. Do you think it is feasible?

 

Thank you very much.

HRZ
Valued Contributor II
131 Views

As far as I know, each set of nested loops in a kernel will be implemented as an individual pipeline. If the compiler does not detect any dependency between two such pipelines, it might reorder or parallelize them. However, this should not happen in your case since there is a data dependency between the loop nests. In your case, the compiler must guarantee that the each loops nest is completely processed and its pipelines is flushed before starting the next one. I am not sure what could be causing the problem in your case; it could as well be a compiler bug.

 

Regarding run time, you should probably first calculate the total amount of data that is transferred between the FPGA and its external memory in your code and divide it by the FPGA external memory bandwidth. This will give you an upper-bound for the performance you can achieve. If this upper-bound is higher (worse) than your goal, then your goal is unachievable. If your goal is higher (worse) than the upper-bound, the further it is from the upper-band, the more likely it will be to achieve. Of course there is never any guarantee you would be able to achieve this upper-band in practice.

FCive
New Contributor I
131 Views

I would like to understand more in depth if the problem is caused by an issue on my code or it really is a bug compiler. Do you think I have to open a technical support ticket with Intel in order to report it?

 

About the run time performance, if I understood well, I can calculate the FPGA external memory bandwidth as:

kernel_frequency x number_of_banks x bus_width

In my case I have 2 banks of DDR3 @933MHz, thus the memory operating frequency is 933x2=1866MHz. According to your answer in the thread that you suggested me, the memory controller on the FPGA has a frequency of 1866/8=233MHz. Thus, the maximum frequency achievable for the kernel is 233MHz. I have to read/write 512x16=8192 bits from/to the global memory. Assuming that the kernel operative frequency is 233MHz, the max FPGA external memory bandwidth is 233MHz x 2 x 64 = 29.8Gbps and the upper-bound is 8192bit/29.8Gbps=275 nanoseconds to transfer data from/to the FPGA.

Are these calculation correct? Am I making some mistakes?

 

Thank you very much.

HRZ
Valued Contributor II
42 Views

You can try reporting your issue to Intel, but unless you have Premier Support access, I don’t think it is possible to open tickets with Intel anymore. They have offloaded support for people who do not have Premier Support access to the forums. Maybe you can send a PM to one of the Intel-affiliated moderators in the forum and then they can open a ticket with the engineering team on your behalf.

 

Regarding the memory performance, the memory of the DE5-Net board operates at 1600 MHz (800 MHz double data-rate) and the memory controller operates at 200 MHz. The peak external memory bandwidth of the board is 2 x 64 bit x 1600 MHz = 25.6 GB/s = 23.8 GiB/s. Assuming that your OpenCL kernel is running at 200 MHz, you will need to read/write a minimum of 128 bytes per clock to saturate the external memory bandwidth. Since the kernel and the memory controller operate at different clocks, and the clock of the memory controller is fixed, you can saturate the memory bandwidth using less bytes per clock at a higher kernel operating frequency, or using more bytes per clock at a lower kernel operating frequency. The compiler places buffers between the kernel and the memory interface to allow this. If the total amount of data you need to transfer between the FPGA and its external memory is only 8192 bits, then you would probably be better off just running your code on the host since your bottleneck will definitely be the PCI-E transfer. The computation on the FPGA in this case will be latency-bound since you will not be saturating the pipeline, and your run time will then depend on the depth of the pipeline and kernel operating frequency - parameters which the user has very little control over in a high-level design.

Reply