Cannot find specific errors of kernel compilation

Altera_Forum · ‎05-16-2017

I'm trying to compile a specific kernel I've written myself. The compilation fails while it tries to generate the hardware implementation. It also fails after one hour processing, with no reason. It's most probably cannot synthesis the model, but I cannot find any related log file explains what exactly is happening. Here is my kernel:

__kernel void Test51(__global double *data, __global double *rands, int index, int rand_max){
    double2 temp;
    int gid = get_global_id(0);
    temp = data;
    temp = (double) rands * temp;
    temp = (double) rands * temp;
    temp = (double) rands * temp;
    temp = (double) rands * temp;
    temp = (double) rands * temp;
    temp = (double) rands * temp;
    temp = (double) rands * temp;
.....
    temp = (double) rands * temp;
    temp = (double) rands * temp;
    temp = (double) rands * temp;
    temp = (double) rands * temp;
    temp = (double) rands * temp;
    temp = (double) rands * temp;
    temp = (double) rands * temp;
    data = temp.s0;
}

I've tried so hard to understand what exactly is the reason for failure but cannot figure out anything. Can anyone help me with this issue?

Altera_Forum · ‎05-16-2017

Do you have either of the "quartus_sh_compile.log" or the "*kernel_name*.log" files? Do you get the resource estimation if you add --report to your command line? If you don't get the latter and your compilation is failing during LLVM compilation/optimization, you should consider changing your design strategy.

Why don't you use a for loop instead of your current unrolled design? Your design will most likely not fit on any FPGA, even if it compiles.

Altera_Forum · ‎05-16-2017

As I expected, your design is too big to fit on the FPGA, and that is why it is failing to compile. This is the area estimation from the log:

+--------------------------------------------------------------------+
; Estimated Resource Usage Summary                                   ;
+----------------------------------------+---------------------------+
; Resource                               + Usage                     ;
+----------------------------------------+---------------------------+
; Logic utilization                      ;  191%                     ;
; ALUTs                                  ;   69%                     ;
; Dedicated logic registers              ;  121%                     ;
; Memory blocks                          ;   14%                     ;
; DSP blocks                             ;  136%                     ;
+----------------------------------------+---------------------------;

Your kernel will probably need at least 499 double-precision multipliers and a huge amount of memory and logic to support them. If you write the kernel as follows, you will have the exact same dependency but with a very modest area usage:

__kernel void Test51(__global double *data, __global double *rands, int index, int rand_max){
    double2 temp;
    int gid = get_global_id(0);
    temp = data;
    for (int i = 1; i < 500; i++)
    {
        temp = (double) rands * temp;
    }
    data = temp.s0;
}

It is worth mentioning that loops in NDRange kernels are not pipelined and instead, are shared by multiple threads to keep the pipeline busy. Because of this, loop-carried dependencies do not have much of a negative effect in NDRange kernels. However, if you write the same kernel as single work-item, you will get an initiation interval of higher than one due to the loop-carried dependency on the temp variable and very bad performance (which can be fixed by inferring a shift register as outlined in Intel's documents).

Altera_Forum · ‎05-17-2017

Thanks much for the answer. Something that's interesting to me, is lack of specific error message, while the synthesis phase is being failed. I was expecting to see kind of verbose somewhere in the log files.

I also have one more question. You've mentioned the inefficiency of my kernel, since I've unrolled the loop manually instead of placing a simple iterative loop and you've said the pipeline is being shared by multiple threads to keep the pipeline busy in NDRange Kernel. Does this mean:

1) Single-Work items designs should be considered as the first choice while deploying an OpenCL code into FPGA?

2) Does Thread here means a single work-item belong to a work-group?

Thanks.

Altera_Forum · ‎05-17-2017

Actually, in normal circumstances, the compilation will fail during fitting in such cases and you will get an explicit message in stdout saying that Quartus failed to fit the design on the FPGA, but in your case it seems fitting actually finishes successfully (but with slowed OpenCL clock) and fails after that, which is pretty strange. Since AOC's estimation can be wrong at times (especially on Arria 10), the "top.fit.summary" will give you a much more accurate estimation of area usage.

Your kernel is not necessarily inefficient performance-wise, it is kinda inefficient from a coding effort point-of-view since you can easily use a for loop as I suggested and use the provided "#pragma unroll" to fully or partially unroll the loop based on the amount of area you have available on the FPGA.

Regarding your questions:

1) It is hard to come up with a fixed formula as to when single work-item works better and when NDRange does, but I would suggest using single work-item in cases where loop-carried dependencies exist and can be resolved by using temporary registers or shift registers to achieve an Iteration Interval of one. For full data-parallel kernels, you might as well use NDRange. In the general case, I would consider single work-item as the first choice and if I couldn't achieve good performance due to unresolvable dependencies or un-puipelinable loops, I would switch to NDRange.

2) Yes, that is what I was referring to by using the term "thread".