Area & Timing report mismatches

Altera_Forum · ‎02-02-2018

Hello!

I am trying to have an idea of why results seem somewhat different according to where they are looked at. I emphasize seem since I guess I am missing something. I am working with Version 17.0.2 Build 602.

Regarding area, results in report.html are different to those in acl_quartus_report.txt (which are the same as those found under Quartus Fitter report). And I am not able to find any relation between the ALUTs/FFs/RAMs reported in acl_quartus_report.txt with any of the modules reported under the "Area Analysis of system/source" in the HLD FPGA Reports (Beta) of the report.html file. Any idea on this? Which should be trusted? And, what part of the whole system is considered for the results appearing in acl_quartus_report.txt.?

Regarding maximum operating frequency, I guess the valid value is the one in acl_quartus_report.txt since I haven't been able to find any single result in the report.html. Besides, if the aoc is executed with a --board specification, should the obtained fmax be the real fmax finally happening in the device?

Also, any way to have an aggregated value for the whole latency of a complete kernel? So far, all results in the "System viewer" of the report.html are provided in a per-block basis. With this report (which I am having difficulties to properly understand and get easily, by the way...although I still need to go into greater detail in the guides), I'd say adding all the latencies reported for all blocks is not the way to go. So, any hint on this?

I have been looking in the Programming/Best Practices guides and in the forum, and I haven't been able to find this. Not to mention the guides for the 17/17.1 version are still using the previous .txt optimization reports to develop some of the sections, which have so far been superseded by the html report. But well, I guess this is just a matter of time for the guides to be properly updated and catch up with the tools (!), hopefully in future versions.

Thanks so much in advance,

Rubén

Altera_Forum · ‎02-03-2018

Well, whatever you see in the area report is an inaccurate estimation of resource utilization, and is just supposed to give you an idea of whether your kernel is going to fit on the FPGA or not. In fact, in Quartus v17.1 a new message has been added to the compiler's output that explicitly says so. The numbers in acl_quartus_report.txt are the final post-place-and-route area and timing results which are 100% accurate. The early estimation, which is based on some model Altera has developed, is required since the compilation and place and route might take a couple of hours and people generally cannot wait so long just to see if their kernel is going to fit on the device. However, the accuracy is low since modelling area utilization on FPGAs is not easy. There is one exception and that is DSP utilization; DSP utilization in the area report is accurate because modelling that is easy. The only possible source of discrepancy is if your BSP is reserving some DSPs but not using them, which will be counted in the area report but not the final report. In this case, the number of DSPs that are actually used will be equal to the number of DSPs reported in the area report minus the number of DSPs reserved by the BSP.

The operating frequency you see in acl_quartus_report.txt is the final real Fmax. Fmax cannot be easily estimated, hence there is no estimation in the area report (in fact, there was one in the early versions of AOC but they removed it since it was extremely inaccurate).

Regarding latency, if you add up the latency of the different blocks in your kernel, you will get the minimum latency of the pipeline. The actual latency is always higher due to stalls from external memory and channel operations. Furthermore, since the latency of each block is for one loop iteration (single work-item kernels) or one work-item (NDRange kernels), if you want to get an estimation of minimum run time, you should multiply the latency of each block with its loop trip count or number of work-items that traverse it.

Altera_Forum · ‎02-06-2018

Thanks HRZ for the nice and fast response. And sorry for this larger than expected reply.

OK, I get it. I thought that by the end of the compilation & synthesis flow, the report.html would have been updated with the final results. Seems it's not the case.

Don't know what you folks think, but I can envision (meaning I would like to have) here having 2 different reports (the estimative one for early feedback, and the post-place-and-route) so we can check both under the HTML GUI (and ideally integrating the profiler information). But this is another story.

I have some other questions on these area results:

(1) I guess the information found in the acl_quartus_report.txt file is "just" for the kernel logic (including the associated W-I ID dispatching etc.) but without considering Board Interface and Global Interconnect contribution, right? I am asking this since I'd like to know the specific resource consumption of my kernel "logic", i.e., the part implementing the actual functionality. Something nice would be getting results on the actual datapath pipeline separated from everything else (LSUs, and the like, i.e., all those other elements within the kernel that are actually there to provide system level integration). The best way to describe what I mean is imagining a standard vhdl entity of a given circuit that could be plugged into the kernel to do the computation and which should then be integrated somehow at system level (this can actually be done through the OpenCL library in AOC): well, these are the post-place-and-route area results I'd like to know for just-the-functional-part-of-my-kernel, for comparison purposes with other tools.

(2) I guess in case of having more than one kernel (either in one or several .cl files) compiled into the same .aocx the result in this quartus report file would be the aggregation of all kernels, right? (so far I have gone only for 1 kernel). In this case, the solution would be going to Quartus and check the resource utilization by entity, but this approach does clearly not scale.

Now, regarding latency, I get it too. What I was looking to was the actual "capability" of the synthesized circuit (pipeline), regardless of external factors such as bus contention, memory bottlenecks and the like. Basically, just knowing the number of clock cycles needed for an input data to traverse the whole pipeline until reaching the output: this is what I would be doing in RTL and logic simulation. To check if I got it right, please let me formulate the following: from the point of view of an RTL implementation, to compute the latency of a datapath (pipeline) not considering potential stalls caused by external factors (meaning one new data could be available each clock cycle), the minimum (ideal) latency can be computed the same way for any pipeline as: clock_cycles_to_fill_pipeline + N_clock_cycles. So far so good. So, considering the following case in OpenCL (computing a set of dot products):

__kernel void foo (global char * restrict in, global char * restrict out, const uchar m)
{
   # define LENGTH 3
    // actual ff values don't matter
    char ff = {
        { 1, 2, 3},
        { 5, 5, 6},
        { 7, 8, 9},
        { 10, 11, 12}
    };
    char f;
    //fill 'f' according to a kernel parameter, 'm'
   # pragma unroll
    for (int i = 0; i < FILTER_LENGTH; i++){
        // 'm' is a kernel input paramenter
        f = ff;
    }
    // hold input samples in a shift reg. structure
    char tmp;
   # pragma unroll
    // 'in' is a kernel input parameter (data to be filtered: SIZE > LENGTH)
    for(int j = 0; j < LENGTH; j++){
        tmp = in;
    }
    // filter
   # pragma unroll 1 // prevents unrolling
    for (int j = 0; j < SIZE+1; j++) {
        // accumulator
        short acc = 0;
        // dot product
       # pragma unroll
        for(int i=0; i<LENGTH; i++){
            acc += f * tmp;
        }
        // write output: 'out' is a kernel output parameter
        out =  acc;
        // updates tmp buffer
        // ...shift samples
       # pragma unroll
        for(int i = 0; i < LENGTH-1; i++){
            tmp = tmp;
        }
        // ...get new sample
        tmp = in;
    }
}

here things change a bit, the difference coming from the contribution to the latency of fully unrolled loops and non unrolled loops. In order to add up the latency of the different blocks, all the blocks corresponding to fully unrolled loops would just count once (add their latency once) and loops not unrolled at all would be multiplied by their loop trip count (same for partially unrolled loops, which would add up as original_trip_count/unroll_factor). Is this right?

As before, hopefully Intel might do this computation for us at some point, of course clearly stating that this would be the ideal situation without pipeline stalls, which can't be predicted (at least not completely). However, since the access pattern to memory is clear from the kernel code and if that's the only kernel in your system and not any other piece of code is accessing the main memory to input data to the kernel and you compile/synthesize against a given board, I guess it shouldn't be impossible for the tool to compute the number of clock cycles required to complete the whole computing work (given that all constants and loop bounds are known at compile time). How does this sound to you?

Cheers,

Rubén