Placement cannot find a legal solution

Altera_Forum · ‎10-11-2017

Hi,

I am targeting the a10gx using AOC and Quartus Prime, both v16.1.2 Build 203.

According to the logs my design is not overutilizing resources, but the building process fails during fitting:


Error (18999): Placement cannot find a legal solution

I found another post reporting the same problem https://www.alteraforum.com/forum/showthread.php?t=55881, but it doesn't help as I am using Linux for my compiling machine and targeting the a10gx.

The only thing I assume is that my design requires more resources than available and the tool is not showing correctly this, but this is just a guess.

As far as I know, error messages due to overutilization are different.

Perhaps anyone could give some hints on this?

The following link contains some logs (logs are too large to be attached to this post)

https://www.dropbox.com/s/izv7ld1bj5x4cxq/aoc_quartus_placement_error_logs.zip?dl=0

Altera_Forum · ‎10-11-2017

Your log indeed looks strange. The fitter just throws some "Placement cannot find a legal solution" messages and aborts. The numbers in the top.fit.summary file do not make much sense either, especially the ALM utilization. However, as I also mentioned in the other thread, these numbers seem to overflow in cases where utilization goes over 100%. What do you get from aoc's resource estimation? If the logic utilization you get is over 80%, my guess is that you are running out of logic, even though that would be very rare/strange on Arria 10.

Altera_Forum · ‎10-11-2017

Hi HRZ,

Thanks you for checking the log files!

--- Quote Start ---

... However, as I also mentioned in the other thread, these numbers seem to overflow in cases where utilization goes over 100%. What do you get from aoc's resource estimation? If the logic utilization you get is over 80%, my guess is that you are running out of logic, even though that would be very rare/strange on Arria 10.

--- Quote End ---

This is the logic estimation I got using aoc:

+--------------------------------------------------------------------+

; Estimated Resource Usage Summary ;

+----------------------------------------+---------------------------+

; Resource + Usage ;

+----------------------------------------+---------------------------+

; Logic utilization ; 64% ;

; ALUTs ; 31% ;

; Dedicated logic registers ; 34% ;

; Memory blocks ; 67% ;

; DSP blocks ; 44% ;

+----------------------------------------+---------------------------;

All numbers are under 80% so no idea what could be wrong ...

Altera_Forum · ‎10-11-2017

Hmm, that doesn't really look like a standard case of overutilization, but I guess since the logic utilization is being reported as less than 1% in the fitting report, your kernel might be overutilizing logic and the number might be overflowing in the report.

Is there anyway you can reduce the area usage of your kernel to get a proper estimation of the actual area utilization after placement and routing? e.g. by reducing unroll factor or SIMD size? After all, the estimation by aoc is generally not very accurate and I have seen differences up to 50%; though aoc pretty much always overestimates the area utilization for me.

Altera_Forum · ‎10-12-2017

--- Quote Start ---

...

Is there anyway you can reduce the area usage of your kernel to get a proper estimation of the actual area utilization after placement and routing? e.g. by reducing unroll factor or SIMD size? After all, the estimation by aoc is generally not very accurate and I have seen differences up to 50%; though aoc pretty much always overestimates the area utilization for me.

--- Quote End ---

Although all my kernels are single threaded, I think it is important to mentioning the following:

I had a kernel whose performance I couldnt improve further due to dependencies, so I dediced to duplicate, and then (if possible) triplicate it by creating other kernels with same functionality (minor difference in order to ensure overall synchronization).

The duplicated version fits on the a10gx and its fitter summary is the following:


+--------------------------------------------------------------------------+
; Fitter Summary                                                           ;
+-----------------------------+--------------------------------------------+
; Fitter Status               ; Successful - Wed Oct  4 20:51:46 2017      ;
; Quartus Prime Version       ; 16.1.2 Build 203 01/18/2017 SJ Pro Edition ;
; Revision Name               ; top                                        ;
; Top-level Entity Name       ; top                                        ;
; Family                      ; Arria 10                                   ;
; Device                      ; 10AX115S2F45I1SG                           ;
; Timing Models               ; Final                                      ;
; Logic utilization (in ALMs) ; 192,778 / 427,200 ( 45 % )                 ;
; Total registers             ; 495991                                     ;
; Total pins                  ; 173 / 960 ( 18 % )                         ;
; Total virtual pins          ; 0                                          ;
; Total block memory bits     ; 9,621,616 / 55,562,240 ( 17 % )            ;
; Total RAM Blocks            ; 1,502 / 2,713 ( 55 % )                     ;
; Total DSP Blocks            ; 469 / 1,518 ( 31 % )                       ;
; Total HSSI RX channels      ; 8 / 72 ( 11 % )                            ;
; Total HSSI TX channels      ; 8 / 72 ( 11 % )                            ;
; Total PLLs                  ; 78 / 144 ( 54 % )                          ;
+-----------------------------+--------------------------------------------+

Then, I decided to triplicate it, and thats when I get the "placement cannot find a legal solution" error message.

All logs attached previously correspond to this triplication-attempt.

To compare, here is the (triplicate) fitter summary showing, as you mentioned, the strange 1% of logic utilization:


+--------------------------------------------------------------------------+
; Fitter Summary                                                           ;
+-----------------------------+--------------------------------------------+
; Fitter Status               ; Failed - Tue Oct 10 21:21:23 2017          ;
; Quartus Prime Version       ; 16.1.2 Build 203 01/18/2017 SJ Pro Edition ;
; Revision Name               ; top                                        ;
; Top-level Entity Name       ; top                                        ;
; Family                      ; Arria 10                                   ;
; Device                      ; 10AX115S2F45I1SG                           ;
; Timing Models               ; Final                                      ;
; Logic utilization (in ALMs) ; 136 / 427,200 ( < 1 % )                    ;
; Total registers             ; 685413                                     ;
; Total pins                  ; 173 / 960 ( 18 % )                         ;
; Total virtual pins          ; 0                                          ;
; Total block memory bits     ; 12,651,216 / 55,562,240 ( 23 % )           ;
; Total RAM Blocks            ; 134 / 2,713 ( 5 % )                        ;
; Total DSP Blocks            ; 696 / 1,518 ( 46 % )                       ;
; Total HSSI RX channels      ; 8 / 72 ( 11 % )                            ;
; Total HSSI TX channels      ; 8 / 72 ( 11 % )                            ;
; Total PLLs                  ; 78 / 144 ( 54 % )                          ;
+-----------------------------+--------------------------------------------+

Following a similar reasoning, then I think there might be an overutilization of RAM as well

Does this help to get a better idea? thank you!

Altera_Forum · ‎10-12-2017

I see. Based on what you describe, even though a linear extrapolation would give the impression that three kernel copies should also fit, it seems it is not the case, and both logic and RAM seem to be getting overutilized. Assuming that it is possible for you to decouple memory accesses from compute in your application and putting them in different kernels connected via channels, I recommend converting the compute part to an autorun kernel and then replicating it using the num_compute_units attribute (different functionality compared to when this attribute is used with NDRange kernels). In my experience, replicating single work-item autorun kernels using num_compute_units results in very small replication overhead.

Altera_Forum · ‎10-13-2017

Before trying the "num_compute units" attribute, I tried building again but this time with AOC 17.0.

The only modification in the source code was on channel-function invokation as from v17 they are "write_channel_intel" instead of "write_channel_altera".

The binary .aocx is produced successfully and the utilization for my "triplicated" design was the following:


+--------------------------------------------------------------------------+
; Fitter Summary                                                           ;
+-----------------------------+--------------------------------------------+
; Fitter Status               ; Successful - Thu Oct 12 23:57:12 2017      ;
; Quartus Prime Version       ; 17.0.0 Build 290 04/26/2017 SJ Pro Edition ;
; Revision Name               ; top                                        ;
; Top-level Entity Name       ; top                                        ;
; Family                      ; Arria 10                                   ;
; Device                      ; 10AX115S2F45I1SG                           ;
; Timing Models               ; Final                                      ;
; Logic utilization (in ALMs) ; 234,954 / 427,200 ( 55 % )                 ;
; Total registers             ; 581181                                     ;
; Total pins                  ; 173 / 960 ( 18 % )                         ;
; Total virtual pins          ; 0                                          ;
; Total block memory bits     ; 12,777,548 / 55,562,240 ( 23 % )           ;
; Total RAM Blocks            ; 2,668 / 2,713 ( 98 % )                     ;
; Total DSP Blocks            ; 648 / 1,518 ( 43 % )                       ;
; Total HSSI RX channels      ; 8 / 72 ( 11 % )                            ;
; Total HSSI TX channels      ; 8 / 72 ( 11 % )                            ;
; Total PLLs                  ; 78 / 144 ( 54 % )                          ;
+-----------------------------+--------------------------------------------+

Perhaps there is a problem with the tool ... ?

Altera_Forum · ‎10-13-2017

Your kernel is very close to overutilizing the Block RAMs even with v17.0. I did some quick test on v17.0 some time ago and noticed that it seems to be reducing Block RAM utilization compared to v16.1.2 for the same kernel; this is probably the reason why your kernel does not fit with 16.1.2, but fits with v17.0. Though the area reduction might be because of Altera slimming down the BSP in v17.0, opening more room for the kernel to be placed on the FPGA, rather than improvement in the OpenCL compiler. In my experience, v17.0 unfortunately decreases performance for the same kernel by up to 30% compared to v16.1.2, so you might want to check the actual performance you achieve before switching to this version.