Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Honored Contributor I
981 Views

Why does this design not fit and how to reason about congestion

Hi all, 

 

I'm trying to optimize a simple program which computes a correlation matrix. To promote reuse/parallelism the design proceeds in blocks. However, I do not get the design to route for larger block sizes (> 12)  

 

I do not fully understand why it does not fit. Here is the code: 

 

# define NR_RECEIVERS 576# define NR_BASELINE 166176 # define NR_SAMPLES_PER_CHANNEL 1024# define NR_CHANNELS 64# define NR_POLARIZATIONS 2# define COMPLEX 2# define BLOCK_SIZE_X 8# define BLOCK_SIZE (BLOCK_SIZE_X * BLOCK_SIZE_X)# define NR_BLOCK_X (NR_RECEIVERS / BLOCK_SIZE_X) # define NR_BLOCKS (NR_BLOCK_X * (NR_BLOCK_X + 1)) / 2# define OUTSIZE NR_BLOCKS * BLOCK_SIZE typedef signed char int8_t; typedef float2 fcomplex; typedef fcomplex InputType/**/; typedef fcomplex OutputType/**/; fcomplex __attribute__((__overloadable__,__always_inline__,const)) mulConj(fcomplex a, fcomplex b){ return (float2)(a.x * b.x + a.y * b.y, a.y * b.x - a.x * b.y); } __kernel __attribute__((reqd_work_group_size(1,1,1))) __attribute__((max_global_work_dim(0))) void Correlator(__global OutputType *restrict output, const __global volatile InputType *restrict input) { int blockx = 0; int blocky = 0; for( int baselineBlock = 0 ; baselineBlock < NR_BLOCKS; baselineBlock++) { fcomplex sums; # pragma unroll for(int i = 0 ; i < BLOCK_SIZE_X ; i++){ # pragma unroll for(int j = 0 ; j < BLOCK_SIZE_X ; j++){ sums = (fcomplex)0; } } for (int time = 0 ; time < NR_SAMPLES_PER_CHANNEL; time++){ fcomplex memx; fcomplex memy; # pragma unroll for (int i = 0; i < BLOCK_SIZE_X ; i++){ memy = (*input); } # pragma unroll for (int i = 0; i < BLOCK_SIZE_X ; i++){ memx = (*input); } # pragma unroll for(int i = 0; i < BLOCK_SIZE_X ; i++){ # pragma unroll for(int j = 0; j < BLOCK_SIZE_X ; j++){ sums += mulConj( memx , memy); } } } # pragma unroll for(int i = 0 ; i < BLOCK_SIZE_X ; i++){ # pragma unroll for(int j = 0 ; j < BLOCK_SIZE_X ; j++){ (*output) = sums; } } blocky+=BLOCK_SIZE_X; if(blocky > blockx){ blockx+=BLOCK_SIZE_X; blocky = 0; } } }  

For block size 16, the compiler estimate (--report) suggests that it should fit, with estimates like this: 

+--------------------------------------------------------------------+ ; Estimated Resource Usage Summary ; +----------------------------------------+---------------------------+ ; Resource + Usage ; +----------------------------------------+---------------------------+ ; Logic utilization ; 40% ; ; ALUTs ; 11% ; ; Dedicated logic registers ; 28% ; ; Memory blocks ; 14% ; ; DSP blocks ; 67% ; +----------------------------------------+---------------------------;  

 

At large block sizes, the quartus_sh_compile says that there is high congestion in the design. I assumed that this was due to the high amount of reuse in the design, every value in memx or memy is read 16 times, for a blocksize of 16.  

 

To combat this effect, I redesigned the kernel in a systolic array-like design, where data is shifted through a grid in both the x and y direction and each value is only read once (instead of 16 times): 

\# define NR_RECEIVERS 576# define NR_BASELINE 166176 # define NR_SAMPLES_PER_CHANNEL 1024# define NR_CHANNELS 64# define NR_POLARIZATIONS 2# define COMPLEX 2# define BLOCK_SIZE_X 16# define BLOCK_SIZE_Y (BLOCK_SIZE_X / 2)# define BLOCK_SIZE (BLOCK_SIZE_X * BLOCK_SIZE_X)# define NR_BLOCK_X (NR_RECEIVERS / BLOCK_SIZE_X) # define NR_BLOCKS (NR_BLOCK_X * (NR_BLOCK_X + 1)) / 2# define OUTSIZE ((NR_RECEIVERS * (NR_RECEIVERS + 1)) / 2) typedef signed char int8_t; typedef float2 fcomplex; typedef fcomplex InputType/**/; // Write non-contigous: // typedef fcomplex OutputType/**/; // Write contigous: typedef fcomplex OutputType/**/; fcomplex __attribute__((__overloadable__,__always_inline__,const)) mulConj(fcomplex a, fcomplex b){ return (float2)(a.x * b.x + a.y * b.y, a.y * b.x - a.x * b.y); } __kernel __attribute__((reqd_work_group_size(1,1,1))) __attribute__((max_global_work_dim(0))) void Correlator(__global OutputType *restrict output, const __global volatile InputType *restrict input) { int blockx = 0; int blocky = 0; # pragma unroll 1 for( int baselineBlock = 0 ; baselineBlock < NR_BLOCKS; baselineBlock++) { fcomplex sums; # pragma unroll for(int i = 0 ; i < BLOCK_SIZE_X ; i++){ # pragma unroll for(int j = 0 ; j < BLOCK_SIZE_X ; j++){ sums = (fcomplex)0; } } fcomplex bufx; # pragma unroll for(int i = 0 ; i < 2 * BLOCK_SIZE - 1 ; i++){ # pragma unroll for(int j = 0 ; j < BLOCK_SIZE ; j++){ bufx = (fcomplex)0; } } fcomplex bufy; # pragma unroll for(int i = 0 ; i < BLOCK_SIZE ; i++){ # pragma unroll for(int j = 0 ; j < 2 * BLOCK_SIZE - 1 ; j++){ bufy = (fcomplex)0; } } # pragma unroll 1 for (int time = 0 ; time < NR_SAMPLES_PER_CHANNEL; time++){ # pragma unroll for(int j = 0 ; j < BLOCK_SIZE ; j++){ # pragma unroll for(int i = 2 * BLOCK_SIZE - 2 ; i >=1 ; i--){ bufx = bufx; } } # pragma unroll for(int i = 0 ; i < BLOCK_SIZE ; i++){ # pragma unroll for(int j = 2 * BLOCK_SIZE - 2 ; j >=1 ; j--){ bufy = bufy; } } # pragma unroll for(int i = 0 ; i < BLOCK_SIZE ; i++){ bufx = (*input); } # pragma unroll for(int i = 0 ; i < BLOCK_SIZE ; i++){ bufy = (*input); } # pragma unroll for(int i = 0; i < BLOCK_SIZE_X ; i++){ # pragma unroll for(int j = 0; j < BLOCK_SIZE_Y ; j++){ float2 a = bufx; float2 b = bufy; sums += (float2)(a.x * b.x + a.y * b.y, a.y * b.x - a.x * b.y); } } } # pragma unroll for(int i = 0 ; i < BLOCK_SIZE_X ; i++){ # pragma unroll for(int j = 0 ; j < BLOCK_SIZE_X ; j++){ (*output) = sums; } } blocky+=BLOCK_SIZE_X; if(blocky > blockx){ blockx+=BLOCK_SIZE_X; blocky = 0; } } }  

 

 

This gives estimates like before. However, this also does not fit for larger block sizes (>12), just like the earlier design. The log states: Cannot place the following 18 DSP cells -- a legal placement which satisfies all the DSP requirements could not be found.....  

 

I do not understand why. Why do these designs have high congestion, how can I reason about this and how can I make a larger block size fit? Is it simply unrealistic to use more than 50% of the DSP nodes? 

 

Any ideas appreciated!
0 Kudos
7 Replies
Highlighted
Honored Contributor I
43 Views

I don't know, but you might want to take a look at routing resources, it could be that the most of the interconnect is used and there is none left to connect the DSPs. This is just a guess, so it could be something completely different.

0 Kudos
Highlighted
Honored Contributor I
43 Views

Thanks for the suggestion. What do you mean with look at routing resources? Where can I find this information? I'm a bit of a beginner here. Thanks in advance!

0 Kudos
Highlighted
Honored Contributor I
43 Views

Altera's estimation can be very wrong at times, especially on Arria 10. 

 

If your design is failing during placement, then it means the design is too big to fit on the FPGA, even though in my experience, AOC's DSP usage estimation is generally correct. 

 

If, however, your design is failing during routing due to congestion, with a an area utilization that is not relatively high, then there isn't much that can be done about it except redesigning the kernel. On Arria 10, since OpenCL uses Partial Reconfiguration through PCI-E by default, routing fails very regularly due to routing congestion in the region that connects the static block to the dynamic block (I have encountered this far too many times to count). I have never gone over 60% DSP utilization on Arria 10, though (logic and RAM are usually over-utilized long before DSP in my experience), so that might also be a contributing factor. 

On Stratix V, however, I have encountered routing failure due to congestion only a few times, and that was because my design was bad (too many non-coalesced reads and writes from the same local buffer). 

 

Can you provide the quartus_sh_compile.log and also mention your target device and the version of Quartus/AOC you are using? 

 

P.S. Your first kernel looks relatively straightforward and data parallel; you might have better luck if you implement it as NDRange. In particular, spatial blocking using local buffers that are implemented as standard multi-ported memory buffers on the FPGA (in contrast to, for example, shift registers) seem to work better in NDRange kernels in my [limited] experience.
0 Kudos
Highlighted
Honored Contributor I
43 Views

Thanks for the input. I've attached the quartus_sh_compile.sh for the second kernel below. I'm afraid I've deleted it for the first kernel, but I will rerun this tonight, and send it later.  

 

I'm using an Arria 10 with Quartus 16.0 (no 16.1 yet for Nalla board). 

 

What exactly is the definition of congestion here? How do I prevent it? 

 

I'll also try a NDRange kernel, the problem with that is that to devide the computation some extra computations involving squareroots are needed. 

 

Thanks in advance for the help!
0 Kudos
Highlighted
Honored Contributor I
43 Views

This report shows that failure is occurring during fitting due to lack of DSPs (or maybe DSPs that match placement requirements). Does the compiler also generate the file named top.fit.summary in this case? That report should make it clear whether you are really running out of DSPs or not (even though I regularly see negative or overflown numbers in that file). 

 

With block size 16 for your first code, I get this with Quartus 16.1.2 and Arria 10: 

 

+--------------------------------------------------------------------+ ; Estimated Resource Usage Summary ; +----------------------------------------+---------------------------+ ; Resource + Usage ; +----------------------------------------+---------------------------+ ; Logic utilization ; 65% ; ; ALUTs ; 31% ; ; Dedicated logic registers ; 35% ; ; Memory blocks ; 40% ; ; DSP blocks ; 90% ; +----------------------------------------+---------------------------; 

 

I would expect this to be very hard to fit on the FPGA, especially since the logic utilization is probably underestimated. 

Your second code takes a very long time to do the first stage of compilation with a block size of 16, so I gave up on it, but the estimation seems to report lower number of DSPs for a block size of 8, compared to the first code. 

 

Both codes seem to compile very nicely with all loops being fully-pipelined with an II of one, and based on the system overview from Quartus v16.1.2, the system has 4 fully-coalesced memory ports and a low-latency peipeline (100-200 clocks). I guess this case is simply a problem of high area utilization. You could reduce your area overhead by converting some of your less important fully-unrolled loops to partially-unrolled ones.
0 Kudos
Highlighted
Honored Contributor I
43 Views

The top.fit.summary (good to know about btw!) confirms what you say, the estimate is simply off, a blocksize of 16 will not fit and uses 101% of DSP nodes: 

 

Fitter Status : Failed - Mon May 1 13:24:51 2017 Quartus Prime Version : 16.0.0 Build 211 04/27/2016 SJ Pro Edition Revision Name : top Top-level Entity Name : top Family : Arria 10 Device : 10AX115N3F40E2SG Timing Models : Final Logic utilization (in ALMs) : 257,659 / 427,200 ( 60 % ) Total registers : 468262 Total pins : 288 / 826 ( 35 % ) Total virtual pins : 0 Total block memory bits : 4,183,118 / 55,562,240 ( 8 % ) Total RAM Blocks : 227 / 2,713 ( 8 % ) Total DSP Blocks : 1,536 / 1,518 ( 101 % ) Total HSSI RX channels : 8 / 48 ( 17 % ) Total HSSI TX channels : 8 / 48 ( 17 % ) Total PLLs : 18 / 112 ( 16 % )  

 

Thanks for the help! Next time, I'll check for a smaller blocksize what the actual (instead of estimate) figures are.
0 Kudos
Highlighted
Honored Contributor I
43 Views

Yeah, that settles it. Time to pre-order a Stratix 10 board, LOL! :D

0 Kudos