Solved: Serialization problem with code blocks without dependency

YChoi1 · ‎01-16-2020

Hi. I was implementing a really simple dense matrix-vector multiplication, with 32 blocks doing a partial sum of A*x, and one block doing final sum of the partial sum.

Obviously, there is a dependency between the final sum block and the other 32 blocks - what I don't understand is why 32 blocks between themselves are getting serialized as well. There is no dependency between any of them.

Going into the code, the 32 parallel blocks are

k0.B6 (98), k0.B10 (110), ..... k0.B130 (470)

and the serial block is

k0.B134 (483).

The AOCL report does mention the serial execution:

"Iteration executed serially across k0.B6, k0.B10, k0.B14, ... , k0.B130, k0.B134. Only a single loop iteration will execute inside this region due to memory dependency:

From: Load Operation (hbm_boardtest.cl: 102)

To: Load Operation (hbm_boardtest.cl: 485)"

which is true for B134, but B6, ... B130 should execute in parallel.

I am using AOCL 19.4 Build 64

targeting Stratix 10, 1SM21BHU2F53E2VGS1, s10mx_ref:s10mx_hbm_es

with command "aoc -board=s10mx_hbm_es -fp-relaxed device/hbm_boardtest.cl"

I am attaching my kernel file.

Could anyone help with how to solve this problem?

Thank you.

HRZ · ‎01-17-2020

The way you have described your design right now, the 32 loop nests will be pipelined rather than parallelized and since the outer loop is not pipelined, the loop nests will also be executed one by one. If you want those 32 loop nests to run in parallel, you should describe them as a single loop nest wrapped inside another loop with a trip count of 32 that is fully unrolled. In that case, you will get 32 parallel blocks. However, in your case it is impossible to construct the code like that since, due to the genius way HBM works on Intel FPGAs (and apparently also Xilinx) where there is no interleaving, you are forced to allocate 32 buffers with different names, and it is impossible to address buffers with different names in a loop (but of course this problem somehow does not exist on GPUs which have been using HBM since 3-4 years ago). One possible solution I can think of is to construct your code as I mentioned, and use a large switch case block inside of the unrolled loop to map each iteration of the unrolled loop to one of the differently-named buffers like this:

#pragma unroll
for (int U = 0; U < 32; U++)
{
   for(int i = 0 ; i < matrix_size/1 ; i++ )
   {
       . . .
       union uf8 local_A;
       switch (U)
       {
         case 0:
           local_A.f8 = A0[i*matrix_size/8/32+j];
           break;
         case 1:
           local_A.f8 = A1[i*matrix_size/8/32+j];
           break;
           .
           .
           .
       }
       . . . 
   }
}

Hopefully the compiler will be smart enough to just create one memory port for each block in this case (and optimize out the rest), rather than 32 ports for each with a multiplexer at the end...

If this doesn't work, another option is to use a multi-kernel design with each of the 32 blocks having their own kernel, and one kernel handling the memory reads, and one kernel performing the final reduction and memory writes. You can probably leverage the autorun kernel type to implement the 32 compute kernels with minimal code size. Though of course a multi-kernel design with blocking channels will incur huge area overhead if you also want to utilize the Hyper-Optimized Handshaking optimization (another great feature of Stratix 10).

P.S. What Stratix 10 MX board is this that already supports OpenCL? Is it Bittware's board?

View solution in original post

HRZ · ‎01-17-2020

The way you have described your design right now, the 32 loop nests will be pipelined rather than parallelized and since the outer loop is not pipelined, the loop nests will also be executed one by one. If you want those 32 loop nests to run in parallel, you should describe them as a single loop nest wrapped inside another loop with a trip count of 32 that is fully unrolled. In that case, you will get 32 parallel blocks. However, in your case it is impossible to construct the code like that since, due to the genius way HBM works on Intel FPGAs (and apparently also Xilinx) where there is no interleaving, you are forced to allocate 32 buffers with different names, and it is impossible to address buffers with different names in a loop (but of course this problem somehow does not exist on GPUs which have been using HBM since 3-4 years ago). One possible solution I can think of is to construct your code as I mentioned, and use a large switch case block inside of the unrolled loop to map each iteration of the unrolled loop to one of the differently-named buffers like this:

#pragma unroll
for (int U = 0; U < 32; U++)
{
   for(int i = 0 ; i < matrix_size/1 ; i++ )
   {
       . . .
       union uf8 local_A;
       switch (U)
       {
         case 0:
           local_A.f8 = A0[i*matrix_size/8/32+j];
           break;
         case 1:
           local_A.f8 = A1[i*matrix_size/8/32+j];
           break;
           .
           .
           .
       }
       . . . 
   }
}

Hopefully the compiler will be smart enough to just create one memory port for each block in this case (and optimize out the rest), rather than 32 ports for each with a multiplexer at the end...

If this doesn't work, another option is to use a multi-kernel design with each of the 32 blocks having their own kernel, and one kernel handling the memory reads, and one kernel performing the final reduction and memory writes. You can probably leverage the autorun kernel type to implement the 32 compute kernels with minimal code size. Though of course a multi-kernel design with blocking channels will incur huge area overhead if you also want to utilize the Hyper-Optimized Handshaking optimization (another great feature of Stratix 10).

P.S. What Stratix 10 MX board is this that already supports OpenCL? Is it Bittware's board?

YChoi1 · ‎01-19-2020

Using switch + unroll pragma for different HBM port has worked perfectly. I am planning to recommend this coding style to my group. Thanks so much for your advice :)

Reply to your ps - it is called S10MX ES ("early-silicon") version - received from Intel (not sure about Bitware)