Solved: The performance of the loop

Kilinc__Gorkem · ‎01-13-2019

Hello,

I want to element-wise multiply two arrays of unsigned integers on DE-10 Standard SoC board by Terasic . When I write my kernel as such, I don't think that AOC compiles it the most efficient way:

kernel void test_builtin (global unsigned int * restrict in1, global unsigned int * restrict in2, global unsigned int * restrict out) {

int i = get_global_id(0);

unsigned int x = in1;

unsigned int y = in2;

out = x * y;

}

In the report file I see that the line corresponding to multiplication uses 2 DSPs. However, as far as I know, DSP utilizes sequential logic[1] whereas it is possible to fully implement the multiplication just by combinational logic. For this reason, I implemented unsigned integer multiplication as simple as possible in Verilog:

`timescale 1 ps / 1 ps

module mul_uint (
input   clock,
input   resetn,
input   ivalid,
input   iready,
output ovalid,
output oready,
input   [31:0] datainA,
input   [31:0] datainB,
output [31:0] dataout);

assign ovalid = 1'b1;
assign oready = 1'b1;
// clk, ivalid, iready, resetn are ignored
assign dataout = ( datainB[0] ? datainA << 0 : 0) +
( datainB[1] ? datainA << 1 : 0) +
( datainB[2] ? datainA << 2 : 0) +

// 28 lines of code ...

( datainB[31] ? datainA << 31 : 0);

endmodule

Following the steps shown in the example code[2] I included this code as library function as described by the programming guide[3] . Now that I have a multiplication function which takes zero clock cycles to operate, I would expect this function to work faster than the standard way of AOC which utilizes 2 DSPs. Using the host code provided as the example code as template, my kernel function looks like this:

unsigned int mul_uint(unsigned int x, unsigned int y);

// Using HDL library components

kernel void test_lib (global unsigned int * restrict in1, global unsigned int * restrict in2, global unsigned int * restrict out, int N) {

int i = get_global_id(0);

ulong j = mul_uint(i,N);

for (int k = 0; k < N; k++) {

unsigned int x = in1[j + k];

unsigned int y = in2[j + k];

out[j + k] = mul_uint( x , y );

}

// Using identical (in function and implementation) built-in components

kernel void test_builtin (global unsigned int * restrict in1, global unsigned int * restrict in2, global unsigned int * restrict out, int N) {

int i = get_global_id(0);

ulong j = i*N;

for (int k = 0; k < N; k++) {

unsigned int x = in1[j + k];

unsigned int y = in2[j + k];

out[j + k] = x * y;

}

Surprisingly test_built_in works as fast as test_lib for N = 0, which gives the fastest computation. For other values of N I have not observed an obvious pattern about which kernel works faster. I thought including an RTL might have overhead, is that true? Why can't the simplest possible combinational multiplication beat two DSP implementation?

Thanks in advance

Mark_H_Intel1 · ‎01-16-2019

The performance of the loop is ultimately limited by the need to read input value from memory, and to write back output values.

That means the Initiation Interval of the loop must be at least 1 - the compiled hardware cannot launch loop iterations more than one per cycle (unrolling the loop can change this).

Although I'm not familiar with the Cyclone V DSP block, it looks like there is a single layer of registering - that 32-bit multiplication will take 2 cycles if the registers of both DSP blocks are used.

That will not prevent the loop from having an II of 1, as it is possibly to feed the first DSP block with new values at the same time that its contribution to the multiplication result is passed to the second DSP block.

That means there will be no overall time difference between a loop that uses a zero-latency multiplier and one that has a latency of 1 cycle. If the multiplier has a latency of 2 cycles the total time for N iterations of the loop will take 1 extra cycle.

This is the power of OpenCL and HLS - the compiler builds a computational pipeline that produces one result per cycle (as long as all the loop IIs are 1), regardless of how complex the computation might be.

The timing variations between values of N are probably due to the interactions with memory - which is a resource shared between the HPS and FPGA fabric. When N=0 you have no memory interactions in the OpenCL kernel.

In general large combinational logic structures such as you've written in the Verilog will limit the ability of Quartus to lay out the logic for a good fMax - putting registers in the right places is the key to getting a good operating frequency (and this is one of the great strengths of the OpenCL compiler).

Try creating a design that has only the DSP based multiplier (no test_lib kernel) and compare its operating frequency against that of the design you've described here.

Mark

View solution in original post

Mark_H_Intel1 · ‎01-16-2019