OpenCL* for CPU
Ask questions and share information on Intel® SDK for OpenCL™ Applications and OpenCL™ implementations for Intel® CPU
Announcements
This forum covers OpenCL* for CPU only. OpenCL* for GPU questions can be asked in the GPU Compute Software forum. Intel® FPGA SDK for OpenCL™ questions can be ask in the FPGA Intel® High Level Design forum.
1663 Discussions

Including HDL code as library function in Intel OpenCL SDK for FPGA

Kilinc__Gorkem
Beginner
252 Views

Hello,

I want to element-wise multiply two arrays of unsigned integers on DE-10 Standard SoC board by Terasic . When I write my kernel as such, I don't think that AOC compiles it the most efficient way:

kernel void test_builtin (global unsigned int * restrict in1, global unsigned int * restrict in2, global unsigned int * restrict out) {

int i = get_global_id(0);

unsigned int x = in1;

unsigned int y = in2;

out = x * y;

}

In the report file I see that the line corresponding to multiplication uses 2 DSPs. However, as far as I know, DSP utilizes sequential logic[1] whereas it is possible to fully implement the multiplication just by combinational logic. For this reason, I implemented unsigned integer multiplication as simple as possible in Verilog:

`timescale 1 ps / 1 ps

module mul_uint (
  input   clock,
  input   resetn,
  input   ivalid,
  input   iready,
  output  ovalid,
  output  oready,
  input   [31:0]  datainA,
  input   [31:0]  datainB,
  output  [31:0]  dataout);

  assign  ovalid = 1'b1;
  assign  oready = 1'b1;
  // clk, ivalid, iready, resetn are ignored
  assign dataout = ( datainB[0] ? datainA << 0 : 0) +
  ( datainB[1] ? datainA << 1 : 0) +
  ( datainB[2] ? datainA << 2 : 0) +


  // 28 lines of code ...


  ( datainB[31] ? datainA << 31 : 0);

endmodule

Following the steps shown in the example code[2] I included this code as library function as described by the programming guide[3] . Now that I have a multiplication function which takes zero clock cycles to operate, I would expect this function to work faster than the standard way of AOC which utilizes 2 DSPs. Using the host code provided as the example code as template, my kernel function looks like this:

unsigned int mul_uint(unsigned int x, unsigned int y);

// Using HDL library components

kernel void test_lib (global unsigned int * restrict in1, global unsigned int * restrict in2, global unsigned int * restrict out, int N) {

int i = get_global_id(0);

ulong j = mul_uint(i,N);

for (int k = 0; k < N; k++) {

unsigned int x = in1[j + k];

unsigned int y = in2[j + k];

out[j + k] = mul_uint( x , y );

}

}

// Using identical (in function and implementation) built-in components

kernel void test_builtin (global unsigned int * restrict in1, global unsigned int * restrict in2, global unsigned int * restrict out, int N) {

int i = get_global_id(0);

ulong j = i*N;

for (int k = 0; k < N; k++) {

unsigned int x = in1[j + k];

unsigned int y = in2[j + k];

out[j + k] = x * y;

}

}

Surprisingly test_built_in works as fast as test_lib for N = 0, which gives the fastest computation. For other values of N I have not observed an obvious pattern about which kernel works faster. I thought including an RTL might have overhead, is that true? Why can't the simplest possible combinational multiplication beat two DSP implementation?

Thanks in advance

 

0 Kudos
1 Solution
Mark_H_Intel1
Employee
252 Views

The performance of the loop is ultimately limited by the need to read input value from memory, and to write back output values.

That means the Initiation Interval of the loop must be at least 1 - the compiled hardware cannot launch loop iterations more than one per cycle (unrolling the loop can change this). 

Although I'm not familiar with the Cyclone V DSP block, it looks like there is a single layer of registering - that 32-bit multiplication will take 2 cycles if the registers of both DSP blocks are used. 

That will not prevent the loop from having an II of 1, as it is possibly to feed the first DSP block with new values at the same time that its contribution to the multiplication result is passed to the second DSP block. 

That means there will be no overall time difference between a loop that uses a zero-latency multiplier and one that has a latency of 1 cycle. If the multiplier has a latency of 2 cycles the total time for N iterations of the loop will take 1 extra cycle.

This is the power of OpenCL and HLS - the compiler builds a computational pipeline that produces one result per cycle (as long as all the loop IIs are 1), regardless of how complex the computation might be. 

The timing variations between values of N are probably due to the interactions with memory - which is a resource shared between the HPS and FPGA fabric. When N=0 you have no memory interactions in the OpenCL kernel.

In general large combinational logic structures such as you've written in the Verilog will limit the ability of Quartus to lay out the logic for a good fMax - putting registers in the right places is the key to getting a good operating frequency (and this is one of the great strengths of the OpenCL compiler).

Try creating a design that has only the DSP based multiplier (no test_lib kernel) and compare its operating frequency against that of the design you've described here. 

Mark

 

View solution in original post

3 Replies
Mark_H_Intel1
Employee
253 Views

The performance of the loop is ultimately limited by the need to read input value from memory, and to write back output values.

That means the Initiation Interval of the loop must be at least 1 - the compiled hardware cannot launch loop iterations more than one per cycle (unrolling the loop can change this). 

Although I'm not familiar with the Cyclone V DSP block, it looks like there is a single layer of registering - that 32-bit multiplication will take 2 cycles if the registers of both DSP blocks are used. 

That will not prevent the loop from having an II of 1, as it is possibly to feed the first DSP block with new values at the same time that its contribution to the multiplication result is passed to the second DSP block. 

That means there will be no overall time difference between a loop that uses a zero-latency multiplier and one that has a latency of 1 cycle. If the multiplier has a latency of 2 cycles the total time for N iterations of the loop will take 1 extra cycle.

This is the power of OpenCL and HLS - the compiler builds a computational pipeline that produces one result per cycle (as long as all the loop IIs are 1), regardless of how complex the computation might be. 

The timing variations between values of N are probably due to the interactions with memory - which is a resource shared between the HPS and FPGA fabric. When N=0 you have no memory interactions in the OpenCL kernel.

In general large combinational logic structures such as you've written in the Verilog will limit the ability of Quartus to lay out the logic for a good fMax - putting registers in the right places is the key to getting a good operating frequency (and this is one of the great strengths of the OpenCL compiler).

Try creating a design that has only the DSP based multiplier (no test_lib kernel) and compare its operating frequency against that of the design you've described here. 

Mark

 

Kilinc__Gorkem
Beginner
252 Views

Thank you Mark,

 

I tried compiling DSP based multiplier on its own as you suggested. The result complies with your explanation. Quartus report files show that without library kernel fmax is 138 MHz, whereas fmax is 98 MHz when we include library kernel. Considering what you wrote about II, it seems like the only way to make a kernel run faster is optimizing its bottleneck or data-parallelizing it (either by simd or multiple compute unit pragmas).

 

Regards,

Gorkem 

Mark_H_Intel1
Employee
252 Views

With FPGAs the amount of data parallelism is limited only by the size of the device - and it's quite easy to express.

There are two really easy ways of getting wide data-parallelism in OpenCL:

1. use OpenCL vector data types (these are up to 16 elements wide, e.g. int16 is a 16-wide vector)

2. unroll the loop - using #pragma unroll directive

Putting #pragma unroll before a for loop tells the compiler to make hardware for each iteration of the loop. But... performance will be less than expected if the number of iterations is not a constant, or if there is any dependency between loop iterations.

If the compiler has inferred a loop carried dependency (maybe due to one iteration of the loop writing to an array which is used in other iterations), you can use #pragma ivdep to tell the compiler that despite appearances there are no loop-carried dependencies.

The way to go is probably:

1. use a task-kernel rather than an NDRange kernel
2. copy your data into fixed-size local arrays or vectors
3. use vector operations and/or unrolled loops

The sample code I'll write here is untested, as a suggested starting point for you. It's designed to operate on chunks of 64 bytes, which are generally efficient for main-memory transfers.

kernel void test_builtin (global unsigned int * restrict in1, global unsigned int * restrict in2, global unsigned int * restrict out, int N) {
int Nchunks = (N+15)/16;

int chunk, i;

unsigned int local_in1[16], local_in2[16], local_out[16];

int chunkstart = 0;

for (chunk = 0; chunk < Nchunks; chunk++) {

  #pragma unroll 16
  for (i = 0; i < 16; i++) {
    local_in1 = in1[chunkstart+i];
    local_in2 = in2[chunkstart+i];
  }

  #pragma unroll 16
  for (i = 0; i < 16; i++) {
    local_out = local_in1 + local_in2;
  }

  #pragma unroll 16
  for (i = 0; i < 16; i++) {
    out[chunksize+i] = local_out;
  }

}

Vector version (also untested) - in this case, an NDRange kernel operating on vectors of 16 uints

kernel void test_builtin (global uint16 * restrict in1, global uint16 * restrict in2, global uint16 * restrict out) {

int i = get_global_id(0);
uint16 x = in1;
uint16 y = in2;
out = x * y;
}

Reply