Data Parallelization in FPGA

Kilinc__Gorkem · ‎01-17-2019

Hello,

I am working on DE10 Standard SoC Board from Terasic (Cyclone V FPGA). I have a simple kernel for addition of two floats:

__kernel void vector_add(__global const float *x,
                         __global const float *y,
                         __global float *restrict z)
{
    int index = get_global_id(0);
    z[index] = x[index] + y[index];
}

After compilation this kernel takes up less than 10% of FPGA resources for each type of them (LAB, RAM and DSP) . So I want to benefit from the unused resources by creating more than one pipelines. The suggested way of having data-parallelization on FPGA is either by using num_simd_work_items or num_compute_units pragmas. However, none of those gives a performance increase even though they replicate the same pipeline in different ways. On the other hand, when I create copies of the same kernel within the same program, share the data between them and run them from different command queues the performance increase indicates that they run in parallel and do a better job than using pragmas. To be more precise, my code looks like this:

__kernel void vector_add1(__global const float *x,
                         __global const float *y,
                         __global float *restrict z)
{
    int index = get_global_id(0);
    z[index] = x[index] + y[index];
}

__kernel void vector_add2(__global const float *x,
                         __global const float *y,
                         __global float *restrict z)
{
    int index = get_global_id(0);
    z[index] = x[index] + y[index];
}

__kernel void vector_add3(...

...

Increasing the number of kernels, after second kernel (vector_add2) total computation time remains constant. I am sure that this is not the way to achieve what I intend to do. But can you explain why I get slower computation using the pragmas? Especially for num_compute_units, what does it do if not what I did manually? I am aware of the frequency decrease as more resources are utilized but its effect should be minimal compared to something else I am missing (fmax is 121MHz for multiple compute units, 140MHz for simd and 146 MHz without data parallelization).

Thanks in advance.

Michael_C_Intel1 · ‎01-17-2019

*edit update for links that didn't resolve initially*

Hi GorkemK,

Thanks for posting this question.

If you can post the a minimally necessary example reproducer as an attachment to this article... that may help...

The kernel... host side launch... and compile flags you're using can be helpful. It could make it obvious for a forum watcher or Intel® FPGA user to identify where the gap is in scaling up the example kernel. Please just make sure nothing proprietary or privileged is included.

Note, I'm not an FPGA user... but the sample may help FPGA users familiar with the process provide some comments.

A couple of pointers to related collateral relevant for this post...
Section 1.3 of this pdf guide. It goes into the nature of one work item on FPGA, which is different from many other OpenCL™ devices. Also see section 6 (6.3) of the guide for a description of scaling up compute units.
This website also has a few source code examples of the features mentioned.
Karl Qi has a good introduction video to general FPGA programming and then dives into an OpenCL™ based overview at around to 30minute mark of this video.

-MichaelC

Kilinc__Gorkem · ‎01-17-2019

Hi Michael,

To be more clear I upload host and kernel files. I found it from examples in the website: https://www.intel.com/content/www/us/en/programmable/support/support-resources/design-examples/design-software/opencl/vector-addition.html . I made some changes on it. But I think it is not necessary reproduce what I did because my question is more generic than this specific case. The compilation command I use is: aoc vector_add.cl -o vector_add/vector_add.aocx -board=de10_standard_sharedonly -v -report -fp-relaxed -fpc . For different programs I always used the same flags.

Gorkem

Michael_C_Intel1 · ‎01-18-2019

*Updating this thread adding the appropriate hyperlinks to the earlier post*

Karl_Q_Intel · ‎01-18-2019

It surprised me that num_simd didn't help with performance. Did you verify from the HTML report that that the SIMD was actually applied? Did you use any other kernel attributes with simd? simd requires reqd_workgroup_size and has several other limitations.

Num compute unit may not have helped depending on the number of workgroups launched on the host side. And sometimes num compute unit also leads to undesirable memory access patterns so if your memory bandwidth is already fully utilized then that may be the problem..

Kilinc__Gorkem · ‎01-23-2019

Hi Karl,

I retried same computation with different kernels. Seems like I made a mistake about simd kernel previously. For 4194304 additions, simd kernel (16 num_simd_work_items, required work group size is (64,1,1) ) takes about 19 milliseconds whereas it takes 29 milliseconds when no kernel attribute is indicated. When num_compute_units is indicated the performance is not as good as plain kernel and the performance gets worse as the number is increased. However, when I compile more than one kernel within the program each of them being identical as I mentioned in my first message and use them I get an increase in performance, which I still can not explain. Basically those kernels must race for the limited bandwidth as in num_compute_units kernel. But still it takes about 20 milliseconds when I use two of them.

I have another question based on your answer. I don't understand how work group size can change the performance. Until now I observed that it can slightly change the results in some cases but considering that we use an offline compiler for FPGA what difference does it make if we change its value? Most of the time we have one pipeline in FPGA, but if we indicate num_compute_units to be n is it better to have n work groups ( local size = global size / n ) ?