How to add the number of work items in flight for the NDRange kernel? - Page 2

hiratz · ‎05-13-2019

Hi,

Since the NDRange is implemented as work item based pipeline on FPGA, if I understand it correctly, the maximum number of work items in flight should be determined by the complexity (or stage) of the kernel, right?

Take the following kernel code for example (from the beginning of Chapter 4 of Intel "Best Practices")

__kernel void add (__global int * a,
__global int * b,
__global int * c)
{
int gid = get_global_id(0);
c[gid] = a[gid]+b[gid];
}

The compiler generates a 3-stage pipeline for it:

1) Two Load units (load a and b simultaneously)

2) One Add unit

3) One Store unit

So for this 3-stage pipeline, at most only 3 work items can be in flight no matter how many work items are specified in the host code. If we want to get more in-flight work items, we have to add more computation or operations that will be translated into extra stages. Do I understand this correctly?

Since a deeper pipeline provides more parallelism, if my understanding above is correct, a simple kernel with few operations actually is not able to benefit much from the NDRange implementation (no matter how many work items are used or specified), right?

Thanks!

HRZ · ‎06-07-2019

That makes sense. That email address is still valid, so you can contact me through there. I think it should also be possible to send private messages through the forum, now.

hiratz · ‎06-07-2019

Sent it. I don't know how to send private messages through this forum ...

I'll think about your reply and then provide some feedback in 2 - 3 days if I have. Now I have to first finish a course project whose deadline is tomorrow ...

Thanks

HRZ · ‎06-07-2019

I checked your report, it seems the area usage difference you are seeing is simply an artifact of estimation accuracy. The LSUs created in both cases (vector or unrolled) are exactly the same and the area usage for the loads is also the same. However, there are minor differences with respect to pipeline latencies. "Select", from what I remember, refers to comparison operations and branches which require MUXes, but it is not immediately clear why the definition of "output_cache" requires such operations. "xN" also shows how many such operations are needed by that line. I tried compiling your code with Quartus 16.1.2; there, instead of "Select", the report says "State". Going to the latest v19.1 and compiling against the a10_ref BSP, when vector load is disabled, "output_cache" is implemented as a "barrel shifter" and a few wide "select" operations are used to support it which finally clears things up. However, when vector load is enabled, it seems the "output_cache" is absorbed into the pipeline since the area usage of the buffer definition disappears from the report. With v19.1, the difference of the area estimation between the two versions becomes minute which I believe is the correct case. If you actually place and route the two versions, despite the large difference in the reports of v17.1.1 which you are using, the post-place-and-route difference will very likely be negligible and there will be no performance difference either (of course if adjusted for possible differences in operating frequency).

hiratz · ‎06-07-2019

Thanks a lot for your analysis!

"there will be no performance difference either", this is exactly what I saw from the comparison experiments last week (I also mentioned this in previous reply). I agree with you that this is probably an artifact. Also, both versions have the same "Kernel Clock Frequency" (248 MHz, for both kernels, in the Intel Dynamic Profiler). According to my past experiment experience, it seems that more area usage incurs lower frequency. But this does not apply to the two versions we are discussing now. So I think you are right!