I have some theoretical questions to have a better understanding of Intel FPGA OpenCL Compiler. First of all, I still don't know when to prefer NDRange Kernel over Single-Task Kernel. To my understanding, it is possible to data-parallelize the kernel with more flexibility in single-task kernel by using unroll loop pragma. By using this pragma we can indicate the parts we want vectorization. On the other hand NDRange Kernel offers simd pragma, which is bound to multiples of 2(why?) and requires the programmer to fix the size of work group size. NDRange concept fits well to other OpenCL platforms because of their fixed hardware consisting of multiple compute units but I can not grasp its necessity for FPGA.
Secondly, I would like to know when to prefer multiple compute units over simd. According to Best Practices Guide it is a bit of experimentation with the numbers to get the best results(best combination of compute units and simd) . But I can not think of a possible scenario that we have n compute units that has no memory coalescing would give better performance than n simd units. It comes to me as if it is always better to decrease number of compute units by a factor of n and increase simd units by the same factor(as long as we have enough resources). If this is the case, what is the justification of existence of multiple compute units pragma?
Lastly, after optimizing the number of compute units and simd units, what procedure should we follow in order to find the best work group size? Best Practices Guide states that each work group can only work on one compute unit. So, that should mean that the number of work groups we have must be a multiple of number of compute units we created for better performance(or not?). I always aimed for having least value for reqd_work_group_size attribute so that choosing global work size becomes easier(as it has to be a multiple of work group size for my device). What is a more elegant way of choosing work group size?
Your first question can be answered by the following free online training.
"When to prefer NDRange Kernel over Single-Task Kernel. "
One approach is not better than the other
****Create single work-item kernels if
–Data processing sequencing is critical
–Algorithm can’t easily break down into work-items due to data dependencies
–Not all data available prior to kernel launch
–Data cannot be easily partitioned into workgroups
***Create NDRange kernels if
–Kernel does not have loop and memory dependencies
–Kernel can execute multiple work-items in parallel efficiently
–Able to take advantage of SIMD processing
num_compute_units vs. simd
Try SIMD vectorization first
–Usually leads to more efficient hardware than compute unit replication
You can combine SIMD vectorization with computer unit replication
–Possibly required to achieve best performance and/or fit
--SIMD is only a power of 2, so you may not be able to fit SIMD 16 lanes in your device, but you may be able to fit 3 CUs with SIMD of 4. (12 total lanes)
Work groups should be set based on the algorithm. Do any work items need to share local memory? or "How many work items does it make sense for the algorithm to work on at a time?" For example, JPEG encoding works on 8x8 pixels blocks, so the required work group size should be 8,8. That allows the work items to share local (on-chip) memory. If you are doing matrix multiplication, then the max work group size would be the max size matrix that is supported. (N,M) In general, if the work group size isn't obvious, then it probably makes sense to write a single work item kernel instead. You can get "SIMD like" parallelization with single work item kernels using loop-unrolling.
If you can, I highly recommend taking this course.
I want to clarify a few things:
Most algorithms are easier to write and have better performance as single-work-item kernels. I would recommend starting with a single work item kernel and only changing to NDRange if you can't get good performance.
Memory access is usually the bottleneck. Pay very close attention to all memory accesses. Make memory accesses wide and sequential whenever possible. (use loop unrolling and on-chip caching) Use as few compute units as possible to avoid contention.
Thank you for the information. How you express loop unrolling -"SIMD like" parallelization- implies that hardware configuration is different for two approaches(NDRange Kernel + SIMD vs SingleWorkItem + LoopUnrolling). I want to check the online courses you mentioned if I can solve the problem with registration to FPGA Program.