Solved: Re: The weird aggressive aocl optimization "removing unnecessary storage to local memory" - Page 2

hiratz · ‎03-29-2019

Hello,

I used local memory variables in my kernels but got many compilation warnings like this when compiling them with aocl on Intel FPGA arria10.

When the kernels are compiled into the task type (single work item), their running cannot give correct results. However, if I used global memory variables instead of these local memory ones, the results are always correct.

When the kernels are compiled into the NDRange type, the running always show correct results (it doesn't matter which type of variable (the local memory vs global memory) is used)

So I was wondering if it is possible for such aggressive optimization to affect the correctness of the calculation.

(I checked my code again and again. Logically, the store statements that cause this warning should not be removed; otherwise the emulation definitely will give the wrong results).

Does anyone also encounter this warning or know if it has any impact on the program semantics?

Thanks!

HRZ · ‎04-19-2019

With respect to functional verification, what I do is that I construct my host code in a way that both run-time and offline compilation are supported, the latter for FPGAs and the former for other devices, and I use AMD's OpenCL SDK for other devices. In this case, as long as the run-time OpenCL driver is installed, the same host code can then be used to execute the same kernel on any type of CPU, GPU or FPGA. You can take a look at the host code/makefiles of the optimized benchmarks in the following repository as example of achieving this:

https://github.com/fpga-opencl-benchmarks/rodinia_fpga

I emulated all of those kernels on CPUs/GPUs using the same host and kernel codes. What I would tell you is that if an NDRange kernel with sufficiently large local and global size performs correctly on a GPU, it should also perform correctly on an FPGA (unless there is a bug in the FPGA compiler). A CPU should also work fine even if the whole kernel runs on one core, since there will still be multiple threads (work-items) running on that core that could be issued out of order and this is usually enough to show concurrency issues but a GPU would likely be more trustworthy in this case.

With respect to, let's say HDL vs. OpenCL, many old-school HDL programmers tend to think that OpenCL or HLS tools in general are insufficient and it is possible to achieve better results using HDL. This is indeed true in some cases like latency-sensitive or low-power applications where clock-by-clock control over the code is required, or applications that are limited by logic resources, but I would not say this is the case for high-throughput applications where limitation is Memory/PCI-E bandwidth or DSP count since these limitations are independent of the programming language. With respect to the particular case of unpipelinable nested loops, HDL or OpenCL would not make a difference. If you have a regular outer loop with an irregular inner loop, the outer loop cannot be pipelined; it doesn't matter how you "describe" the code. There are two ways to approach such loops on FPGAs:

1- Use NDRange and let the run-time work-item scheduler do its best in maximizing pipeline efficiency and minimizing the average loop II.

2- Collapse the nested loop as long as it is not too irregular and get an II of one at the cost of a noticeable Fmax hit. Though by "collapse" I mean manual collapse and not the compiler's "coalesce" pragma. Take a look at Section 3.2.4.3 in this document:

https://arxiv.org/abs/1810.09773

Even though the provided example involves collapsing a regular nested loop, this optimization also sometimes applies to irregular nested loops. I such case, the condition inside the collapsed loop that is used to increment the variable of the original outer loop will have more than one statement (which complicates the critical path and reduces the Fmax). Indeed the possibility also exists to implement parts of your application in HDL and use it as an HDL library in an OpenCL kernel but you are going to run into complications if your HDL library does not have a fixed latency and I highly doubt you would be able to achieve much better results in the end.

Finally, with respect to NDRange vs. Single Work-item, I recommend reading Section 3.1 (and particularly 3.1.4) of the document I posted above.

View solution in original post

hiratz · ‎04-12-2019

Thanks for the information.

I just downloaded the Intel® Acceleration Stack Version 1.2 (a10_gx_pac_ias_1_2_pv_rte_installer.tar.gz) (https://www.intel.com/content/www/us/en/programmable/solutions/acceleration-hub/downloads.html), but I was not able to install it on the vLab machine because doing so needs sudo permission I don't have. I believe currently only a10_gx_pac_ias_1_1_pv is installed.

Sorry about that. If you have sudo permission, you can try it instead.

HRZ · ‎04-12-2019

@hiratz Can you post your design files again? The file you uploaded before is not available anymore.

I don't think your problem will go away by using a newer versions of Quartus anyway; the optimizations done by the compiler do not change drastically from one version to another.

hiratz · ‎04-13-2019

@HRZ KTan9 has posted it again below. (Please understand this is still part of my current work :), so I just put it here for several days)

Let's hope KTan9 (as the Intel expert) would shed some light on this problem.

Here I put some interesting data across different Quartus versions:

For my complete kernel files, the resource utilization breakdowns across various versions are as follows:

Logic utilization, ALUts, Dedicated Logic registers, DSP blocks

For 17.1.1: they are 58%, 33%, 28%, 59%, 4%, respectively

For 18.0, they are: 57%, 31%, 28%, 62% and 4%, respectively.

For 19.1, they are 52%, 29%, 24%, 57%, 3%, respectively.

But for 18.1, they are: 85%, 54%, 36%, 70% and 14%, respectively.

(It looks like the latest 19.1 has the best resource utilization! (at least for my design))

I asked this in the Harp community and was told that a change that large suggests the compiler made a significantly different topological decision.

These observations are for your reference.

KennyTan_Altera · ‎04-12-2019

Here you go.

KennyTan_Altera · ‎04-15-2019

Actually, what we suggest is that:

INTELFPGAOCLROOT set to Q19.1

QUARTUS_ROOT set to Q17.1.1

AOCL_BOARD_PACKAGE_ROOT set to 1.2

Let me know if you get any error? I tested it from my side and I am getting an error. This PAC should works with OpenCL 19.1 RTE, and with 17.1.1 quartus.

Thanks

hiratz · ‎04-15-2019

Thanks for your suggestions! Unfortunately, there is no bsp 1.2 on the Intel Harp machine, and I was not able to install it there because installing it needs sudo permission. So I cannot try the "AOCL_BOARD_PACKAGE_ROOT set to 1.2".

KennyTan_Altera · ‎04-15-2019

Then, you try on 1.1 first.

hiratz · ‎04-15-2019

I've been running it for more than 15 minutes and the errors have not yet appeared, which seems a good sign! Previously the errors happened within 5 minutes after I launched it.

I will let you know once the compilation is finished successfully. For ndrange (#define attr_setup attr_max_wg), the compilation usually needs 4 ~ 5 hours under 17.1.1; for task (single work item) (#define attr_setup attr_task), the compilation usually is much faster and needs 2 hours under 17.1.1.

Here is my configuration script:

sys_quartus_dir=/export/quartus_pro
my_quartus_dir=$HOME
 
export QSYS_ROOTDIR=${sys_quartus_dir}/17.1.1/qsys/bin
export QUARTUS_ROOTDIR=${sys_quartus_dir}/17.1.1/quartus/bin
export QUARTUS_ROOTDIR_OVERRIDE=${sys_quartus_dir}/17.1.1/quartus
 
export PATH="${sys_quartus_dir}/17.1.1/quartus/bin/:$PATH"
export PATH="${sys_quartus_dir}/17.1.1/qsys/bin:$PATH"
 
 
# PAC_A10 BSP version 1.1
 
export AOCL_BOARD_PACKAGE_ROOT=/export/fpga/release/a10_gx_pac_ias_1_1_pv/opencl/opencl_bsp
# export PATH="$my_quartus_dir/intelFPGA_pro/19.1/hld/board/a10_ref/:$PATH"
 
 
# set OpenCL version 19.1
 
export ALTERAOCLSDKROOT=$my_quartus_dir/intelFPGA_pro/19.1/hld
export INTELFPGAOCLSDKROOT=$my_quartus_dir/intelFPGA_pro/19.1/hld
 
export PATH=/homes/hiratz/intelFPGA_pro/19.1/hld/bin:$PATH
 
#run the OpenCL Setup script in 19.1
 
source $my_quartus_dir/intelFPGA_pro/19.1/hld/init_opencl.sh
 
 
/export/fpga/bin/qsub-fpga

KennyTan_Altera · ‎04-15-2019

Good to hear that, btw, how do you access the Intel Harp machine? I will look into requesting upgrade it to 1.2 version

hiratz · ‎04-15-2019

The Intel's Harp environment is documented at: https://wiki.intel-research.net/FPGA.html

To apply for access, you can fill out the form at: https://registration.intel-research.net/register

Once you get an account, you can send emails to iam@intel-research.net for help requests.

(Actually I already asked about the bsp 1.2's installation on Intel's Harp community and they replied they probably will update it soon)

Update about my progress:

I successfully compiled the code for both ndrange type and task type. The former took 1:42 hours (1 hour, 42 mins) and the latter took 1:29 hours.

Please note: for the task type compilation, __attribute__((task)) is not anymore supported in 19.1. You have to replace it with __attribute__((max_global_work_dim(0))). In addition, though 17.1.1 allows the code like "get_global_id()" or "get_global_size()" to stay in the code when compiling a task type version, 19.1 does not allow doing so. If you forcibly do so, you will get an obviously abnormal resource utilization report:

Logic utilization (423226%), ALUTs (502536%), Dedicated logic registers (82%), Memory blocks(65%), DSP blocks(2%)

So you may want to remove them and replace the variable "gid" with 0 and the variable "gsize" with 1.

However, the binary (aocx file) cannot work correctly for both emulation and hardware execution.

The emulation shows the binary has some problems:

Using AOCX: decom_comp.aocx
Error: Malformed program interface definition found in binary: 
Error: FAILED to read auto-discovery string at byte 44. Full auto-discovery string value is 19 emulatorhash0000000000000000000000000000 Emul
atorDevice 0 0 0 0 3 decomp 0 0 0 0 0 0 0 1 0 10 2 1 8 2 1 8 2 1 8 2 1 8 0 0 4 0 0 4 2 1 8 2 1 8 0 0 4 0 0 4 0 0 0 0 0 1 2147483647 0 1 comp
ress 0 0 0 0 0 0 0 1 0 13 2 1 8 2 1 8 0 0 4 0 0 4 0 0 4 2 1 8 2 1 8 0 0 4 0 0 4 2 1 8 2 1 8 0 0 4 0 0 4 0 0 0 0 0 1 2147483647 0 1 merge_str
eams 0 0 0 0 0 0 0 1 0 8 2 1 8 2 1 8 2 1 8 2 1 8 2 1 8 0 0 4 0 0 4 0 0 4 0 0 0 0 0 1 2147483647 0 1 
Error: Invalid binary
ERROR: CL_INVALID_BINARY 
Location: ../common-fpga/src/AOCLUtils/opencl.cpp:392
Failed to create program with binary

The hardware version's launching shows similar errors. Some additional messages are:

acl_hal_mmd.c:1393:assert failure: Failed to initialize kernel interface
acl_hal_mmd.c:1393: l_try_device: Assertion `0' failed.

So I think there may be some mismatch across Quartus 17.1.1, OpenCL RTE 19.1 and the underlying bsp 1.1

So for now, it is better for me to stick with 17.1.1

HRZ · ‎04-15-2019

I had a look at your code, there are some fundamental issues in the code which raise doubt about its correctness:

-You are supposedly using local memory but there is not even one local memory barrier in the whole code (or at least I can't see any). This can only mean two things:

1-The code is incorrect.

2-Every local memory read is only done by the same work-item that initially wrote to that location and there is no data sharing between the work-items through local memory (or else barriers would have been required). This essentially means there is no point in using local memory for this code and the compiler's decision in removing the local memory operations is correct.

-You seem to be under the impression that you can convert an NDRange kernel to Single Work-item just by adding "__attribute__((task))" to the kernel header. This is indeed not the case (unless the compiler nowadays has the capability to automatically convert NDRange kernels to Single Work-item but I doubt it). You need to replace all NDRange-specific functions like "get_global_id", "get_local_id", etc. with appropriate loops so that the Single Work-item version performs the same number of operations as the NDRange one with loops iterations rather than work-items. I am surprised the compiler actually compiles the code like this!

Remember that just because the code works fine in the emulator it does not mean it is actually correct. The emulator does not fully replicate hardware execution environment and specifically, it cannot correctly emulate concurrency issues and race conditions that might exist in your code. Porting code optimized for CPUs/GPUs to FPGAs is extremely inefficient. You will essentially have to "de-optimize" the code first and then re-optimize it for FPGAs. Moreover, considering the size of your code and all the functions involved in it, debugging this code will not be very different from looking for a needle in a haystack. I would recommend starting from an unoptimized implementation instead (if you can get such implementation from the code authors); that would save a lot of time (and headache).

>For 17.1.1: they are 58%, 33%, 28%, 59%, 4%, respectively

>For 18.0, they are: 57%, 31%, 28%, 62% and 4%, respectively.

>For 19.1, they are 52%, 29%, 24%, 57%, 3%, respectively.

>But for 18.1, they are: 85%, 54%, 36%, 70% and 14%, respectively.

Are these post-place-and-route utilization numbers or resource estimation numbers from the first stage of compilation? The latter is highly unreliable and you should not make any conclusions based on that. Since the compiler is auto-unrolling some of the loops, and the auto-unroll logic might change from one compiler version to another, this could also change resource utilization drastically. In fact, auto-unrolling loops seems to have been completely disabled in v19.1, which means your code will definitely behave differently with this version compared to previous ones.

hiratz · ‎04-15-2019

Hi HRZ,

@HRZ Thank you so much for spending time looking at our code and writing so much feedback!!

Actually I have multiple similar versions of this code. In the one shown here, I used local memory for two key struct variables frequently accessed by many stream related functions: __local zfp_stream zfp[MAX_SEG]; __local bitstream stream[MAX_SEG]; (in the kernel “decomp” and “compress”, respectively). You may notice the global pointer arguments “__global zfp_stream * restrict zfp2, __global bitstream * restrict stream2” which is not used here and is another implementation where zfp and stream are put in global memory. Some members in stream, like "buffer", "bits" and "i" (current read/write position) are accessed in many called functions. Removing some assignment statements to them (causing the warnings) make the emulation's results incorrect. Though emulation cannot emulate the concurrency, but it can tell us whether a function is correct from the perspective of logic (please correct me if I'm wrong).

For other buffers like xy_buffer and xy_bs1 in the kernel "decomp", they may be too big to be put into the local memory (e.g., for a 2048 x 2048 double matrix, xy_buffer occupies 2k x 2k x 8 = 32 MB bytes).

About the barrier, as you see from the code, our framework can be constructed as a 3-stage decompression -> processing -> compression. The processing could be any kind of computation (e.g., processing of one image, transposing of one matrix, etc., I did not show its code here). There needs synchronization between two consecutive stages. I once used the barrier to synchronize between stages in one of earlier versions in which only one kernel is used. Later on I found that is inefficient. So I breaked the single big kernel into four ones (3 of them are shown in the code here). The synchronization between them is controlled by the opencl events in the host side. So this becomes a barrier-free design.

If you look at the main compression loop (in codec_2d_public.h) (Input: data to be compressed "xy_buffer", Output: bitstream buffer "begin")

for(int b = start_b; b < start_b + nblock; b++)

{

zfp_encode_block_double_2(begin, stream, zfp, xy_buffer + b * BLOCK_ITEMS);

}

you can see what I want to do is: split a xy plane (like an image) into multiple regions and one region contains nblock 4x4 blocks. So each work item just compresses one region. The above loop should be executed by all workitems in parallel but they access different regions in a big chunk of global memory. Zfp and stream contain some control data, like current bitstream read/write position, etc. Therefore, actually there is no any data sharing among the work items. There is also no conflict or overlapping between them. (One potential synchronization across work items happen between compression and merge_streams, but it also can be done in the host side)

The decompression loop is similar to the compression one (its input: bitstream buffer xy_bs1, output: xy_buffer)

Unfortunately, my code’s ndrange version is not stable. For small matrix size (like 64 x 64), it works well; but for large ones (like 256x 256 or 512x512), only using one work item shows correct results; using more than one gave me wrong results most of the time. I am still not able to find the root cause of this phenomenon.

For the task version (using __attribute__((task))), there also exist a weird but interesting bug: If put zfp or stream in local memory (shown in the code) or private memory (defined as: zfp_stream zfp; bitstream stream), the results are not correct; but if I put them into global memory by defining them as the global pointer, the results are always correct. Still, I don't know what exactly happened behind this (though logically I cannot see any wrong things). I once suspected if something is wrong with the alignment of zfp or stream. But even if I changed the alignment size in their definitions (codec_2d.h) (like 256), such problems still exist.

“You seem to be under the impression that you can convert an NDRange kernel to Single Work-item just by adding "__attribute__((task))" to the kernel header. This is indeed not the case ..."

You are totally correct! I did not realize this until yesterday night I tried the latest 19.1. With 17.1.1, I can simply use __attribute__((task)) even though the "get_global_id()" or "get_global_size()" still exist in the code. The reports generated by the initial compilation show the code is indeed compiled into a single work item type and most loops are pipelined if possible (but its real underlying implementation may not follow the correct logic even though 17.1.1 successfully compiled it. I have not idea if the bug I mentioned above is related to this). However, with 19.1, the __attribute__((task)) is not supported any more and cannot be identified by the compiler. I have to use "__attribute__((max_global_work_dim(0)))" instead. In this case, if I still leave "get_global_id()" or "get_global_size()" in my code, I would get a obviously incorrect report:

Logic utilization (423226%), ALUTs (502536%), Dedicated logic registers (82%), Memory blocks(65%), DSP blocks(2%)

After I removed all "get_global_id()" or "get_global_size()" and replaced all "gid" with 0, the report looks normal.

Please note: __attribute__((reqd_work_group_size(1, 1, 1))) cannot make the 19.1 identify the code as a single work-time type (it is still be viewed as a ndrange type).

"Remember that just because the code works fine in the emulator it does not mean it is actually correct."

I've been stuck in such kind of problems for more than one month. For all my implementation versions, their emulations are always correct. But their hardware implementation are not necessary. The compression software zfp has not yet provided a FPGA implementation (their GPU version is published just recently). Is it possible that they already tried the FPGA but found it is inefficient? I think I need to contact the authors.

For the number of resource utilization across different Quartus versions, yes, they are from the first stage of compilation (it takes 1 ~ 2 mins). For the number from 18.1.1, I tried the compilation several times and 18.1.1 always give similar numbers.

With 17.1,1, I always get some warnings like "Compiler Warning: Auto-unrolled loop at file_path: 40 (line number)" if I did not use the "#pragma unroll N". That are exactly the auto-unrolling you mentioned. But with 19.1, they are gone. So you are right, this function probably has been removed (or disabled) in 19.1

Finally, would you like to consider a possible cooperation with us if you have interest and time? Currently I am the only programmer in this project but I don't have much experience. If you would like to join, we would consider you are one contributor of our project and add your name in our paper we would submit in the future :)

Thank you again!

hiratz · ‎04-16-2019

Moved.

KennyTan_Altera · ‎04-16-2019

Hi,

Can you opened a new thread to address this differently?

Thanks

hiratz · ‎04-16-2019

Do you mean opening a new thread for the "-fast-compile" problem above?

If so, I will delete it and open a new question for it.

KennyTan_Altera · ‎04-16-2019

Yup.. thanks

hiratz · ‎04-16-2019

Done.

HRZ · ‎04-18-2019

@hiratz This is the best way I can simplify the problem of local memory and barriers:

1- Is there any instance in your code where work-item "i" writes to point X in a local buffer and work-item "j" reads from that point? (based on your last reply it seems the answer is no)

Yes: You need a local memory barrier after every such write operation. --> END

No: You do not need to use local memory. --> GOTO 2

2- Is there any instance in your code where work-item "i" reads point Y from global memory multiple times? (based on your last reply it seems the answer is yes)

Yes: Then create a private variable (rather than local) to store the point and reuse it. The size of this variable will depend on how many such points you need to store on-chip per work-item. You do not need to account for the work-group size or SIMD factor in this case. The compiler will automatically create one such buffer for as many work-items as your SIMD factor. --> END

No: There is no point in optimization using local/private memory in this case. --> END

Getting correct results with one work-item but incorrect results with more than one sounds very much like a concurrency issue to me. If there is no overlapping between the regions that the work-items access in global memory, then the problem must come from local memory or incorrect synchronization. Of course there are also cases where the compiler generates incorrect logic due to some compiler bug, but such cases are quite rare. Considering the fact that you are not using channels (or any other FPGA-specific constructs), have you tried running your modified code on a GPU? I personally would not use Intel's emulator unless I am using such constructs. Running on a GPU is not only much faster, but also allows debugging concurrency issues.

Finally, I wouldn't mind academic collaborations but right now I am occupied by my own projects and since I have little knowledge of the code you are trying to port and, to be honest, it looks quite big, I am not sure if I could allocate enough time for such collaboration. One advice I have for you is to first make sure porting this code for FPGAs is worth the time you are spending on it. If it ends up being too slow since the algorithm is not suitable for FPGA acceleration, it would be difficult to justify the time and effort spent on it. If the code is memory-intensive/memory-bound, if it largely involves random or indirect memory accesses, or if it cannot be properly pipelined due to loop-carried dependencies or variable loop exit conditions, I wouldn't say the code is a good candidate for FPGA acceleration.

hiratz · ‎04-18-2019

@HRZ Thank you for so specific code guidance and suggestions ! I'll look through my code again with your method 1 and 2.

Speaking of concurrency, actually my project has three different implementation versions for now: "CPU Serial", "OpenCL CPU" and the one I showed here "OpenCL FPGA". The "CPU Serial" is used to generate the correct results for comparison with the other two versions; the "OpenCL CPU" is used to detect potential concurrency bugs or problems.

I forgot to mention in last reply that another important reason why I remove all barriers in my code is: only doing so makes all workitems in the "OpenCL CPU" run on DIFFERENT CPU cores. On the one hand, Intel's OpenCL runtime driver for CPU views a core as a "Compute Unit (CU)" and puts work items in the same workgroup into one core. On the other hand, as we all know, OpenCL barriers are only applied within one workgroup. As a result, my earlier "OpenCL CPU" versions (one single big kernel, one group) with many barriers only can run on a single core, and no concurrent running happened! But if I put workitems into different groups with each group containing only one item, the results became incorrect because the barrier cannot play role. Therefore, I decided to break that big single kernel into multiple ones and meanwhile remove all barriers. Then it worked stable and well and always gives the correct results in my own computers. (Note that Intel's OpenCL CPU runtime uses Intel TBB as its underlying mechanism to create work items (threads)). (I also think putting a group into one core is a limitation of Intel OpenCL CPU runtime itself, not the OpenCL itself. I have not tried other CPU type like AMD's).

So this is why I did not pay much attention to the concurrency issue. I have not implemented an OpenCL GPU version yet. Considering both CPU and GPU are "instruction-decoding-based" execution style, I feel that there should be no concurrency issue too on GPU if my CPU version can work correctly. I may be wrong, of course. If you think this (my thinking) is not correct, I'll consider also implementing it on GPU and it should be not difficult. Unlike CPU and GPU, FPGA implements its concurrency using direct "circuit-based" execution style. So maybe there are still chance for concurrency issues to happen on FPGA. In other words, there may be no "concurrency-test portability" among CPU, GPU and FPGA.

I totally agree with you about how to apply applications on FPGA. Not all applications are suitable for FPGA acceleration. But here let me first clarify some concepts before further discussion: "OpenCL-based FPGA" vs "FPGA itself". One faculty in our group thinks it is the problem of the OpenCL compiler that cannot generate a high-efficient circuit, not the problem of the FPGA itself; using a low level RTL implementation may solve this problem.

Take my case for example, the "(de)compression" kernel actually is a big top-level loop which contains many small loops, many of which are IRREGULAR. By "irregular", I mean these loops do not have fixed number of iterations. As you said, they have "variable loop exit conditions" that are determined by the runtime data. So the top-level loop will be never pipelined by the current compiler. (It seems also difficult to break this big top loop into multiple parts according to the code's semantics) Do you think that it is possible to handle such irregular cases including other cases you mentioned with a RTL implementation without considering an OpenCL compiler? (In my opinion, the issues caused by program logic or semantics are intrinsic and hard to solve by changing the implementation)

Actually now I'm considering two solutions: 1) implement these irregular loops with RTL language and then call it by OpenCL Library (Intel compiler supports this). But I'm not very sure if it is feasible (especially in the statement-level, like call a API which is a RTL implementation of an irregular loop in somewhere inside a kernel). I'm concerned that the top level loop might still not be pipelined even it is feasible; 2) Implement the whole (de)compression function with RTL. Undoubtedly this will take much more time.

Though Intel's compiler supports both task and NDRange type, it seems that Intel focuses more on the former and implicitly suggests people to use the former by which more aggressive optimizations can be conducted (according to the manuals "Programming Guide" and "The Best Practices"). In my case, a SIMD-style optimization instead of a pipelined one should be more appropriate because multiple work items access non-overlapping data. But even for NDRange one, Intel still uses a pipeline to implement it across work items. Only their "kernel Vectorization" is real SIMD-style, but it only targets the kernel level (not the statement level.). Another optimization "Multiple Compute Units" is obviously not feasible for my case. A pair of "decomp" and "compress" kernels already occupy ~50% resource. From this perspective, the "instruction-decoding based" execution style is not limited by your code's logic complexity and size. That's why thousands of simple cores can be put into a GPU to do massive SIMD-style processing. So I think pipeline is the main advantage of FPGA, which is what both CPU and GPU lack (Note that the microarchitecture pipeline inside the CPU is specifically for instruction level parallelism, not the function-level pipeline we are talking about in the FPGA). It seems impractical to achieve a comparable performance to GPU by replicating function units (especially for those complex ones).

PS: Quartus 18.1 and 19.1 provide many example OpenCL designs (in 18.1/hld/examples_aoc, and 19.1/hld/examples_aoc, respectively). I looked through them quickly and noticed that most loops are regular.

Finally, I totally understand that you are busy with your projects and don't have much time. That's fine and I still appreciate your kindness and your advice!

HRZ · ‎04-19-2019

With respect to functional verification, what I do is that I construct my host code in a way that both run-time and offline compilation are supported, the latter for FPGAs and the former for other devices, and I use AMD's OpenCL SDK for other devices. In this case, as long as the run-time OpenCL driver is installed, the same host code can then be used to execute the same kernel on any type of CPU, GPU or FPGA. You can take a look at the host code/makefiles of the optimized benchmarks in the following repository as example of achieving this:

https://github.com/fpga-opencl-benchmarks/rodinia_fpga

I emulated all of those kernels on CPUs/GPUs using the same host and kernel codes. What I would tell you is that if an NDRange kernel with sufficiently large local and global size performs correctly on a GPU, it should also perform correctly on an FPGA (unless there is a bug in the FPGA compiler). A CPU should also work fine even if the whole kernel runs on one core, since there will still be multiple threads (work-items) running on that core that could be issued out of order and this is usually enough to show concurrency issues but a GPU would likely be more trustworthy in this case.

With respect to, let's say HDL vs. OpenCL, many old-school HDL programmers tend to think that OpenCL or HLS tools in general are insufficient and it is possible to achieve better results using HDL. This is indeed true in some cases like latency-sensitive or low-power applications where clock-by-clock control over the code is required, or applications that are limited by logic resources, but I would not say this is the case for high-throughput applications where limitation is Memory/PCI-E bandwidth or DSP count since these limitations are independent of the programming language. With respect to the particular case of unpipelinable nested loops, HDL or OpenCL would not make a difference. If you have a regular outer loop with an irregular inner loop, the outer loop cannot be pipelined; it doesn't matter how you "describe" the code. There are two ways to approach such loops on FPGAs:

1- Use NDRange and let the run-time work-item scheduler do its best in maximizing pipeline efficiency and minimizing the average loop II.

2- Collapse the nested loop as long as it is not too irregular and get an II of one at the cost of a noticeable Fmax hit. Though by "collapse" I mean manual collapse and not the compiler's "coalesce" pragma. Take a look at Section 3.2.4.3 in this document:

https://arxiv.org/abs/1810.09773

Even though the provided example involves collapsing a regular nested loop, this optimization also sometimes applies to irregular nested loops. I such case, the condition inside the collapsed loop that is used to increment the variable of the original outer loop will have more than one statement (which complicates the critical path and reduces the Fmax). Indeed the possibility also exists to implement parts of your application in HDL and use it as an HDL library in an OpenCL kernel but you are going to run into complications if your HDL library does not have a fixed latency and I highly doubt you would be able to achieve much better results in the end.

Finally, with respect to NDRange vs. Single Work-item, I recommend reading Section 3.1 (and particularly 3.1.4) of the document I posted above.

hiratz · ‎04-19-2019

Great reading materials! I'll read the optimized code of rodia_fpga and the paper in arxiv carefully. They will improve my understanding of the optimizations.

Also thank you for telling me your experience about function verification and how you emulated the kernels. Good practices can help avoid much unnecessary trouble.

I'll update my work's status here once I make significant progress.

Finally, thank you again!