Solved: The weird aggressive aocl optimization "removing unnecessary storage to local memory"

hiratz · ‎03-29-2019

Hello,

I used local memory variables in my kernels but got many compilation warnings like this when compiling them with aocl on Intel FPGA arria10.

When the kernels are compiled into the task type (single work item), their running cannot give correct results. However, if I used global memory variables instead of these local memory ones, the results are always correct.

When the kernels are compiled into the NDRange type, the running always show correct results (it doesn't matter which type of variable (the local memory vs global memory) is used)

So I was wondering if it is possible for such aggressive optimization to affect the correctness of the calculation.

(I checked my code again and again. Logically, the store statements that cause this warning should not be removed; otherwise the emulation definitely will give the wrong results).

Does anyone also encounter this warning or know if it has any impact on the program semantics?

Thanks!

HRZ · ‎04-19-2019

With respect to functional verification, what I do is that I construct my host code in a way that both run-time and offline compilation are supported, the latter for FPGAs and the former for other devices, and I use AMD's OpenCL SDK for other devices. In this case, as long as the run-time OpenCL driver is installed, the same host code can then be used to execute the same kernel on any type of CPU, GPU or FPGA. You can take a look at the host code/makefiles of the optimized benchmarks in the following repository as example of achieving this:

https://github.com/fpga-opencl-benchmarks/rodinia_fpga

I emulated all of those kernels on CPUs/GPUs using the same host and kernel codes. What I would tell you is that if an NDRange kernel with sufficiently large local and global size performs correctly on a GPU, it should also perform correctly on an FPGA (unless there is a bug in the FPGA compiler). A CPU should also work fine even if the whole kernel runs on one core, since there will still be multiple threads (work-items) running on that core that could be issued out of order and this is usually enough to show concurrency issues but a GPU would likely be more trustworthy in this case.

With respect to, let's say HDL vs. OpenCL, many old-school HDL programmers tend to think that OpenCL or HLS tools in general are insufficient and it is possible to achieve better results using HDL. This is indeed true in some cases like latency-sensitive or low-power applications where clock-by-clock control over the code is required, or applications that are limited by logic resources, but I would not say this is the case for high-throughput applications where limitation is Memory/PCI-E bandwidth or DSP count since these limitations are independent of the programming language. With respect to the particular case of unpipelinable nested loops, HDL or OpenCL would not make a difference. If you have a regular outer loop with an irregular inner loop, the outer loop cannot be pipelined; it doesn't matter how you "describe" the code. There are two ways to approach such loops on FPGAs:

1- Use NDRange and let the run-time work-item scheduler do its best in maximizing pipeline efficiency and minimizing the average loop II.

2- Collapse the nested loop as long as it is not too irregular and get an II of one at the cost of a noticeable Fmax hit. Though by "collapse" I mean manual collapse and not the compiler's "coalesce" pragma. Take a look at Section 3.2.4.3 in this document:

https://arxiv.org/abs/1810.09773

Even though the provided example involves collapsing a regular nested loop, this optimization also sometimes applies to irregular nested loops. I such case, the condition inside the collapsed loop that is used to increment the variable of the original outer loop will have more than one statement (which complicates the critical path and reduces the Fmax). Indeed the possibility also exists to implement parts of your application in HDL and use it as an HDL library in an OpenCL kernel but you are going to run into complications if your HDL library does not have a fixed latency and I highly doubt you would be able to achieve much better results in the end.

Finally, with respect to NDRange vs. Single Work-item, I recommend reading Section 3.1 (and particularly 3.1.4) of the document I posted above.

View solution in original post

KennyTan_Altera · ‎04-01-2019

Hi, 1. Make sure there are no missing variable in your .cl code 2. If there are no loops and the code is simple, use NDRange 3. If there are loops, especially that can be pipelined, use single workitem (EnqueueTask) Let me know if it helps?

hiratz · ‎04-01-2019

Thanks for you reply.

Actually my kernels are quite complex (I'm porting a data compression/decompression algorithm into the FPGA from its C version) and contain a lot of complex loops (some have fixed number of iterations known during the compilation time, some do not). If I compile them with "__attribute__((task))", the report.html shows some loops are pipelined but some others not. If I compile them into NDRange, no loops are pipelined.

What do you mean by "missing variable"? My four kernels can be compiled successfully for both emulation and real hardware. Only the compilation for hardware shows this warning if I use some key local memory variables. (If I use global or private ones, there are no such warnings) The emulated version always gives correct results.

KennyTan_Altera · ‎04-01-2019

Why dont you attached the *.cl files here so that we can start comment about it?

hiratz · ‎04-01-2019

Hi KTan9,

I really appreciate your help and time!

I just attached all (.h and .cl) files (as a .zip file, Hiratz_device.zip) that are needed by the compilation. The main .cl file is decom_comp.cl that includes all other eight .h files.

I'm implementing the zfp floating point compression algorithm (https://computation.llnl.gov/projects/floating-point-compression) (version 0.5.4). The kernel "decomp" is for decompression, and the kernel "compress" is responsible for compression.

Naturally, zfp's block-level decompression/compression is very suitable for GPU's SIMD processing. For example, given a N x N matrix, it can be split into N_SEG segments or sections that are then processed by multiple work items at the same time. However, I'm still hoping that I can achieve the same acceleration effect on Intel FPGA with all possible optimizations.

Unfortunately, there are still some problems with my current version I attached here. For its NDRange version, besides the warnings I mentioned, the num_simd_work_items(n) also cannot work if I put them before the kernel function. If I increases the number of work items and the matrix dimension size (i.e., N), the kernel's running is not stable any more, which sometimes shows wrong results (the emulation always show the correct results). For its task (single work item) version, the current version cannot give correct results. If I replace "zfp[MAX_SEG]" and "stream [MAX_SEG]" with global pointers in the kernel arguments (using global memory), the kernel can work well but it is quite slow.

Two local memory arrays, "zfp[MAX_SEG]" and "stream [MAX_SEG]", are the culprits that cause the warnings I mentioned before. But according to the program semantics, all related code lines that cause these warnings should not be removed (otherwise the emulation running will give wrong results).

The most complex two nested loops are in "decode_ints_uint64(...) " in decode.2h and in "encode_ints_uint64" in encode_2d.h, respectively. Both look similar and each one is 3-level nested loop and the inner 2-level nested loop is so irregular that I cannot find any effective methods to rewrite or optimize it.

In addition, I also tried to use channel to transfer the lseg_size[] from the kernel "compress" to the kernel "merge_stream" but failed. The reason is related to the kernel "compress".

Overall, it seems that most mentioned optimizations in the manual of "Best Practices Guide" are hard to be applied to this application ...

The compilation commands lines I am using are as follows:

1) for quick resource utilization report

aoc -v -c -I./device device/decom_comp.cl -o decom_comp.aocx -g board=pac_a10

2) complete compilation

aoc -v -I./device device/decom_comp.cl -o decom_comp.aocx -g board=pac_a10

Either 1) or 2) will show the warnings I mentioned before if you run them on Intel vLab (Harp) machine. But eventually it will be compiled successfully.

I've been stuck in the above-mentioned problems for over half a month. I appreciate your help so much!

KennyTan_Altera · ‎04-02-2019

I am getting some error below using your .cl files, is there any things that needed to change in your code?

/data/ts_farm/kentan/Open_cl/forum/forum_1_case/device/decom_comp.cl:49:1: error: 'max_work_group_size' attribute requires exactly 3 arguments

attr_setup

^

/data/ts_farm/kentan/Open_cl/forum/forum_1_case/device/decom_comp.cl:46:20: note: expanded from macro 'attr_setup'

#define attr_setup attr_max_wg

^

/data/ts_farm/kentan/Open_cl/forum/forum_1_case/device/decom_comp.cl:40:36: note: expanded from macro 'attr_max_wg'

#define attr_max_wg __attribute__((max_work_group_size(256)))

^

/data/ts_farm/kentan/Open_cl/forum/forum_1_case/device/decom_comp.cl:83:1: error: 'max_work_group_size' attribute requires exactly 3 arguments

attr_setup

^

/data/ts_farm/kentan/Open_cl/forum/forum_1_case/device/decom_comp.cl:46:20: note: expanded from macro 'attr_setup'

#define attr_setup attr_max_wg

^

/data/ts_farm/kentan/Open_cl/forum/forum_1_case/device/decom_comp.cl:40:36: note: expanded from macro 'attr_max_wg'

#define attr_max_wg __attribute__((max_work_group_size(256)))

^

/data/ts_farm/kentan/Open_cl/forum/forum_1_case/device/decom_comp.cl:111:1: error: 'max_work_group_size' attribute requires exactly 3 arguments

attr_setup

^

/data/ts_farm/kentan/Open_cl/forum/forum_1_case/device/decom_comp.cl:46:20: note: expanded from macro 'attr_setup'

#define attr_setup attr_max_wg

^

/data/ts_farm/kentan/Open_cl/forum/forum_1_case/device/decom_comp.cl:40:36: note: expanded from macro 'attr_max_wg'

#define attr_max_wg __attribute__((max_work_group_size(256)))

hiratz · ‎04-02-2019

I know this problem. In the Programming guide, the attribute max_work_group_size has three parameters: like " __attribute__((max_work_group_size(X, Y, Z)))". However, this made my compilation failed. Later on, I noticed that this post (https://forums.intel.com/s/question/0D50P00003yyRoGSAU/error-when-specifying-max-work-group-size?language=en_US) says it should be used with only one parameter. So I used it like "__attribute__((max_work_group_size(256)))" and it can be compiled successfully in Intel vLab machine.

So you can try "__attribute__((max_work_group_size(256, 1, 1)))" instead.

BTW, I tested my files before I uploaded them and they were compiled successfully. Let me know if you have other issues when you compile them.

Thanks!

KennyTan_Altera · ‎04-04-2019

I was able to compile your design, did you get error something like: "Aggressive compiler optimization: pushing out local memory contents"

I try to locate this in the report but was not able to find. where do you see this warning? can point me to the correct log files?

hiratz · ‎04-04-2019

No, I did not get the error "Aggressive compiler optimization: pushing out local memory contents". What I got was the warning "Aggressive aocl optimization: removing unnecessary storage to local memory".

What's your aocl version? My version is "aocl 17.1.1.273 (Intel(R) FPGA SDK for OpenCL(TM), Version 17.1.1 Build 273". I did not see any related report file. These warnings were shown on the screen.

KennyTan_Altera · ‎04-04-2019

I am using Q19.1 version, which had been release yesterday. May be you can put your screenshot here so that I can compare it from my side.

Also, can you try on Q19.1?

hiratz · ‎04-04-2019

The attached zip file (Screen_Output.zip) contains two screenshot pictures. You can compare them with yours.

Currently Intel vLab's machine is the only platform I can use. Q17.1.1 is their default version for the pac_a10 board. I'm not sure if I can choose a different version as a customer (You can see the computing node "vsl111" assigned to me this time from the attached screenshot).

KennyTan_Altera · ‎04-04-2019

the message is actually in the decom.log files. I try it on Q17.1 and it shows that. But using Q19.1 this message no longer there. I will get back to you on this.

hiratz · ‎04-04-2019

Thanks! If a newer version can solve this problem, that will be great. I'll appreciate it if you can guide me to have access to the Q19.1.

HRZ · ‎04-05-2019

http://fpgasoftware.intel.com/19.1/?edition=pro

hiratz · ‎04-05-2019

Thanks, HRZ! I will look at the setup scripts on vLab and see if I can configure it with a different compiler version.

hiratz · ‎04-11-2019

Hi KTan9 @KennyT_Intel and HRZ @HRZ ,

I'm sorry I have to bother you again.

I installed Quartus 19.1 in my home directory successfully, but some bsp related errors happened (I will show them later)

According to this post ("Using A10 PAC BSP with OpenCL SDK 18.1", https://forums.intel.com/s/question/0D50P00004894Hn/using-a10-pac-bsp-with-opencl-sdk-181?language=en_US), one can compile OpenCL kernel using latest intel SDK with older BSP version. Though @FJumaah shows a detailed configuration procedure, I cannot make it by directly using it (at least at Intel vLab's pac_a10, no a generic Arria 10).

And then, I tried the second configuration method:

I copied the configuration directory "/export/fpga/bin" to my home "/fpga/bin" and did the following changes:

1) In ./fpga/bin/sh/fpga_classes, change "fpga_quartus_version[fpga-pac-a10]="17.1.1" to "19.1" (or "18.1", etc.)

2) In all set-*-env files in ./fpga/bin, change "SCRIPT_DIR="/export/fpga/bin" to "$HOME/fpga/bin"

3) In setup-synth-env, for line 52 - line 57, change "/export/fpga/tools/quartus_pro" to "$HOME/intelFPGA_pro"

((For 18.1 and 17.1.1, the 3) are not applied because they are in /export/fpga/tools/quartus_pro)

Then I run "source $HOME/fpga/bin/setup-fpga-env fpga-pac-a10

qsub-fpga"

as before.

I tested the above modified script with 17.1.1, 18.1 and 19.1. Only 17.1.1 can work well. Both 18.1 and 19.1 show bsp related errors. The configured environmental variables and partial errors are shown here respectively (I will attache the file "quartus_sh_compile.log" as well)

INTELFPGAOCLSDKROOT is set to /homes/hiratz/intelFPGA_pro/19.1/hld. Using that.
 
Will use $QUARTUS_ROOTDIR_OVERRIDE= /homes/hiratz/intelFPGA_pro/19.1/quartus  to find Quartus
 
AOCL_BOARD_PACKAGE_ROOT is set to /export/fpga/release/a10_gx_pac_ias_1_1_pv/opencl/opencl_bsp. Using that.
Adding /homes/hiratz/intelFPGA_pro/19.1/hld/bin to PATH
Adding /homes/hiratz/intelFPGA_pro/19.1/hld/host/linux64/lib to LD_LIBRARY_PATH
Adding /export/fpga/release/a10_gx_pac_ias_1_1_pv/opencl/opencl_bsp/linux64/lib to LD_LIBRARY_PATH
 
Configured FPGA environment for fpga-pac-a10:
  Quartus:  /homes/hiratz/intelFPGA_pro/19.1/quartus
  Platform: /export/fpga/release/a10_gx_pac_ias_1_1_pv
  OPAE:     /export/fpga/opae/install/opae-install-20190112
Starting interactive job on queue fpga-pac-a10
 
qsub: waiting for job 132804.iam-pbs to start
qsub: job 132804.iam-pbs ready

You can see the bsp I'm using is /export/fpga/release/a10_gx_pac_ias_1_1_pv which is the default bsp specifically for pac_a10. (Note that the generic a10 bsp is in /homes/hiratz/intelFPGA_pro/19.1/hld/board/a10_ref/). The SDK I'm using is /homes/hiratz/intelFPGA_pro/19.1/quartus. Both "ALTERAOCLSDKROOT" and "INTELFPGAOCLSDKROOT" are set to "/homes/hiratz/intelFPGA_pro/19.1/hld"

The bsp errors are as follows:

aoc: Linking with IP library ...
aoc: Checking if memory usage is larger than 100%...
aoc: Memory usage is not above 100.
Compiler Warning: addpipe in board_spec.xml is set to 1 which is no longer supported
Compiler Warning: global memory pipeline stage is now implemented in BSP instead
aoc: First stage compilation completed successfully.
Compiling for FPGA. This process may take a long time, please be patient.
Error (16045): Instance "ccip_std_afu|bsp_logic_inst|board_inst" instantiates undefined entity "board" File: /homes/hiratz/ndr-test/decom_co
mp/build/bsp_logic.sv Line: 133
Error (16185): Can't elaborate user hierarchy "ccip_std_afu|bsp_logic_inst|board_inst" File: /homes/hiratz/ndr-test/decom_comp/build/bsp_log
ic.sv Line: 133
Error (16185): Can't elaborate user hierarchy "ccip_std_afu|bsp_logic_inst" File: /homes/hiratz/ndr-test/decom_comp/build/BBB_cci_mpf/hw/rtl
/cci-mpf-if/cci_mpf_if.vh Line: 38
Error (16185): Can't elaborate user hierarchy "ccip_std_afu" File: /homes/hiratz/ndr-test/decom_comp/build/platform/green_bs.sv Line: 183
Error (16186): Can't elaborate top-level user hierarchy
Error: Flow failed: 
Error: Quartus Prime Synthesis was unsuccessful. 6 errors, 413 warnings
Error (23035): Tcl error: ERROR: Error(s) found while running an executable. See report file(s) for error message(s). Message log indicates 
which executable was run last.
Error (23031): Evaluation of Tcl script a10_partial_reconfig/flow.tcl unsuccessful
Error: Quartus Prime Shell was unsuccessful. 12 errors, 413 warnings
Error: Compiler Error, not able to generate hardware

For 18.1, it caused similar bsp errors to the above.

Note: The above bsp errors only happened for complete compilation to the hardware. The quick initial compilation for a report (aoc -report -v -rtl -I./device device/decom_comp.cl -board=pac_a10) works well.

kTan9 @KennyT_Intel : My 19.1 compilation showed neither "Aggressive compiler optimization: removing unnecessary storage to local memory" nor "Aggressive compiler optimization: pushing out local memory contents".

So I guess there must be something wrong with my configurations but I just cannot figure it out after I tried it again and again.

Would you please show me a correct configuration procedure you used for pac_a10 with Quartus 19.1 or 18.1? If you need more information, please let me know.

I really appreciate your help!

KennyTan_Altera · ‎04-12-2019

Hi,

You cannot use Q19.1 for pac_10. I uses a different board for testing purposes only.

I will have to get back to you on how to remove the warning if possible. You may have to stick with Q17.1.1.

Thanks

hiratz · ‎04-12-2019

I see. Thank you!

I look forward to hearing good news from you.

Best

KennyTan_Altera · ‎04-12-2019

Hi,

Can you try acl19.1 with dcp1.2 (acds17.1.1).

let me know if you get problem

Thanks,

hiratz · ‎04-12-2019

Is the dcp 1.2 a newer BSP for Intel pac a10? Unfortunately I do not know where it is in the vLab machine. I do find a directory "17.1.1_pac1.2" in the /export/fpga/tools/quartus_pro/, but all files in it are actually symbol links to /export/fpga/tools/quartus_pro/17.1.1

Or can I download it from some place?

Thanks

KennyTan_Altera · ‎04-12-2019

https://www.intel.com/content/www/us/en/programmable/documentation/mwh1391807309901.html#mwh1391807297091

Intel FPGA SDK for OpenCL Pro Edition and BSP Backwards Compatibility

To use an older BSP with the Intel® FPGA SDK for OpenCL™ , you must have a version of Intel® Quartus® Prime Pro Edition with the same version number as your BSP. For example, to use a Version 18.1 BSP with Intel® FPGA SDK for OpenCL™ Pro Edition Version 19.1, you need Intel® Quartus® Prime Pro Edition Version 18.1.