System integrator fails when the data type size of the external IO channel is different from 256b.
You can reproduce the error using the attached code.
$ make trigger_bug aoc -v -v -v -D__TRIGGER_BUG__ -rtl -report -g krnl_chtest.cl -o krnl_chtest.aocr -board="p520_max_sg280l" -no-interleaving=default ... !=========================================================================== ! The report below may be inaccurate. A more comprehensive ! resource usage report can be found at krnl_chtest/reports/report.html !=========================================================================== +--------------------------------------------------------------------+ ; Estimated Resource Usage Summary ; +----------------------------------------+---------------------------+ ; Resource + Usage ; +----------------------------------------+---------------------------+ ; Logic utilization ; 69% ; ; ALUTs ; 35% ; ; Dedicated logic registers ; 35% ; ; Memory blocks ; 31% ; ; DSP blocks ; 29% ; +----------------------------------------+---------------------------; remove krnl_chtest.bc /cm/shared/opt/intelFPGA_pro/18.1.1/hld/linux64/bin/system_integrator --cic-global_no_interleave --bsp-flow top --rand-hash 6e2a5000a9316341eb32862b79911918a518e834 /opt/intelFPGA_pro/18.1.1/hld/board/nalla_pcie/hardware/p520_max_sg280l/board_spec.xml "krnl_chtest.bc.xml" none Compiler Error: Trying to bind incompatible signal to input port Compiler Error: Port: avm_channel_id_kernel_input_ch0_read (N9custom_ic3hdl21AvalonStreamPortGroupE) / Signal: avm_channel_id_kernel_input_ch0_read (N9custom_ic3hdl23AvalonStreamSignalGroupE) Compiler Error: Bound signal: kernel_input_ch0 (N9custom_ic3hdl23AvalonStreamSignalGroupE) Compiler Error: Compiler Error: Port signal declaration: Compiler Error: logic avm_channel_id_kernel_input_ch0_read_valid; Compiler Error: logic avm_channel_id_kernel_input_ch0_read_ready; Compiler Error: logic [31:0] avm_channel_id_kernel_input_ch0_read_data; Compiler Error: Bound signal declaration: Compiler Error: logic kernel_input_ch0_valid; Compiler Error: logic kernel_input_ch0_ready; Compiler Error: logic [255:0] kernel_input_ch0_data; Error: System integrator FAILED. Refer to krnl_chtest/krnl_chtest.log for details. make: *** [trigger_bug] Error 1
The file krnl_chtest.log doesn't provides further details.
Issue found in Intel FPGA OpenCL SDK 18.0.1 and 18.1.1 for Bittware 520N.
The issue is not present in Intel FPGA OpenCL SDK 17.1 for Bittware 385A.
This issue is important because the definition of channels with a smaller width let possible to exploit an higher 'Occupancy%' in the case of processing 256b (related to the external io channels) per clock cycle is not needed/wanted.
For example, consider a network connection with a bandwidth of 40 Gbit/s. A channel 256b-wide can transfer data at 156 Mhz maximum.
The usage of such a channel in a pipeline synthesized with an higher frequency will be a relevant source of 'Stall%' (e.g. 48% in a 300 Mhz design).
Have you confirmed with Nallatech that this is not a BSP issue? It is very much possible they did not include the necessary logic in their BSP for the 520N board to convert the channel read/write width if the logical width and physical width do not match.
I see. Then it would probably be faster if they directly raise the issue with Intel through their direct channels, rather than going through the forums.
BittWare has recreated this behavior compiling against the 19.1 520 BSPs, and using the 19.1 OpenCL Compiler with an Arria10 BSP.
Attached is an example showing how to transfer data using the IO channels where the data is less than the 256b. Hopefully, this may be of some use whilst you await a response from Intel. The example transfers two 128b data words every 2nd kernel clock cycle. This enables the kernel to maintain the kernel clock rate and should alleviate the IO channels as a source of stall. In the examples case an even number of words must always be transmitted.
None of the previous and current versions of the Bittware BSPs contain any logic to resize IO channel interfaces automatically. Any success on previous BSPs is likely due to logic added by the Intel aoc compilation process.
Thanks for performing this test.
We noticed the BSP you did your testing with was named the "a10gx" BSP. If this BSP is unaltered then it does not contain any channels and whilst the compiler may have passed rtl generation it will probably fail later on in the compilation process. However, the Intel "a10gx_hostpipe" BSP does contain IO channels and we were able to recreate the bug/error with his BSP. Attached is modified kernel code and compilation output
Could you please try the BSP with the IO channel and verify you can replicate the issue?
I have shared the case with our Engineering team, there is no change in HLD tools from 17.1 to 18.0 that would have caused this. System integrator now enforces bit width checking to prevent and mismatches in this case.
. In summary tool behavior in 17.1 (to let width mismatch compile) was a bug and the behavior is correct now.
You can just pad the unused 224 bits of the i/o channel in the kernel code to get the desired effect. Or better yet, make use of all 256 bits.
How was this a "bug" in v17.1 when everything worked correctly and there was no crash/data corruption/etc? I wrote a test for the serial I/O channel between the two FPGAs on the dual-Arria 10 Nallatech 510T board (https://github.com/zohourih/FPGAStream/blob/master/fpga-stream-kernel-sch.cl) and regardless of bit width (both lower and higher than the physical 256-bit width), I verified the integrity of data passing through the channel in all cases and nothing was ever corrupted [unless I did something wrong somewhere]. I also verified that channel throughput scaled linearly with data size until the channel's bandwidth was saturated. This led me to believe that the compiler adds a buffer between the kernel and the I/O interface to adjust the data width. In fact, such buffer will be required anyway since the kernel is running at a different clock than the I/O channel and I fail to see how the same buffer cannot be used to also adjust the data width.
Please, consider the attached image. It shows the profile report for two Intel FPGA OpenCL SDK 17.1 Bittware 385A designs.
In both, data is transferred between two kernels using external channels: 'krnlA_send' sends to 'krnlB_recv' and 'krnlB_send' sends to 'krnlA_recv'.
On the left, you have 'float4' (128b) external channels. On the right, 'float8' (256b) external channels. Both reach almost the same fmax: 308 MHz (float4) and 294.55 MHz (float8).
As you can see, the bandwidth measurement (last column) shows the same values in both cases: 4900+ MByte/s, very close to the peak performance of 5000 MByte/s (40 Gbit/s). So, there is no issue on this point.
The main difference is in the 'Stall%' and 'Occupancy%' columns.
In the 'float4' case, there is not stall, i.e. we get full occupancy. This means that there are no bubbles into the pipeline, every clock cycle is used actively for computation.
On the other hand for 'float8', 'Stall%' is equal to 47%. This is expected. In fact, given a bandwidth of 40 Gbit/s, an Intel channel whose datatype is 256b-wide can transfer data at most at 156 MHz.
Considering that the actual frequency of the design is 294 MHz, the maximum 'Occupancy%' achievable can be at most (156MHz / 294MHz) = 53%.
Sorry for the long preamble, but it's necessary to share with you my point of view.
Define channels with a smaller width will let me to exploit an higher 'Occupancy%' in case I don't need/want to process 256b (related to the external channels) per clock cycle.
This behavior was correctly implemented in the Intel FPGA OpenCL SDK 17.1 BSP for Bittware 385A.
If I'm forced to process 256b, the occurring 'Stall%' will propagate along the pipeline and will effects the other computations/memory accesses.
Moreover, for complex design the extension of the data-path to 256b cannot be always feasible in terms of area utilization.
I know that I could write workarounds in my code in order to overcome these issues. But this will cause drawbacks that were not present previously.
>If I'm forced to process 256b, the occurring 'Stall%' will propagate along the pipeline and will effects the other computations/memory accesses.
This is a perfectly valid point to explain why "padding" the data to match the physical width is not a good idea. You will be essentially limiting your kernel "throughput" to the throughput of the I/O channel, even if you don't need to fully utilize the channel throughput.
Before 18.0 we were using Qsys flow in system integrator and Qsys was doing the adaptation of the channel widths that came with decent performance as the described. However, this benefit from using Qsys was an unintended side-effect. The expectation was the data type matches up with the channel width. Of course, without the Qsys flow in 18.0 and after, the adaptation is no longer exist and the compile fails if the widths don’t match.
Unfortunately, the only workaround right now is to implement the width adaptation in your OpenCL code. That means not padding but instead buffering the smaller packets into one large one covering the entire channel width before writing and then reading every 2nd, 3rd, 4th, etc. cycle depending on what the width of the data type is. With that change, it should be similar to what you were getting in 17.1 with the Qsys flow.
Did you consider this approach?
we have been considering the suggested approach since the beginning.
From a functional point of view, it works.
But, this approach adds penalties in terms of performances: lower fmax and Occupancy%.
Is the automatic width adaptation going to be brought back in the future versions of the compiler, or is it going to be left as it is?
Hello. Intel currently has no plan to bring back the automatic adaption mode. If you think this will be a useful feature for you and other customers going forward, please let me know the specific use cases.
Hi @PGorl1, I am the manager to the engineer assigned to this case. Can you please elaborate a little more about you meant that doing the approach will add penalties in terms of performances: lower fmax and Occupancy%? Do you have any data that you could provide to support this so that we can evaluate at our end on what we can do differently?