Re: [FPGA SDK for OpenCL] External IO channel problem. - Page 2

PGorl1 · ‎05-15-2019

System integrator fails when the data type size of the external IO channel is different from 256b.

You can reproduce the error using the attached code.

$ make trigger_bug
aoc -v -v -v -D__TRIGGER_BUG__ -rtl -report -g krnl_chtest.cl -o krnl_chtest.aocr  -board="p520_max_sg280l" -no-interleaving=default
 
 ...
 
!===========================================================================
! The report below may be inaccurate. A more comprehensive           
! resource usage report can be found at krnl_chtest/reports/report.html    
!===========================================================================
 
+--------------------------------------------------------------------+
; Estimated Resource Usage Summary                                   ;
+----------------------------------------+---------------------------+
; Resource                               + Usage                     ;
+----------------------------------------+---------------------------+
; Logic utilization                      ;   69%                     ;
; ALUTs                                  ;   35%                     ;
; Dedicated logic registers              ;   35%                     ;
; Memory blocks                          ;   31%                     ;
; DSP blocks                             ;   29%                     ;
+----------------------------------------+---------------------------;
remove krnl_chtest.bc
/cm/shared/opt/intelFPGA_pro/18.1.1/hld/linux64/bin/system_integrator   --cic-global_no_interleave  --bsp-flow top --rand-hash 6e2a5000a9316341eb32862b79911918a518e834 /opt/intelFPGA_pro/18.1.1/hld/board/nalla_pcie/hardware/p520_max_sg280l/board_spec.xml "krnl_chtest.bc.xml" none 
Compiler Error: Trying to bind incompatible signal to input port
Compiler Error:    Port: avm_channel_id_kernel_input_ch0_read (N9custom_ic3hdl21AvalonStreamPortGroupE) / Signal: avm_channel_id_kernel_input_ch0_read (N9custom_ic3hdl23AvalonStreamSignalGroupE)
Compiler Error:    Bound signal: kernel_input_ch0 (N9custom_ic3hdl23AvalonStreamSignalGroupE)
Compiler Error: 
Compiler Error:    Port signal declaration:
Compiler Error:       logic avm_channel_id_kernel_input_ch0_read_valid;
Compiler Error:       logic avm_channel_id_kernel_input_ch0_read_ready;
Compiler Error:       logic [31:0] avm_channel_id_kernel_input_ch0_read_data;
Compiler Error:    Bound signal declaration:
Compiler Error:       logic kernel_input_ch0_valid;
Compiler Error:       logic kernel_input_ch0_ready;
Compiler Error:       logic [255:0] kernel_input_ch0_data;
Error: System integrator FAILED.
Refer to krnl_chtest/krnl_chtest.log for details.
 
make: *** [trigger_bug] Error 1

The file krnl_chtest.log doesn't provides further details.

Issue found in Intel FPGA OpenCL SDK 18.0.1 and 18.1.1 for Bittware 520N.

The issue is not present in Intel FPGA OpenCL SDK 17.1 for Bittware 385A.

This issue is important because the definition of channels with a smaller width let possible to exploit an higher 'Occupancy%' in the case of processing 256b (related to the external io channels) per clock cycle is not needed/wanted.

For example, consider a network connection with a bandwidth of 40 Gbit/s. A channel 256b-wide can transfer data at 156 Mhz maximum.

The usage of such a channel in a pipeline synthesized with an higher frequency will be a relevant source of 'Stall%' (e.g. 48% in a 300 Mhz design).

HRZ · ‎08-13-2019

There are two approaches to tackle the problem of reading from/writing to an I/O channel with a width smaller than the physical width in an OpenCL kernel:

1- Manually pad the data to the physical width of the I/O channel. This reply explains perfectly why this approach is counterproductive:

https://forums.intel.com/s/question/0D70P000006S5nSSAS/

Essentially, even if the programmer wants to use only a small portion of the I/O channel's bandwidth, this approach will limit the throughput of the design to the bandwidth of the I/O channel since the bandwidth of the channel will be exhausted and the pipeline will be frequently stalled to compensate for this.

2- Manually buffer the data in the OpenCL kernel and pass it to the channel through full-width writes once every few loop iterations as suggested here:

https://forums.intel.com/s/question/0D70P000006RWHbSAO

This approach will be extremely difficult to implement if the physical channel width is not a multiple of the data size, it will have a large area and operating frequency overhead since it will require barrel shifters or large register-based buffers to buffer the data (using Block RAM-based buffers will increase the loop II), and it will also require at least one extra loop inside the main loop which will complicate the critical path consisted of the loop exit conditions and further hurt operating frequency; Stratix 10 in particular is very sensitive to this issue.

Fortunately, Intel does not expect people to do only 512-bit read/writes to memory even though that is the physical width of the path between the kernel and each memory bank. In a similar fashion, expecting people to only do same-width read/writes to I/O channels is unreasonable, especially when this problem was automatically handled by the tool-chain before.

PGorl1 · ‎08-23-2019

Hi @NRaml,

we re-synthesized our designs with the 19.1.0 BSP. Now, the suggested work around that implements 128b-wide channels works better. We are able to achieve a good Occupancy%.

We tested the work around by connecting the RX and TX kernels to another FPGA card which receives/transmits data using 256b-wide channels at full throughput.

The theoretical peak throughput of the external channel in case of work around kernels can be computed as

Peak throughput = 256b * min(0.5*fmax, 156.25Mhz)

Considering the code proposed by GGene, the original problem was in the TX kernel (see attached image).

In the 18.1.1 BSP, the Occupancy% and consequently the percentage of achieved theoretical peak throughput was equal to 83%. With the 19.1.0 BSP, this value is 97%.

For now, we are getting the same Occupancy% also for other designs that implement this kind of work around.

Cheers,

Paolo

Hazlina_R_Intel · ‎08-24-2019

Hi @PGorl1, good to know that you have made progress at your end. FYI, I have previously channeled the two feedback that you provided to our engineering team for their assessment for future tool enhancement considerations.

I apologize for the delay in the responses for this case which mostly contributed by the resources crunch at our end which we are working to resolve in the next few weeks. The next time you feel that you are not getting a quick or good enough response from this forum, please contact me directly at hazlina.ramly@intel.com to expedite the issues resolution. I am the manager/lead for the 25 engineers who are supporting the forums currently and shall be able to expedite the resolutions as need.

Please do use our forums going forward, and if you have feedback on our support or products/tools, please do not hesitate to reach out to me.