Application Acceleration With FPGAs
Programmable Acceleration Cards (PACs), DCP, FPGA AI Suite, Software Stack, and Reference Designs
485 Discussions

Uneven bandwidth between two sides of a channel in the OpenCL profile report

SBioo
Beginner
1,443 Views

I have a design implementation, where one kernel send some data through a channel to another autorun kernel. When I profile the kernels, I see mismatched bandwidth result on both sides of the channel. Here are screen shots for the bandwidths:

 

img1.png

img2.png

As you can see, "mem_read_data.cl" sends the data to the "winograd_transform_channels" with speed of 9769 MB/s. On the other hand the "winograd_transform.cl" (autorun) receives the data with the rate of "1850 MB/s". The stall percentage on this side seems to be 81%. I cannot fully understand, how such thing is possible. For more information, this is how I reset the autorun profile counters on the host side:

 

img3.png

 

I assume that I'm resetting the counters in the right spots.

 

Any idea what is going on here?

 

Thanks

0 Kudos
4 Replies
HRZ
Valued Contributor III
1,160 Views

Have you checked further down the pipeline? Especially the memory write at the very end? The stalls could be propagating from there all the way up the pipeline. Also, are the IIs of all your loops the same?

 

P.S. Fantastic comments in your code. :D

0 Kudos
SBioo
Beginner
1,160 Views

I have checked the other PEs and also the final stage which receives the final output and writes it back to the memory. they suffer from the same stall.

 

I'm kinda afraid that I'm not capturing the PEs counter numbers properly.

 

BTW, about the comments, that's how a software engineer survives FPGA programming :D

0 Kudos
HRZ
Valued Contributor III
1,160 Views

Then it sounds like the stalls are propagating from the bottom of the pipeline. I am afraid I have never profiled autorun kernels, so I cannot comment on the correctness of the way you are capturing the counters. However, I find it very unlikely for regular compute PEs or on-chip channel to become a performance bottleneck. As a test, you can remove all your PEs from the kernel, and just keep the memory read/write kernels directly connected to each other through a channel. If you get similar stalling on this simplified kernel, the problem is coming from memory. Note that if you are exhausting the external memory bandwidth, seeing such stalls on the channels is completely normal.

0 Kudos
SBioo
Beginner
1,160 Views

Alright,

 

Thanks much. Will try the new approach and see how things will change. Will update you here :)

0 Kudos
Reply