Data transfer via channels is taking more time

MFaiz · ‎12-06-2018

Hi All,

I am working on application, where I am transferring data from one kernel to other autorun kernel. In my kernel, I am loading my global to local variables before passing via channels.

Please refer to the below piece of code

But the data transfer itself is taking approx , 8-9msec.

Can you please help in ways to optimize this.

- The data we are passing via channel is local, does that reside in local memory itself ?

- Instead of passing the data, can I pass as pointer so that autorun fetches the pointer address and starts fetching the values ?

channel char chan __attribute__((depth(1024* 10)));

__kernel void producer (__global const uint * src)

{

__local src_local[1024];

for (unsigned i = 0; i < 1024; i++)

src_local[i] = src[i];

for (int x = 0; x < iterations; x++){

for (int i = 0; i < 1024; i++)

{

write_channel_intel(chan, src_local[2*i]);

}

__attribute__((max_global_work_dim(0)))

__attribute__((autorun))

__kernel void consumer ()

{

__local dst_local[1024];

for (int x = 0; x < iterations; x++){

for (int i = 0; i < 1024; i++)

{

dst[i] = read_channel_intel(chan);

}

Thanks in advance

HRZ · ‎12-07-2018

Using the term "local memory" for single work-item kernels could be misleading since there are no threads running in parallel to share data between each other using local memory, and the "__local" identifier does not make any difference int his kernel type, either. For single work-item kernels, any data that is not in external memory will be implemented as buffers/FIFOs/RAMs that use FPGA on-chip memory resources (registers and Block RAMs). This includes all variables, channels, etc.

Autorun kernels do NOT have an interface to host or external memory; hence, even if you pass a global memory pointer to an autorun kernel, you will not be able to read from the global buffer in the autorun kernel since it is not connected to the memory interface.

I am not sure how you are judging that your kernel is slow. Considering your code snippet, your bottleneck is likely the external memory transfers, not the channel transfers, and it will likely run at the same speed even if you remove the channels.

P.S. There is no need to load data from global memory into a separate variable and then write it into the channel. Yu can write dirrectly from global memory to the channel.