Intel® Quartus® Prime Software
Intel® Quartus® Prime Design Software, Design Entry, Synthesis, Simulation, Verification, Timing Analysis, System Design (Platform Designer, formerly Qsys)
Announcements
Intel Support hours are Monday-Fridays, 8am-5pm PST, except Holidays. Thanks to our community members who provide support during our down time or before we get to your questions. We appreciate you!

Need Forum Guidance? Click here
Search our FPGA Knowledge Articles here.
15555 Discussions

Data transfer via channels is taking more time

MFaiz
Beginner
1,290 Views

Hi All,

I am working on application, where I am transferring data from one kernel to other autorun kernel. In my kernel, I am loading my global to local variables before passing via channels.

Please refer to the below piece of code 

 

But the data transfer itself is taking approx , 8-9msec. 

Can you please help in ways to optimize this.

- The data we are passing via channel is local, does that reside in local memory itself ?

- Instead of passing the data, can I pass as pointer so that autorun fetches the pointer address and starts fetching the values ?

channel char chan __attribute__((depth(1024* 10)));

__kernel void producer (__global const uint * src)

{

  __local src_local[1024];

  for (unsigned i = 0; i < 1024; i++)

  src_local[i] = src[i];

for (int x = 0; x < iterations; x++){

  for (int i = 0; i < 1024; i++) 

  {

    write_channel_intel(chan, src_local[2*i]);

     

  }

}

}

 

__attribute__((max_global_work_dim(0)))

__attribute__((autorun))

__kernel void consumer ()

{

__local dst_local[1024];

for (int x = 0; x < iterations; x++){

  for (int i = 0; i < 1024; i++) 

  {

    dst[i] = read_channel_intel(chan);

  }

}

}

 

Thanks in advance

0 Kudos
1 Reply
HRZ
Valued Contributor II
171 Views

Using the term "local memory" for single work-item kernels could be misleading since there are no threads running in parallel to share data between each other using local memory, and the "__local" identifier does not make any difference int his kernel type, either. For single work-item kernels, any data that is not in external memory will be implemented as buffers/FIFOs/RAMs that use FPGA on-chip memory resources (registers and Block RAMs). This includes all variables, channels, etc.

 

Autorun kernels do NOT have an interface to host or external memory; hence, even if you pass a global memory pointer to an autorun kernel, you will not be able to read from the global buffer in the autorun kernel since it is not connected to the memory interface.

 

I am not sure how you are judging that your kernel is slow. Considering your code snippet, your bottleneck is likely the external memory transfers, not the channel transfers, and it will likely run at the same speed even if you remove the channels.

 

P.S. There is no need to load data from global memory into a separate variable and then write it into the channel. Yu can write dirrectly from global memory to the channel.

Reply