- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I've two kernels: a producer (that generates a stream of float numbers) and a consumer (that read the numbers and compute the sum):
#define W 32
channel float channel_float __attribute__((depth(W)));
//generates a stream of floating point numbers
__kernel void producer(int N)
{
int outer_loop_limit=(int)(N/(W));
int remainder=N-(outer_loop_limit)*(W);
for(int i=0;i<outer_loop_limit;i++)
{
#pragma unroll
for(int j=0;j<W;j++)
write_channel_intel(channel_float,(float)(2.0));
}
//Remainder
for(int i=0;i<remainder;i++)
write_channel_intel(channel_float,(float)(2.0));
}
__kernel void consumer(int N, __global float* res)
{
int outer_loop_limit=(int)(N/(W));
int remainder=N-(outer_loop_limit)*(W);
float acc_o=0, acc_i=0;
float mult[W], x[W];
for(int i=0; i<outer_loop_limit; i++)
{
#pragma unroll
for(int j=0; j<W; j++)
x[j]=read_channel_intel(channel_float);
acc_i=0;
#pragma unroll
for(int j=0; j<W; j++)
acc_i+=x[j];
acc_o+=acc_i;
}
//Remainder
acc_i=0;
for(int i=0;i<remainder;i++)
{
x[i]=read_channel_intel(channel_float);
acc_i+=x[i];
}
acc_o+=acc_i;
*res=acc_o;
}
To increase the number of writes/reads per clock cycle, I've applied unrolling (W times). So to handle a stream of N numbers we may have a little bit of extra code to handle the case in which N is not a multiple of W (in the code is called Remainders, lines 17-18 and 44-49).
Now, if I compile the code commenting the remainder part, everything goes ok: internal loops are fully unrolled and outer loops are pipelined with II=1.
Instead, If I have the remainder management, the compiler print a list of warnings like "Multiple writes to channel channel_float This may lead to bad QoR" and loops are pipelined with a II=34, claiming that there is a dependency in the writes into the channels.
Any hints on how to solve the problem?
Quartus version is 18.0, compiled for Arria10.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You might be able to avoid the issue by merging both channel reads and both channel writes into one read and one write and instead use an "if" condition inside of the unrolled read/write loops to avoid reading/writing the out-of-bound data.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I've tried it, loops are pipelined with II=1. But this comes at the expenses of a lower Fmax (as the report states)
The new code look likes this:
__kernel void producer(const int N)
{
int outer_loop_limit=(int)(N/(W))+1;
for(int i=0;i<outer_loop_limit;i++)
{
const int it=i*W;
#pragma unroll
for(int j=0;j<W;j++)
{
if(it+j<N) //handle boundary conditions
write_channel_intel(channel_float,(float)(2.0));
}
}
}
__kernel void consumer(const int N, __global float* res)
{
const int outer_loop_limit=(int)(N/(W))+1;
float acc_o=0, acc_i=0;
float mult[W], x[W];
for(int i=0; i<outer_loop_limit; i++)
{
const int it=i*W;
#pragma unroll
for(int j=0; j<W; j++)
{
if(it+j<N) //handle boundary conditions
x[j]=read_channel_intel(channel_float);
}
acc_i=0;
#pragma unroll
for(int j=0; j<W; j++)
{
if(it+j<N)
acc_i+=x[j];
}
acc_o+=acc_i;
}
}
The report states that a bottleneck occurs at line 34 (Fmax) due to data dependencies (it indicates the comparisons in the two loop guards of consumer and in the definition of the outer_loop_limit constant)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am not getting any messages from the compiler regarding an Fmax bottleneck in your second code; however, on my environment, the compiler fails to unroll the loops over the channel operations now due to "conditional channel execution". Looks like my suggestion made things worse instead of better.
Is there any reason why you are trying to avoid passing the out-of-bound data via channels? Since W is small, the overhead of passing the few extra indexes will be extremely small. I suggest that you send the extra data via channels anyway to avoid the "conditional channel execution", but instead avoid processing the extra data. You already have the same condition in the second loop in the consumer kernel and that loop is correctly unrolled and pipelined; this should be enough to generate correct output even if you do send the out-of-bound data through the channels.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sorry,
in the second posted code, I forgot to copy the last line in the consumer kernel, to save the data into memory (otherwise the compiler remove compilation)
//...
*res=acc_o;
}
Apart from this, even in my case it complains about conditional write, but then, according the report, loops are unrolled but the with lower Fmax.
I would like to avoid generates more data than needed, for the sake of code portability (e.g. the generator is implemented by some other code). Apart from this, even if I try to do it, the problem remains (bottleneck is still Fmax) and a new compiler warning appears "Cannot unroll loop for.body3 in producer because channel endpoints would undergo different amounts of unrolling"
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The Fmax bottleneck only happens if one of the channel loops is not unrolled. I commented out both conditions in the channel loops and left the one in the compute loop and all loops where pipelined with an II of one without any Fmax bottleneck. This is the code:
#define W 32
#pragma OPENCL EXTENSION cl_intel_channels : enable;
channel float channel_float __attribute__((depth(W)));
__kernel void producer(const int N)
{
int outer_loop_limit=(int)(N/(W))+1;
for(int i=0;i<outer_loop_limit;i++)
{
const int it=i*W;
#pragma unroll
for(int j=0;j<W;j++)
{
//if(it+j<N) //handle boundary conditions
write_channel_intel(channel_float,(float)(2.0));
}
}
}
__kernel void consumer(const int N, __global float* res)
{
const int outer_loop_limit=(int)(N/(W))+1;
float acc_o=0, acc_i=0;
float mult[W], x[W];
for(int i=0; i<outer_loop_limit; i++)
{
const int it=i*W;
#pragma unroll
for(int j=0; j<W; j++)
{
//if(it+j<N) //handle boundary conditions
x[j]=read_channel_intel(channel_float);
}
acc_i=0;
#pragma unroll
for(int j=0; j<W; j++)
{
if(it+j<N)
acc_i+=x[j];
}
acc_o+=acc_i;
}
*res=acc_o;
}
Regarding the problem with portability, I would personally recommend against having multiple reads/writes from/to channels. In fact, up until v17.1 (or maybe v17.0), doing so was not even allowed. If you want to completely avoid extra channel reads/writes while having pipelineable loops, you might be able to achieve this by using W channels instead of one channel that is unrolled W times. Something like this:
#pragma unroll
for (int j = 0; j < W; j++)
{
if(it + j < N)
{
x[j] = read_channel_intel(channel_float[j])
}
}
In this case the compiler should not fail to unroll the loop due to conditional execution anymore, since now you have W channels instead of one.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Ok, i see.
Multiple channels may work, but I have to keep their number low (at least, this is what Intel documentation states).
Thanks
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page