Ordering of channel operations

tde_m · ‎06-18-2019

Hello,

I'm currently struggling with enforcing the order of write/read to channels.

I know, from the Intel programming guide, that two independent channels operation can be re-ordered by the compiler to generate efficient hardware.

However, this seems to occur even if there is a clear dependence. I've created a minimal working example for this:

#pragma OPENCL EXTENSION cl_intel_channels : enable
 
//message
typedef struct{
	bool request;
	int data;
}message_t;
 
//represents the status of the computation
typedef struct{
	bool start;
	message_t m;
}computation_t;
 
channel message_t channels[2] __attribute__((depth(2)));
 
// Auxiliary function for receiving data
void receive(computation_t *status, int *data){
    if(status->start){
        //at the beginning send the request for data
        write_channel_intel(channels[0],status->m);
        status_>start=false;
    }
    //receive the data and store it
    status->m=read_channel_intel(channels[1]);
    *data=status->m.data;
}
 
__kernel void comp(const int N, const int start, __global int *mem){
    int data;
    computation_t status;
    status.start=true;
    status.m.data=N;
    for(int i=0;i<N;i++)
    {
        //receive data, increment and store it to memory
        receive(&status,&data);
        data++;
        mem[i]=data;
    }
 
}
 
 
//generates a stream of data upon request
__kernel void generator(){
	//receive the request
	message_t m=read_channel_intel(channels[0]);
	for(int i=0;i<m.data;i++)
	{
            message_t send;
            send.data=i;
            send.request=false;
            write_channel_intel(channels[1],send);
	}
}

The "comp" kernel is characterized from a pipelined loop in which it receives data coming from the "generator" kernel using the "receive" function. At the first iteration, a request is sent to the generator in order to let it generate the right amount of data.

If I try to compile this, the channel operations of the "receive" function are re-oderdered, as can be seen from the report:

This occurs even if there is a clear dependency between the two.

Clearly, if in hardware it is first executed the read, this will lead to deadlock.

This happens with Quartus 18.1 and 19.1 (Stratix 10 as target board).

In you opinion, is it a compiler bug or I have to handle this in a different way?

Thanks

HRZ · ‎06-20-2019

I highly doubt the order of channel operations in the System Viewer section of the HTML report follows the actual order implemented by the compiler; the drawing probably prioritizes minimizing space used for the figure, rather than accuracy with respect to order of operations. I would go ahead and just compile and run the kernel to see if it actually gives incorrect results. You can also try adding a barrier in-between the two channel operations as a test to see if the order of the operations will change in the report.

tde_m · ‎06-20-2019

Hi,

I've already tried to generate the bitstream. The compiled version hangs (of course in emulation it works).

If I introduce a fence between the two channel operations, they result to be properly ordered, even if the loop II is now 7 instead of 1 (see figure). The compiled version works.

Still, there exists a dependency between the two channel operations so they should not be arbitrarily re-ordered.

HRZ · ‎06-21-2019

Then this looks like a compiler bug, indeed those channel operations should not be reordered. Can you mention the version of aoc you are using?

Maybe @MUsman or one of the other Intel-affiliated moderators can take a look at this issue and report to the engineering team for a possible fix.

tde_m · ‎06-21-2019

I'm using version 18.1.1 (build 263) and I'm compiling with the "-fpc -fp-relaxed: flags.

I got the same report and the same hardware behaviour using version 19.1

Douglas_P_Intel · ‎06-21-2019

Hi,

This is expected behavior. Within a kernel, multiple channel calls (to different channels) are considered independent. This can be a problem if channels form a cycle between kernels.

When I compile your code, the compiler generates a warning as follows: (check your log files)

Compiler Warning: Kernels comp and generator may form a cycle due to connectivity of channels channels[0] and channels[1]. Use mem_fence if you require source code-based ordering of channel operations. Channel depths cannot be optimized.

Ideally, you want to avoid forming a cycle between kernels with channels.

https://www.intel.com/content/www/us/en/programmable/documentation/mwh1391807965224.html#mwh1391806067772

HRZ · ‎06-21-2019

@douglas.prinn The "receive" kernel has a clear Write-after-Read hazard/dependency on "status->m". The two operations involving the variable, regardless of what those operations are, should not be reordered (by any sane compiler). The existence of a cycle a channels in this case, or channel operations for that matter, is irrelevant. Are you implying that aoc ignores data dependencies when channel operations are involved?

tde_m · ‎06-21-2019

Hi Douglas, thanks for your reply.

I've seen the warning, and I recognize that (from the compiler point of view) a potential cycle is recognized. Yet, this is want I desired and, in any case, the re-ordering of the compiler is not resolving this issue (which, in this case is not a problem).

So:

should I assume that the compiler will treat all the channel operations as independent even if there is a clear data dependency (write-after-read in this case)?
why the presence of the barrier increases the II up to 7?

Thanks for your support

Douglas_P_Intel · ‎06-21-2019

@tde m @HRZ - Yes, you should assume all channel operations are independent even if you think there is a dependency. I know it looks like a read / write dependency on status, but the order that the channel operations are done is not dependent on the read / write order of status. The compiler will order the reads and writes to "status" correctly, but it considers the channel operations totally independent. It would be the same if you were reading or writing to 2 totally different regions in memory. You don't see the reads and writes to status because it's a register. If you force it to a RAM as shown below, it becomes clear. (See diagram)

computation_t __attribute__((memory) status;

cycle

4 - start read from CH1

5 - read (LD) from status (gets old data)

5 - write (ST) to status (result from CH1 read)

12 - start write to CH0 using old data from status

Douglas_P_Intel · ‎06-21-2019

@tde m I don't see an increase in II when I add the mem fence in 18.1 or 19.1.

mem_fence(CLK_CHANNEL_MEM_FENCE);

//receive the data and store it

status->m=read_channel_intel(channels[1]);

tde_m · ‎06-21-2019

@douglas.prinn

I see the II increase, it jumps to 7 by inserting a fence exactly like you described (it jumps to 20 if the status variable is stored in memory). This is why the diagram was red in the picture attached above.

Also, if I compile with attibute memory, I obtain a different diagram wrt to yours (I don't have 4 stores, just one).

Still, I don't understand the argument about why channels are considered to be independent:

write channel take in input a variable V;
read channel writes into a variable Z

If Z==V, the compiler must enforce that the two operations are not swapped.

If this is not the case, please explain it clearly into your documentation.

Douglas_P_Intel · ‎06-21-2019

Here's one way to think of it

temp = Z; (cycle 5 above)

if(temp) write channel take in input temp;

read channel writes into Z

As long as temp=Z happens before read channel writes into Z, it doesn't matter what order the write channel and read channel happen.

Douglas_P_Intel · ‎06-21-2019

If I copy and paste the code above and add the mem fence I still see an II of ~1.

Douglas_P_Intel · ‎06-21-2019

tde_m · ‎06-21-2019

Hello,

Your reply with the tmp example clarified what is happening (thanks).

Still, I'm not understanding why is this happening: are there any explanations about the compiler not taking care of that dependencies?

Furthermore, having a cyclic dependencies (that the barrier does not resolve) is a thing that could happen frequently and ordering is what you rely on to build client-server patterns for example.

Concerning the higher II, I have the same II (7) even with the 19.1 compiler. I've attached the reports (the same occurs if I use the s10gx BSP included in the reports). Do you have any idea about why is this happening?

Thanks for you help

HRZ · ‎06-22-2019

You are compiling against Stratix 10 while @douglas.prinn is probably compiling against Arria 10. On Arria 10, II is one after adding the mem_fence while on Stratix 10 it becomes 7. This is likely because the default target operating frequency on Arria 10 is 240 MHz, while on Stratix 10 it has been increased to 480 MHz. The solution is to probably not use channels at all (or multi-kernel designs, or the autorun kernel feature, or any other useful feature of the compiler that worked perfectly fine on previous generation FPGAs but for some unexplained reason, has "high overhead" on Stratix 10) as written in Section 9 of the Best Practices Guide. You can also try reducing the Fmax targert using the --fmax switch to reduce the II, but that will likely also lower your final post-place-and-route operating frequency.

tde_m · ‎06-23-2019

I agree with you: iIf I compile against Arria 10 I also obtain an II=1.

For the Stratix 10:

the fmax switch does not help;
if I use non-blocking channels (like suggest in the Intel documentation), I get a Serial Exe on the for loop in the comp kernel (i.e. it is not pipelined)

Therefore, @douglas.prinn are there any ways of obtaining an II=1 for a Stratix 10?

Thanks

Douglas_P_Intel · ‎06-24-2019

You can get ii = 1 with Stratix 10 by removing the data dependency on the variable status in the main loop. There are several ways to remove the dependency.

One way is by moving the status variable inside the loop. See below.

__kernel void comp(const int N, const int start, __global int *mem){

int data;

// computation_t status;

// status.start=true;

// status.m.data=N;

for(int i=0;i<N;i++)

{

computation_t status;

if(i==0) { status.start=true; status.m.data=N; }

//receive data, increment and store it to memory

receive(&status,&data);

data++;

mem[i]=data;

}

tde_m · ‎06-24-2019

I can move the initialization inside (like you suggested), but not the declaration of the status variable, as this will result in incorrect code.

In your code you are essentially declaring status locally on the loop iteration

So, If move the initialization only, like the following:

computation_t status;

// status.start=true;

// status.m.data=N;

for(int i=0;i<N;i++)

{

if(i==0) { status.start=true; status.m.data=N; }

I still obtain an II of 7.

Douglas_P_Intel · ‎06-24-2019

Right. I see now. You are writing the previous value of status that you read from channel 1 into channel 0, so you can't move status inside the loop. The problem is that if the value of status is carried across loop iterations, iteration i+1 can't start until the value of status has been updated by iteration i. I’m not sure how to remove the dependency. I will have to think about it a little more.

tde_m · ‎06-28-2019

Thanks, it would be helpful. Up to now I was not able to remove it.