Relaxing Data Dependencies on Memory Access

Altera_Forum · ‎03-17-2016

I'm trying to stream in a block of contiguous memory, but only process the date until an end marker is reached. Put simply, iterate through an array until a certain value is found, after which all further elements are to be ignored. A simplistic solution with a OpenCL single work item kernel wold be as follows:


__kernel void in_streamer(__global const uint2* in, uint n) {
    for(uint i = 0; i != n; ++i) {
        uint2 value = in;
        write_channel_altera(chan, value);
        
        if(value.s0 == END_MARKER) {
            break;
        }
    }
}
    
__kernel void consumer() {
    while(true) {
        uint2 value = read_channel_altera(chan);
        
        // do work here
        
        if(value.s0 == END_MARKER) {
            break;
        }
    }
}

The consumer kernel is entirely unproblematic, the data dependency to the previous iteration only contains a equality operation. The in_streamer, while working as intended, causes terribly poor performance because there is a data dependency on a memory load operation. The AOCL compiler produces the following warning in the optimization report: "Successive iterations launched every 164 cycles due to: Data dependency on variable, Largest Critical Path Contributor: 98%: Load Operation". This in itself is of course nothing special. I've dealt with such data dependencies before by using a shift register to relax the dependency as the Altera Best Practice Guide suggests.

The Idea is to let allow the compiler to pipeline an expensive operation. To make this possible one I can't use the data in the next iteration, but only after a large number iterations. This usually worked for me in these kinds of problems. It doesn't seem to work with memory accesses.

The following solution tries to implement the in_streamer to break the loop after the end marker was found, but not immediately, in order to relax the dependency. The elements that are read after the end marker was found are discarded and not written to the channel:


__kernel void in_streamer(__global const uint2* in, uint n) {
    const uint MEM_DELAY = 164;
    bool endmarker_reached;
   # pragma unroll
    for(int s = 0; s < MEM_DELAY; ++s) {
        endmarker_reached = false;
    }
    
    for(uint i = 0; i < n; ++i) {
        uint2 value = in;
        if(endmarker_reached) {
            write_channel_altera(chan, value);
        }
        
        bool end_it = false;
        if(value.s0 == 0x70000000) {
            end_it = true;
        }
       # pragma unroll
        for(int s = (MEM_DELAY-1); s > 0; --s) {
            endmarker_reached = endmarker_reached;
        }
        endmarker_reached = end_it;
        
        if(endmarker_reached) {
            break;
        }
    }
}

Here I run into a problem. While the dependency is relaxed I still get reduced performance, just not as badly reduced as before. The optimization report now says "Successive iterations launched every 2 cycles...". It then gives the following details over a hundred times: "Data dependency on variable, Largest Critical Path Contributor: 45%: Load Operation".

While this is much better than before, it's still a massive waste of processing time. It doesn;t matter how high I set the constant MEM_DELAY, the issue remains.

Another working solution would be the following:


__kernel void in_streamer(__global const uint2* in, uint n) {
    bool end = false;
    
    for(uint i = 0; i != n; ++i) {
        uint2 value = in;
        if(!end) {
            write_channel_altera(chan, value);
        }
        
        
        if(value.s0 == END_MARKER) {
            end = true;
        }
    }
}

This works and both kernels are pipelined perfectly. The problem is, that the input array is read to the very end, the values are only discarded after the end marker.

Has anyone encountered a similar issue? I'd be very interested in where the delay comes from.

Altera_Forum · ‎03-17-2016

One thing I found so far is that in version 15.1 of the compiler (has been out for a while), the loop iterations in producer are launched every 4 cycles, with Load still being major part of critical path. With 15.0, I'm seeing 164 cycles just like you are.

The code has to wait for load to complete to know if write_channel_altera() should be done during the next iteration -- so that's your loop-carried dependency. This delay is <delay of load> + <delay of integer comparison to END_MARKER>. In 15.1, the expected delay of the simple load has changed, hence the drop in number of cycles between iterations from 164 to 4.

I'm not sure how to relax this below 2 (as you already found).

Altera_Forum · ‎03-18-2016

--- Quote Start ---

The code has to wait for load to complete to know if write_channel_altera() should be done during the next iteration -- so that's your loop-carried dependency.

--- Quote End ---

But isn't that the case as well in the third code sample I gave. I don't get any timing issues there. Furthermore: when I comment out that part of the code as follows, the problem persists:


//if(endmarker_reached) {
    write_channel_altera(chan, value);
 //}

As for the 15.1 compiler: Unfortunately I don't have a board support package for that yet.