Data is read before accumulation is finished

Altera_Forum · ‎11-07-2017

===============================================================

for(uint t = 0; t < loop_cnt; t++) {

//load data to data buffer

for(uint w = 0; w < TILE_WIDTH; w++) {

data[w] = read_channel_altera(data_in_ch);

}

for(uint h = 0; h < TILE_HEIGHT; h++) {

weight[h] = read_channel_altera(weight_in_ch);

}

//comput the matrix tile multiplication using the PE(mac) array

# pragma unroll

for(uint w = 0; w < TILE_WIDTH; w++) {

float data_temp = data[w];

# pragma unroll

for(uint h = 0; h < TILE_HEIGHT; h++) {

float weight_temp = weight[h];

float temp = data_temp * weight_temp;

if(t == 0)

output[h * TILE_WIDTH + w] = temp;

else

output[h * TILE_WIDTH + w] = output[h * TILE_WIDTH + w] + temp;

}

//declare output data to be enqueued in altara channel

lane output_lane;

for(uint w = 0; w < TILE_WIDTH; w++) {

# pragma unroll

for(uint h = 0; h < TILE_HEIGHT; h++) {

//multiply with scale and plus bias before moving it out

output_lane.lane_data[h] = output[h * TILE_WIDTH + w] * scale[h] + bias[h];

}

write_channel_altera(output_ch, output_lane);

}

========================================================================================

Here is a snippet of my code. Basically what I am doing is doing matrix multiplication and move the data out by channel if the accumulation is finished. But according to the hardware run, the output is not fully accumulated (it's moved out before the accumulation is finished, for example, if the correct output pattern is all 36, the hardware run result would be a mix of values smaller than 36). And the compilation report seems to support this (with TILE_WIDTH 4 and TILE_HEIGHT 8, the number of simultaneous reads to output local buffer should be 32, but in the report it's 40, which is because after accumulation I have 8 simultaneous reads to move the data out (32 + 8 = 40). So it looks like the accumulation and moving out is happening at the same time!! This is very weird because moving out should happen after accumulation is finished.

below is the report of local buffer output

===========================================================================================

Local memory: Optimal. Requested size 128 bytes (rounded up to nearest power of 2), implemented size 128 bytes, stall-free, 40 reads and 32 writes. Additional information: - Banked on lowest dimension into 32 separate banks (this is a good thing). - Reducing accesses to exactly one read and one write for all on-chip memory systems may increase overall system performance.

==========================================================================================

And advice would be greatly appreciated!!

Altera_Forum · ‎11-08-2017

The 40 reads that the compiler reports is correct and does not actually have the implication you think it does. Your kernel will turn into one single pipeline, and there will be 40 read ports going from the pipeline to the "output" buffer. It does not matter in which part of the code those reads are located; the compiler will make sure all reads can be satisfied in parallel to avoid any stalls in the pipeline. Note that all of these ports will be reported in the same part of the report, you should not expect to see 32 ports in one part and another 8 ports in another part of it.

Regarding outputs being sent to the channel before being fully accumulated: this is not possible. The compiler will ensure the loop on "t" fully finishes before the loop on the output channels starts. If you are seeing different output that what you expect, the problem is somewhere else. Have you tried to see what output you receive in the emulator and debug by printing the intermediate values?

A few tips:

1- Be very careful with using uint loop variables; if you compare "unit" with "int", you could get very different behavior compared to what you expect. Specifically for the loop on "t", "loop_cnt" must be also uint for correct behavior.

2- You might have to initialize the output buffer; depending on how the compiler actually unrolls the loops, the accumulation line might result in undefined behavior if the buffer is not initialized.

3- You should probably consider changing the order of your "h" and "w" loops, or transpose your "output" buffer, so that when you unroll the loops, the unrolled accesses to the buffer can be coalesced, resulting in larger but less read and write ports. With your current implementation, if you keep increasing the size of the output buffer to a point that it has to be implemented using Block RAMs, you will get a really large replication factor from the compiler to support all the non-coalesced read and write ports.

Altera_Forum · ‎11-09-2017

Hi HRZ,

Thanks for your kindly reply!

I will follow your advice and make some changes! If you don't mind, would you like to see my source code and report so that you can see my problems better? You help would be greatly appreciated!!

Best regards,

Jiang Wenbo

Altera_Forum · ‎11-09-2017

I can take a look at your code, but I will be mostly unavailable in the following week and might not be able to help you much.

Altera_Forum · ‎11-09-2017

It's okay, I've been stuck here for more than one month. Do you have a personal email?

Altera_Forum · ‎11-09-2017

Since the board does not seem to allow private messages and I prefer not to post my email address directly on an open forum to avoid it being picked up by bots, please check this page (https://github.com/fpga-opencl-benchmarks/rodinia_fpga) for my email address. I am the second guy in the contact list (at the very bottom).

Altera_Forum · ‎11-10-2017

Hi HRZ,

Thanks for your advice, I removed the initialization part and declared a variable in the accumulation loop called last_sum, which is 0 when t is 0 and the current accumulator value when t is greater than 0, and it resolves the problem. I have 2 more questions:

1. With TILE_WIDTH = 4 and TILE_HEIGHT = 4, The "output" buffer is supposed be be duplicated 16 times (I intentionally made it BRAM), but in the report it takes 32 BRAMS. Where does this 2 times more replication come from?

2. In the report, there's 16 simultaneous thread launched for the loop controlled by t, does this mean pipelining? If yes, this loop is not pipelinable, due to the data dependency on "output" buffer.

Any advice would be greatly appreciated!

Altera_Forum · ‎11-10-2017

Please post your new code and also the report.

Altera_Forum · ‎11-10-2017

=========================================================================

# define TILE_HEIGHT 8

# define TILE_WIDTH 4

# define CVEC 2

typedef struct {

float vector[CVEC];

} vec;

kernel some_kernel() {

........

__local vec data[TILE_WIDTH];

//weight buffer is two dimensional because whole rows of weight are buffered for reuse

__local vec weight[TILE_HEIGHT];

__local float output[TILE_HEIGHT * TILE_WIDTH];

__local float scale[TILE_HEIGHT];

__local float bias[TILE_HEIGHT];

for(uint t = 0; t < (input_dim3 / CVEC) * FILTER_UNROLL_SIZE; t++) {

//load data to data buffer

for(uint w = 0; w < TILE_WIDTH; w++) {

data[w] = read_channel_altera(data_in_ch);

}

for(uint h = 0; h < TILE_HEIGHT; h++) {

weight[h] = read_channel_altera(weight_in_ch);

}

//comput the matrix tile multiplication using the PE(mac) array

# pragma unroll

for(uint h = 0; h < TILE_HEIGHT; h++) {

vec weight_temp = weight[h];

# pragma unroll

for(uint w = 0; w < TILE_WIDTH; w++) {

vec data_temp = data[w];

float last_sum;

if (t == 0)

last_sum = 0;

else

last_sum = output[h * TILE_WIDTH + w];

# pragma unroll

for(uint vv = 0; vv < CVEC; vv++) {

last_sum += data_temp.vector[vv] * weight_temp.vector[vv];

}

output[h * TILE_WIDTH + w] = last_sum;

}

//declare output data to be enqueued in altara channel

lane output_lane;

//bias and scale

for(uint w = 0; w < TILE_WIDTH; w++) {

# pragma unroll 1

for(uint h = 0; h < TILE_HEIGHT; h++) {

output_lane.lane_data[h] = output[h * TILE_WIDTH + w] * scale[h] + bias[h];

}

write_channel_altera(output_ch, output_lane);

}

..........

}

============================================================================================

Below is the report about the loop controlled by t

============================================================================================

Block34:

Maximum simultaneous execution: 16 threads

Local memories are replicated to maximize throughput.

See Area analysis of system for exact replication factor.

===========================================================================================

Below is the report about local memory "data", "weight", "output"

===========================================================================================

conv.cl:135 (data) ALUTs: 0 FFs: 0 BRAMs: 8 DSPs: 0 Local memo...

conv.cl:135 (data):

Local memory: Good but replicated.

Requested size 32 bytes (rounded up to nearest power of 2), implemented size 96 bytes, replicated 3 times total, stall-free, 1 read and 1 write. Additional information:

- Replicated 3 times to create private copies for simultaneous execution of 3 threads in the loop containing accesses to the array.

conv.cl:135 (weight) ALUTs: 0 FFs: 0 BRAMs: 16 DSPs: 0 Local memo...

conv.cl:137 (weight):

Local memory: Good but replicated.

Requested size 64 bytes (rounded up to nearest power of 2), implemented size 192 bytes, replicated 3 times total, stall-free, 1 read and 1 write. Additional information:

- Replicated 3 times to create private copies for simultaneous execution of 3 threads in the loop containing accesses to the array.

conv.cl:139 (output) ALUTS: 1025 FFs: 8192 BRAMS: 64 DSPs: 0 Local memo...

conv.cl:139 (output):

Local memory: Optimal.

Requested size 128 bytes (rounded up to nearest power of 2), implemented size 128 bytes, stall-free, 3 reads and 1 write. Additional information:

- Reducing accesses to exactly one read and one write for all on-chip memory systems may increase overall system performance.

=============================================================================================

Altera_Forum · ‎11-12-2017

I already explained the reason for this extra replication in your other thread (http://www.alteraforum.com/forum/showthread.php?t=56880&p=231291#post231291).

This replication factor is decided by the compiler, and at least in the older versions of the compiler, could be controlled using "#pragma max_concurrency"