How to tell the compiler that store operations are not dependent?

NSriv2 · ‎12-17-2018

Hi,

I am trying to compile the following kernel using aoc 18.0

__kernel void C_unloader (
  const int num_elems,
  __global int* restrict C,
  __local int16* __shared
) {
    bool read_success[SYS_NUM_ROWS][SYS_NUM_COLS];
    Drain_data data[SYS_NUM_ROWS][SYS_NUM_COLS];
 
    int count = 0;
    while (count != num_elems) {
      #pragma ivdep array(C)
      #pragma unroll
      for (int y_id = 0; y_id < SYS_NUM_ROWS; y_id++) {
        #pragma ivdep array(C)
        #pragma unroll
        for (int x_id = 0; x_id < SYS_NUM_COLS; x_id++) {
           data[y_id][x_id] = read_channel_nb_intel(C_unloader_channel[y_id][x_id], &read_success[y_id][x_id]);
           if (read_success[y_id][x_id]) {
             count++;
             C[data[y_id][x_id].addr] = data[y_id][x_id].data;
           }
        }
      }
    }
}

And the compiler says that the achieved II is 313 due to the memory dependency between the store operations on line 20. I know these memory accesses will never alias. Is there a way to tell the compiler that these memory accesses will never alias to achieve an II of 1? I tried #pragma ivdep but that does not solve the issue.

Thanks,

Nitish

HRZ · ‎12-19-2018

Can you post a snippet that can be compiled? The current one doesn't compile due to missing definitions. Since the two inner for loops are fully unrolled, I assume the high II is for the while loop, but you have not used #prgam ivdep for that loop which means the compiler will not ignore memory dependencies for that loop.

It is worth mentioning that if you are sure the addresses do not overlap, you should be able to modify the addressing so that indirect addressing can be avoided. Your algorithm is going to perform very poorly since accesses to the C buffer cannot be coalesced due to indirect addressing.

NSriv2 · ‎12-19-2018

Hi,

I also tried #pragma ivdep for while loop which didn't work. However, replacing while loop with a for (count = 0; count < num_elems; ) loop and then applying #pragma ivdep worked.

I am sure that the addresses do not overlap since each address is unique. However, it is not possible to statically determine the addresses since they depend on some task scheduling algorithm. Do you think the performance degradation will be a lot? Is there a way I can improve this situation?

HRZ · ‎12-19-2018

ivdep working for for but not while is unexpected. Either way, indirect addressing is going to result in poor performance on any hardware but things will be worse on FPGAs since there is no cache hierarchy. The amount of performance degradation will depend on ratio of memory accesses to compute in your algorithm. If your algorithm is memory-intensive, then performance is going to suffer greatly from indirect addressing and lack of memory access coalescing.

NSriv2 · ‎12-19-2018

Hi HRZ,

Yes, that makes sense. The current design is just testing version. In the main design I will vectorize the output data to make efficient use of memory bandwidth.

I am also having some issues where this design works in the emulation mode (with strict channel depths) but on the FPGA it is giving wrong results. I am facing similar issues for another benchmark as well. I am using aoc 17.1 on Virtual Lab (VLAB) from Intel. Have you ever faced this issue? Here is the code: https://drive.google.com/file/d/1QcMYwOPU9onk4CiZx6YFD1pEYjXOSsbY/view?usp=sharing Can you see if there is any obvious bug?

Here are the commands that you can run. (You might need to modify the Makefile according to your aoc setup):

% make compile-emulation

% make compile-host

% make run-emulation

This will pass and give correct results.

FPGA flow:

% make compile-device

% make compile-host

% make run

runs it on FPGA but the results are all 0s

Thanks,

Nitish

HRZ · ‎12-20-2018

I will reply to the new problem in your other thread.