Solved: Reducing initiation interval, relaxing loop-carried dependency

SChun19 · ‎04-20-2020

Hi, I am trying to implement and optimize multiply-add accumulate as a single work-item. The access pattern is not sequential for one of the reads, so unrolling creates a separate LSU corresponding to the unroll factor and a private cache for each unit. This improves performance, but obviously blows up area which isn't ideal. I'll post the simplified code that still compiles and shows the issues that I have:

kernel void base(global float* restrict compute, global float* restrict in, global float* restrict w, int N, int M, int O) {
  for (int i = 0; i < N; ++i) {
    for (int yy = 0; yy < M; ++yy) {
      for (int xx = 0; xx < M; ++xx) {
        int yy_curr = yy * M;
        int i_curr = i * O;
 
        #pragma unroll 32
        for (int rc = 0; rc < O; ++rc) {
          compute[((yy_curr) + xx)] += in[((((rc * M) + yy) * M) + xx)] * w[((i_curr) + rc)];
        }
      }
    }
  }
}
 
kernel void single_var(global float* restrict compute, global float* restrict in, global float* restrict w, int N, int M, int O) {
  for (int i = 0; i < N; ++i) {
    for (int yy = 0; yy < M; ++yy) {
      for (int xx = 0; xx < M; ++xx) {
        int yy_curr = yy * M;
        int i_curr = i * O;
        float tmp = 0.0;
 
        #pragma unroll 32
        for (int rc = 0; rc < O; ++rc) {
          tmp += in[((((rc * M) + yy) * M) + xx)] * w[((i_curr) + rc)];
        }
        compute[((yy_curr) + xx)] = tmp;
      }
    }
  }
}
 
kernel void separate_sums(global float* restrict compute, global float* restrict in, global float* restrict w, int N, int M, int O) {
  for (int i = 0; i < N; ++i) {
    for (int yy = 0; yy < M; ++yy) {
      for (int xx = 0; xx < M; ++xx) {
        int yy_curr = yy * M;
        int i_curr = i * O;
 
        float p_sums[32];
        #pragma unroll
        for (int jj = 0; jj < 32; ++jj) {
            p_sums[jj] = 0.0;
        }
 
        #pragma unroll 32
        for (int rc = 0; rc < O; ++rc) {
          p_sums[rc % 32] += in[((((rc * M) + yy) * M) + xx)] * w[((i_curr) + rc)];
        }
 
        float tmp = 0.0;
        #pragma unroll
        for (int jj = 0; jj < 32; ++jj) {
           tmp += p_sums[jj];
        }
 
        compute[((yy_curr) + xx)] = tmp;
      }
    }
  }
}
 
#define II_CYCLES 5
#define UF 32
kernel void shift_reg(global float* restrict compute, global float* restrict in, global float* restrict w, int N, int M, int O) {
  for (int i = 0; i < N; ++i) {
    for (int yy = 0; yy < M; ++yy) {
      for (int xx = 0; xx < M; ++xx) {
        int yy_curr = yy * M;
        int i_curr = i * O;
        
        float shift_reg[II_CYCLES + 1];
        #pragma unroll
        for (unsigned j = 0; j < II_CYCLES + 1; j++)
            shift_reg[j] = 0.0f;
 
        for (int rc = 0; rc < O; rc += UF) {
            float acc_i = 0.0;
 
            #pragma unroll 
            for (int rcrc = rc; rcrc < rc + UF; rcrc++) {
                acc_i += in[((((rcrc * M) + yy) * M) + xx)] * w[((i_curr) + rcrc)];
            }
 
            shift_reg[II_CYCLES] = shift_reg[0] + acc_i;
            #pragma unroll
            for (unsigned j = 0; j < II_CYCLES; ++j) 
                shift_reg[j] = shift_reg[j+1];
        }
 
        float sum = 0.0;
        #pragma unroll
        for (int jj = 0; jj < II_CYCLES; ++jj) {
           sum += shift_reg[jj];;
        }
 
        compute[((yy_curr) + xx)] = sum;
      }
    }
  }
}
 
kernel void shift_reg_no_unroll(global float* restrict compute, global float* restrict in, global float* restrict w, int N, int M, int O) {
  for (int i = 0; i < N; ++i) {
    for (int yy = 0; yy < M; ++yy) {
      for (int xx = 0; xx < M; ++xx) {
        int yy_curr = yy * M;
        int i_curr = i * O;
        
        float shift_reg[II_CYCLES + 1];
        #pragma unroll
        for (unsigned j = 0; j < II_CYCLES + 1; j++)
            shift_reg[j] = 0.0f;
 
        for (int rc = 0; rc < O; ++rc) {
            shift_reg[II_CYCLES] = shift_reg[0] + (in[((((rc * M) + yy) * M) + xx)] * w[((i_curr) + rc)]);
            
            #pragma unroll
            for (unsigned j = 0; j < II_CYCLES; ++j) 
                shift_reg[j] = shift_reg[j+1];
        }
 
        float sum = 0.0;
        #pragma unroll
        for (int jj = 0; jj < II_CYCLES; ++jj) {
           sum += shift_reg[jj];;
        }
 
        compute[((yy_curr) + xx)] = sum;
      }
    }
  }
}

Base and single_var compile with an II = 161 for lines 10 and 27, 5 (cycles for hardened-point float) * 32 (unroll factor) + 1 due to unrolling. In hopes to bring it down I declare separate registers for each unroll factor, which has a significant impact on performance and is able to bring the II down to 5.

I want to bring this down further, and the manual suggests inferring shift registers to bring the II to 1. Shift_reg is my attempt at trying to preserve the unrolling and adding the shift register at the same time by creating another nested loop. Unfortunately, the loop is still schedule with an II=5 with an "Undetermined reason" in Loop Analysis in the reports. If I get rid of the loop unrolling and keep the loop structure as is, adding the shift registers bring down the II to 1 in shift_reg_no_unroll. But this ends up getting worse performance than when I had unrolling.

Any suggestions to improve the performance here? It is likely that the compiler does not like the fact that the loop increment is not unit, that would be my guess, and the kernel also suffers from poor performance due to non-sequential reads. Or perhaps this is the wrong approach altogether?

HRZ · ‎04-23-2020

I did some testing on your kernel, the problem is from your overly large unroll factor and all the memory ports it creates. If you set II_CYCLES to 5 (but not below that) and reduce UF to 16, you will get an II of 1. Moreover, at least with v19.4 (probably 19.0+ but I didn't test), you will get an II of 1 even with II_CYCLES set to 1 on Arria 10 because the compiler optimizes out the shift register and implements single-cycle accumulation instead. This bring me to a better solution to your problem if you are targeting Arria 10:

#define UF 16
kernel void shift_reg(global float* restrict compute, global float* restrict in, global float* restrict w, int N, int M, int O) {
  for (int i = 0; i < N; ++i) {
    for (int yy = 0; yy < M; ++yy) {
      for (int xx = 0; xx < M; ++xx) {
        int yy_curr = yy * M;
        int i_curr = i * O;
        float final_sum = 0.0f;
        
        int exit = O / UF;
        for (int j = 0; j < exit; j++) {
            float acc_i = 0.0;
 
            #pragma unroll 
            for (int k = 0; k < UF; k++) {
                int rc = j * UF + k;
                acc_i += in[((((rc * M) + yy) * M) + xx)] * w[((i_curr) + rc)];
            }
            
            final_sum += acc_i;
        }
        
        compute[((yy_curr) + xx)] = final_sum;
      }
    }
  }
}

This will give an II of 1 regardless of UF; though I think single-cycle accumulation is not available on Stratix 10 since the default target frequency on Stratix 10 is 480 MHz which is too high for single-cycle accumulation. Either way, you can still use the shift register method on Stratix 10, just don't use an unroll factor larger than 16. My guess is that if you compile your kernel with varying values of UF from 1 to 32, performance will probably go up until 4 or 8, but it will start going down after that.

P.S. You should also consider the operating frequency when doing performance comparison between different kernels.

View solution in original post

HRZ · ‎04-21-2020

In general you should avoid partially unrolling a loop unless the loop has a compile-time-known trip count and that trip count is divisible by the unroll factor. In the particular case of reductions loops, partial unrolling will also break the shift register optimization since when you partially unroll the loop, you are effectively increasing the latency of each loop iteration, which means that you will have to use a larger shift register to solve the dependency on the reduction variable and you will have to keep increasing the size of the shift register as you increase the unroll factor; this is certainly not a scalable method. To solve this issue, check Section 3.2.2.1 in the following document:

https://arxiv.org/ftp/arxiv/papers/1810/1810.09773.pdf

Specifically, the code segment in Figure 3-5 shows how you can implement partial unrolling manually by splitting your reduction loop into two loops, the innermost one having a trip count equal to the unroll factor and being fully unrolled (which will not require shift register anymore), and the outer loop's trip count being adjusted accordingly and the shift register optimization being used only for the outer loop which is not unrolled, with a fixed size that is independent of the unroll factor.

One more thing you should consider doing is to re-order the content of your "in" buffer on the host side so that accesses to that buffer in the kernel also become consecutive or else your external memory performance will suffer greatly.

SChun19 · ‎04-22-2020

Hi HRZ, thanks for the response. I've been able to get the II down to 1 with the optimization in Figure 3-5 and setting the shift register size to 32. The first problem I was having was that setting the shift register size to the actual MAC latency was still getting an II~6 for an "Undetermined reason". Despite this, the shift register optimization leads to worse results in practice, about 2 to 4x worse runtime on my board. Is this something that you encountered?

I will consider reordering the buffer for consecutive access.

HRZ · ‎04-22-2020

Can you post your new kernel? You should not need to set the shift register size to a value higher than the MAC latency using the method I mentioned.

Regarding run time, it is worse than which case? Due to the non-coalescable accesses in your code to the "in" buffer, you are going to experience a huge amount of contention caused by all those memory ports competing for the memory bus, and your unroll factor is also quite large which makes things even worse; it is not very surprising to get worse performance in such cases compared to the non-unrolled case or smaller unroll factors since the kernel would be memory-bound anyway while lower unroll factor will result in less memory contention and better performance. Moreover, in the method I mentioned, you will be doing some extra unused computation in the last iteration; hence, if your iteration count is not divisible by the unroll factor and at the same time, it is not very large compared to the unroll factor, the unused computation in the last iteration can potentially have a substantial effect on run time.

SChun19 · ‎04-22-2020

#define II_CYCLES 24
#define UF 32
kernel void shift_reg(global float* restrict compute, global float* restrict in, global float* restrict w, int N, int M, int O) {
  for (int i = 0; i < N; ++i) {
    for (int yy = 0; yy < M; ++yy) {
      for (int xx = 0; xx < M; ++xx) {
        int yy_curr = yy * M;
        int i_curr = i * O;
        float shift_reg[II_CYCLES] = {0.0f}, final_sum = 0.0f;
        
        int exit = O / UF;
        for (int j = 0; j < exit; j++) {
            float acc_i = 0.0;
 
            #pragma unroll 
            for (int k = 0; k < UF; k++) {
                int rc = j * UF + k;
                acc_i += in[((((rc * M) + yy) * M) + xx)] * w[((i_curr) + rc)];
            }
 
            shift_reg[II_CYCLES - 1] = shift_reg[0] + acc_i;
 
            #pragma unroll
            for (unsigned k = 0; k < II_CYCLES - 1; ++k) 
                shift_reg[k] = shift_reg[k+1];
        }
 
        #pragma unroll
        for (int jj = 0; jj < II_CYCLES; ++jj) {
           final_sum += shift_reg[jj];;
        }
 
        compute[((yy_curr) + xx)] = final_sum;
      }
    }
  }
}

This is the new kernel. It is worse than all of the above. It still does get much better performance than the non-unrolled case as the generation of 32 separate cache lines greatly reduces memory contention due to a much higher hit rate compared to a single LSU. In my case, the iteration count is always divisible by the unroll factor as they are known prior to runtime.

The MAC latency is 4 cycles. If I set it to 4, then I get this from Loops Analysis:

So I tried setting it to 7+4=11. Still scheduled with II~2. Only with larger shift registers does the compiler successfully schedule it to 1.

HRZ · ‎04-23-2020

I did some testing on your kernel, the problem is from your overly large unroll factor and all the memory ports it creates. If you set II_CYCLES to 5 (but not below that) and reduce UF to 16, you will get an II of 1. Moreover, at least with v19.4 (probably 19.0+ but I didn't test), you will get an II of 1 even with II_CYCLES set to 1 on Arria 10 because the compiler optimizes out the shift register and implements single-cycle accumulation instead. This bring me to a better solution to your problem if you are targeting Arria 10:

#define UF 16
kernel void shift_reg(global float* restrict compute, global float* restrict in, global float* restrict w, int N, int M, int O) {
  for (int i = 0; i < N; ++i) {
    for (int yy = 0; yy < M; ++yy) {
      for (int xx = 0; xx < M; ++xx) {
        int yy_curr = yy * M;
        int i_curr = i * O;
        float final_sum = 0.0f;
        
        int exit = O / UF;
        for (int j = 0; j < exit; j++) {
            float acc_i = 0.0;
 
            #pragma unroll 
            for (int k = 0; k < UF; k++) {
                int rc = j * UF + k;
                acc_i += in[((((rc * M) + yy) * M) + xx)] * w[((i_curr) + rc)];
            }
            
            final_sum += acc_i;
        }
        
        compute[((yy_curr) + xx)] = final_sum;
      }
    }
  }
}

This will give an II of 1 regardless of UF; though I think single-cycle accumulation is not available on Stratix 10 since the default target frequency on Stratix 10 is 480 MHz which is too high for single-cycle accumulation. Either way, you can still use the shift register method on Stratix 10, just don't use an unroll factor larger than 16. My guess is that if you compile your kernel with varying values of UF from 1 to 32, performance will probably go up until 4 or 8, but it will start going down after that.

P.S. You should also consider the operating frequency when doing performance comparison between different kernels.

SChun19 · ‎04-23-2020

I see. From what I read I thought that as well but in practice I was seeing a linear increase in performance from 4 to 64, so this is why I went with this. One thing that I should mention is that I am using an experimental Stratix 10 MX board with HBM2 on Quartus 19.1, though I have seen the same thing occur with the Stratix 10 SX (PAC with DDR4) on vLab. I will give that a try.

Aside from this (and making the in reads coalesceable), what are more common methods to increase parallelism? I can think of vectorization or invoking more compute units -- but I find it hard to imagine how I would be able to maximize parallel use of DSPs before hitting a BRAM/logic wall due to LSU bloat.

I will also keep the fmax in mind for future comparisons. Thank you so much for your help so far.

HRZ · ‎04-24-2020

Since your kernel only has three global buffers, you can saturate the bandwidth of a maximum of 3 memory banks (reference: https://arxiv.org/abs/1910.06726) even with interleaving enabled. If you disable interleaving and do manual [DDR] memory banking (S10 MX with HBM doesn't even support interleaving as far as I know), you should hit the "memory wall" if each of those 3 buffers are being read/written with 16 floats per loop iteration, an II=1, and an operating frequency of 300 MHz on S10 GX/SX (I think you need 400 MHz for MX with HBM memory, or a larger vector size). Of course all this is with the assumption that your memory ports are all coalesced; performance should stop going up with smaller vector sizes when you have non-coalesced memory accesses since such accesses are less efficient.

If you fix the memory coalescing problem and reduce the number of memory ports to three, the LSU bloat will largely disappear and you can continue increasing the vector size (unroll factor). A big portion of the LSU bloat is also typically because of the private cache the compiler might create for each port, the area overhead of which will increase proportional to the number of ports. You should consider tiling your loops and manually creating a scratchpad memory to absorb the data reuse in your kernel, and completely disabling the private cache by [falsely] marking your global buffers as volatile to avoid its area overhead. On Stratix 10 GX/SX, you will very likely hit the memory wall far much faster than you utilize the FPGA resources. For S10 MX with HBM, however, you might be able to scale the performance until you properly saturate the DSPs, but you will have to split your global buffers into 32 chunks (since the HBM memory has 32 banks...) on the host yourself and do manual DDR banking and use large switch/cases in the kernel code to read from and write to all the 32 chunks of each buffer simultaneously... You are going to have a lot of LSUs in this case anyway, which will have a large area overhead even if you disable the private cache; I am hoping Intel has thought of something in the compiler to address this problem already...

P.S. You can further reduce area overhead of burst coalesced LSUs by inferring aligned LSUs instead of non-aligned. This can reliably be achieved if you use OpenCL vectors or wide structs for reading/writing from/to global buffers. But of course this will only work if all your memory accesses are aligned by the vector size.

SChun19 · ‎04-27-2020

I changed the access patern so that in is now accessed sequentially. With a UF of 64 (fmax=220MHz) this becomes coalesced to a 2048-bit read. This results in a 5x increase in runtime compared to the non-sequential access pattern above. Minding your comment about memory saturation I decreased that UF to 16 (fmax=309MHz) and resynthesized. Area is drastically saved in both cases but performance does not improve much. I would have imagined that sequential reads would have greatly improved performance here, but it does not. I've included some profiled test cases below (from OCL events).

N, M, O: runtime

UF 64, II=1

64, 112, 32, : 14.711209 ms

128,56, 64, : 14.717709 ms

128,56, 128, : 29.326209 ms

256,28, 128, : 10.700833 ms

256,28, 256, : 29.360500 ms

512,14, 256, : 2.439333 ms

512,14, 512, : 21.220958 ms

1024, 7, 512, : 2.243875 ms

1024, 7, 1024, : 3.990750 ms

UF 16, II=1

64, 112, 32, : 10.554042 ms

128,56, 64, : 10.457500 ms

128,56, 128, : 20.892041 ms

256,28, 128, : 10.520417 ms

256,28, 256, : 20.920375 ms

512,14, 256, : 10.513583 ms

512,14, 512, : 21.001666 ms

1024, 7, 512, : 8.579375 ms

1024, 7, 1024, : 21.323458 ms

In comparison, the non-sequential read (as per the kernel code in a previous post) with 64x UF gets this:

64, 112, 112, 32: 7.376835 ms

128, 56, 56, 64 : 3.769259 ms

128, 56, 56, 128: 5.600221 ms

256, 28, 28, 128: 2.882847 ms

256, 28, 28, 256: 4.712566 ms

512, 14, 14, 256: 2.420952 ms

512, 14, 14, 512: 4.206385 ms

1024, 7, 7, 512 : 2.233914 ms

1024, 7, 7, 1024: 4.076918 ms

Regarding the HBMs, I don't think much has changed, only one HBM channel can be assigned to a global memory arg. Creating 32 different kernels as per the bandwidth test example seems to be another way of doing it... but also cannot think of other ways, or if it would be worthwhile...

HRZ · ‎04-27-2020

That doesn't make much sense to me, sequential access must improve performance compared to non-sequential. The only other thing I can think of is if the compiler creates the private cache for the non-sequential case, but does not create it for the sequential one, and your kernel has a lot of data reuse which gets absorbed by the cache and improves performance substantially in the non-sequential case.

SChun19 · ‎04-29-2020

The compiler is creating a private cache for both cases, but I have just noticed that in the sequential case the compiler is building a non-aligned burst-coalesced cached LSU (whereas for the non-sequential case, it is aligned). Perhaps this might account for the performance drop.

HRZ · ‎04-29-2020

That is expected; in the non-coalesced case, all accesses are 32-bit and all are 32-bit-aligned, so you will always get aligned LSUs without coalescing. With coalescing, however, unless the compiler can guarantee all the accesses are aligned by the size of the coalesced port (i.e. vector size), you will get non-aligned LSUs, and this is what actually happens in 99% of the cases. However, if all your access are actually aligned by the vector size, then whether the LSU is aligned or not will not affect performance, it is just that the non-aligned LSU has slightly higher area overhead. As I mentioned earlier, you can infer aligned coalesced LSUs by using OpenCL vector types or a wide struct that has one multi-index array as its only member, but you would be able to describe your computation like this only if all the accesses are aligned by the vector size already, in which case you should not expect performance improvement over what you already have. If, however, you have some or a lot of non-alinged accesses, then you will lose out on a lot of memory performance (again refer to https://arxiv.org/abs/1910.06726 for numbers) and you should try to fix the alignment problems which can potentially result in a substantial improvement in memory performance.

SChun19 · ‎04-30-2020

I see, that makes sense. In this case all the accesses are always aligned by the vector size, so this shouldn't be an issue. I did have a few experiments in the past with non-aligned accesses and have seen the same results, almost 5-10x reduction in performance.

At this point I guess I can try maintaining my own private caches manually. Perhaps I can revisit loop reordering & tiling optimizations again...

AnilErinch_A_Intel · ‎04-30-2020

Hi

Sorry for the wrong links which was provided in earlier answer.

Deleting it to avoid confusion.Thanks Schun19 to point it out.

There are some good references to concepts like Loop Speculation for the OpenCL also

https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/hb/opencl-sdk/aocl-best-practices-guide.pdf

Thanks and Regards

Anil

SChun19 · ‎04-30-2020

In the case where II=1 due to the single-cycle accumulator, would loop speculation still benefit performance? Esp. if the exit condition is already simple enough.

AnilErinch_A_Intel · ‎05-11-2020

Hi ,

In your latest optimized design , enabling loop speculation will not have have an impact. But knowing it would help , in situations like you have encountered initially.

Thanks and Regards

Anil