Minimum II of 2 but HTML report has no further information

Altera_Forum · ‎11-14-2017

I have a single work item kernel with a local mem used for a ping pong buffer with a form similar to the following:

local float __attribute__((bankwidth(4),

numreadports(2),

numwriteports(2),

doublepump,

bank_bits(2,1,0))) mem[1024][4][2];

for (uint outer_outer = 0; outer_outer < 8; ++outer_outer)

{

// some integer add,sub,and shifts that are used to help compute x_idx, y_idx later

float x_pipe[4];

float y_pipe[4];

uint x_idx_pipe[4];

uint y_idx_pipe[4];

for (uint outer = 0; outer < 8; ++outer)

{

uint x_idx, y_idx;

// compute x_idx, and y_idx using integer add, subs, and shifts

# pragma unroll

for (uint inner = 0; inner < 4; ++inner)

{

float x_fetched = mem[x_idx][inner][0];

float y_fetched = mem[y_idx][inner][0];

mem[x_idx_pipe[0]][inner][1] = x_pipe[0];

mem[y_idx_pipe[0]][inner][1] = y_pipe[0];

// shift register statements + computations on x and y

x_pipe[3] = x_fetched;

y_pipe[3] = y_fetched;

x_idx_pipe[3] = x_idx;

y_idx_pipe[3] = y_idx;

}

The compiler seems to detect the parallelization of the inner loop correctly, but my II on the 'outer' loop is 2. Unfortunately there is no additional information in the Loop Analysis section of the HTML report about what's the limiting factor. Does anyone here have any insight into what it means if the HTML report doesn't provide info on what's limiting the II? Does that mean the control logic is causing it hence there's nothing I can do?

I've tried forcing it using# pragma ii 1 but the compiler fails. Looking at the system view I notice the two store ops are sequential (the second dependent on the first) but am unsure if this is just a graphical thing (I.E. the system view doesn't display doublepump allowing for parallel store).

Altera_Forum · ‎11-14-2017

In nested loops, II of outer loops will be two since the exit condition of the inner loop and the outer loop need to be evaluated in one cycle if you want II of one on the outer loop, and that will create a very large critical path and significantly reduce operating frequency. This issue does not necessarily result in lower performance; however, you can merge your loops manually into one to achieve II of one. There should be a note about this in the report at the bottom if you click on the line with the II info, but I don't remember exactly.

Altera_Forum · ‎11-14-2017

HRZ,

Thanks for the insight. I had not considered nested loops, and experimenting with that produced something interesting. I removed the inner loop and just did the calculations on one bank to test things out. I also unrolled a for loop I used to implement the shift regs (idx_pipes, etc) so that there were no nested loops inside the 'outer' loop. Now the tool still shows an II of 2 on the 'outer' loop except now it provides more info. It does say there's a store dependency on those two lines. I wouldn't expect that behavior because it's a doublepumped memory (2 wr ports, 2 rd ports).

Do you have any advice on that?

EDIT: This is with version 17.0.

Altera_Forum · ‎11-14-2017

Please post your new code. You seem to be using indirect addressing on the local buffer; this is very likely not a good idea. Double-pumping memory should not affect load/store dependencies.

Altera_Forum · ‎11-14-2017

I could see double pumping not affecting load/store dependencies, but from an II perspective I think it should matter. Here's what I'm thinking, please let me know if you disagree: if I do two writes per loop at clock rate 'clk', and my memory is doublepumped such that it operates at 'clk2x' then on the first cycle of clk2x the first write will be performed, and on the second cycle of clk2x the second write will be performed. The writes will have been performed in order, and in 1 cycle of 'clk'.

Also, do you have any insight into why indirect address is bad in OpenCL? Is it just Altera preventing anyone from accidentally causing write collisions?

local float __attribute__((bankwidth(4),

numreadports(2),

numwriteports(2),

doublepump,

bank_bits(2,1,0))) mem[1024][4][2];

for (uint outer_outer = 0; outer_outer < 8; ++outer_outer)

{// some integer add,sub,and shifts that are used to help compute x_idx, y_idx later

float x_pipe[4];

float y_pipe[4];

uint x_idx_pipe[4];

uint y_idx_pipe[4];

for (uint outer = 0; outer < 8; ++outer)

{

uint x_idx, y_idx;

// compute x_idx, and y_idx using integer add, subs, and shifts

float x_fetched = mem[x_idx][0][0];

float y_fetched = mem[y_idx][0][0];

mem[x_idx_pipe[0]][0][1] = x_pipe[0];

mem[y_idx_pipe[0]][0][1] = y_pipe[0];

// manually coded shift register statements (i.e. no for loop) + computations on x and y

x_pipe[3] = x_fetched;

y_pipe[3] = y_fetched;

x_idx_pipe[3] = x_idx;

y_idx_pipe[3] = y_idx;

}

Altera_Forum · ‎11-14-2017

HRZ,

You are correct that indirect addressing is causing it. If I index with constants it reduces to II = 1. I'm not sure I understand why indirect is such a problem though...

Altera_Forum · ‎11-15-2017

By using indirect addressing, you are preventing the compiler from properly determining whether there is any load/store dependency in accessing the mem buffer in the loop or not; hence, the compiler assumes the worst case and enforces the highest II to ensure correct functionality.

Altera_Forum · ‎11-15-2017

I see what you're saying but I'm reading from one bank and writing to another. The tool correctly infers the loads can be done in parallel, even with indirect addressing. I don't see how a store dependency would affect anything since it's doublepumped - just schedule the second store to be on the second edge of clk2x. Perhaps Altera needs to provide more control over the M20K configuration such as the ability to specify what happens during a collision because this unexpected behavior halves the throughput of the kernel.

Altera_Forum · ‎11-16-2017

Double-pumping just reduces the Block RAM usage, it will not help eliminating load/store dependencies. You have two loops here, both of which are pipelined. You should pay attention to this fact that your pipeline might be long enough that not only all iterations of the inner loop, but also some iterations of the outer loop might be in flight in the pipeline at the same time. This can potentially lead to iteration "i" in the outer loop and some "x" in the inner loop writing to the same location in the mem buffer as iteration "i+1" in the outer loop and "y" in the inner loop is trying to read from; unless the compiler can ensure this will not happen, it will assume it does, and adjusts the II accordingly.

Altera_Forum · ‎11-16-2017

That's a good point I had not considered, although in my case my "outer_outer" loop is executed serially (as correctly reported by the tool) due to the structure of my code. Therefore an II of 1 is achievable on the "outer" loop IF the tool utilized both write ports of the BRAM simultaneously, which is a common design pattern in HDL, but I guess not supported at this time by the OpenCL compiler.

As an aside, from an RTL perspective, doublepumping really does more than just reduce BRAM usage - it increases memory bandwidth, or throughput, of each BRAM. If I have two singlepumped BRAMs I should be able to do 1 write, 2 reads per kernel clock; if I have one doublepumped BRAM I should be able to do 2 writes and 2 reads per kernel clock.

EDIT: I think I may have been wrong. According to the Best Practices guide:

--- Quote Start ---

By default, each local memory bank has one read port and one write port. The double pumping feature allows each local memory bank to support up to three read ports.

--- Quote End ---

I wonder if the M20K read latency is half the write latency or something like that which results in this....

Altera_Forum · ‎11-16-2017

Actually you are correct, I had forgotten that you cannot have more than one write per buffer with single-pumped Block RAMs.