Re: #pragma ivdep not allowing for parallel stores to local memory

Altera_Forum · ‎10-20-2017

Hello,

I am having difficulty in parallelizing local memory store operations and would appreciate help. The load from local memory appears to be parallelized however in the report.html's System Viewer tab it's showing a dependency on store(lmem[idx[0]] = result[0]) -> store(lmem[idx[1]] = result[1]) -> etc. I've experimented with setting the numbanks attribute and the tool correctly banks the memory but there seems to be no effect on the store dependencies.

This is in a single work item kernel. This particular code snippet is part of a larger loop. During a given loop iteration this code will never attempt to store to the same address (no address collisions); but depending on the loop iteration# it may or may not have bank conflicts. The only way to have zero bank conflicts for all iterations is to use registers - but the lmem size is too large to use registers.

Based on the above description I understand during some loop iterations the store operations will end up being sequential (when all are to the same bank); however I want to take advantage of parallel stores during the loop iterations that will not have bank conflicts.

Things I've tried on the store section:

- removing the# pragma unroll. This resulted in the compiler automatically unrolling the stores.

-# pragma unroll 1. This bottlenecks my algorithm to the point where I won't see any benefit from vectorization.

- flipping the# pragma ivdep and# pragma unroll. No effect.

I would've expected the# pragma ivdep to resolve this - can someone please provide help?

Thank you.

See the code snippet below:

__private float2 operand[8];

__private float2 result[8];

__private uint idx[8];

__local float2 __attribute__((bankwidth(8))) lmem1[8192];

__local float2 __attribute__((bankwidth(8))) lmem2[8192];

for (uint aa = 0; aa < 4; ++aa){#pragma ivdep

for (uint bb = 0; bb < 8192; bb += 8)

{

...

... Code that computes the idx array

...

if ((aa & 0x1) == 0) // ping pong buffer

{

#pragma unroll

#pragma ivdep

for (uint ii = 0; ii < 8; ++ii)

{

operand[ii] = lmem1[idx[ii]];

} // ii

} else

{

#pragma unroll

#pragma ivdep

for (uint ii = 0; ii < 8; ++ii)

{

operand[ii] = lmem2[idx[ii]];

} // ii

}

...

... Code that computes the result array

...

if ((aa & 0x1) == 0) // ping pong buffer

{

#pragma unroll

#pragma ivdep

for (uint ii = 0; ii < 8; ++ii)

{

lmem2[idx[ii]] = result[ii];

} // ii

} else

{

#pragma unroll

#pragma ivdep

for (uint ii = 0; ii < 8; ++ii)

{

lmem1[idx[ii]] = result[ii];

} // ii

}

} // bb

} // aa

UPDATE : Perhaps two things are occurring: the BRAMs inferred by local memory have a fixed write port size so in theory if all the entries in the idx[ii] array write to the same BRAM 8 cycles are needed AND the compiler is trying to preserve order in case any of the entires of idx[ii] are equal (hence the data gets overwritten). I guess the way to solve my problem would be to somehow create a FIFO per each BRAM that can assert back pressure to the kernel? Surely I can't be the first person to encounter this...

Altera_Forum · ‎10-21-2017

#pragma ivdep is designed to avoid false load/store dependencies on accesses to global memory that are caused due to certain information being only available in the host code and not being available to the kernel compiler. I have never encountered any case in which this pragma had any effect on load/store dependencies on local buffers. In my experience, the compiler never makes a mistake in detecting such dependencies on local buffers and hence, trying to avoid such dependencies will likely result in incorrect output.

In your case, it is not very easy to make a judgment without seeing the whole code. However, from what I can see, you are using indirect addressing to the local buffer and the compiler cannot know that these indirect accesses will not overlap and hence, has to force sequential reads and writes. In single work-item kernels, this overhead is generally unavoidable unless you change your design strategy. If you are certain that these addresses never overlap, there must be another method that you can use to avoid the indirect addressing; however, if there is no way to avoid it, I think you might be able to get better performance with NDRange kernels because at least in that case the scheduler will try to maximizing pipeline utilization at runtime by reordering the threads, rather than forcing full sequential operation.

I am not sure if it applies to your case but if FIFO-based synchronization can help you, you can always use the channels extension.

Altera_Forum · ‎10-24-2017

Thank you for your reply HRZ.

Eventually I ended up figuring out a way to rewrite my algorithm to use banking. On the System View section of the compiler report I now see parallel loads/stores however I still see a couple of minor strange store dependencies within what appears to be a section of the code that's contained to operate on a single bank. I'm guessing because in my code I'm writing to two values to the same bank even though I have set the numwriteports to 2.

I understand what the original issue was but after revisiting the Best Practices Guide, it seems Altera/Intel suggest ivdep should be applicable to local memory. From the guide:

"The array specified by the ivdep pragma must be a local or private memory array, or a pointer variable that points to a global, local, or private memory storage."

It seems like the ivdep applied to local memory could be useful for my original situation - where the array access is indirect or complex but the user knows there won't be bank collisions.

Thanks for your help again.

Altera_Forum · ‎10-25-2017

I see. I remember reading something in the changelog of either v17.0 or one of its updates that Altera had fixed a bug related to ivdep not being applied in certain cases. If you are using v16.0/16.1, that could be the reason why it didn't work in your original case.