Intel® Quartus® Prime Software
Intel® Quartus® Prime Design Software, Design Entry, Synthesis, Simulation, Verification, Timing Analysis, System Design (Platform Designer, formerly Qsys)
16597 Discussions

Forcing loop iterations to execute sequentially

Altera_Forum
Honored Contributor II
1,494 Views

Hi all, 

 

I have some code that has a nested for loop in a single work item kernel like so: 

 

__local lmem[M][2]; // ping pong buffer 

 

for (uint outer = 0; outer < N; ++outer) 

{__private wr_bank_sel = outer & 0x1; 

__private rd_bank_sel = !(outer & 0x1); 

for (uint inner =0; inner < M; ++inner) 

{  

 

lmem[inner][wr_bank_sel] = lmem[inner][rd_bank_sel] * 10.0f/(inner + outer); // placeholder math op but real dependencies 

 

 

 

The code above worked in the emulator but failed in hardware. After much digging via printf statements, what I discovered was the outer loop and inner loop both executing out of order simultaneously! I.E. 

[outer = 0, inner = 0] -> [outer = 1, inner = 0] -> [outer = 2, inner = 0] -> [outer = 0, inner = 1] -> [outer = 3, inner = 0], etc  

when it should be: 

[outer = 0, inner = 0] -> [outer = 0, inner = 1] -> [outer = 0, inner = 2] -> [outer = 0, inner = M-1]-> [outer = 1, inner = 0], etc. 

 

i'm fine with the inner loop executing out of order but the outer loop executing out of order at the same time obviously doesn't work with a ping pong buffer strategy. 

 

I've tried mem_fences which don't appear to have any effect (I've tried CLK_GLOBAL_MEM_FENCE, CLK_LOCAL_MEM_FENCE, CLK_CHANNEL_MEM_FENCE and combinations of those). 

 

What does seem to work is adding an unnecessary channel in my outer loop. This modification looks like the following: 

 

__local lmem[M][2]; // ping pong buffer 

 

for (uint outer = 0; outer < N; ++outer) 

write_channel_intel(fake_channel, outer); // NEW  

 

mem_fence(CLK_CHANNEL_MEM_FENCE); // NEW 

 

const uint fake_outer = read_channel_intel(fake_channel); // NEW 

 

__private wr_bank_sel = fake_outer & 0x1; 

 

__private rd_bank_sel = !(fake_outer & 0x1); 

 

for (uint inner =0; inner < M; ++inner) 

lmem[inner][wr_bank_sel] = lmem[inner][rd_bank_sel] * 10.0f/(inner + fake_outer); // NEW (replaced outer with fake_outer)  

 

 

 

In the report.html I now see it say that there is a serial execution dependency. "Iteration executed serially across BlockN. Only a single loop iteration will execute inside this region due to memory dependency". I think this is exactly what I want -- my outer loop to execute serially. 

 

I've built and run this modified code and it seems to work. I've also found you can play with atomics to get the same message in report.html (have not yet built and tried it in hardware though). 

 

 

is there a better way?  

 

I'm having a hard time believing Intel/Altera would not have considered this use case. I also imagine I'm incurring some performance penalty with these workarounds. 

 

Thanks in advance for your help. If nobody replies I hope I've at least provided some workaround strategies for anyone in the future who stumbles upon this problem. 

 

 

This is with the 17.0 version of the compiler.
0 Kudos
4 Replies
Altera_Forum
Honored Contributor II
354 Views

You can force serial execution with loop unrolling. I don't know what your N and M values are, but have you tried doing partial or full unrolling?

0 Kudos
Altera_Forum
Honored Contributor II
354 Views

Unless your kernel is being falsely compiled as NDRange, loops are never executed out of order in single work-item kernels. NDRange kernels have a scheduler that can change the order of threads at runtime, but single work-item kernels do not have n scheduler and loop iterations are guaranteed to be executed in-order. What is happening here is that your printfs are coming out out-of-order, not the iterations themselves. I am still not sure how printfs are implemented on the hardware, but there is likely some on-chip buffering in place and hence, I don't think correct ordering of printfs are guaranteed on the hardware. You should probably not use printf for debugging ordering on the hardware. 

 

Anyway, I wrote a test kernel based on your code snippet, and as I expected, the compiler already correctly detects the dependency and forces the outer loop to be executed sequentially. This is the test kernel I wrote: 

 

#define M 100 # define N 1000 //__attribute__((max_global_work_dim(0))) __kernel void ast(__global float* restrict in, __global float* restrict out) { __local float lmem; for (int i = 0; i < M; i++) { lmem = in; } for (uint outer = 0; outer < N; ++outer) { uint wr_bank_sel = outer & 0x1; uint rd_bank_sel = !(outer & 0x1); for (uint inner = 0; inner < M; ++inner) { lmem = lmem * 10.0f/(inner + outer); // placeholder math op but real dependencies } } for (int i = 0; i < M; i++) { out = lmem; } } 

 

And this the compiler's dependency report v16.1.2: 

The kernel is compiled for single work-item execution. The kernel has a required work-group size of (1, 1, 1). Loop Report: + Loop "Block1" (file test.cl line 9) Pipelined well. Successive iterations are launched every cycle. + Loop "Block2" (file test.cl line 14) | Pipelined with successive iterations launched every cycle. | | Iterations executed serially across the region listed below. | Only a single loop iteration will execute inside the listed region. | This will cause performance degradation unless the region is pipelined well | (can process an iteration every cycle). | | Loop "Block3" (file test.cl line 18) | due to: | Memory dependency on Load Operation from: (file test.cl line 20) | Store Operation (file test.cl line 20) | | |-+ Loop "Block3" (file test.cl line 18) Pipelined well. Successive iterations are launched every cycle. + Loop "Block5" (file test.cl line 24) Pipelined well. Successive iterations are launched every cycle. 

 

The dependency is also properly detected in v17.0.2 and sequential execution if forced in the outer loop. I am not sure why this is not happening in your case. You are not using# pragma ivdep on the outer loop, are you?
0 Kudos
Altera_Forum
Honored Contributor II
354 Views

Thanks HRZ, unfortunately I'm still having this issue. The test code you posted detects the serial execution dependency fine however I get very inconsistent results in the compiler for my program. I'm going to keep trying to create simplified versions to show you what I mean but if you have any# pragma recommendations or anything in the meantime let me know.  

 

I've noticed it occurs when I have add a for loop around some code that contains 2 layer deep nested for loops - adding this "outer" for loop seems to make my inner loops lose their serial execution dependency statements in my reports (and their II goes from ~100 to ~2, with no memory replication). I'm not using# pragma ivdep anywhere in my code, and adding# pragma max_concurrency 1 and/or# pragma unroll 1 didn't help.
0 Kudos
Altera_Forum
Honored Contributor II
354 Views

I am afraid I do not know of any pragmas that could force serial execution. Altera has many undocumented pargmas, they could also have one for this, but rarely anybody outside of Altera knows about them. 

 

If you managed to create a simple example that reproduces the problem, it would be interesting to take a look at it; there could be a bug in the compiler after all. 

 

P.S. Try to test your code with the latest version of Quartus (17.1 was released yesterday), even if your board does not have a BSP for it, to see how the optimization report changes. They might have fixed the issue already.
0 Kudos
Reply