Intel® Quartus® Prime Software
Intel® Quartus® Prime Design Software, Design Entry, Synthesis, Simulation, Verification, Timing Analysis, System Design (Platform Designer, formerly Qsys)
Announcements
FPGA community forums and blogs on community.intel.com are migrating to the new Altera Community and are read-only. For urgent support needs during this transition, please visit the FPGA Design Resources page or contact an Altera Authorized Distributor.
17268 Discussions

Elongate pipeline voluntarily

Altera_Forum
Honored Contributor II
1,354 Views

Hi peeps, 

 

Suppose I have a global memory access which, according to profiling, stalls the pipeline very often. 

 

Well, first of all, is there a way to know for how many cycles is the pipeline stalling? Let's say this number is 'P'. 

 

Then, is there a way to elongate the pipeline voluntarily to wait for the memory access without stalling? 

 

For example, in pseudo code: 

 

var = getFromGlobal(randomIndex) // Stalls very often var = var var = var ... var = var var = var calculate(var) // Use the result  

 

And then, there would be a bypass for when the memory access takes a long time, to put the result directly in var[P] after the result is received. 

 

This way, we could keep inserting the work-items in the pipeline at every clock cycle, and retrieve them at the same rate. 

 

If it's not clear enough, I can try to explain in another way.
0 Kudos
4 Replies
Altera_Forum
Honored Contributor II
637 Views

NDRange kernels will typically hide these stalls by ensuring that many work-items are in flight to fill in the bubbles in the pipeline. The first line in your pseudo code suggests that you are indexing into memory in a non-sequential (or non-predictable) sequence which I think is the source of the problem you are running into. So even thought the kernel scheduler will attempt to keep the pipeline full, the access pattern will most likely prevent the data being read to keep the pipeline busy doing work. OpenCL aside if a master reads from an SDRAM device in a random order you will typically see idle periods in between blocks of read data returning. When SDRAM is accessed sequentially then the read data typically returns in long continuous blocks (i.e. no stalls) 

 

Instead of trying to elongate the pipeline (which I doubt will help nor is it easy to do without knowing how the compiler works) maybe you can describe the size of the data being accessed by the kernel and whether the index used has any predictable pattern and we can try to suggest a way to improve the memory accesses to avoid the issue at the root of the problem. In cases like these I typically attempt to change my algorithm to access memory in a different order or attempt to preload a block of global memory contents sequentially then access the local copy randomly (local memory can be accessed in any order without any performance degradation).
0 Kudos
Altera_Forum
Honored Contributor II
637 Views

I'm basically searching for 16byte elements in a hashset ~8GB big. What I did for now to speed up collision resolution is put a maximum of 7 spots to check for the elements. These 7 spots are calculated right from the start and the memory accesses are calculated independently, such as this: 

 

Elements = hashset Elements = hashset ... Elements = hashset  

 

Then, I verify which element is the right answer. 

 

So I expect all memory accesses to cause only one pipeline stall overall, since they are done in parallel because of independence. According to the aocl report, the pipeline stall percentages for each of the memory accesses are different, but all hovering around ~75%. Do you think the stalls are indeed combined?
0 Kudos
Altera_Forum
Honored Contributor II
637 Views

The problem is even though the kernel will issue these accesses in rapidly they most likely will not be coalesced unless spot0-spot6 are packed together. As a result you will end up with a subpar memory access pattern and then read data will most likely return spuratically that causes the stalling to occur. If data doesn't flow into the kernel as large blocks then the pipeline will stall because it's dependent on that data.

0 Kudos
Altera_Forum
Honored Contributor II
637 Views

Okay, I understand. I'll take this into consideration. 

 

BTW, I had noticed your absence on the OpenCL forums since a little while now. I'm glad to see you back, providing us with some valuable insight!
0 Kudos
Reply