Re: Elongate pipeline voluntarily

Altera_Forum · ‎10-22-2014

Hi peeps,

Suppose I have a global memory access which, according to profiling, stalls the pipeline very often.

Well, first of all, is there a way to know for how many cycles is the pipeline stalling? Let's say this number is 'P'.

Then, is there a way to elongate the pipeline voluntarily to wait for the memory access without stalling?

For example, in pseudo code:


var = getFromGlobal(randomIndex)    // Stalls very often
var = var
var = var
...
var = var
var = var
calculate(var)      // Use the result

And then, there would be a bypass for when the memory access takes a long time, to put the result directly in var[P] after the result is received.

This way, we could keep inserting the work-items in the pipeline at every clock cycle, and retrieve them at the same rate.

If it's not clear enough, I can try to explain in another way.

Altera_Forum · ‎10-23-2014

NDRange kernels will typically hide these stalls by ensuring that many work-items are in flight to fill in the bubbles in the pipeline. The first line in your pseudo code suggests that you are indexing into memory in a non-sequential (or non-predictable) sequence which I think is the source of the problem you are running into. So even thought the kernel scheduler will attempt to keep the pipeline full, the access pattern will most likely prevent the data being read to keep the pipeline busy doing work. OpenCL aside if a master reads from an SDRAM device in a random order you will typically see idle periods in between blocks of read data returning. When SDRAM is accessed sequentially then the read data typically returns in long continuous blocks (i.e. no stalls)

Instead of trying to elongate the pipeline (which I doubt will help nor is it easy to do without knowing how the compiler works) maybe you can describe the size of the data being accessed by the kernel and whether the index used has any predictable pattern and we can try to suggest a way to improve the memory accesses to avoid the issue at the root of the problem. In cases like these I typically attempt to change my algorithm to access memory in a different order or attempt to preload a block of global memory contents sequentially then access the local copy randomly (local memory can be accessed in any order without any performance degradation).

Altera_Forum · ‎10-24-2014

I'm basically searching for 16byte elements in a hashset ~8GB big. What I did for now to speed up collision resolution is put a maximum of 7 spots to check for the elements. These 7 spots are calculated right from the start and the memory accesses are calculated independently, such as this:


Elements = hashset
Elements = hashset
...
Elements = hashset

Then, I verify which element is the right answer.

So I expect all memory accesses to cause only one pipeline stall overall, since they are done in parallel because of independence. According to the aocl report, the pipeline stall percentages for each of the memory accesses are different, but all hovering around ~75%. Do you think the stalls are indeed combined?

Altera_Forum · ‎10-24-2014

The problem is even though the kernel will issue these accesses in rapidly they most likely will not be coalesced unless spot0-spot6 are packed together. As a result you will end up with a subpar memory access pattern and then read data will most likely return spuratically that causes the stalling to occur. If data doesn't flow into the kernel as large blocks then the pipeline will stall because it's dependent on that data.

Altera_Forum · ‎10-24-2014

Okay, I understand. I'll take this into consideration.

BTW, I had noticed your absence on the OpenCL forums since a little while now. I'm glad to see you back, providing us with some valuable insight!