Vectorize circular buffer? - Page 2

stx · ‎10-15-2010

Hi!

I have a lot of circular buffers and I wonder if it is possible to vectorize the inner loops. They will typically look like this (read phase):

[cpp]void read(float *buf, float *u, int pbuf, const int bufferLength)
  int s;

  for(s=0; s = buf[pbuf];
	
    pbuf++;
  
    if(pbuf == bufferLength)
      pbuf = 0;
  }

  return;
}[/cpp]

I can make the bufferLength power of 2, and simplify the above to something like:

void read(float *buf, float *u, int pbuf, const int bufferLength)
  int s;

  for(s=0; s = buf[pbuf & (bufferLength-1)];
	       
        pbuf++;
  }

  return;
}

However, it still won't vectorize. The compiler says "loop was not vectorized: dereference too complex", which I kind of understand. So my question is, is there anything smart I can do here? Is it even possible to vectorize this kind of operation?

Thanks!

TimP · ‎10-20-2010

If you write at most 512 floats before continuing to code which may read what you have written, you would not expect a major issue with memory bandwidth. The major use for nontemporal is where you store so much data that temporal would evict what you need next from cache, which doesn't appear to be the case for you.
It does look like your buffer is big enough that if you fill most of it before reading it back, you would not expect the data to remain in L1 cache, which may be a point in favor of nontemporal. If you can get a 10% advantage in what looks like a marginal case, that could be useful information which you could get only by testing.

In your last posted code, it looks like the failure to vectorize is more likely still due to your omission of restrict qualifiers, or possibly to the lack of a #pragma vector always or #pragma vector nontemporal. Note that both those pragmas ask for vectorization without evaluating efficiency, so it is possible that nontemporal could promote vectorization. It might be interesting to see whether the use of unsigned bufferlength has any influence on the compiler's evaluation, possibly combinations of signed and unsigned casts. unsigned tends to be inefficient for loop counters, subscripts, and bounds (counter to the opinion of many), but may be preferred for masking.