Intel® C++ Compiler
Community support and assistance for creating C++ code that runs on platforms based on Intel® processors.

Vectorize circular buffer?

stx
Beginner
1,235 Views
Hi!
I have a lot of circular buffers and I wonder if it is possible to vectorize the inner loops. They will typically look like this (read phase):
[cpp]void read(float *buf, float *u, int pbuf, const int bufferLength)
  int s;

  for(s=0; s = buf[pbuf];
	
    pbuf++;
  
    if(pbuf == bufferLength)
      pbuf = 0;
  }

  return;
}[/cpp]
I can make the bufferLength power of 2, and simplify the above to something like:
void read(float *buf, float *u, int pbuf, const int bufferLength)
  int s;

  for(s=0; s = buf[pbuf & (bufferLength-1)];
	       
        pbuf++;
  }

  return;
}
However, it still won't vectorize. The compiler says "loop was not vectorized: dereference too complex", which I kind of understand. So my question is, is there anything smart I can do here? Is it even possible to vectorize this kind of operation?
Thanks!
0 Kudos
21 Replies
Om_S_Intel
Employee
1,139 Views

You may need to tell the compiler that pointers are not aliased. This you can do with "restrict" key word.

Also loop count should be known for vectorization.

0 Kudos
stx
Beginner
1,139 Views
Yeah, you are right!
I did that by using "#pragma ivdep", which seems to work, but forgot it in my example. Are you sure about the loop count? It will not be known at compile time, but I see other loops in my code using the same loop count which do vectorize.
0 Kudos
TimP
Honored Contributor III
1,139 Views
icc performs automatic loop splitting or peeling only for a limited group of cases, not likely to include any with an explicit conditional. You could write it out explicitly with an additional level of looping so as to take the pointer wrap out of the inner loop. You could write it as (at most a pair of) memcpy() if that is appealing.
Your 2nd alternative is a good idea, with a possible version to try:

voidread(float*restrict buf,float*restrict u, const int pbuf,const intbufferLength)
#pragma vector always
for(int s=0;s u=buf[pbuf + (s &(bufferLength-1))];

which ought to be possible to auto-vectorize, at least with SSE4. If not, you might submit an actual reproducer on your premier.intel.com account to ask for advice. As this leads to scalar loads packed into parallel stores, one would expect less effectiveness than with loop splitting, depending of course on the values of N and bufferLength and whether you align u[].
0 Kudos
TimP
Honored Contributor III
1,139 Views
The requirement on N is that it be clearly loop invariant. I hate to try to say it more technically than Om. A possible alternative to restrict would be to define local copies within the scope containing the loop. As you said, #ivdep should eliminate any analysis for possible aliasing, but it's good to use syntax which supports the optimization you want without having to tell the compiler not to check anything.
The forum seems to have updates delayed by hours today; I didn't see the intervening posts.
0 Kudos
stx
Beginner
1,139 Views
Thanks for your replies!
I did some more tests, and this is what I found out:
This code will vectorize:
[cpp]void read(float *buf, float *u, const int N, int *pBuf, const int bufferLength)
{
  int s;
  int pbuf = *pBuf;
  int moduloMask = bufferLength - 1;

#pragma ivdep
  for(s=0; s = buf[(s + pbuf) + moduloMask];
  }

  *pBuf = pbuf+N;
  
  return;
}  [/cpp]
While this code won't:
[cpp]void read(float *buf, float *u, const int N, int *pBuf, const int bufferLength)
{
  int s;
  int pbuf = *pBuf;
  int moduloMask = bufferLength - 1;

#pragma ivdep
  for(s=0; s = buf[(s + pbuf) & moduloMask];
  }

  *pBuf = pbuf+N;
  
  return;
}  [/cpp]
I would have hoped that the addition instruction would have been replaced with something like ANDPS. I guess I can code this using intrinsics like _mm_and_ps instead.
0 Kudos
stx
Beginner
1,139 Views
Wait! This can never work. There are no SSE instructions to load data from four different addresses, only from four consecutive addresses, right?
So if the wraparound happens at a random boundary, this will never work. I guess it can be solved by writing extra (duplicate) data at the end of the buffer to catch any reads past the end of the buffer.
0 Kudos
jimdempseyatthecove
Honored Contributor III
1,139 Views
u=buf[(s+pbuf)+moduloMask];

The above is wrong. You need to & with moduloMask as in your 2nd code example.

Try something like this untested code

[bash]void read(float *buf, float *u, const int N, int *pBuf, const int bufferLength)
{
  int s, e;
  int pbuf = *pBuf;

  s = 0;
  e = min(bufferLength, pbuf + N);

#pragma ivdep
  for(r=pbuf; r;

#pragma ivdep
  for(r=0; s;
  

  *pBuf = r;
  
  return;
}  [/bash]

Jim Dempsey

0 Kudos
stx
Beginner
1,139 Views
Yes, its wrong! I just wanted to demonstrate that the compiler knows how to vectorize the construct as such (but using another operator than &). Sorry for not being clear about that.
Your ide in the example is probably the best thing to do, to explicitly avoid the wraparound. I will look into that. I was hoping to be able to avoid that solution, as I also have more complicated functions with several pointers, and then it gets complicated, but I'll give it a try.
Thanks for all your help!
0 Kudos
jimdempseyatthecove
Honored Contributor III
1,139 Views
stx,

Will you always be peeling the data out of the circular buffer in vector sized chunks?
IOW 2 for doubles, 4 for floats.

If so, consider defining your buffer as a union of floats (doubles) with __m128i data types (or cast to __m128i). Then use you modulus buffer index into the __m128i buffer.

Note, source and destination buffers must be aligned to 16-byte boundaries.

Jim Dempsey
0 Kudos
stx
Beginner
1,139 Views
Yeah, that sounds like a clever thing to do, but only if the modulo is a factor of 4 (for floats). This is typically not the case in my application. I guess this can be solved by explicitly writing samples at both ends of the buffers when close to the wrap around?
Thanks for your effort!
0 Kudos
jimdempseyatthecove
Honored Contributor III
1,139 Views
If the amount of data per move is sufficiently large, I suggest you take Tim's advice and use the memcpy (twice when necessary). This will take care of having you write the alignment and partial sse vector code into your function (read more portable). When your data packets are small though, it may be faster to just use an index on the array of floats. Not knowing the data flow characteristics makes it rather difficult to recommend a "best way".

Jim Dempsey
0 Kudos
stx
Beginner
1,139 Views
I see what you are saying. It seems like there is not perfect solution for this problem, so I will set up a test case to compare memcpy, plain loops and explicitly split-up loops (to avoid modulo). It might be that some of these cases are limited by memory bandwidth, and then SSE won't help much I guess.
Thanks again for your help!
0 Kudos
jimdempseyatthecove
Honored Contributor III
1,139 Views
Quoting stx
I see what you are saying. It seems like there is not perfect solution for this problem, so I will set up a test case to compare memcpy, plain loops and explicitly split-up loops (to avoid modulo). It might be that some of these cases are limited by memory bandwidth, and then SSE won't help much I guess.
Thanks again for your help!

memcpy, with the Intel compiler, includes a decision tree that determines lenght, alignment, and SSE capabilityissues andwhen suitable it will take a path that uses SSE instructions. The decision tree has some up front overhead.

Additionaly, the memcpy may incure additional SSE register save/restore and invalidate of transient SSE registers. Meaning, if your circular buffer code gets inlined, it may avoid some SSE register save/restore and invalidation of transient SSE registers.

If your circular buffer is a choke point in your code, you might consider hand tuning it. I would venture to guess that the deciding issue is first, the average length of copy. And second, the size of the ring buffer (determines the frequency of split copies).

The other possible issue is if you have a worst case latency requirement. Usually this is not a concerne as much as average latency.

Jim Dempsey

0 Kudos
TimP
Honored Contributor III
1,139 Views
Quoting stx
some of these cases are limited by memory bandwidth, and then SSE won't help much I guess.

memcpy, with the Intel compiler, includes a decision tree that determines lenght, alignment, and SSE capabilityissues andwhen suitable it will take a path that uses SSE instructions. The decision tree has some up front overhead.

This includes invoking non-temporal move where appropriate, which seems likely, as you mention it may be memory bandwidth limited.
If you do need a significant part of the code of an optimized memcpy(), the function call saves a lot of code expansion in your application, and may improve instruction cache locality. For this reason, Intel compilers make memcpy() substitutions automatically in for() when it appears appropriate, based on the assumption that the loop count is at least 100 (should you not specify it by #pragma or visible static array size declaration).
As Jim has been reminding us, the choice among such methods depends on whether you have long data moves or many short ones. We can go on and on presenting information you may not want, if you are unwilling to answer those questions.
0 Kudos
jimdempseyatthecove
Honored Contributor III
1,139 Views
>>This includes invoking non-temporal move where appropriate

In the case of the circular buffer, if this is a single producer single consumer queue .and. producer and consumer do not share L1 (HT siblings) .and. the time in the buffer is very short .and./.or. the data is processed immediately after extraction from the buffer (IOW producer and consumer do not share L1 and are using circular buffer as inter-thread message passing), then non-temporal moves would not be appropriate as you would want the data to be placed into and extracted from L3 cache (or L2/L1 if HT siblings). Note, L1 onlyis written on non-temporal move (other caches are invalidated), in the special case of message passing between HT siblings the non-temporal moves (of short data) may remain fully cached in L1.

Is there a pragma or other function call to inhibit (or control) non-temporal moves?

#pragma vectortemporal
memcpy(dst, src, n);

On this subject, I did notice that the CEAN extension does pay attention to #pragma vectortemporal. So you could use

if((index + n) <= bufferSize)
{
#pragma vectortemporal
dstAsFloats[0:n] = bufferAsFloats[index:n];
index+= n;
if(index == bufferSize)
index = 0;
} else {
int part1 = bufferSize - index;
int part2 = n - part1;
#pragma vectortemporal
dstAsFloats[0:part1] = bufferAsFloats[index:part1];
#pragma vectortemporal
dstAsFloats[part1:part2] = bufferAsFloats[0:part2];
index = part2;
}


Jim Dempsey
0 Kudos
TimP
Honored Contributor III
1,139 Views
memcpy() would choose non-temporal move only when the data segment is so large that it would evict most of the data currently in cache. Then, the (unforeseen) case where non-temporal is bad is the one where the code is written explicitly to use ("consume") first the tail end of the array which was just moved.
There are a number of cases where the default action of the Intel optimizer can defeat the purpose of code which uses a forward-backward scheme with the intent of improving cache locality between loops for large arrays.
The pragma would not change the action of memcpy(), but it can be used to control instruction selection for vectorizable for() loops.
If you don't know more than you are willing to tell us about your usage, you will have to experiment. If you perform a 2 part move, it's even conceivable that one should be non-temporal and the other (which you want to remain in cache) should be temporal.
0 Kudos
stx
Beginner
1,139 Views
Quoting tim18
As Jim has been reminding us, the choice among such methods depends on whether you have long data moves or many short ones. We can go on and on presenting information you may not want, if you are unwilling to answer those questions.

Alright, here is some more info. The length of the circular buffer will range from ~1000 to ~15000. However, the delay between the writepointer and the readpointer is variable over the entire range, more or less, and not known at compile time. The bufferLength, the number of floats processed per call, is also variable in the range 32 to 512 or something like that.

Also, the function will typically write at two places and read from two places in each call, which makes the explicit modulo stuff a bit cumbersome, but still feasible.

And again, thanks for all your help! Even though not all of these ideas might apply in my case, I still find them very interesting.

0 Kudos
stx
Beginner
1,139 Views
On this subject, I did notice that the CEAN extension does pay attention to #pragma vectortemporal. So you could use

if((index + n) <= bufferSize)
{
#pragma vectortemporal
dstAsFloats[0:n] = bufferAsFloats[index:n];
index+= n;
if(index == bufferSize)
index = 0;
} else {
int part1 = bufferSize - index;
int part2 = n - part1;
#pragma vectortemporal
dstAsFloats[0:part1] = bufferAsFloats[index:part1];
#pragma vectortemporal
dstAsFloats[part1:part2] = bufferAsFloats[0:part2];
index = part2;
}


Jim Dempsey

Hi again. I have done some experiments with code similar to the one above, but with more reads and writes, and some arithmetic operations. The result is a 2:1 speed-up, which is great! The most simple test cases, with only read and write, will not see the same speed-up. My guess that is due to memory bandwidth being the limiting factor in those situations.

0 Kudos
stx
Beginner
1,139 Views
Quoting tim18
memcpy() would choose non-temporal move only when the data segment is so large that it would evict most of the data currently in cache. Then, the (unforeseen) case where non-temporal is bad is the one where the code is written explicitly to use ("consume") first the tail end of the array which was just moved.
There are a number of cases where the default action of the Intel optimizer can defeat the purpose of code which uses a forward-backward scheme with the intent of improving cache locality between loops for large arrays.
The pragma would not change the action of memcpy(), but it can be used to control instruction selection for vectorizable for() loops.
If you don't know more than you are willing to tell us about your usage, you will have to experiment. If you perform a 2 part move, it's even conceivable that one should be non-temporal and the other (which you want to remain in cache) should be temporal.

This is all interesting. I didn't understand the temporal/non-temporal stuff before, but after reading this and experimenting a bit further, I can see that I gain another 10% by using #pragma vector nontemporal. This also makes sense according to what you write above.

Thanks!

0 Kudos
jimdempseyatthecove
Honored Contributor III
910 Views
The "temporal' choice of words and effect on caches is a bit confusing. Whereas SSE nomenclature of might be less so.

Temporal has the effect of:

write into L1, into L2, into L3, and dribble in sequence to RAM

non-Temporal had the effect of

write into L1, invalidate L2, invalidate L3, and dribble into RAM not necessarily in sequence and potentially with write combining.

The primary difference is

pollute / populate L2 and L3
or
invalidate L2, L3 and dribble into RAM not necessarily in sequence and potentially with WC

As to if it is pollute or populate, this depends on if you want to immediatelyreuse the data or not.

Jim Dempsey


0 Kudos
Reply