- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[cpp]void read(float *buf, float *u, int pbuf, const int bufferLength) int s; for(s=0; s= buf[pbuf]; pbuf++; if(pbuf == bufferLength) pbuf = 0; } return; }[/cpp]
void read(float *buf, float *u, int pbuf, const int bufferLength) int s; for(s=0; s = buf[pbuf & (bufferLength-1)]; pbuf++; } return; }
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You may need to tell the compiler that pointers are not aliased. This you can do with "restrict" key word.
Also loop count should be known for vectorization.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Your 2nd alternative is a good idea, with a possible version to try:
voidread(float*restrict buf,float*restrict u, const int pbuf,const intbufferLength)
#pragma vector always
for(int s=0;s
which ought to be possible to auto-vectorize, at least with SSE4. If not, you might submit an actual reproducer on your premier.intel.com account to ask for advice. As this leads to scalar loads packed into parallel stores, one would expect less effectiveness than with loop splitting, depending of course on the values of N and bufferLength and whether you align u[].
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The forum seems to have updates delayed by hours today; I didn't see the intervening posts.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[cpp]void read(float *buf, float *u, const int N, int *pBuf, const int bufferLength) { int s; int pbuf = *pBuf; int moduloMask = bufferLength - 1; #pragma ivdep for(s=0; sWhile this code won't:= buf[(s + pbuf) + moduloMask]; } *pBuf = pbuf+N; return; } [/cpp]
[cpp]void read(float *buf, float *u, const int N, int *pBuf, const int bufferLength) { int s; int pbuf = *pBuf; int moduloMask = bufferLength - 1; #pragma ivdep for(s=0; sI would have hoped that the addition instruction would have been replaced with something like ANDPS. I guess I can code this using intrinsics like _mm_and_ps instead.= buf[(s + pbuf) & moduloMask]; } *pBuf = pbuf+N; return; } [/cpp]
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The above is wrong. You need to & with moduloMask as in your 2nd code example.
Try something like this untested code
[bash]void read(float *buf, float *u, const int N, int *pBuf, const int bufferLength) { int s, e; int pbuf = *pBuf; s = 0; e = min(bufferLength, pbuf + N); #pragma ivdep for(r=pbuf; r; #pragma ivdep for(r=0; s ; *pBuf = r; return; } [/bash]
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Will you always be peeling the data out of the circular buffer in vector sized chunks?
IOW 2 for doubles, 4 for floats.
If so, consider defining your buffer as a union of floats (doubles) with __m128i data types (or cast to __m128i). Then use you modulus buffer index into the __m128i buffer.
Note, source and destination buffers must be aligned to 16-byte boundaries.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
memcpy, with the Intel compiler, includes a decision tree that determines lenght, alignment, and SSE capabilityissues andwhen suitable it will take a path that uses SSE instructions. The decision tree has some up front overhead.
Additionaly, the memcpy may incure additional SSE register save/restore and invalidate of transient SSE registers. Meaning, if your circular buffer code gets inlined, it may avoid some SSE register save/restore and invalidation of transient SSE registers.
If your circular buffer is a choke point in your code, you might consider hand tuning it. I would venture to guess that the deciding issue is first, the average length of copy. And second, the size of the ring buffer (determines the frequency of split copies).
The other possible issue is if you have a worst case latency requirement. Usually this is not a concerne as much as average latency.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
memcpy, with the Intel compiler, includes a decision tree that determines lenght, alignment, and SSE capabilityissues andwhen suitable it will take a path that uses SSE instructions. The decision tree has some up front overhead.
If you do need a significant part of the code of an optimized memcpy(), the function call saves a lot of code expansion in your application, and may improve instruction cache locality. For this reason, Intel compilers make memcpy() substitutions automatically in for() when it appears appropriate, based on the assumption that the loop count is at least 100 (should you not specify it by #pragma or visible static array size declaration).
As Jim has been reminding us, the choice among such methods depends on whether you have long data moves or many short ones. We can go on and on presenting information you may not want, if you are unwilling to answer those questions.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In the case of the circular buffer, if this is a single producer single consumer queue .and. producer and consumer do not share L1 (HT siblings) .and. the time in the buffer is very short .and./.or. the data is processed immediately after extraction from the buffer (IOW producer and consumer do not share L1 and are using circular buffer as inter-thread message passing), then non-temporal moves would not be appropriate as you would want the data to be placed into and extracted from L3 cache (or L2/L1 if HT siblings). Note, L1 onlyis written on non-temporal move (other caches are invalidated), in the special case of message passing between HT siblings the non-temporal moves (of short data) may remain fully cached in L1.
Is there a pragma or other function call to inhibit (or control) non-temporal moves?
#pragma vectortemporal
memcpy(dst, src, n);
On this subject, I did notice that the CEAN extension does pay attention to #pragma vectortemporal. So you could use
if((index + n) <= bufferSize)
{
#pragma vectortemporal
dstAsFloats[0:n] = bufferAsFloats[index:n];
index+= n;
if(index == bufferSize)
index = 0;
} else {
int part1 = bufferSize - index;
int part2 = n - part1;
#pragma vectortemporal
dstAsFloats[0:part1] = bufferAsFloats[index:part1];
#pragma vectortemporal
dstAsFloats[part1:part2] = bufferAsFloats[0:part2];
index = part2;
}
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
There are a number of cases where the default action of the Intel optimizer can defeat the purpose of code which uses a forward-backward scheme with the intent of improving cache locality between loops for large arrays.
The pragma would not change the action of memcpy(), but it can be used to control instruction selection for vectorizable for() loops.
If you don't know more than you are willing to tell us about your usage, you will have to experiment. If you perform a 2 part move, it's even conceivable that one should be non-temporal and the other (which you want to remain in cache) should be temporal.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Alright, here is some more info. The length of the circular buffer will range from ~1000 to ~15000. However, the delay between the writepointer and the readpointer is variable over the entire range, more or less, and not known at compile time. The bufferLength, the number of floats processed per call, is also variable in the range 32 to 512 or something like that.
Also, the function will typically write at two places and read from two places in each call, which makes the explicit modulo stuff a bit cumbersome, but still feasible.
And again, thanks for all your help! Even though not all of these ideas might apply in my case, I still find them very interesting.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
if((index + n) <= bufferSize)
{
#pragma vectortemporal
dstAsFloats[0:n] = bufferAsFloats[index:n];
index+= n;
if(index == bufferSize)
index = 0;
} else {
int part1 = bufferSize - index;
int part2 = n - part1;
#pragma vectortemporal
dstAsFloats[0:part1] = bufferAsFloats[index:part1];
#pragma vectortemporal
dstAsFloats[part1:part2] = bufferAsFloats[0:part2];
index = part2;
}
Jim Dempsey
Hi again. I have done some experiments with code similar to the one above, but with more reads and writes, and some arithmetic operations. The result is a 2:1 speed-up, which is great! The most simple test cases, with only read and write, will not see the same speed-up. My guess that is due to memory bandwidth being the limiting factor in those situations.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
There are a number of cases where the default action of the Intel optimizer can defeat the purpose of code which uses a forward-backward scheme with the intent of improving cache locality between loops for large arrays.
The pragma would not change the action of memcpy(), but it can be used to control instruction selection for vectorizable for() loops.
If you don't know more than you are willing to tell us about your usage, you will have to experiment. If you perform a 2 part move, it's even conceivable that one should be non-temporal and the other (which you want to remain in cache) should be temporal.
This is all interesting. I didn't understand the temporal/non-temporal stuff before, but after reading this and experimenting a bit further, I can see that I gain another 10% by using #pragma vector nontemporal. This also makes sense according to what you write above.
Thanks!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Temporal has the effect of:
write into L1, into L2, into L3, and dribble in sequence to RAM
non-Temporal had the effect of
write into L1, invalidate L2, invalidate L3, and dribble into RAM not necessarily in sequence and potentially with write combining.
The primary difference is
pollute / populate L2 and L3
or
invalidate L2, L3 and dribble into RAM not necessarily in sequence and potentially with WC
As to if it is pollute or populate, this depends on if you want to immediatelyreuse the data or not.
Jim Dempsey
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page