- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This code scales poorly with AVX on my Sandy Bridge, how can I make it more vectorizer friendly:
for (auto i = 0; i < pcount; i += 2){ for (auto j = 0; j < pcount; j += 2){ if (i == j) continue; auto px = points0 * aspect - points0* aspect; auto py = points0[i + 1] - points0[j + 1]; auto z = divisor / (px * px + py * py + offset); points1 = points0 + (velocities -= px * z) / aspect; points1[i + 1] = points0[i + 1] + (velocities[i + 1] -= py * z); } }
All variables are long double.
- Tags:
- Intel® Advanced Vector Extensions (Intel® AVX)
- Intel® Streaming SIMD Extensions
- Parallel Computing
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
AVX doesn't support long double. If you compile in double for AVX and turn on -opt-report:4 you should get some indication about vectorization (maybe you need double * __restrict points0). If you get "protects exception" you may need #pragma omp simd or #pragma vector . If you don't want to rely on -Qprec-div- you could get some improvement by calculating 1/aspect outside the loop. Even the AVX divide on Sandy Bridge isn't nearly as efficient as on more recent CPUs. You may see some gains by factoring out aspect, but you should be able to use the opt-report figures for that purpose.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm using __restrict, I changed the variables to double and moving the aspect compensation outside of the loop improved performance by about 10% but AVX still only improves performance by about 1.4x compared to disabling intrinsic functions.
"#pragma vector always" is required to vectorize the loop.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The way the data is packed into the points0 and points1 arrays may not be so condusive to vectorization. You're basically using AoS to store the x,y pairs interleaved in memory. The x and y points go through different calculations so they have to be grouped into different vectors to take full advantage of vector instructions. I'm not sure how well the compiler can shuffle the interleaved data into vectors and then shuffle it back to interleaved to write back to memory. You might get better performance if you can switch to SoA for the points (and velocity) data. If you can't change the data storage format for the points you might get better results by re-writing the code with intrinsics where you could shuffle/blend the data into vectors with AVX intrinsics. It could also help if you can ensure your arrays are aligned to 32 bytes.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
With Qvec-report6 I get "loop was not vectorized: existence of vector dependence" for the inner loop and I realize now why it thinks thats the case, the velocity is changed on each iteration, but it doesn't actually matter what order the loop is executed in! So how do I tell the compiler to ignore it?
#pragma ivdep makes no difference.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The intel compiler should be able to recognize the reduction idiom with the velocities (see section about sum):
https://software.intel.com/en-us/node/522572
Maybe try moving the assignments to points1 and points1[i+1] outside and after the inner loop. There isn't much point in assigning to them on every trip of inner loop since it's only the value assigned in the last trip that matters. That should speed things up in any case.
You could also use some local variables inside the outer loop instead of velocities and velocities[i+1] as your summing variables and then just copy the values to the velocities array after the inner loop. I wouldn't think this part is necessary, but if you do that the inner loop will look pretty much the same as the sum reduction idiom in the reference guide.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The branch in the inner loop might also impair performance a bit. You could try just removing the "if(i==j) continue;" and see if the extra few computations of the i=j case are less costly than the branch. If your offset variable can be zero then this won't work since you'll get a divide by zero, but if you can guarantee that offset isn't zero then that extra trip through the loop will just add zero to your velocities and have no effect. Might be faster than the branch.
Another way would be to split the inner loop into 2 parts for j above and below i, but the trick above might vectorize a bit better if you know offset is nonzero.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This is where I am now:
if (aspect != 1.0L) //#pragma omp parallel for simd for (auto i = 0; i < pcount; i += 2) points0 *= aspect; #pragma vector always //#pragma omp parallel for simd for (auto i = 0; i < pcount; i += 2){ auto iy = i + 1; for (auto j = 0; j < pcount; j += 2){ if (i == j) continue; auto px = points0 - points0; auto py = points0[iy] - points0[j + 1]; auto z = divisor / (px * px + py * py); velocities -= px * z; velocities[iy] -= py * z; } points1 = points0 + velocities; points1[iy] = points0[iy] + velocities[iy]; } if (aspect != 1.0L) //#pragma omp parallel for simd for (auto i = 0; i < pcount; i += 2) points1 /= aspect;
AVX scales about 1.84x and multithreading with "#pragma omp parallel for simd" scales amazingly well!
"if (i == j) continue;" is required to prevent a NaN.
I no longer need the offset.
Thanks for all the help.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
That looks like it could be vectorized pretty well now. The operations on points0, points1 and velocities are easily vectorized and the calculation of z should be able to use a horizontal add instruction. It'd be interesting to see the assembly generated by the compiler to see how good it does at vectorizing this.
In the case that aspect != 1 the extra iterations through points1 and points0 could end up causing a larger performance penalty than you gain in the vectorization though. For example, if points0 and points1 are too big for the cache the extra load and store on each of those arrays will be really costly.
Not sure how far you want to push this optimization, but you could try getting rid of the branch inside your inner loop by splitting it like this:
for (auto j = 0; j < i; j += 2){} for (auto j = i+2; j < pcount; j += 2){}
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The gravity equation make more sense to me now so I cleaned up the variables a bit:
for (auto i = 0; i < pcount; i += 2){ auto iy = i + 1; for (auto j = 0; j < i; j += 2){ auto px = points0 - points0; auto py = points0[iy] - points0[j + 1]; auto f = m / (px * px + py * py); velocities -= px * f; velocities[iy] -= py * f; } for (auto j = i + 2; j < pcount; j += 2){ auto px = points0 - points0 ; auto py = points0[iy] - points0[j + 1]; auto f = m / (px * px + py * py); velocities -= px * f; velocities[iy] -= py * f; } points1 = points0 + velocities; points1[iy] = points0[iy] + velocities[iy]; }
no gravitational constant is needed because its not physically accurate just an experiment.
I'm only using something in the order of 2048 points which is 32KB so it should all fit in the L2 cache
This is the assembly for f (previously z):
vmovupd ymm4, ymmword ptr [rip+0xce19] lea ebp, ptr [rax+0x1] vmovsd xmm3, qword ptr [rip+0xce3e] vmovups xmmword ptr [rsp+0x20], xmm10 vmovups xmmword ptr [rsp+0x30], xmm11 vmovups xmmword ptr [rsp+0x40], xmm12 shr ebp, 0x1f vmovups xmmword ptr [rsp+0x50], xmm13 vmovups xmmword ptr [rsp+0x60], xmm14 vmovups xmmword ptr [rsp+0x70], xmm15 lea ebp, ptr [rax+rbp*1+0x1] sar ebp, 0x1 vmulpd ymm11, ymm10, ymm10 vmulpd ymm12, ymm5, ymm5 vaddpd ymm11, ymm11, ymm12 vdivpd ymm12, ymm4, ymm11 vmulsd xmm14, xmm12, xmm12 vmulsd xmm15, xmm5, xmm5 vaddsd xmm14, xmm14, xmm15 vdivsd xmm15, xmm3, xmm14 vmovupd ymm4, ymmword ptr [rip+0xcc22] lea r10d, ptr [rax+0x1] vmovsd xmm3, qword ptr [rip+0xcc46] vmovups xmmword ptr [rsp+0x20], xmm10 vmovups xmmword ptr [rsp+0x30], xmm11 vmovups xmmword ptr [rsp+0x40], xmm12 shr r10d, 0x1f vmovups xmmword ptr [rsp+0x50], xmm13 vmovups xmmword ptr [rsp+0x60], xmm14 vmovups xmmword ptr [rsp+0x70], xmm15 lea r10d, ptr [rax+r10*1+0x1] sar r10d, 0x1
Now the limit is px * f, VTune thinks it has a high CPI of about 1.7 but reports no memory problems
vmovsd xmm11, qword ptr [r8+r11*8] vxorpd xmm15, xmm15, xmm15 vxorps xmm5, xmm5, xmm5 vmovsd xmm2, xmm15, xmm11 vinsertf128 ymm2, ymm2, xmm5, 0x1 vmulpd ymm10, ymm10, ymm12 <-- high cpi vsubpd ymm2, ymm2, ymm10 vextractf128 xmm11, ymm2, 0x1 vaddpd xmm12, xmm2, xmm11 vunpckhpd xmm14, xmm12, xmm12 vaddsd xmm11, xmm12, xmm14 vmulsd xmm12, xmm12, xmm15 vsubsd xmm11, xmm11, xmm12 vmovsd qword ptr [r8+r11*8], xmm11
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Stall cycles at px*f (and high effectiveness of threading) presumably are the result of the expected delay in availability of the division result.
Could you consider 32 byte alignments?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Redundant post.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page