Sandy Bridge: SSE performance and AVX gather/scatter - Page 6

capens__nicolas · ‎01-04-2011

Hi all,

I'm curious how the two symmetric 128-bit vector units on Sandy Bridge affect SSE performance. What's its peak throughput, and sustainable throughput for legacy SSE instructions?

I also wonder when parallelgather/scatter instructions will finally be supported. AVX is great in theory, but in practice parallelizing a loop requires the ability to load/store elements from (slightly) divergent memory locations. Serially inserting and extracting elements was still somewhat acceptable for SSE, but with 256-bit AVXitbecomes a serious bottleneck, which partially cancels its theoretical benefits.

Sandy Bridge's CPU cores are actually more powerful than its GPU, but the lack of gather/scatter will limit the use of all this computing power.

Cheers,

Nicolas

jan_v_ · ‎06-21-2013

bronxzv wrote:

EDIT: I just tested with VTune and your demo is indeed maxing out memory bandwidth apparently, for example FQuake64 AVX2.exe use a very steady (when looking around the room) 19 GB/s aggregate bandwidth (10.2 GB/s average read bw) with spikes at 20.8 GB/s, it never fell lower than 17.8 GB/s in a ~ 30 s test run

That is still with the IGP displaying ? If so you are mainly measuring the Intel driver copying buffers.

As I'm getting 324 (max 380fps looking at a wall ) fps at 2560x1440, that is 5.2 GB/s just for writing the rendered images. If these can be sent to the discrete GPU with one read, there should be no more than 10.4 GB/s memory bandwidth usage (as all texture fits in L3 cache imho).

Remark that 400 fps seems to be the limit allowed by PCIe3 bandwidth. Even with point sampling (hit space bar) and close to wall it doesn't get any faster.

bronxzv · ‎06-21-2013

jan v. wrote:

Quote:

bronxzvwrote:
EDIT: I just tested with VTune and your demo is indeed maxing out memory bandwidth apparently, for example FQuake64 AVX2.exe use a very steady (when looking around the room) 19 GB/s aggregate bandwidth (10.2 GB/s average read bw) with spikes at 20.8 GB/s, it never fell lower than 17.8 GB/s in a ~ 30 s test run

That is still with the IGP displaying ? If so you are mainly measuring the Intel driver copying buffers.

the IGP, sure, as a true fan of pure software rendering I have no discrete gfx card on this system... frame buffer size = 1600x1200x4B = 7.68MB, at 195 fps that is just ~1.5B GB/s of write bandwidth, or 3 GB/s of aggregate copy bandwidth, display is at 60 Hz so ~0.5 GB/s of extra read bandwidth for the screen refresh

bronxzv · ‎06-21-2013

jan v. wrote:
Even with point sampling (hit space bar) and close to wall it doesn't get any faster.

thanks for the hint, I tested it again with this keyboard shortcut, FQuake64 AVX2.exe is at 194-195 fps with bilinear interpolation and 200-201 fps with point sampling, so indeed it looks like bottlenecks are somewhere else, an idea to better show the advantage of AVX2 vs SSE2 for your texture mapping code will be to have more than one texture per final sample, for example using transparent surfaces, bump maps and reflection maps

jan_v_ · ‎06-21-2013

bronxzv wrote:

so indeed it looks like bottlenecks are somewhere else, an idea to better show the advantage of AVX2 vs SSE2 for your texture mapping code will be to have more than one texture per final sample, for example using transparent surfaces, bump maps and reflection maps

Actually I already had something different in mind: bicubic interpolation :-)

bronxzv · ‎06-21-2013

jan v. wrote:
That is still with the IGP displaying ?

out of curiosity I have measured the time it takes to display a frame (copy one of my own back buffer to the IGPU front buffer) without the rendering fighting for bandwidth, I get these timings at 1600 x 1200 32-bit (alpha ignored):

Direct 2D : 2.45 ms per frame
GDI : 10.85 ms per frame

bronxzv · ‎06-21-2013

jan v. wrote:

Quote:

bronxzvwrote:
so indeed it looks like bottlenecks are somewhere else, an idea to better show the advantage of AVX2 vs SSE2 for your texture mapping code will be to have more than one texture per final sample, for example using transparent surfaces, bump maps and reflection maps

Actually I already had something different in mind: bicubic interpolation :-)

sure, bicubic interpolation is also a nice idea since GPUs don't have direct hardware support for it (AFAIK) so GPU people must ressort to fragment shaders to get it working

bronxzv · ‎06-21-2013

c0d1f1ed wrote:
bronxzv, was the performance regression you observed due to not clearing the destination register? Apparently that is required to break a dependency chain.

it's possible, the culprit code looks like this (used in a lot of different place):

[cpp]
       vpcmpeqd ymm0, ymm0, ymm0                              ;118.3
        vmovdqu   xmm4, XMMWORD PTR [rdx]                       ;118.3
        vmovdqu xmm1, XMMWORD PTR [16+rdx] ;118.36
        vmovdqa   ymm5, ymm0                                    ;118.3
        vpgatherdq ymm3, YMMWORD PTR [rcx+xmm4], ymm5           ;118.3
        vpgatherdq ymm2, YMMWORD PTR [rcx+xmm1], ymm0           ;118.36
[/cpp]

the ymm3 and ymm2 destinations aren't cleared before the gather instructions, this is generated by the Intel compiler from the intrinsic _mm256_i32gather_epi64

when I try to explicitely clear the destination variables with _mm256_setzero_si256() the compiler ignore it so it is more challenging to test than I would like

EDIT: I managed to test it resorting to inline ASM for clearing the destination (vpxor used before vpgatherdq) and I get no speedup, in fact a slowdown unfortunately

[cpp]
test                      duration                     speedup

AVX baseline              13523 - 13665 kclocks
AVX2 gather               14289 - 14358 kclocks        0.946  x
AVX2 gather w/ clr dst    15052 - 15167 kclocks        0.898 x
[/cpp]

jan_v_ · ‎06-22-2013

I've adapted my SSE2 versus AVX2 texture mapping demo. Now it can run full screen, when hitting the Enter key. With only the HD 4600 displaying it's now much faster, compared to windowed. Apparently the extra bit blitting was a bit too much overhead.

Edit: The full screen works on win7, on win8 it seems it doesn't...

Bernard · ‎06-24-2013

>>>bicubic interpolation is also a nice idea since GPUs don't have direct hardware support for it>>>

Bicubic interpolation uses easily representable (at machine code level) instructions which are implemented inside shader processors.

Bernard · ‎06-24-2013

>>>bicubic interpolation is also a nice idea since GPUs don't have direct hardware support for it>>>

Or do you mean hardware accelerated like transcendental functions unit?

Bernard · ‎06-24-2013

Hi bronxzv,

sorry for asking this twice(large part of our fx-related conversation was lost during the forum transition).How writing to frame buffer is managed by software renedering?Do you use DirectX to do that?

Thanks in advance

jan_v_ · ‎06-24-2013

iliyapolak wrote:

sorry for asking this twice(large part of our fx-related conversation was lost during the forum transition).How writing to frame buffer is managed by software renedering?Do you use DirectX to do that?

He already mentionned that:

" I measured the time to copy a 32-bit 1600x1200 frame with Direct 2D and the iGPU (CPU at 3.9 GHz fixed) and it takes 3.28 += 0.02 "

bronxzv · ‎06-24-2013

iliyapolak wrote:

>>>bicubic interpolation is also a nice idea since GPUs don't have direct hardware support for it>>>

Bicubic interpolation uses easily representable (at machine code level) instructions which are implemented inside shader processors.

by "no direct hardware support " I was meaning that the Texturing Units can't be used directly for it, so you must resort to a software based solution on GPUs too (so it levels the playing field when comparing GPU and CPU rendering speed), there was a paper about that in one of the GPU Gems or Shader X book, sorry but I don't remember exactly the title of the paper

bronxzv · ‎06-24-2013

iliyapolak wrote:

Hi bronxzv,

sorry for asking this twice(large part of our fx-related conversation was lost during the forum transition).How writing to frame buffer is managed by software renedering?Do you use DirectX to do that?

Thanks in advance

I was using Direct X in the past (actually the IDirect3D9 interface) with even my own memory copy routines for some specific targets, for compatibility with all platforms there is also a GDI-based fallback

now I have replaced the custom D3D version with one based on Direct 2D (requires >= Windows 7 or Vista with a patch), the code is much simpler now and the performance was the same (D3D vs D2D) when I tested it on a low end discrete GPU

here are the timings (D2D and GDI) that I get on Haswell with the IGPU: http://software.intel.com/en-us/comment/reply/285867/1740857

Bernard · ‎06-24-2013

jan v. wrote:

Quote:

iliyapolakwrote:
sorry for asking this twice(large part of our fx-related conversation was lost during the forum transition).How writing to frame buffer is managed by software renedering?Do you use DirectX to do that?

He already mentionned that:

" I measured the time to copy a 32-bit 1600x1200 frame with Direct 2D and the iGPU (CPU at 3.9 GHz fixed) and it takes 3.28 += 0.02 "

I must admit that did not read the whole thread:)

Bernard · ‎06-24-2013

>>>by "no direct hardware support " I was meaning that the Texturing Units can't be used directly for it, so you must resort to a software based solution on GPUs too>>>

I see what you mean.

Bernard · ‎06-24-2013

>>>here are the timings (D2D and GDI) that I get on Haswell with the IGPU: http://software.intel.com/en-us/comment/reply/285867/1740857>>>

I think that D2D is faster because more of its operations better utilizes hardware acceleration.

jan_v_ · ‎06-24-2013

bronxzv wrote:

EDIT: I just tested with VTune and your demo is indeed maxing out memory bandwidth apparently, for example FQuake64 AVX2.exe use a very steady (when looking around the room) 19 GB/s aggregate bandwidth (10.2 GB/s average read bw) with spikes at 20.8 GB/s, it never fell lower than 17.8 GB/s in a ~ 30 s test run

Did you check the new version I put, that can run in full screen (Enter key), did you see any frame rate improvement ? (only works for win 7, Vista)

bronxzv · ‎06-25-2013

jan v. wrote:
Did you check the new version I put, that can run in full screen (Enter key), did you see any frame rate improvement ? (only works for win 7, Vista)

Hi jan, no I didn't and I have Windows 8 on the Haswell system so if I understand it well ful screen mode will not work on it, once you have a Windows 8 version available I'll be glad to test it

JEROME_B_Intel1 · ‎05-29-2014

bronxzv wrote:

Serially inserting and extracting elements was still somewhat acceptable for SSE, but with 256-bit AVXitbecomes a serious bottleneck,

For "slightly divergent locations", i.e.most elements in the same 64B cache line, AFAIK with SSEthe bestsolutionwas with indirect jumps to a series of static shuflles (controls as immediates) in order to maximize 128-bit load/stores. Now with AVX we can use dynamic shuffles (controls inYMM registers)using VPERMILPS. Based on IACA the new AVX solution is more than 2x faster than legacy SSE, I suppose it will be even more than 2x faster on real hardware since the main issue with the indirect jump solution is the high branch miss rate

Could you point me at an example of loading 4 double floats into a YMM from non-contiguous locations, using AVX intrinsics? This is exactly the problem I have (on Sandy Bridge), and the only way I can see to do it is to "marshall" the data in contiguous locations, which is obviously a non-starter from a performance standpoint.

bronxzv · ‎05-29-2014

JEROME B. (Intel) wrote:

Could you point me at an example of loading 4 double floats into a YMM from non-contiguous locations, using AVX intrinsics? This is exactly the problem I have (on Sandy Bridge), and the only way I can see to do it is to "marshall" the data in contiguous locations, which is obviously a non-starter from a performance standpoint.

It's generally slower to arrange data in memory just before to fill an YMM register (among other reasons because store to load forwarding is blocked with 4 x 64-bit stores followed by a 256-bit load)

I suppose the Intel compiler generates the best code sequence if you use _mm256_set_pd