Sandy Bridge: SSE performance and AVX gather/scatter - Page 5

capens__nicolas · ‎01-04-2011

Hi all,

I'm curious how the two symmetric 128-bit vector units on Sandy Bridge affect SSE performance. What's its peak throughput, and sustainable throughput for legacy SSE instructions?

I also wonder when parallelgather/scatter instructions will finally be supported. AVX is great in theory, but in practice parallelizing a loop requires the ability to load/store elements from (slightly) divergent memory locations. Serially inserting and extracting elements was still somewhat acceptable for SSE, but with 256-bit AVXitbecomes a serious bottleneck, which partially cancels its theoretical benefits.

Sandy Bridge's CPU cores are actually more powerful than its GPU, but the lack of gather/scatter will limit the use of all this computing power.

Cheers,

Nicolas

bronxzv · ‎06-18-2013

iliyapolak wrote:
cpi for gather instructions is measured when memory references are cached in L1.If it is not the situation cpi will rise due to latency of memory access.

sure, in other words the gather instructions are worthless even in the situation where they should provide the best speedup (low L1D cache misses)

Bernard · ‎06-18-2013

Very true.

Btw I am waiting for new computer to start testing AVX2.

TimP · ‎06-18-2013

Enough people have been saying they wanted the gather instructions to simplify intrinsics coding under Visual Studio that the CPU architects appear to have been encouraged to include them, one of the goals being to see whether there would be sufficient use to justify later hardware features to accelerate it.

You might compare the Intel(c) Xeon Phi(tm) version of gather, which should fetch all operands from a single cache line simultaneously, but currently requires iteration over the group of cache lines involved (which the compiler handles implicitly).

capens__nicolas · ‎06-19-2013

It's good to avoid ISA fragmentation by providing new instructions even if their implementation is not optimal yet. But that requires them to not be slower than legacy code. Correct me if I'm wrong, but that doesn't seem to be the case for Haswell's gather implementation. Developers will still need multiple code paths so the purpose of providing these instructions early on was defeated. I'm glad there's an implicit promise in the docs that it will get faster, but developers will have to test for AVX2 support and the new architecture before using gather.

For the same reasons of ISA fragmentation I'm very disappointed that TSX support is not available on all Haswell models.

The reason this is critically important is because most developers don't bother with more than a couple of code paths. More paths mean higher QA cost and having to budget in a higher support cost for when one of the code paths needs maintenance. They'd much rather make a one-time investment to support AVX2 with a not-slower gather implementation and TSX. Without all of that, it becomes harder to justify as there is no clarity on when such an investment will provide value. It's all about the ROI. But this also affects Intel's ROI. If AVX2 and TSX are underutilized then it's just dead silicon that swallowed up a lot of R&D cost and will take longer to become a selling point.

bronxzv · ‎06-20-2013

c0d1f1ed wrote:
But that requires them to not be slower than legacy code. Correct me if I'm wrong, but that doesn't seem to be the case for Haswell's gather implementation

it was an overall regression in my own use cases, though I'm not sure it applies to other people use cases or even to all my functions, maybe some individual functions get a speedup and others a slowdown?

to get a deeper insight I'll try to write a series of gather-focused microbenchmarks and I'll publish my findings here

jan_v_ · ‎06-20-2013

I've been updating my software renderer, making use of AVX2 gather (see bottom web page). There is a speedup, of around 10%. Clearing the destination register caused some extra speedup. It's faster as the DX10 version on HD4600, but only with an external PCIe3 GPU, to write the software rendered images to.

bronxzv · ‎06-20-2013

c0d1f1ed wrote:
Correct me if I'm wrong, but that doesn't seem to be the case for Haswell's gather implementation

I just wrote a microbenchmark where hardware gather provides some speedup, I tried to make a simple example where the legacy AVX path use 18 instructions (the very same implementation we discussed in the past), the code is a simplistic float to float LUT based conversion

Source code (partial):
[cpp]
void GatherTest(const float *lut, const float *src, float *dst, int n)
{
for (int i=0; i<n; i+=8) Store(dst+i,Gather(lut,Trunc(OctoFloat(src+i))));
}
[/cpp]

AVX path, aka "SW gather":
[cpp]
.B3.3:                          ; Preds .B3.3 .B3.2
        vcvttps2dq ymm2, YMMWORD PTR [esi+eax*4]                ;123.51
        vmovd     edi, xmm2                                     ;123.40
        vextracti128 xmm6, ymm2, 1                              ;123.40
        vmovss    xmm0, DWORD PTR [ecx+edi*4]                   ;123.40
        vpextrd   edi, xmm2, 1                                  ;123.40
        vinsertps xmm1, xmm0, DWORD PTR [ecx+edi*4], 16         ;123.40
        vpextrd   edi, xmm2, 2                                  ;123.40
        vinsertps xmm3, xmm1, DWORD PTR [ecx+edi*4], 32         ;123.40
        vpextrd   edi, xmm2, 3                                  ;123.40
        vinsertps xmm0, xmm3, DWORD PTR [ecx+edi*4], 48         ;123.40
        vmovd     edi, xmm6                                     ;123.40
        vmovss    xmm4, DWORD PTR [ecx+edi*4]                   ;123.40
        vpextrd   edi, xmm6, 1                                  ;123.40
        vinsertps xmm5, xmm4, DWORD PTR [ecx+edi*4], 16         ;123.40
        vpextrd   edi, xmm6, 2                                  ;123.40
        vinsertps xmm7, xmm5, DWORD PTR [ecx+edi*4], 32         ;123.40
        vpextrd   edi, xmm6, 3                                  ;123.40
        vinsertps xmm1, xmm7, DWORD PTR [ecx+edi*4], 48         ;123.40
        vinsertf128 ymm2, ymm0, xmm1, 1                         ;123.40
        vmovups   YMMWORD PTR [ebx+eax*4], ymm2                 ;123.28
        add       eax, 8                                        ;123.22
        cmp       eax, edx                                      ;123.19
        jl        .B3.3         ; Prob 82%                      ;123.19
[/cpp]

AVX2 path, aka "HW gather":
[cpp]
.B4.3:                          ; Preds .B4.3 .B4.2
        vcvttps2dq ymm0, YMMWORD PTR [edi+eax*4]                ;139.53
        vpcmpeqd ymm1, ymm1, ymm1                              ;139.40
        vxorps    ymm2, ymm2, ymm2                              ;139.40
        vgatherdps ymm2, YMMWORD PTR [ecx+ymm0*4], ymm1         ;139.40
        vmovups   YMMWORD PTR [esi+eax*4], ymm2                 ;139.28
        add       eax, 8                                        ;139.22
        cmp       eax, edx                                      ;139.19
        jl        .B4.3         ; Prob 82%                      ;139.19
[/cpp]

timings (single thread):

[cpp]
    128 elts      (1536 B): SW gather 0.569 ns/elt HW gather 0.562 ns/elt HW speedup = 1.012 x
    256 elts      (3072 B): SW gather 0.578 ns/elt HW gather 0.547 ns/elt HW speedup = 1.058 x
    512 elts      (6144 B): SW gather 0.568 ns/elt HW gather 0.543 ns/elt HW speedup = 1.048 x
   1024 elts     (12288 B): SW gather 0.560 ns/elt HW gather 0.544 ns/elt HW speedup = 1.030 x
   2048 elts     (24576 B): SW gather 0.725 ns/elt HW gather 0.692 ns/elt HW speedup = 1.047 x
   4096 elts     (49152 B): SW gather 0.719 ns/elt HW gather 0.678 ns/elt HW speedup = 1.061 x
   8192 elts     (98304 B): SW gather 0.694 ns/elt HW gather 0.607 ns/elt HW speedup = 1.144 x
16384 elts    (196608 B): SW gather 0.650 ns/elt HW gather 0.568 ns/elt HW speedup = 1.143 x
32768 elts    (393216 B): SW gather 0.777 ns/elt HW gather 0.782 ns/elt HW speedup = 0.994 x
65536 elts    (786432 B): SW gather 0.975 ns/elt HW gather 0.991 ns/elt HW speedup = 0.984 x
131072 elts   (1572864 B): SW gather 1.323 ns/elt HW gather 1.362 ns/elt HW speedup = 0.971 x
262144 elts   (3145728 B): SW gather 1.526 ns/elt HW gather 1.539 ns/elt HW speedup = 0.992 x
524288 elts   (6291456 B): SW gather 1.790 ns/elt HW gather 1.835 ns/elt HW speedup = 0.975 x
1048576 elts (12582912 B): SW gather 2.186 ns/elt HW gather 2.250 ns/elt HW speedup = 0.972 x
2097152 elts (25165824 B): SW gather 4.056 ns/elt HW gather 4.081 ns/elt HW speedup = 0.994 x
4194304 elts (50331648 B): SW gather 6.224 ns/elt HW gather 6.236 ns/elt HW speedup = 0.998 x
8388608 elts (100663296 B): SW gather 7.915 ns/elt HW gather 7.919 ns/elt HW speedup = 1.000 x
[/cpp]

the best speedup (x1.14) is for the total workset (input & output arrays + LUT) in the L2 cache

Configuration: Core i7 4770K @ 3.5 GHz (both core ratio and cache ratio fixed at 35 with bclock = 100.0 MHz) + DDR3-2400, HT disabled

bronxzv · ‎06-20-2013

I get better speedups after 8x unrolling (~best unroll factor) since HW gather is faster but SW gather unchanged

Source code (partial):
[cpp]
void GatherTest(const float *lut, const float *src, float *dst, int n)
{
#pragma unroll(8)
for (int i=0; i<n; i+=8) Store(dst+i,Gather(lut,Trunc(OctoFloat(src+i))));
}
[/cpp]

AVX path not shown, way too complex after 8x unrolling...

AVX2 path:
[cpp]
.B4.7:                          ; Preds .B4.7 .B4.6
        vcvttps2dq ymm1, YMMWORD PTR [edx+ebx]                  ;139.53
        inc       ecx                                           ;139.3
        vmovdqa   ymm2, ymm0                                    ;139.40
        vgatherdps ymm3, YMMWORD PTR [edi+ymm1*4], ymm2         ;139.40
        vmovups   YMMWORD PTR [edx+eax], ymm3                   ;139.28
        vcvttps2dq ymm4, YMMWORD PTR [32+edx+ebx]               ;139.53
        vmovdqa   ymm5, ymm0                                    ;139.40
        vgatherdps ymm6, YMMWORD PTR [edi+ymm4*4], ymm5         ;139.40
        vmovups   YMMWORD PTR [32+edx+eax], ymm6                ;139.28
        vcvttps2dq ymm7, YMMWORD PTR [64+edx+ebx]               ;139.53
        vmovdqa   ymm1, ymm0                                    ;139.40
        vgatherdps ymm2, YMMWORD PTR [edi+ymm7*4], ymm1         ;139.40
        vmovups   YMMWORD PTR [64+edx+eax], ymm2                ;139.28
        vcvttps2dq ymm3, YMMWORD PTR [96+edx+ebx]               ;139.53
        vmovdqa   ymm4, ymm0                                    ;139.40
        vgatherdps ymm5, YMMWORD PTR [edi+ymm3*4], ymm4         ;139.40
        vmovups   YMMWORD PTR [96+edx+eax], ymm5                ;139.28
        vcvttps2dq ymm6, YMMWORD PTR [128+edx+ebx]              ;139.53
        vmovdqa   ymm1, ymm0                                    ;139.40
        vgatherdps ymm7, YMMWORD PTR [edi+ymm6*4], ymm1         ;139.40
        vmovups   YMMWORD PTR [128+edx+eax], ymm7               ;139.28
        vcvttps2dq ymm1, YMMWORD PTR [160+edx+ebx]              ;139.53
        vmovdqa   ymm2, ymm0                                    ;139.40
        vgatherdps ymm3, YMMWORD PTR [edi+ymm1*4], ymm2         ;139.40
        vmovups   YMMWORD PTR [160+edx+eax], ymm3               ;139.28
        vcvttps2dq ymm4, YMMWORD PTR [192+edx+ebx]              ;139.53
        vmovdqa   ymm5, ymm0                                    ;139.40
        vgatherdps ymm6, YMMWORD PTR [edi+ymm4*4], ymm5         ;139.40
        vmovups   YMMWORD PTR [192+edx+eax], ymm6               ;139.28
        vcvttps2dq ymm7, YMMWORD PTR [224+edx+ebx]              ;139.53
        vmovdqa   ymm1, ymm0                                    ;139.40
        vgatherdps ymm2, YMMWORD PTR [edi+ymm7*4], ymm1         ;139.40
        vmovups   YMMWORD PTR [224+edx+eax], ymm2               ;139.28
        add       edx, 256                                      ;139.3
        cmp       ecx, esi                                      ;139.3
        jb        .B4.7         ; Prob 99%                      ;139.3
[/cpp]

timings (single thread):

[cpp]
    128 elts      (1536 B): SW gather 0.570 ns/elt HW gather 0.481 ns/elt HW speedup = 1.185 x
    256 elts      (3072 B): SW gather 0.586 ns/elt HW gather 0.456 ns/elt HW speedup = 1.286 x
    512 elts      (6144 B): SW gather 0.571 ns/elt HW gather 0.447 ns/elt HW speedup = 1.276 x
   1024 elts     (12288 B): SW gather 0.560 ns/elt HW gather 0.444 ns/elt HW speedup = 1.263 x
   2048 elts     (24576 B): SW gather 0.634 ns/elt HW gather 0.572 ns/elt HW speedup = 1.109 x
   4096 elts     (49152 B): SW gather 0.718 ns/elt HW gather 0.591 ns/elt HW speedup = 1.216 x
   8192 elts     (98304 B): SW gather 0.695 ns/elt HW gather 0.575 ns/elt HW speedup = 1.209 x
16384 elts    (196608 B): SW gather 0.696 ns/elt HW gather 0.624 ns/elt HW speedup = 1.116 x
32768 elts    (393216 B): SW gather 0.740 ns/elt HW gather 0.733 ns/elt HW speedup = 1.011 x
65536 elts    (786432 B): SW gather 1.066 ns/elt HW gather 1.080 ns/elt HW speedup = 0.987 x
131072 elts   (1572864 B): SW gather 1.351 ns/elt HW gather 1.357 ns/elt HW speedup = 0.996 x
262144 elts   (3145728 B): SW gather 1.539 ns/elt HW gather 1.542 ns/elt HW speedup = 0.998 x
524288 elts   (6291456 B): SW gather 1.714 ns/elt HW gather 1.767 ns/elt HW speedup = 0.970 x
1048576 elts (12582912 B): SW gather 2.195 ns/elt HW gather 2.103 ns/elt HW speedup = 1.044 x
2097152 elts (25165824 B): SW gather 4.057 ns/elt HW gather 4.017 ns/elt HW speedup = 1.010 x
4194304 elts (50331648 B): SW gather 6.220 ns/elt HW gather 6.203 ns/elt HW speedup = 1.003 x
8388608 elts (100663296 B): SW gather 7.909 ns/elt HW gather 7.908 ns/elt HW speedup = 1.000 x
[/cpp]

best speedup is now 1.26-1.28 x when the whole workset fit in the L1D cache

I also tested this code under VTune and I don't get the same performance warnings (for ex. "Machine Clears") than with my full project, there is simply a lot of assists (Filled Pipeline Slots -> Retiring -> Assists: 0.092) which looks normal since gather is implemented as microcode at the moment

Configuration: Core i7 4770K @ 3.5 GHz (both core ratio and cache ratio fixed at 35 with bclock = 100.0 MHz) + DDR3-2400, HT disabled

bronxzv · ‎06-20-2013

jan v. wrote:
Clearing the destination register caused some extra speedup.

interesting, I see that the Intel compiler do just that, see vxorps ymm2, ymm2, ymm2 in my example above (AFAIK this is a "zeroing idiom"), though it's omitted in the unrolled version for some reason

bronxzv · ‎06-20-2013

jan v. wrote:
I've been updating my software renderer, making use of AVX2 gather (see bottom web page). There is a speedup, of around 10%.

out of curiosity I tested your demo, keeping the default initial view point I get these scores:

FQuake64.exe : 208 fps
FQuake64 AVX2.exe : 195 fps

so it looks like the AVX2 path is slower than the other one (supposedly AVX ?), how can I measure the 10% speedup you are refering to ?

Configuration: Core i7 4770K @ 3.5 GHz (turbo up to 4.0 GHz) + DDR3-2400, HT enabled

SergeyKostrov · ‎06-20-2013

>>...the best speedup ( x1.14 ) is for the total workset ( input & output arrays + LUT ) in the L2 cache... It actually means 14% improvement for that kind of processing. If you ever looked at Intel MKL Release Notes you could see improvements numbers like 2% or 3% for some functions ( algorithms ) and it makes a difference when large data sets need to be processed.

bronxzv · ‎06-20-2013

Sergey Kostrov wrote:
It actually means 14% improvement for that kind of processing.

note that my goal was to find an upper bound for the use case the most detrimental to legacy "software" gather, namely when it is implemented as a generic function equivalent to AVX2 hardware gather (start from a 256-bit vector of indices and store to a 256-bit destination), after further tests the best speedup I have reached so far is at 1.29 x (29 %), example shown below (this code has no practical purpose):

AVX2 path (unrolled 8x):
[cpp]
.B6.7:                          ; Preds .B6.7 .B6.6
;;;   {
;;;     const OctoInt indices(Trunc(work));
        vcvttps2dq ymm3, ymm1                                   ;170.22
        inc       edx                                           ;168.3
;;;     checkSum ^= Gather(lut,indices);
        vmovdqa   ymm4, ymm2                                    ;171.13
        vgatherdps ymm5, YMMWORD PTR [esi+ymm3*4], ymm4         ;171.13
;;;     work = work * work;
        vmulps    ymm3, ymm1, ymm1                              ;172.19
        vxorps    ymm6, ymm0, ymm5                              ;171.10
        vcvttps2dq ymm0, ymm3                                   ;170.22
        vmovdqa   ymm1, ymm2                                    ;171.13
        vgatherdps ymm7, YMMWORD PTR [esi+ymm0*4], ymm1         ;171.13
        vxorps    ymm4, ymm6, ymm7                              ;171.10
        vmulps    ymm6, ymm3, ymm3                              ;172.19
        vcvttps2dq ymm0, ymm6                                   ;170.22
        vmovdqa   ymm1, ymm2                                    ;171.13
        vgatherdps ymm5, YMMWORD PTR [esi+ymm0*4], ymm1         ;171.13
        vxorps    ymm1, ymm4, ymm5                              ;171.10
        vmulps    ymm4, ymm6, ymm6                              ;172.19
        vcvttps2dq ymm7, ymm4                                   ;170.22
        vmovdqa   ymm0, ymm2                                    ;171.13
        vgatherdps ymm3, YMMWORD PTR [esi+ymm7*4], ymm0         ;171.13
        vxorps    ymm6, ymm1, ymm3                              ;171.10
        vmulps    ymm1, ymm4, ymm4                              ;172.19
        vcvttps2dq ymm5, ymm1                                   ;170.22
        vmovdqa   ymm0, ymm2                                    ;171.13
        vgatherdps ymm7, YMMWORD PTR [esi+ymm5*4], ymm0         ;171.13
        vxorps    ymm4, ymm6, ymm7                              ;171.10
        vmulps    ymm6, ymm1, ymm1                              ;172.19
        vcvttps2dq ymm0, ymm6                                   ;170.22
        vmovdqa   ymm3, ymm2                                    ;171.13
        vgatherdps ymm5, YMMWORD PTR [esi+ymm0*4], ymm3         ;171.13
        vxorps    ymm1, ymm4, ymm5                              ;171.10
        vmulps    ymm4, ymm6, ymm6                              ;172.19
        vcvttps2dq ymm7, ymm4                                   ;170.22
        vmovdqa   ymm0, ymm2                                    ;171.13
        vgatherdps ymm3, YMMWORD PTR [esi+ymm7*4], ymm0         ;171.13
        vxorps    ymm6, ymm1, ymm3                              ;171.10
        vmulps    ymm1, ymm4, ymm4                              ;172.19
        vcvttps2dq ymm5, ymm1                                   ;170.22
        vmulps    ymm1, ymm1, ymm1                              ;172.19
        vmovdqa   ymm0, ymm2                                    ;171.13
        vgatherdps ymm7, YMMWORD PTR [esi+ymm5*4], ymm0         ;171.13
        vxorps    ymm0, ymm6, ymm7                              ;171.10
        cmp       edx, eax                                      ;168.3
        jb        .B6.7         ; Prob 99%                      ;168.3
[/cpp]

I still have to do more tests, for example the same kind of code in a multi-thread application and HT enabled, but I suppose it answers c0d1f1ed's main question: it makes sense to consider using AVX2 gather instructions for today's code (for some use cases) since there isn't always a performance regression

Sergey Kostrov wrote:
If you ever looked at Intel MKL Release Notes you could see improvements numbers like 2% or 3% for some functions ( algorithms ) .

I can't see any mention of gather in the latest MKL Release Notes (MKL 11.0 update 4), so you probably read it somewhere else, if you find it again please let me know

Sergey Kostrov wrote:
and it makes a difference when large data sets need to be processed.

actually it makes more difference with small data sets (the best 29% speedup I reached is with a 16 KiB (read only) dataset entirely in the L1D cache, with truly big sets you are mostly LLC cache bound, or worse, memory bound

capens__nicolas · ‎06-21-2013

That's actually not bad at all! Worst case seems to be a 3% performance loss, which is neglibible, especially since 14% or more can be gained. And with even more to be gained in future implementations, I see no reason to hold back on using gather.

bronxzv, was the performance regression you observed due to not clearing the destination register? Apparently that is required to break a dependency chain.

While the mask and blend functionality of gather may not seem very valuable from a software point of view, it is actually essential for handling interrupts at the hardware level. It can store the partial result in the destination register and track where it left off by updating the mask register, before both get stored in memory for a thread switch to occur. The alternative would have been to discard the partial result and start the gather operation all over again when resuming. This would have made it perform worse than an extract/insert sequence.

capens__nicolas · ‎06-21-2013

jan v. wrote:
I've been updating my software renderer, making use of AVX2 gather (see bottom web page). There is a speedup, of around 10%.

Does that include widening to 256-bit?

It's faster as the DX10 version on HD4600, but only with an external PCIe3 GPU, to write the software rendered images to.

That is hugely impressive! Hopefully that helps Intel take CPU-GPU unification seriously. If the IGP was replaced with more CPU cores, that would make software like yours faster overall than using 'dedicated' hardware. It's much easier to program than a heterogeneous system, and there would be no limitations.

That just leaves power consumption as an issue. But that could be addressed by equipping each core with two clusters of 512-bit SIMD units, running at half frequency, each dedicated to one thread. The loss of Hyper-Threading for SIMD operations (not at the software level but at the hardware level), can be replaced by executing AVX-1024 operations in two cycles, which would be more power efficient to boot.

jan_v_ · ‎06-21-2013

bronxzv wrote:

out of curiosity I tested your demo, keeping the default initial view point I get these scores:

FQuake64.exe : 208 fps
FQuake64 AVX2.exe : 195 fps

so it looks like the AVX2 path is slower than the other one (supposedly AVX ?), how can I measure the 10% speedup you are refering to ?

Configuration: Core i7 4770K @ 3.5 GHz (turbo up to 4.0 GHz) + DDR3-2400, HT enabled

I'm running on: Core i7 4770K @ 3.5 GHz (turbo does 3.9 GHz for all cores) + DDR3-1600, HT enabled, Win 7, at 2560x1440 resolution

FQuake64.exe : 295 fps --> actually this is a SSE2 version so 128 bit
FQuake64 AVX2.exe : 324 fps -> 256 bit

That is with the screen to a discrete GPU, supporting PCIe3 (8GB/s), AMD 7970.

With the IGP displaying, CPU rendering, I'm only getting a mere 100 fps, some buffer copying must be going on, eating all the memory bandwidth. The IGP in DX10 does 260 fps.

jan_v_ · ‎06-21-2013

c0d1f1ed wrote:

Does that include widening to 256-bit?

That is hugely impressive! Hopefully that helps Intel take CPU-GPU unification seriously. If the IGP was replaced with more CPU cores, that would make software like yours faster overall than using 'dedicated' hardware. It's much easier to program than a heterogeneous system, and there would be no limitations.

Yes, 256-bit.

I'm winning here mainly because the type of rendering I'm doing is very bandwidth efficient. Given more memory bandwidth like with the integrated 128 MB cache, the IGP would be faster. And indeed it uses some 50 more Watt.

bronxzv · ‎06-21-2013

jan v. wrote:
I'm running on: Core i7 4770K @ 3.5 GHz (turbo does 3.9 GHz for all cores) + DDR3-1600, HT enabled, Win 7, at 2560x1440 resolution

FYI, I tested at 1600 x 1200 with the 4770K iGPU, btw it looks like there is a big serial portion of code in your demo at each frame (thus low overall CPU usage, less than 85% on my machine, in other words you are leaving more than 15% performance potential on the table at the moment), it's probably when you copy the rendered frames to the front buffer, I'll advise to do that in parallel with the rendering of the next frame (triple buffering), this way copying the rendered frames will be concurrent with rendering, typically one thread is enough to saturate the PCIe bandwidth so that's a very sensible solution to do it like that, a 8-thread pool for rendering and a single extra thread for copying the final frames

EDIT: I measured the time to copy a 32-bit 1600x1200 frame with Direct 2D and the iGPU (CPU at 3.9 GHz fixed) and it takes 3.28 += 0.02 ms (executing in parallel with 8 running threads, > 99% overall CPU usage), i.e. the hard limit is at ~300 fps if copying the rendered frames overlap with rendering next frames (which my own engine do btw)

jan v. wrote:
FQuake64.exe : 295 fps --> actually this is a SSE2 version so 128 bit
FQuake64 AVX2.exe : 324 fps -> 256 bit

ah! so your 10% are not the gain from gather in isolation, now, 10% speedup looks quite low from SSE2 to AVX2, you can probably get more from going from 128-bit to 256-bit floats and ints + FMA + gather, I'm getting around 45% speedup for my own texture mapping code (using 256-bit unpack and shuffle + FMA, but neither using gather nor generic permute at the moment)

jan_v_ · ‎06-21-2013

bronxzv wrote:

FYI, I tested at 1600 x 1200 with the 4770K iGPU, btw it looks like there is a big serial portion of code in your demo at each frame (thus low overall CPU usage, less than 85% on my machine, in other words you are leaving more than 15% performance potential on the table at the moment), it's probably when you copy the rendered frames to the front buffer, I'll advise to do that in parallel with the rendering of the next frame (triple buffering), this way copying the rendered frames will be concurrent with rendering, typically one thread is enough to saturate the PCIe bandwidth so that's a very sensible solution to do it like that, a 8-thread pool for rendering and a single extra thread for copying the final frames

Indeed there is still a serial part, that can be parallelized. It's the geometry processing and setup for parallel rendering of screen blocks. The renderer doesn't do any frame copying, that would take too much bandwidth anyhow.

bronxzv wrote:

ah! so you are not comparing the gain from gather in isolation, btw the speedup looks quite low from SSE2 to AVX2, I'm getting around 45% speedup for my own texture mapping code (using 256-bit unpack and shuffle + FMA, but neither using gather nor generic permute at the moment)

The speedup is so low, because the SSE2 perspective correct bilinear interpolation is already so efficient that it completely overlaps with memory reads. AVX2 does not make the maths faster as it's not the bottleneck. The bottleneck is texel fetching, and that is where gather brings a modest improvement.

bronxzv · ‎06-21-2013

jan v. wrote:
The speedup is so low, because the SSE2 perspective correct bilinear interpolation is already so efficient that and it completely overlaps with memory reads. AVX2 does not make the maths faster as it's not the bottleneck. The bottleneck is texel fetching, and that is where gather brings a modest improvement.

your textures look so low res and your mesh / BSP tree so simple that it's strange that memory bandwidth is an issue, do you know the size of your working set at each frame? also if you are memory bandwidth bound or even LLC bandwidth bound, there is no possibility that gather buy you 10% speedup (based on my microbenchmark results above in this thread)

EDIT: I just tested with VTune and your demo is indeed maxing out memory bandwidth apparently, for example FQuake64 AVX2.exe use a very steady (when looking around the room) 19 GB/s aggregate bandwidth (10.2 GB/s average read bw) with spikes at 20.8 GB/s, it never fell lower than 17.8 GB/s in a ~ 30 s test run

jan_v_ · ‎06-21-2013

bronxzv wrote:

your textures look so low res and your mesh so simple that it's strange that memory bandwidth is an issue, do you know the size of your working set at each frame? also if you are memory bandwidth bound or even LLC bandwidth bound, there is no possibility that gather buy you 10% speedup

The texture and geometry, are from a game of decades ago. They are even 8 bit with palette, but I turn them in 32 bit for rendering (except for the sky and mud, where I do byte gathering instead of int, and an extra gather through the palette). The other textures are acutally larger than you would think, as all texure on the walls/floor is unique. They are blended from a material and light texture into a third texture buffer that is used for rendering.

With memory read I mean seen from a software point of view, so any instruction with a memory operand. I'm reasonably sure all texture seen in a frame fits in L3 cache. L1D cache hit should be pretty high, so most texel fetches will hit the L1 cache. That must be about the best case scenario for gather.

bronxzv · ‎06-21-2013

jan v. wrote:
The texture and geometry, are from a game of decades ago. They are even 8 bit with palette, but I turn them in 32 bit for rendering (except for the sky and mud, where I do byte gathering instead of int, and an extra gather through the palette). The other textures are acutally larger than you would think, as all texure on the walls/floor is unique. They are blended from a material and light texture into a third texture buffer that is used for rendering.

ah, it makes sense so it's indeed well bigger than I expected, I was thinking that you were doing two texture passes per sample (low res tiled albedo + light map) EDIT: I see that it's clearly explained on your site that you do a single pass, though it's not mentioned how often you update the LRU cache with the composited texture

please note that I edited my previous message with bandwidth measurements of your demo