Solved: Optimizing application for L3 cache access on Intel Xeon D

NStoj · ‎09-26-2019

Hello,

We are running on Intel Xeon D 1548 a bare-metal application in isolated CPU Core using Jailhouse Hypervisor.
Issue that we are observing with our application is really bad L3 cache utilization and poor latency.

With PMC counters measuring LONGEST_LAT_CACHE.REFERENCE and LONGEST_LAT_CACHE.MISS, we can see that the values are the same. Which means that we have 100% L3 cache misses?

/**********************************************************************************************************/
//pointer shmem is 256MB DDR memory physical contiguous segment and its mapped CACHED into application.
/**********************************************************************************************************/

#define X86_AS_WRITE(addr, value)  \
    (*(volatile float *)(shmem +  0xc00000 + addr) = (value))

#define X86_VIO_READ_FLOAT(addr)  \
    (*(volatile float *)(shmem +  addr))

 
static double vars[1024] __attribute__((aligned(64)));
static inline void test_fnc(void){
    int i;

    for(i=0; i<1024; i++){
        X86_AS_WRITE(0x10000 + i*4, vars);
    }

    for(i=0; i<1024; i++){
        vars = X86_VIO_READ_FLOAT(0x9600000 + i*4);
    }
}

And here we can observe L3_REFERENCE/L3_MISS to be 1.

We are trying to understand why we see this behavior, to execute this test_fnc above, CPU needs 4696 cycles at 1.9GHz (4696/1.9GHz = ~2us) which is absurd just for writing and reading into memory.

Can somebody help us to debug further where could be the issue ?

Many thanks in advance!

Br,

Nikola

McCalpinJohn · ‎10-14-2019

Nothing unusual, but with a call to an external routine it is hard to tell what is actually happening.... I think that the best (?) way to prevent GCC from replacing the loops with calls to memcpy is to change -ftree-loop-distribute-patterns to -fnotree-loop-distribute-patterns. It might not make much difference to performance, but you should be able to see exactly what code is run.

View solution in original post

McCalpinJohn · ‎10-02-2019

It looks like a single invocation of test_fnc():

copies 4KiB from a local (normal system memory) address range and writes it to a portion of the special range, then
copies 4KiB from a different part of the special range and overwrites the (just-used) local 4KiB range.

For this single invocation, neither portion of the special range is reused, while the local 4KiB "vars[]" array should be re-used in L1 cache.

If you are describing a scenario with multiple calls to test_fnc() with the same addresses, then one would expect re-use of the addresses in the special range from the L1 cache as well -- since only 3 distinct 4KiB ranges are being accessed.

I would double-check the assembly code and I would time the writing and reading portions of the function separately. I would also compare the absolute performance counter values with the expected values -- ratios can be misleading in many cases....

NStoj · ‎10-03-2019

Hi John,

Thank you very much for the reply!

I understand what you are referring to, and I agree that we are not utilizing cache in a good way. But still we are trying to understand how to improve the system in this scenario.

I have added more detailed PMC measurement that shows little bit different story when executing function above:

loop_cnt 80834000: average CPU cycles 4572, current CPU cycles 4576, worst CPU cycles 5584

****************************
L3Miss 66469
L3UnsharedHit 21
L2HitM 15
L2Hit 404371061
****************************
L2Miss total 66505
L2HitRatio 0.9998
L3HitRatio 0.5
****************************

We are trying to understand if the latency observed above for this simple function is what we can expect from our system:

Intel Xeon D 1548 is working at constant speed of 2GHz and our memory layout looks like this:

-memory
description: System Memory
physical id: 1
slot: System board or motherboard
size: 16GiB
*-bank:0
description: SODIMM DDR4 Synchronous 2400 MHz (0.4 ns)
physical id: 0
serial: #
slot: Bottom DIMM
size: 8GiB
width: 72 bits
clock: 2400MHz (0.4ns)
*-bank:1
description: SODIMM DDR4 Synchronous [empty]
product: NO DIMM
vendor: NO DIMM
physical id: 1
serial: NO DIMM
slot: Top DIMM Upper
*-bank:2
description: SODIMM DDR4 Synchronous 2400 MHz (0.4 ns)
physical id: 2
serial: #
slot: Top DIMM Lower
size: 8GiB
width: 72 bits
clock: 2400MHz (0.4ns)
*-cache:0
description: L1 cache
physical id: 1c
slot: CPU Internal L1
size: 512KiB
capacity: 512KiB
capabilities: synchronous internal write-back
configuration: level=1
*-cache:1
description: L2 cache
physical id: 1d
slot: CPU Internal L2
size: 2MiB
capacity: 2MiB
capabilities: synchronous internal write-back unified
configuration: level=2
*-cache:2
description: L3 cache
physical id: 2
slot: CPU Internal L3
size: 12MiB
capacity: 12MiB
capabilities: synchronous internal write-back unified
configuration: level=3

Thank you!

Br,

Nikola

McCalpinJohn · ‎10-03-2019

If the number of calls is 80834000 and the number of L2 hits is 404371060, then there are only 5 L2 accesses per call. This only makes sense if the "vars[]" array is cached in L1 and the accesses to the special region are either cached in the L1 or not accessed via the caches at all.

The average cycle count of the function call is large enough that there would be negligible overhead in adding RDTSCP() calls to measure the "write to special memory" and "read from special memory" loops separately. If the transfers are by cache line, then the 4096 bytes written (or read) correspond to 64 cache lines. The average time of 4572 cycles is 2286 ns, or 36 ns per cacheline (considering either the reads or the writes independently). This would be the expected performance if the hardware latency is ~71 ns and the mode of operation supports two concurrent cacheline transfers. Understanding how the time is split across the two loops would help narrow cases to consider for the next step.

Speaking from personal experience, I should note that it is very easy to get confused by the interaction of MTRRs and page-level caching controls, and to end up with a mode that is not what you expected.

NStoj · ‎10-04-2019

Thanks John for the reply!

Hm, could you elaborate how did you figured out that there are only 5 L2 accesses per call ?

Is there a away to check if hardware latency is indeed ~71 ns ? I have enabled adjacent cache line prefetch so in theory it could load two cache lines at the same time.

I have added rdtscp on read and write loops...

loop_cnt 87868000: average CPU cycles 4976, current CPU cycle 4980, worst CPU cycle 5976

****************************
L3Miss 72391
L3UnsharedHit 11
L2HitM 16
L2Hit 441554595
****************************
L2Miss 72418
L2HitRatio 0.9998
L3HitRatio 0.3
****************************
Read loop, worst CPU cycles 2176
Write loop, worst CPU cycles 2764
****************************
Thanks!

McCalpinJohn · ‎10-04-2019

I am just guessing that I understand your output, but L2Hit / loop_cnt = 441554595/87868000 = 5.025....

You can check the hardware latency with the Intel Memory Latency checker. If run as root, it will disable the HW prefetchers for the latency test.

Next steps might include counters for the total number of loads (MEM_LOAD_UOPS_RETIRED.ALL_LOADS, Event 0xD0, Umask 0x81), the total number of stores (MEM_LOAD_UOPS_RETIRED.ALL_STORES, Event 0xD1, Umask 0x82), the number of L1 Data Cache fills (L1D.REPLACEMENT, Event 0x51, Umask 0x01). If these numbers make sense, I would add L1D_PEND_MISS.PENDING (Event 0x48, Umask 0x01) and L1D_PEND_MISS.CYCLES (Event 0x48, Umask 0x01 Cmask=0x1). (Unfortunately this event runs on Counter 2 only, so it will take 2 runs to get both values.) The ratio L1D_PEND_MISS.PENDING/L1D_PEND_MISS.CYCLES gives the average number of concurrent L1D misses, while L1D_PEND_MISS.PENDING/L1D.REPLACEMENT should give the average latency to service the L1D misses.

It might be easier to understand what is happening if the "vars[]" array is replaced by a scalar -- then all of the memory references will be to the "special" range.

NStoj · ‎10-07-2019

Thanks John for the reply!

I couldn't run Intel's MLC... Calling ./mlc gave result "No such file or directory". All related dependencies on Linux side are satisfied. So I'm not sure what I'm missing here.

With added:

MEM_LOAD_UOPS_RETIRED.ALL_LOADS
MEM_LOAD_UOPS_RETIRED.ALL_STORES
L1D.REPLACEMENT

Values are:

loop_cnt 40000000: average CPU cycles 5215, current CPU cycles 5212, worst CPU cycles 5940
****************************
MEM_LOAD_UOPS_RETIRED.ALL_LOADS 245316941668
MEM_LOAD_UOPS_RETIRED.ALL_STORES 280194449
L1D.REPLACEMENT 960664821
****************************
Read loop, worst 2276
Write loop, worst 3004

------------------------------------------------------------------------------------------------------------------------------------

With added:

L1D_PEND_MISS.PENDING

Values are:

loop_cnt 40000000: average CPU cycles 5054, current CPU cycles 5056, worst CPU cycles 6096
****************************
L1D_PEND_MISS.PENDING 4379118702
****************************
Read loop, worst 2108
Write loop, worst 2840

------------------------------------------------------------------------------------------------------------------------------------

With added:

L1D_PEND_MISS.CYCLES

loop_cnt 40000000: average CPU cycles 5114, current CPU cycles 5116, worst CPU cycles 6032
****************************
L1D_PEND_MISS.CYCLES 4988182140
****************************
Read loop, worst 2108
Write loop, worst 2900

The ratio L1D_PEND_MISS.PENDING/L1D_PEND_MISS.CYCLES:

4379118702/4988182140 = 0.8778

The ratio L1D_PEND_MISS.PENDING/L1D.REPLACEMENT

4379118702/960664821 = 4.5584

Moving forward with "vars[]" array replaced by a scalar. This gives "faster" loops as expected as scalar variable went into CPU register.

loop_cnt 10000000: average CPU cycles 1874, current CPU cycles 1876, worst CPU cycles 2772
L3Miss 13567
L3UnsharedHit 6
L2HitM 24
L2Hit 58329343
****************************
L2Miss 13597
L2HitRatio 0.9997
L3HitRatio 0.22
****************************

Read loop, worst 1168
Write loop, worst 608

So this is around 1.3us versus 3us... Hm. Looks like our global variable "vars[]" takes quite a latency hit on application.

Br,

Nikola

McCalpinJohn · ‎10-07-2019

Maybe I have not had enough coffee yet, but I am a bit confused by these numbers....

The total number of loads works out to 6133 per call (245316941668/40000000), or almost exactly 6 loads per array element. You are only loading two array variables, so I would guess that the compiler is inserting redundant loads of pointers and loop counters because the optimization level is low or because it is not sure about aliasing....

The total number of L1 fills is almost exactly 24 per call (960664821/40000000), which is much higher than the number of L2 accesses in any of the cases.

The number of stores is almost exactly 7 per call (280194449/40000000), which makes no sense at all, unless the compiler optimization is very high and it has decided that repeated calls to this routine are redundant (which should not happen with the "volatile" keywords, but it is hard to be sure how compilers "think")....

The L1 D miss pending counters suggest that there is negligible concurrency (which I expected, because I don't think that your "special" memory region is actually mapped as cacheable), but the average L1D_PEND_MISS.CYCLES is 4988182140/40000000 = 124.7, which is a very small fraction of the 5114 cycle average execution time for the test.

The last two numbers provide some more hints, but I am struggling to create a consistent model. If the "special" memory region is mapped as Write-Through (WT) or Write-Protect (WP), then reads from the "special" region can be cached, while writes must be written through the caches. Write combining is explicitly allowed in WT mode, while Section 11.3 of Volume 3 of the SWDM is silent on this topic for WP mode (but I recall that AMD allows write combining in WP mode as well). (WT updates store targets in the caches while writing through to memory, while WP invalidates store targets in the caches while writing through to memory.) If the "special" memory region is mapped as Write-Combining (WC), then reads are uncached, but are allowed to be speculative, and writes are combined (if possible) in write-combining buffers (and invalidate store targets in any caches). Enforcement of ordering between accesses to WB regions (e.g., the "vars[]" array, or any redundant loads of pointers or loop indices) and accesses to regions of other memory types is one of the possible mechanisms for the very slow performance you are seeing....

One problem that makes a lot of this hard to understand is that Intel does not document how most of the performance counters operate with memory references to regions that are not ordinary Write-Back (WB) system memory. Some counter events will increment on any "access" (whether to the cache tags or the cache data), and others will only increment on data accesses. With uncertain memory type, there are too many possible interpretations for me to keep in my head....

One easy test would be to replace the references to the "special" range with references to another part of ordinary write-back memory. This should give behavior that is easier to understand, and may help with the interpretation of the performance counters.....

NStoj · ‎10-08-2019

Thanks John for the reply!

Hm, not sure either how to understand these numbers myself.

Compiler optimizations that we use are:

-O3 -funroll-loops -fomit-frame-pointer -finline -ftree-loop-distribution -floop-interchange -freorder-blocks-and-partition

-march=native -mtune=native -msse4 -m64 -mmmx -msse -msse2 -msse3 -mssse3 -mavx

Hm, not sure how I can check how is the "special region" mapped (WT, WP, WC)?

Replacing "special region" with some other type of "var[]" gave:

static volatile double vars_ram[1024] __attribute__((aligned(64)));
static volatile double vars_ram2[1024] __attribute__((aligned(64))); 
static volatile double vars[1024] __attribute__((aligned(64)));

static inline void test_fnc(void){

    for(i=0; i<1024; i++){ 
        vars_ram = vars; // special region segment A is replaced by vars_ram
    } 

    for(i=0; i<1024; i++){ 
        vars = vars_ram2; // special region segment B is replaced by vars_ram2
    }
}

Here are the PMCs:

lcnt 5000000: average cycles 2800, current cycles 2764, worst cycles 3608
L3Miss 10327
L3UnsharedHit 46
L2HitM 21
L2Hit 54904340
****************************
L2Miss 10394
L2HitRatio 0.9998
L3HitRatio 0.64
****************************
Read loop, worst cycles 1616
Write loop, worst cycles 1208


lcnt 10000000: average cycles 2791, current cycles 2812, worst cycles 3608
L3Miss 12005
L3UnsharedHit 46
L2HitM 21
L2Hit 111617422
****************************
L2Miss 12072
L2HitRatio 0.9998
L3HitRatio 0.55
****************************
Read loop, worst cycles 1616
Write loop, worst cycles 1192

L3HitRatio is dropping as it seems on every loop count....
-----------------------------------------------------------------------------------------------------------


lcnt 5000000: average cycles 2654, current cycles 2616, worst cycles 2908
****************************
MEM_LOAD_UOPS_RETIRED.ALL_LOADS       32979675244
MEM_LOAD_UOPS_RETIRED.ALL_STORES     55750279
L1D.REPLACEMENT 177309920
****************************
Read loop, worst cycles 1488
Write loop, worst cycles 1220



lcnt 10000000: average cycles 2632, current cycles 2616, worst cycles 2908
****************************
MEM_LOAD_UOPS_RETIRED.ALL_LOADS        64858435324
MEM_LOAD_UOPS_RETIRED.ALL_STORES     107687733
L1D.REPLACEMENT 342310164
****************************
Read loop, worst cycles 1488
Write loop, worst cycles 1220

-----------------------------------------------------------------------------------------------------------

lcnt 5000000: average cycles 2682, current cycles 2716, worst cycles 2860
****************************
L1D_PEND_MISS.PENDING 898013718
****************************
Read loop, worst cycles 1452
Write loop, worst cycles 1224


lcnt 10000000: average cycles 2686, current cycles 2692, worst cycles 2860
****************************
L1D_PEND_MISS.PENDING 1783915455
****************************
Read loop, worst cycles 1452
Write loop, worst cycles 1220

-----------------------------------------------------------------------------------------------------------

lcnt 5000000: average cycles 2699, current cycles 2704, worst cycles 2880
****************************
L1D_PEND_MISS.CYCLES 725733523
****************************
Read loop, worst cycles 1492
Write loop, worst cycles 1220


lcnt 10000000: average cycles 2702, current cycles 2740, worst cycles 2884
****************************
L1D_PEND_MISS.CYCLES 1460849861
****************************
Read loop, worst cycles 1492
Write loop, worst cycles 1212

-----------------------------------------------------------------------------------------------------------

lcnt 5000000
The ratio L1D_PEND_MISS.PENDING/L1D_PEND_MISS.CYCLES:
898013718/725733523   = 1.2373

lcnt 10000000
The ratio L1D_PEND_MISS.PENDING/L1D_PEND_MISS.CYCLES:
1783915455/1460849861 = 1.2211
 

lcnt 5000000
The ratio  L1D_PEND_MISS.PENDING/L1D.REPLACEMENT
898013718/177309920   = 5.0646

lcnt 10000000
The ratio  L1D_PEND_MISS.PENDING/L1D.REPLACEMENT
1783915455/342310164  = 5.2114

Thanks again John for helping out!

Br,

Nikola

McCalpinJohn · ‎10-10-2019

Since these are simple copy loops, they would benefit from SIMD vectorization. GCC is capable of vectorization at optimization level O3, but the "volatile" keywords probably prevent it from happening....

With scalar loads and stores, each of the two loops will have 1024 32-bit loads and 1024 32-bit stores, with an execution time of 1024 cycles each, for 2048 cycles total (assuming all L1 DCache hits). Your best times are not too far above this lower bound.

With 256-bit loads and stores, each loop will have 128 256-bit loads and 128 256-bit stores, with an execution time of 128 cycles each, or 256 cycles total (again assuming all L1 DCache hits). Eliminating the "volatile" keywords in the "vars_ram" version of the code should allow the compiler to vectorize these loops. (I would also add "-mavx2" for your processor, but the existing "-mavx" option should be enough to enable 256-bit register use for these copy loops.)

NStoj · ‎10-11-2019

Thanks John for the reply!

I've added volatile per your suggestion on previous reply. No need for volatile here.

I see, so it could be that these times observed are actually limit of the system...

So overall access to our "special" region is not that much different from the latency observed here..

Access to our "special" region:
loop_cnt 10000000: average CPU cycles 1874, current CPU cycles 1876, worst CPU cycles 2772
L3Miss 13567
L3UnsharedHit 6
L2HitM 24
L2Hit 58329343
****************************
L2Miss 13597
L2HitRatio 0.9997
L3HitRatio 0.22
****************************

Read loop, worst 1168
Write loop, worst 608

------------------------------------------------------------------------------------------------------------------------

Access with global "var[]" array:
lcnt 5000000: average cycles 2800, current cycles 2764, worst cycles 3608
L3Miss 10327
L3UnsharedHit 46
L2HitM 21
L2Hit 54904340
****************************
L2Miss 10394
L2HitRatio 0.9998
L3HitRatio 0.64
****************************
Read loop, worst cycles 1616
Write loop, worst cycles 1208

Br,

Nikola

McCalpinJohn · ‎10-11-2019

Have you looked at the inner loops of the assembly code? Inspection of those loops should clarify the instruction count and cache access count, as well as possibly revealing other surprises....

NStoj · ‎10-14-2019

Thanks John for the reply!

static double vars_ram[1024] __attribute__((aligned(64)));
static double vars_ram2[1024] __attribute__((aligned(64)));
static double vars[1024] __attribute__((aligned(64)));

static inline void test_fnc(void){

    for(i=0; i<1024; i++){
        vars_ram = vars; // special region segment A is replaced by vars_ram
    }

    for(i=0; i<1024; i++){
        vars = vars_ram2; // special region segment B is replaced by vars_ram2
    }
}

This is the assembly code between two NOPs for easier tracing.

    a4c8:	90                   	nop
    a4c9:	ba 00 20 00 00       	mov    $0x2000,%edx
    a4ce:	be 80 7b 01 00       	mov    $0x17b80,%esi
    a4d3:	bf 80 3b 01 00       	mov    $0x13b80,%edi
    a4d8:	e8 5c a1 ff ff       	callq  4639 <memcpy>
    a4dd:	ba 00 20 00 00       	mov    $0x2000,%edx
    a4e2:	be 80 5b 01 00       	mov    $0x15b80,%esi
    a4e7:	bf 80 7b 01 00       	mov    $0x17b80,%edi
    a4ec:	e8 48 a1 ff ff       	callq  4639 <memcpy>
    a4f1:	90                   	nop

Don't see any unusual things here. Right ?

Br,

Nikola

McCalpinJohn · ‎10-14-2019

Nothing unusual, but with a call to an external routine it is hard to tell what is actually happening.... I think that the best (?) way to prevent GCC from replacing the loops with calls to memcpy is to change -ftree-loop-distribute-patterns to -fnotree-loop-distribute-patterns. It might not make much difference to performance, but you should be able to see exactly what code is run.

NStoj · ‎10-15-2019

Thanks John for the reply!

Adding -fno-tree-loop-distribute-patterns produced following code:

    a4d8:	90                      nop
    a4d9:	31 c9                	xor    %ecx,%ecx
    a4db:	c5 7d 28 99 80 7b 01 	vmovapd 0x17b80(%rcx),%ymm11
    a4e2:	00 
    a4e3:	c5 7d 28 a1 a0 7b 01 	vmovapd 0x17ba0(%rcx),%ymm12
    a4ea:	00 
    a4eb:	c5 7d 28 a9 c0 7b 01 	vmovapd 0x17bc0(%rcx),%ymm13
    a4f2:	00 
    a4f3:	c5 7d 28 b1 e0 7b 01 	vmovapd 0x17be0(%rcx),%ymm14
    a4fa:	00 
    a4fb:	c5 7d 28 b9 00 7c 01 	vmovapd 0x17c00(%rcx),%ymm15
    a502:	00 
    a503:	c5 fd 28 81 20 7c 01 	vmovapd 0x17c20(%rcx),%ymm0
    a50a:	00 
    a50b:	c5 fd 28 99 40 7c 01 	vmovapd 0x17c40(%rcx),%ymm3
    a512:	00 
    a513:	c5 fd 28 89 60 7c 01 	vmovapd 0x17c60(%rcx),%ymm1
    a51a:	00 
    a51b:	c5 7d 29 99 80 3b 01 	vmovapd %ymm11,0x13b80(%rcx)
    a522:	00 
    a523:	c5 7d 29 a1 a0 3b 01 	vmovapd %ymm12,0x13ba0(%rcx)
    a52a:	00 
    a52b:	c5 7d 29 a9 c0 3b 01 	vmovapd %ymm13,0x13bc0(%rcx)
    a532:	00 
    a533:	c5 7d 29 b1 e0 3b 01 	vmovapd %ymm14,0x13be0(%rcx)
    a53a:	00 
    a53b:	c5 7d 29 b9 00 3c 01 	vmovapd %ymm15,0x13c00(%rcx)
    a542:	00 
    a543:	c5 fd 29 81 20 3c 01 	vmovapd %ymm0,0x13c20(%rcx)
    a54a:	00 
    a54b:	c5 fd 29 99 40 3c 01 	vmovapd %ymm3,0x13c40(%rcx)
    a552:	00 
    a553:	c5 fd 29 89 60 3c 01 	vmovapd %ymm1,0x13c60(%rcx)
    a55a:	00 
    a55b:	48 81 c1 00 01 00 00 	add    $0x100,%rcx
    a562:	48 81 f9 00 20 00 00 	cmp    $0x2000,%rcx
    a569:	0f 85 6c ff ff ff    	jne    a4db <inmate_main+0x4cb>
    a56f:	45 31 c0             	xor    %r8d,%r8d
    a572:	c4 c1 7d 28 90 80 5b 	vmovapd 0x15b80(%r8),%ymm2
    a579:	01 00 
    a57b:	c4 c1 7d 28 a0 a0 5b 	vmovapd 0x15ba0(%r8),%ymm4
    a582:	01 00 
    a584:	c4 c1 7d 28 a8 c0 5b 	vmovapd 0x15bc0(%r8),%ymm5
    a58b:	01 00 
    a58d:	c4 c1 7d 28 b0 e0 5b 	vmovapd 0x15be0(%r8),%ymm6
    a594:	01 00 
    a596:	c4 c1 7d 28 b8 00 5c 	vmovapd 0x15c00(%r8),%ymm7
    a59d:	01 00 
    a59f:	c4 41 7d 28 80 20 5c 	vmovapd 0x15c20(%r8),%ymm8
    a5a6:	01 00 
    a5a8:	c4 41 7d 28 88 40 5c 	vmovapd 0x15c40(%r8),%ymm9
    a5af:	01 00 
    a5b1:	c4 41 7d 28 90 60 5c 	vmovapd 0x15c60(%r8),%ymm10
    a5b8:	01 00 
    a5ba:	c4 c1 7d 29 90 80 7b 	vmovapd %ymm2,0x17b80(%r8)
    a5c1:	01 00 
    a5c3:	c4 c1 7d 29 a0 a0 7b 	vmovapd %ymm4,0x17ba0(%r8)
    a5ca:	01 00 
    a5cc:	c4 c1 7d 29 a8 c0 7b 	vmovapd %ymm5,0x17bc0(%r8)
    a5d3:	01 00 
    a5d5:	c4 c1 7d 29 b0 e0 7b 	vmovapd %ymm6,0x17be0(%r8)
    a5dc:	01 00 
    a5de:	c4 c1 7d 29 b8 00 7c 	vmovapd %ymm7,0x17c00(%r8)
    a5e5:	01 00 
    a5e7:	c4 41 7d 29 80 20 7c 	vmovapd %ymm8,0x17c20(%r8)
    a5ee:	01 00 
    a5f0:	c4 41 7d 29 88 40 7c 	vmovapd %ymm9,0x17c40(%r8)
    a5f7:	01 00 
    a5f9:	c4 41 7d 29 90 60 7c 	vmovapd %ymm10,0x17c60(%r8)
    a600:	01 00 
    a602:	49 81 c0 00 01 00 00 	add    $0x100,%r8
    a609:	49 81 f8 00 20 00 00 	cmp    $0x2000,%r8
    a610:	0f 85 5c ff ff ff    	jne    a572 <inmate_main+0x562>
    a616:	90                      nop

This made latency much better!

lcnt 5000000: average cycles 948, current cycles 1032, worst cycles 1408
L3Miss 10071
L3UnsharedHit 8
L2HitM 22
L2Hit 75445841
****************************
L2Miss 10101
L2HitRatio 0.9998
L3HitRatio 0.29
****************************
Read loop, worst cycles 732
Write loop, worst cycles 292

Br,

Nikola