Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

Last-Level-Cache Hits/Misses

Hilmar_A_
Beginner
1,194 Views

Hi,

I recently started to write my master thesis and it's about Cache Profiling with Perfomance Counters. In this context, I discovered some strange results (reproducible on 2 different CPU's), that I don't understand. My both CPU's are:

family 06H / model 45H   (Intel i5-4300U, Haswell)
family 06H / model 3CH   (Intel i5-4670,  Haswell)


 The profiled code is nothing else than the following c-code:

#include <inttypes.h>
#include <stdlib.h>
#include <stdio.h>
#include "perf.h"

#define DATA_BLOCK 24*1024*1024

int main (int argc, char * argv[]) {        
    uint8_t * arr = (uint8_t *)malloc(DATA_BLOCK);
    uint32_t i;
    
    for (i=0; i<DATA_BLOCK; i++)
        *(arr+i) = i;

    for (i=0; i<DATA_BLOCK; i++)
        *(arr+i) = i;

    uint8_t sum = 0;    
    for (i=0; i<DATA_BLOCK; i++)
        sum += *(arr+i);

    return 0;
}

The counters are switched on and off directly with rdmsr/wrmsr with a kernel-module and are read with rdpmc. I am measuring last-level-cache-events with event mask B7H and Umask 01H (OFF_CORE_RESPONSE_0), with the 8 combinations of the additional MSR 01A6H:

(0) DMND_DATA_RD | (L3_HITM | L3_HITE | L3_HITS) | (all snoop response bits set)
(1) DMND_DATA_RD | (L4_HIT_LOCAL_L4 | L4_HIT_REMOTE_HOP0_L4 | L4_HIT_REMOTE_HOP1_L4 | L4_HIT_REMOTE_HOP2P_L4) | (all snoop response bits set)
(2) DMND_RFO | (L3_HITM | L3_HITE | L3_HITS) | (all snoop response bits set)
(3) DMND_RFO | (L4_HIT_LOCAL_L4 | L4_HIT_REMOTE_HOP0_L4 | L4_HIT_REMOTE_HOP1_L4 | L4_HIT_REMOTE_HOP2P_L4) | (all snoop response bits set)
(4) PF_DATA_RD | (L3_HITM | L3_HITE | L3_HITS) | (all snoop response bits set)
(5) PF_DATA_RD | (L4_HIT_LOCAL_L4 | L4_HIT_REMOTE_HOP0_L4 | L4_HIT_REMOTE_HOP1_L4 | L4_HIT_REMOTE_HOP2P_L4) | (all snoop response bits set)
(6) PF_RFO | (L3_HITM | L3_HITE | L3_HITS) | (all snoop response bits set)
(7) PF_RFO | (L4_HIT_LOCAL_L4 | L4_HIT_REMOTE_HOP0_L4 | L4_HIT_REMOTE_HOP1_L4 | L4_HIT_REMOTE_HOP2P_L4) | (all snoop response bits set)

(These are basically the events, perf is measuring too with LLC-loads, LLC-load-misses, ..)

As far as I understood, someone could translate these 8 events as:

(0) L2 read-lookup for a cacheline in L3 ~> L3 hit occurs
(1) L2 read-lookup for a cacheline in L3 ~> L3 miss occurs
(2) L2-prefetcher read-lookup for a cacheline in L3 ~> L3 hit occurs
(3) L2-prefetcher read-lookup for a cacheline in L3 ~> L3 miss occurs
(4) L2 write-lookup for a cacheline in L3 ~> L3 hit occurs
(5) L2 write-lookup for a cacheline in L3 ~> L3 miss occurs
(6) L2-prefetcher write-lookup for a cacheline in L3 ~> L3 hit occurs
(7) L2-prefetcher write-lookup for a cacheline in L3 ~> L3 miss occurs

For my example above, I would now expect (for a cacheline = 64B) for every loop:
(24*1024*1024)/64 = 393216 lookups in L3, most of them requested by the L2-Prefetcher and MOST OF THEM MISSES IN THE L3.

This is exactly, what i measure for the second and the third loop, but in the first loop i get a lot of HITS IN THE L3. The exact results are:

first loop:

(0) 9539
(1) 4070
(2) 6098
(3) 3734
(4) 17818
(5) 26210
(6) 248019
(7) 110017

second loop:

(0) 6499
(1) 258
(2) 7824
(3) 1047
(4) 660
(5) 24843
(6) 2156
(7) 369703

third loop:

(0) 3873
(1) 21724
(2) 5412
(3) 378818
(4) 420
(5) 18
(6) 1362
(7) 190


With a data-block of doubled size (48*1024*1024), basically all the numbers get doubled too. On another CPU I could not reproduce this, here the results were as I would expect them. (architecture was not Haswell in this case, but I can't remember the exact CPU-model anymore)

Is there any important detail, that I miss?

Kind regards,
Hilmar Ackermann

0 Kudos
1 Solution
McCalpinJohn
Honored Contributor III
1,194 Views

I did some similar testing on a Xeon E5 v3 with the performance counters programmed outside the program and then read inline before and after each loop.

NOTE that the "translations" above for events 2&3 correspond to the original events 4&5, and vice-versa.   I can't tell if the numerical results above follow the first or second list of the events.

 

My results for the 1st loop are not easy to understand because the OS has to create the page mappings for each page here.  This appears to involve black magic.  Fortunately I don't have any real codes whose performance is dominated by OS page instantiation, so I ignore these results....

I had to modify the code to prevent the compiler from eliminating the second loop.   After the modification, I got reasonable results for the second loop, but the code is a bit different.   I changed the assignment to an update, so the data will be read first, rather than generating an RFO.   The total number of events matched the expected number of lines.  (I used a bigger array to overflow the larger cache in my Xeon E5-2690 v3.)  

The results for the 3rd loop also add up to the expected total value.  

The table below shows the results divided by the number of cache lines in the array:

                    read hit read miss rfo hit rfo miss pf_rd hit pf_rd miss pf_rfo hit pf_rfo miss Total
Loop 1 -- initialize    1%      0%	  0%	  3%      0%         0%         0%          0%	      5%
Loop 2 -- update       57%     28%        0%      0%      1%        14%         0%          0%      100%
Loop 3 -- sum          62%     30%        0%      0%      1%         8%         0%          0%      101%

The results show that the hardware prefetcher is effective at bringing the data into the L3 cache, with 57% to 62% of the data being found in the L3 (despite actually coming from memory). 

Comparing the Request Type fields for Sandy Bridge and Haswell shows that the Haswell does not include counts for the "L2 HW Prefetch to L3" event type.   A comment in https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/520331 suggests that this is due to a bug in the Haswell performance counters.   I am not sure if the "L2 HW Prefetch to L3" is a subset of the "PF_DATA_RD" event, or if the "PF_DATA_RD" event only counts "L2 HW Prefetch to L2" events.   Figuring out the actual sequence of events given the limited documentation of the implementation and the limited trustworthiness of the performance counters is a challenge....

View solution in original post

0 Kudos
8 Replies
McCalpinJohn
Honored Contributor III
1,195 Views

I did some similar testing on a Xeon E5 v3 with the performance counters programmed outside the program and then read inline before and after each loop.

NOTE that the "translations" above for events 2&3 correspond to the original events 4&5, and vice-versa.   I can't tell if the numerical results above follow the first or second list of the events.

 

My results for the 1st loop are not easy to understand because the OS has to create the page mappings for each page here.  This appears to involve black magic.  Fortunately I don't have any real codes whose performance is dominated by OS page instantiation, so I ignore these results....

I had to modify the code to prevent the compiler from eliminating the second loop.   After the modification, I got reasonable results for the second loop, but the code is a bit different.   I changed the assignment to an update, so the data will be read first, rather than generating an RFO.   The total number of events matched the expected number of lines.  (I used a bigger array to overflow the larger cache in my Xeon E5-2690 v3.)  

The results for the 3rd loop also add up to the expected total value.  

The table below shows the results divided by the number of cache lines in the array:

                    read hit read miss rfo hit rfo miss pf_rd hit pf_rd miss pf_rfo hit pf_rfo miss Total
Loop 1 -- initialize    1%      0%	  0%	  3%      0%         0%         0%          0%	      5%
Loop 2 -- update       57%     28%        0%      0%      1%        14%         0%          0%      100%
Loop 3 -- sum          62%     30%        0%      0%      1%         8%         0%          0%      101%

The results show that the hardware prefetcher is effective at bringing the data into the L3 cache, with 57% to 62% of the data being found in the L3 (despite actually coming from memory). 

Comparing the Request Type fields for Sandy Bridge and Haswell shows that the Haswell does not include counts for the "L2 HW Prefetch to L3" event type.   A comment in https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/520331 suggests that this is due to a bug in the Haswell performance counters.   I am not sure if the "L2 HW Prefetch to L3" is a subset of the "PF_DATA_RD" event, or if the "PF_DATA_RD" event only counts "L2 HW Prefetch to L2" events.   Figuring out the actual sequence of events given the limited documentation of the implementation and the limited trustworthiness of the performance counters is a challenge....

0 Kudos
McCalpinJohn
Honored Contributor III
1,194 Views

The results I posted suggest that the OFFCORE_RESPONSE event PF_DATA_RD is only counting L2 HW prefetches that bring data into the L2.  There are also L2 HW prefetches that bring the data into the L3, but Haswell does not provide an OFFCORE_RESPONSE event to count these explicitly.   They are still visible implicitly, since they are the reason that 57% to 62% of the data is found in the L3 (DMND_DATA_RD.LLC_HIT).

So to summarize the results of loop 2:

  • 14% of the data is brought into the L2 cache by L2 HW Prefetches to L2  (PF_DATA_RD).  For these lines there is no L2 cache miss, but the OFFCORE_RESPONSE counter captures the data motion due to the HW prefetch.
  • 57% of the data is brought into the L3 cache by L2 HW Prefetches to L3.  For these lines the L2 Demand Read Miss hits in the L3.  DMND_DATA_RD.LLC_HIT.*
  • 28% of the data is not prefetched, so the L2 Demand Read Miss also misses in the L3 and goes to memory.  DMND_DATA_RD.LLC_MISS.
  • None of the writebacks in loop 2 are directly visible.  (This makes sense -- the counter is for OFFCORE_RESPONSE events, and a Writeback is not a "response" from the offcore -- it is a unidirectional transaction sent from the core to the uncore.)

Loop 3 is similar, but with a slightly lower rate of L2 HW Prefetches to L2 (8% vs 14%), and a slightly higher rate of L2 HW Prefetches to L3 (62% vs 57%). 

On the Xeon E5 v3 parts these events can also be measured in the Uncore using the CBo counters.   The Haswell Core i3/i5/i7 parts also have some uncore counters, but I have not looked at what insight they might provide....

0 Kudos
Hilmar_A_
Beginner
1,194 Views

Hi,

first of all: Thank you very much! That helped me a lot to understand what's going on.

I did some additional measurements (with the same example), but three differences:

1) I did not set the OS-flag
2) I measured every combination of enabled/disabled prefetchers. 
3) The results are normalized to the amount of expected LLC-cache-misses.

The entries in the x-Direction are the same, as in your example. The y-Direction specifies, which prefetcher is disabled (1 = disabled), the prefetchers from left ro right are:
DCU IP prefetcher, DCU prefetcher, L2 adjacent cache line prefetcher, L2 hardware prefetcher

First write loop
==============


          | r hit  | r miss | r p hit|r p miss| w hit  | w miss | w p hit|w p miss| total
------------------------------------------------------------------------------------------
     0000 | 0.043% | 0.032% | 0.024% | 0.008% | 1.917% | 1.897% | 58.05% | 30.26% | 92.24% 
     0001 | 0.035% | 0.031% | 0.018% | 0.007% | 37.93% | 11.17% | 27.13% | 13.55% | 89.89% 
     0010 | 0.036% | 0.033% | 7.629% | 0.013% | 2.476% | 1.777% | 59.04% | 30.22% | 93.60% 
     0011 | 0.033% | 0.030% |   0.0% |   0.0% | 70.37% | 19.94% |   0.0% |   0.0% | 90.38% 
     0100 | 0.034% | 0.032% | 0.017% | 0.016% | 2.238% | 1.944% | 58.27% | 30.39% | 92.95% 
     0101 | 0.035% | 0.032% | 0.017% | 0.006% | 37.84% | 11.13% | 27.14% | 13.61% | 89.83% 
     0110 | 0.032% | 0.032% | 0.015% | 0.005% | 1.996% | 1.415% | 58.06% | 30.13% | 91.69% 
     0111 | 0.032% | 0.030% |   0.0% |   0.0% | 71.04% | 19.78% |   0.0% |   0.0% | 90.88% 
     1000 | 0.040% | 0.031% | 0.012% | 0.014% | 2.351% | 1.789% | 58.97% | 28.14% | 91.35% 
     1001 | 0.036% | 0.032% | 0.013% | 0.006% | 38.62% | 11.26% | 26.91% | 14.93% | 91.82% 
     1010 | 0.043% | 0.030% | 0.001% | 0.005% | 2.487% | 1.802% | 56.86% | 27.09% | 88.33% 
     1011 | 0.042% | 0.030% |   0.0% |   0.0% | 71.87% | 18.99% |   0.0% |   0.0% | 90.95% 
     1100 | 0.039% | 0.031% | 0.020% | 0.012% | 1.909% | 1.690% | 57.41% | 31.31% | 92.44% 
     1101 | 0.032% | 0.031% | 0.016% | 0.005% | 37.35% | 12.61% | 27.37% | 15.80% | 93.23% 
     1110 | 0.032% | 0.031% | 7.629% | 0.006% | 2.168% | 1.747% | 55.82% | 32.16% | 91.97% 
     1111 | 0.035% | 0.030% |   0.0% |   0.0% | 72.25% | 20.06% |   0.0% |   0.0% | 92.38% 


Second write loop
==============


          | r hit  | r miss | r p hit|r p miss| w hit  | w miss | w p hit|w p miss| total
------------------------------------------------------------------------------------------
     0000 | 0.031% | 0.010% | 0.033% | 0.013% | 0.014% | 3.773% | 0.064% | 96.03% | 99.97% 
     0001 | 0.034% | 0.014% | 0.024% | 0.003% | 0.017% | 53.79% | 0.013% | 46.24% | 100.1% 
     0010 | 0.032% | 0.013% | 0.002% | 0.015% | 0.012% | 3.289% | 0.037% | 96.19% | 99.59% 
     0011 | 0.033% | 0.016% |   0.0% |   0.0% | 0.038% | 99.97% |   0.0% |   0.0% | 100.0% 
     0100 | 0.037% | 0.010% | 0.027% | 0.012% | 0.014% | 3.216% | 0.064% | 96.80% | 100.1% 
     0101 | 0.028% | 0.014% | 0.019% | 0.005% | 0.024% | 53.70% | 0.019% | 46.21% | 100.0% 
     0110 | 0.023% | 0.011% | 0.019% | 0.011% | 0.017% | 2.913% | 0.055% | 97.36% | 100.4% 
     0111 | 0.036% | 0.017% |   0.0% |   0.0% | 0.036% | 99.99% |   0.0% |   0.0% | 100.0% 
     1000 | 0.038% | 0.010% | 0.027% | 0.017% | 0.014% | 3.469% | 0.067% | 95.88% | 99.53% 
     1001 | 0.032% | 0.015% | 0.020% | 0.004% | 0.026% | 53.87% | 0.016% | 46.19% | 100.1% 
     1010 | 0.031% | 0.011% | 0.004% | 0.010% | 0.015% | 3.370% | 0.050% | 97.21% | 100.7% 
     1011 | 0.031% | 0.016% |   0.0% |   0.0% | 0.017% | 99.99% |   0.0% |   0.0% | 100.0% 
     1100 | 0.036% | 0.010% | 0.029% | 0.013% | 0.010% | 2.662% | 0.069% | 97.38% | 100.2% 
     1101 | 0.024% | 0.014% | 0.017% | 0.005% | 0.017% | 53.56% | 0.011% | 46.37% | 100.0% 
     1110 | 0.027% | 0.013% | 0.004% | 0.013% | 0.013% | 2.878% | 0.063% | 97.33% | 100.3% 
     1111 | 0.025% | 0.014% |   0.0% |   0.0% | 0.047% | 99.97% |   0.0% |   0.0% | 100.0% 


read loop:
==============


          | r hit  | r miss | r p hit|r p miss| w hit  | w miss | w p hit|w p miss| total
------------------------------------------------------------------------------------------
     0000 | 0.035% | 2.683% | 0.051% | 97.35% | 0.013% | 0.001% | 0.061% | 0.002% | 100.2% 
     0001 | 0.022% | 51.82% | 0.017% | 48.39% | 0.009% | 0.002% | 0.011% | 0.005% | 100.2% 
     0010 | 0.047% | 2.898% | 0.002% | 97.22% | 0.010% | 0.001% | 0.044% | 0.003% | 100.2% 
     0011 | 0.029% | 100.0% |   0.0% |   0.0% | 0.015% | 0.003% |   0.0% |   0.0% | 100.0% 
     0100 | 0.040% | 2.824% | 0.042% | 97.12% | 0.012% | 0.002% | 0.058% | 0.001% | 100.1% 
     0101 | 0.022% | 51.75% | 0.020% | 48.30% | 0.014% | 0.001% | 0.021% | 0.001% | 100.1% 
     0110 | 0.031% | 2.529% | 0.004% | 97.48% | 0.015% | 0.004% | 0.075% | 0.002% | 100.1% 
     0111 | 0.028% | 100.0% |   0.0% |   0.0% | 0.020% | 0.003% |   0.0% |   0.0% | 100.0% 
     1000 | 0.054% | 2.078% | 0.038% | 98.00% | 0.013% | 0.001% | 0.069% | 0.004% | 100.2% 
     1001 | 0.033% | 53.35% | 0.016% | 46.65% | 0.020% | 0.002% | 0.016% | 0.005% | 100.1% 
     1010 | 0.030% | 2.185% | 0.002% | 97.93% | 0.015% | 0.001% | 0.068% | 0.004% | 100.2% 
     1011 | 0.028% | 99.99% |   0.0% |   0.0% | 0.010% | 0.003% |   0.0% |   0.0% | 100.0% 
     1100 | 0.040% | 1.801% | 0.036% | 98.10% | 0.009% | 0.002% | 0.044% | 0.007% | 100.0% 
     1101 | 0.027% | 53.30% | 0.015% | 46.76% | 0.014% | 0.002% | 0.012% | 0.001% | 100.1% 
     1110 | 0.034% | 1.846% | 0.002% | 98.06% | 0.012% | 0.001% | 0.058% | 0.002% | 100.0% 
     1111 | 0.023% | 99.99% |   0.0% |   0.0% | 0.011% | 0.003% |   0.0% |   0.0% | 100.0% 


The ~10% misses in the first loop are probably - as you suggested - due to paging. As it seems, the "L2 adjacent cache line prefetcher" can do ~50% of the work of the "L2 hardware prefetcher", but if just the "L2 adjacent cache line prefetcher" is disabled, there is barely a change.

Ok, so furthermore I will look into the uncore counters and I will try some measurements on non-Haswell architectures.. Thank you so far for your help.

Kind regards,
Hilmar

0 Kudos
McCalpinJohn
Honored Contributor III
1,194 Views

It is interesting that the statistics are so different between your system (Haswell Core) and my system  (Haswell Xeon) on the read loop (which should be pretty much the same in your code and mine).   I have not looked into the differences between the hardware prefetcher implementations in the client and server uncore implementations, but it looks like the behavior is quite dissimilar.

I have also seen that the L2 adjacent line prefetcher does not seem to do very much when you have a contiguous access pattern.  It may be more useful for "filling in" pages that are accessed in a non-contiguous order.

Did you keep the performance data for these runs?   It would be interesting to see how close you got to saturating the DRAM interface in each of these cases -- running in the bandwidth-limited regime will result in increased latency & buffer occupancy, which may account for the difference in the behavior of the prefetchers....

 

0 Kudos
GHui
Novice
1,194 Views

I research the llcm also. Can I get your tools and test code. 

Any help will be appreciated.

 

0 Kudos
Hilmar_A_
Beginner
1,194 Views

Hi,

@John

I had not kept any further data, but I did the same measurements again, but also measured memory-bandwidth this time. I don't know what's the best way to do this, but I read out the memory-controller-requests (as suggested here: https://software.intel.com/en-us/articles/monitoring-integrated-memory-controller-requests-in-the-2nd-3rd-and-4th-generation-intel), then activated the performance counter CPU_CLK_UNHALTED and calculated:

( [frequency_of_my_processor] * (UNC_IMC_DRAM_DATA_READS + UNC_IMC_DRAM_DATA_WRITES) * 64Bytes) / CPU_CLK_UNHALTED

[frequency_of_my_processor] is in my case 1.9Ghz and in the end this formula should give me the memory-bandwidth in Bytes/sec. (I know that is not 100% exact, because I measure CPU_CLK_UNHALTED just on the logical core, where my benchmark runs on, but the counter of the memory-controller is systemwide, but I think the offset should not be that large). So at the end I get the following bandwidth (in MB/sec, the columns are just the 8 measurements, in this context their values should be basically the same, still there are huge peaks, as example in the 7th line):

first write-loop
==============


          | r hit  | r miss | r p hit|r p miss| w hit  | w miss | w p hit|w p miss
------------------------------------------------------------------------------------------
     0000 | 785.16 | 787.19 | 791.84 | 782.74 | 787.49 | 793.27 | 782.72 | 787.14 |
     0001 | 770.97 | 772.27 | 773.21 | 773.04 | 781.78 | 776.60 | 773.07 | 772.61 |
     0010 | 785.56 | 782.52 | 787.02 | 783.96 | 782.08 | 785.33 | 786.52 | 780.93 |
     0011 | 765.03 | 762.36 | 764.66 | 764.01 | 767.25 | 791.07 | 799.45 | 811.29 |
     0100 | 836.36 | 855.48 | 899.15 | 915.55 | 870.24 | 887.07 | 946.76 | 970.00 |
     0101 | 965.43 | 964.17 | 971.66 | 964.92 | 958.39 | 963.48 | 968.12 | 969.10 |
     0110 | 982.04 | 978.17 | 1495.5 | 1382.3 | 976.27 | 987.10 | 971.04 | 1003.2 |
     0111 | 962.04 | 969.36 | 961.33 | 955.84 | 966.77 | 961.55 | 961.24 | 961.72 |
     1000 | 979.22 | 978.29 | 966.39 | 977.31 | 977.34 | 978.76 | 986.59 | 976.31 |
     1001 | 965.10 | 962.00 | 976.77 | 960.64 | 960.47 | 967.01 | 962.74 | 966.58 |
     1010 | 980.33 | 977.90 | 977.24 | 980.73 | 969.68 | 975.14 | 981.52 | 984.77 |
     1011 | 962.41 | 954.98 | 957.48 | 954.97 | 962.38 | 971.86 | 956.97 | 960.72 |
     1100 | 976.44 | 981.27 | 981.03 | 979.10 | 982.90 | 982.69 | 977.59 | 978.94 |
     1101 | 970.99 | 966.23 | 963.46 | 967.88 | 965.31 | 964.52 | 969.20 | 964.29 |
     1110 | 962.86 | 979.39 | 972.05 | 984.79 | 976.00 | 984.16 | 980.04 | 981.87 |
     1111 | 967.60 | 965.07 | 961.64 | 961.60 | 979.56 | 970.63 | 960.05 | 971.46 |


first read-loop
==============


          | r hit  | r miss | r p hit|r p miss| w hit  | w miss | w p hit|w p miss
------------------------------------------------------------------------------------------
     0000 | 750.56 | 746.97 | 753.99 | 742.64 | 744.14 | 755.83 | 743.93 | 746.78 |
     0001 | 735.44 | 738.80 | 734.81 | 744.78 | 736.19 | 735.70 | 738.80 | 734.33 |
     0010 | 749.03 | 745.36 | 745.54 | 747.16 | 745.72 | 745.66 | 748.44 | 747.01 |
     0011 | 731.42 | 731.25 | 730.94 | 733.38 | 731.10 | 769.21 | 769.79 | 782.18 |
     0100 | 804.45 | 825.73 | 872.47 | 897.04 | 834.14 | 852.84 | 917.87 | 939.49 |
     0101 | 931.38 | 931.89 | 931.67 | 934.75 | 960.04 | 933.80 | 940.00 | 932.70 |
     0110 | 939.81 | 941.11 | 1470.7 | 1398.1 | 935.43 | 946.40 | 938.27 | 945.11 |
     0111 | 930.62 | 929.68 | 927.73 | 926.35 | 934.82 | 928.09 | 933.38 | 926.23 |
     1000 | 939.06 | 939.08 | 935.46 | 939.59 | 944.13 | 937.42 | 941.63 | 939.01 |
     1001 | 931.90 | 932.97 | 933.89 | 931.58 | 934.68 | 937.89 | 938.34 | 940.22 |
     1010 | 943.17 | 939.18 | 932.83 | 941.98 | 940.10 | 941.24 | 940.44 | 941.44 |
     1011 | 927.74 | 927.95 | 928.77 | 928.86 | 928.25 | 941.98 | 932.29 | 929.30 |
     1100 | 939.16 | 972.93 | 939.61 | 940.12 | 945.31 | 948.24 | 940.04 | 947.60 |
     1101 | 941.93 | 934.19 | 932.48 | 932.66 | 930.67 | 931.97 | 930.67 | 949.08 |
     1110 | 939.21 | 939.03 | 940.97 | 940.50 | 936.55 | 942.98 | 938.11 | 950.23 |
     1111 | 931.34 | 933.78 | 929.01 | 931.55 | 938.30 | 938.35 | 942.54 | 938.89 |


second read-loop
==============


          | r hit  | r miss | r p hit|r p miss| w hit  | w miss | w p hit|w p miss
------------------------------------------------------------------------------------------
     0000 | 714.87 | 713.21 | 735.00 | 716.13 | 718.50 | 729.30 | 714.65 | 713.32 |
     0001 | 709.93 | 713.62 | 718.03 | 717.02 | 711.85 | 715.57 | 714.68 | 715.59 |
     0010 | 716.37 | 716.33 | 715.48 | 714.76 | 717.31 | 715.55 | 714.02 | 716.98 |
     0011 | 706.69 | 703.39 | 704.09 | 706.73 | 707.90 | 715.36 | 739.91 | 754.82 |
     0100 | 780.37 | 795.62 | 843.41 | 877.66 | 808.69 | 827.99 | 894.69 | 909.53 |
     0101 | 919.25 | 910.54 | 911.61 | 911.98 | 913.48 | 909.33 | 911.34 | 916.90 |
     0110 | 909.09 | 909.62 | 1475.2 | 1339.4 | 919.55 | 915.27 | 911.52 | 922.03 |
     0111 | 900.34 | 900.67 | 900.16 | 901.27 | 899.77 | 900.45 | 905.11 | 899.79 |
     1000 | 917.42 | 913.66 | 913.90 | 914.80 | 919.94 | 913.03 | 916.99 | 914.39 |
     1001 | 905.98 | 904.65 | 908.03 | 907.09 | 905.07 | 913.67 | 907.52 | 911.04 |
     1010 | 922.79 | 915.42 | 911.15 | 913.63 | 913.42 | 912.96 | 920.01 | 912.65 |
     1011 | 899.08 | 900.21 | 899.25 | 905.91 | 898.26 | 917.06 | 904.09 | 898.90 |
     1100 | 916.38 | 917.99 | 915.36 | 916.85 | 921.17 | 916.89 | 916.56 | 932.64 |
     1101 | 906.96 | 903.79 | 907.25 | 905.80 | 907.25 | 914.72 | 905.21 | 945.73 |
     1110 | 915.23 | 915.10 | 915.30 | 916.15 | 914.89 | 916.50 | 915.22 | 919.33 |
     1111 | 915.92 | 908.12 | 904.03 | 905.92 | 910.52 | 925.75 | 908.07 | 912.67 |

The interesting part is now to calculate my maximum bandwidth. I googled a bit, but as it seems, there is no easy way to receive that information..? Most people suggested benchmarks, so I tried 'bandwidth' (http://zsmith.co/bandwidth.html) I think it's good enough to get at least an idea, what the maximum bandwidth could look like (?) On my computer, I have as example the following output:

Random read (64-bit), size = 512 MB, loops = 1, 2355.9 MB/s

Random write (64-bit), size = 512 MB, loops = 1, 2660.0 MB/s

@GHui:

So, basically I wrote a small "quick-and-dirty"-kernel-module, that activates the counter, enables/disables prefetchers and calls a wbinvd to reset the cache. The kernel-module is accessed with a pseudo-file-access to /proc/[..], so they get executed at the right logical core (so of course, you can just use this to measure non-parallel code) and then the program just reads the counters with rdpmc (can be executed in usermode, at least as long your linux-distribution allows that) For the above examples, I inserted the calls to this routines manually. Then the benchmark is basically done with:

sudo insmod [Name of Kernel Module]
taskset 0x1 ./[program name]
sudo rmmod [Name of Kernel Module]

Secondly I wrote a tool, that inserts that code automatically, but this is currently WIP and, as far as I would assume, not usable for other users. Still, I will publish this open-source some time, when it's more complete.

If you are interested in the first part, just make a quick respond, I can put together the relevant parts with a simple Makefile and simple example to run inluded and upload that here..

Kind regards,
Hilmar

 

 

0 Kudos
McCalpinJohn
Honored Contributor III
1,194 Views

I am confused by these bandwidth numbers....

Both the Intel Core i5-4300U and the Core i5-4670 have 2 DRAM channels that can run at up to DDR3/1600 rates.   Each channel is 8 Bytes wide, so each channel has a peak bandwidth of 12.8 GB/s and the two channels together have a peak bandwidth of 25.6 GB/s.  I have not tested this on a Haswell Core i5, but I seem to recall that I could reach asymptotic bandwidth with 1 thread on the Sandy Bridge Xeon E3.  On the Sandy Bridge system these values were in the range of 8 GB/s read + 8 GB/s write for the second loop and in the range of 16 GB/s read for the third loop, so I would expect ~10+10 GB/s and 20 GB/s on the Haswell Core i5 systems.

0 Kudos
McCalpinJohn
Honored Contributor III
1,194 Views

I ran a version of this program on a Haswell Core i7-4960HQ using several different compilers (running Mac OS X 10.10.5).

The system has 2 channels of DDR3/1600 memory, so the peak bandwidth is 25.6 GB/s.

I estimated the memory bandwidth by simply assuming that since the array is larger than the cache, all the data had to come from DRAM (and return to DRAM if written).

When compiled with "icc -xCORE-AVX2" I get:

  • Loop 1: ~2.2 GB/s
  • Loop 2: ~19.5 GB/s (uses streaming stores, so no DRAM reads, only writes)
  • Loop 3: ~14 GB/s

When compiled with "gcc -O3" I get:

  • Loop 1: ~2.7 GB/s
  • Loop 2: ~11 GB/s (does not use streaming stores, so there must be 11 GB/s of reads plus 11 GB/s of writebacks)
  • Loop 3: ~9.5 GB/s

When compiled with the default "cc" compiler (Apple LLVM version 7.0.2 (clang-700.1.81)), I get results very similar to the gcc results.

0 Kudos
Reply