Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

OSX:: tuning micro kernel parameters for Haswell 4980



i've used various cpuid programs and intel's program as well we grepping through the documentation but i'm probably not sure how what to look for.  

anyway i'm building some threaded and openmp libraries for osx 10.11 and looking to tune the micro kernel constants.  one of the libraries is bliss and i need to find out the cache sizes.  these are results for Sandybridge 

what are definitions for i7 4980 haswell for below as well as fma3? what is the simd align size?  heap stride align size? contiguous stride align size?  


// -- LEVEL-3 MICRO-KERNEL CONSTANTS -------------------------------------------

// -- Cache blocksizes --

#define BLIS_SGEMM_UKERNEL         bli_sgemm_asm_8x8
#define BLIS_DEFAULT_MC_S          128
#define BLIS_DEFAULT_KC_S          384
#define BLIS_DEFAULT_NC_S          4096
#define BLIS_DEFAULT_MR_S          8
#define BLIS_DEFAULT_NR_S          8

#define BLIS_DGEMM_UKERNEL         bli_dgemm_asm_8x4
#define BLIS_DEFAULT_MC_D          96
#define BLIS_DEFAULT_KC_D          256
#define BLIS_DEFAULT_NC_D          4096
#define BLIS_DEFAULT_MR_D          8
#define BLIS_DEFAULT_NR_D          4

#define BLIS_CGEMM_UKERNEL         bli_cgemm_asm_8x4
#define BLIS_DEFAULT_MC_C          96
#define BLIS_DEFAULT_KC_C          256
#define BLIS_DEFAULT_NC_C          4096
#define BLIS_DEFAULT_MR_C          8
#define BLIS_DEFAULT_NR_C          4

#define BLIS_ZGEMM_UKERNEL         bli_zgemm_asm_4x4
#define BLIS_DEFAULT_MC_Z          64 
#define BLIS_DEFAULT_KC_Z          192
#define BLIS_DEFAULT_NC_Z          4096
#define BLIS_DEFAULT_MR_Z          4
#define BLIS_DEFAULT_NR_Z          4

// -- Register blocksizes --

#define BLIS_DEFAULT_MR_S              8
#define BLIS_DEFAULT_NR_S              4

#define BLIS_DEFAULT_MR_D              4
#define BLIS_DEFAULT_NR_D              6

#define BLIS_DEFAULT_MR_C              8
#define BLIS_DEFAULT_NR_C              4

#define BLIS_DEFAULT_MR_Z              8
#define BLIS_DEFAULT_NR_Z              4

// NOTE: If the micro-kernel, which is typically unrolled to a factor
// of f, handles leftover edge cases (ie: when k % f > 0) then these
// register blocksizes in the k dimension can be defined to 1.

//#define BLIS_DEFAULT_KR_S              1
//#define BLIS_DEFAULT_KR_D              1
//#define BLIS_DEFAULT_KR_C              1
//#define BLIS_DEFAULT_KR_Z              1

// -- Maximum cache blocksizes (for optimizing edge cases) --

// NOTE: These cache blocksize "extensions" have the same constraints as
// the corresponding default blocksizes above. When these values are
// larger than the default blocksizes, blocksizes used at edge cases are
// enlarged if such an extension would encompass the remaining portion of
// the matrix dimension.





// -- Packing register blocksize (for packed micro-panels) --

// NOTE: These register blocksize "extensions" determine whether the
// leading dimensions used within the packed micro-panels are equal to
// or greater than their corresponding register blocksizes above.

//#define BLIS_PACKDIM_MR_S              (BLIS_DEFAULT_MR_S + ...)
//#define BLIS_PACKDIM_NR_S              (BLIS_DEFAULT_NR_S + ...)

//#define BLIS_PACKDIM_MR_D              (BLIS_DEFAULT_MR_D + ...)
//#define BLIS_PACKDIM_NR_D              (BLIS_DEFAULT_NR_D + ...)

//#define BLIS_PACKDIM_MR_C              (BLIS_DEFAULT_MR_C + ...)
//#define BLIS_PACKDIM_NR_C              (BLIS_DEFAULT_NR_C + ...)

//#define BLIS_PACKDIM_MR_Z              (BLIS_DEFAULT_MR_Z + ...)
//#define BLIS_PACKDIM_NR_Z              (BLIS_DEFAULT_NR_Z + ...)

0 Kudos
3 Replies
Black Belt

I have not worked much with BLIS, but the cache sizes on Haswell processors are generally the same as on Sandy Bridge processors, and likely the same as on whatever processors the BLIS authors used for development. 

Loop blocking for linear algebra codes often have three levels: register blocking, L2 cache blocking, and L3 cache (or TLB) blocking. 

Unfortunately I can't tell how the names of the various blocking factors above map to the different indices of the underlying algorithm, so are going to either need to do some more research or some experiments....



Some background information that may be helpful....

Register blocking for Haswell has to be more aggressive than for Sandy Bridge because the throughput is doubled for FMA operations, so you need more independent registers to be gathering partial sums. 

For double-precision matrix multiplication written as:

for (i=0; i<N; i++) {
    for (j=0; j<N; j++) {
       for (k=0; k<N; k++) {
          c += a * b;

my analysis suggests that the optimum register blocking is to unroll the "i" loop by 4, unroll the "j" loop by 12, and unroll the "k" loop by 4.   The number of partial sums is the product of the unrolling of the "i" and "j", but the "j" direction is vectorized, so this ends up using 12 (4*12/4) registers.

Blocking for Level-1 cache is not usually done -- it is just too small to be useful.

Blocking for Level-2 cache should be the same on Haswell as on Sandy Bridge -- the cache is 256KiB in each case.  For matrix multiplication you want to hold blocks of each of the three matrices, so the maximum size of each block is about 104x104 double-precision elements.  Lots of codes round this down to 96x96.  Square is not necessarily the best answer -- 192x48 may provide better cache and TLB utilization.

Blocking for the Level-3 cache is not always done, but it is getting more common as memory hierarchies get deeper and more unbalanced.  Most codes are set up to have each core work on a completely separate block, so the issue is the amount of L3 cache per core.  The server processors generally have 2.5 MiB of L3 cache per core, but the Core i7-4980 has only 1.5 MiB/core (6MiB for 4 cores).  For a variety of reasons you don't want to try to use all of the L3 cache -- 2/3 of the available cache is reasonable.  1 MiB is enough to hold three blocks of 200x200 elements each.  An alternate approach to use the whole L3 cache for one block and parallelize the processing of the block across the cores.  This allows a bigger block size of somewhere in the range of 400x400 to 500x500 with a 6MiB cache.

For very large caches you have to worry about the reach of the TLB.  On Sandy Bridge the level 2 TLB can map 512 4KiB pages, or 2MiB.  This is slightly less than the size of the core's share of the L3 cache (2.5 MiB/core).   Haswell increases the level 2 TLB to 1024 entries, so each core can map 4MiB with the default 4KiB pages.

0 Kudos

Wow, many thanks for the response, lots of food for thought.

i like the concept of using all of the L3 cache for parallel you would be using openMP? or go with native pthreads?

as a sanity check, do these values make sense to you?  what size buffer for L2 would you use? for double/complex/long long

#define L3_SIZE 6291456
#define L3_ASSOCIATIVE 12
#define L3_LINESIZE 64

#define ITB_SIZE 4096
#define ITB_ENTRIES 1024
#define DTB_SIZE 4096


0 Kudos
Black Belt

I am not sure exactly what you mean by "what size buffer for L2 would you use" -- the L2 is 256KiB and you can use almost all of it without triggering lots of misses.  256KiB is 32768 doubles, so rounding down to 30000 doubles is reasonable.  This is consistent with the blocking using in most matrix multiplication algorithms -- the cache is expected to hold 3 blocks of 100x100 doubles.

I don't know how the code uses the ITB and DTB variables above.  The values provided are not quite right. 

  • According to the Intel Optimization Reference Manual (document 248966), the Haswell Instruction TLB is 4-way associative with 128 entries for 4KiB pages.  (The hardware supports 2MiB pages in the ITLB, but I have never heard of any software that uses these.)
  • The Haswell Data TLB is also 4-way associative with 64 entries for 4KiB pages.
  • Both TLBs are backed by a second-level Shared TLB that supports 1024 entries with 8-way associativity.  DTLB misses that hit in the STLB are not usually a performance problem, so claiming that the DTLB has 1024 entries will probably not be a problem.
0 Kudos