KNL thread binding to tiles.

Chronus_Taizen · ‎12-09-2016

Hello,

I am curious. For KNL, how does a programmer know when her OpenMP threads are assigned to the same tile? If the programmer writes with the L2 cache in mind, say she is thinking about groups of 8 threads on a tile to cover the entire the socket ...how would she know if her co-operating threads are located on the same tile?

A picture would be helpful.

Thanks.

jimdempseyatthecove · ‎12-10-2016

You have to have each thread use __cupid() to see if x2APIC is supported (it is on KNL, but it may not be on other systems your app runs on), then use the CPUID for APIC or x2APIC to obtain the APIC number. Note, some APIC numbers on KNL will exceed 255, thus the need to use the x2APIC means to get the APIC number. You will then need to use the CPUInfoCacheParameters using __cpuidEX (you may have to write this your self), running through the leafs on 0x04 to locate cache level 2 (for your thread) and to obtain the bit shift count for the APIC mask. Then apply the mask against the apic numbers to find the siblings. The shift counts for each cache level should be the same for all logical processors. Once all threads have obtained their APIC number you can then use the L2 mask to locate the other siblings.

Also note that using APIC numbers is independent of KMP_AFFINITY .and. independent of if you are running on KNL, Xeon or Core 2 Duo (which also has 2 cores/4 threads sharing L2).

Jim Dempsey

McCalpinJohn · ‎12-12-2016

There are a number of APIs that can be used to get a general solution, but in this case there is a quick solution that works on every KNL that I have seen. All of my KNLs have logical processor numbering that pairs an even logical processor number and the next odd logical processor number on each tile. E.g., [0,1], [2,3], etc. The same applies to the other three thread contexts, except that they start at P, 2*P, and 3*P, where P is the number of physical cores activated on the die. So on my Xeon Phi 7210 (64 core), the logical processors sharing a tile are:

[j, j+64, j+128, j+192],[j+1,j+65,j+129,j+193], for j=0..63

Similarly, for the Xeon Phi 7250 (68 core), the logical processors sharing a tile are:

[j,j+68,j+136,j+204],[j+1,j+69,j+137,j+205]

I verified this on about 400 of TACC's Xeon Phi 7250 nodes using the CPUID "Extended Topology Enumeration Leaf" (initial value of EAX=0x0B), described in the CPUID instruction section of Volume 2 of the Intel Architecture Software Developer's Manual (document 325383). The x2APIC ID of the logical processor executing the CPUID instruction is returned in the EDX register in this case. If the initial value in ECX is set to 0x00, then the returned value in EBX will be 0x04, indicating 4 SMT threads per physical core. In the initial value in ECX is set to 0x01, then the returned value in EBX will be 0x110 (272 decimal), indicating a 68-core/272-thread processor.

In other documents Intel describes how to use the "Extended Topology Enumeration Leaf" to describe groups of logical processors that are larger than a core, but smaller than a package. This could be useful for describing tiles, for example, or it could be useful for describing the groups of cores that share an L3 cache in "cluster on die" mode. BUT, the description of the CPUID instruction in Volume 2 of the SWDM seems to rule this out -- the only allowed "level type" values are "SMT" (0x01) and "core" (0x02).

So all of that is very regular, but there is still a somewhat arbitrary assignment of the "base" logical processor number to each "base" x2APIC ID. This is controlled by the BIOS, so it is mostly invisible, but I did notice significant changes in the mappings when a node is in SNC-4 mode, for example. The good news is that the formulas above still work -- for any value of "j", all of the logical processor numbers listed share a tile.

Loc_N_Intel · ‎12-12-2016

Another approach is to use the Intel MPI command "cpuinfo" to display information about cores, logical processors, and cache.

$ source /opt/intel/parallel_studio_xe_2017.1.043/psxevars.sh intel64
$ cpuinfo
Intel(R) processor family information utility, Version 2017 Update 1 Build 20161016 (id: 16418)
Copyright (C) 2005-2016 Intel Corporation.  All rights reserved.

=====  Processor composition  =====
Processor name    : Intel(R) Xeon Phi(TM)  7250
Packages(sockets) : 1
Cores             : 68
Processors(CPUs)  : 272
Cores per package : 68
Threads per core  : 4

=====  Processor identification  =====
Processor       Thread Id.      Core Id.        Package Id.
0               0               0               0
1               0               1               0
2               0               2               0
3               0               3               0
4               0               4               0
5               0               5               0
…………………………………………………………………………………………………

262             3               64              0
263             3               65              0
264             3               68              0
265             3               69              0
266             3               70              0
267             3               71              0
268             3               72              0
269             3               73              0
270             3               74              0
271             3               75              0
=====  Placement on packages  =====
Package Id.     Core Id.        Processors
0               0,1,2,3,4,5,6,7,10,11,12,13,14,15,16,17,18,19,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,68,69,70,71,72,73,74,75             
(0,68,136,204)(1,69,137,205)(2,70,138,206)(3,71,139,207)(4,72,140,208)(5,73,141,209)(6,74,142,210)(7,75,143,211)(8,76,144,212)(9,77,145,213)(10,78,146,214)(11,79,147,215)(12,80,148,216)(13,81,149,217)(14,82,150,218)(15,83,151,219)(16,84,152,220)(17,85,153,221)(18,86,154,222)(19,87,155,223)(20,88,156,224)(21,89,157,225)(22,90,158,226)(23,91,159,227)(24,92,160,228)(25,93,161,229)(26,94,162,230)(27,95,163,231)(28,96,164,232)(29,97,165,233)(30,98,166,234)(31,99,167,235)(32,100,168,236)(33,101,169,237)(34,102,170,238)(35,103,171,239)(36,104,172,240)(37,105,173,241)(38,106,174,242)(39,107,175,243)(40,108,176,244)(41,109,177,245)(42,110,178,246)(43,111,179,247)(44,112,180,248)(45,113,181,249)(46,114,182,250)(47,115,183,251)(48,116,184,252)(49,117,185,253)(50,118,186,254)(51,119,187,255)(52,120,188,256)(53,121,189,257)(54,122,190,258)(55,123,191,259)(56,124,192,260)(57,125,193,261)(58,126,194,262)(59,127,195,263)(60,128,196,264)(61,129,197,265)(62,130,198,266)(63,131,199,267)(64,132,200,268)(65,133,201,269)(66,134,202,270)(67,135,203,271)

=====  Cache sharing  =====
Cache   Size            Processors
L1      32  KB          (0,68,136,204)(1,69,137,205)(2,70,138,206)(3,71,139,207)(4,72,140,208)(5,73,141,209)(6,74,142,210)(7,75,143,211)(8,76,144,212)(9,77,145,213)(10,78,146,214)(11,79,147,215)(12,80,148,216)(13,81,149,217)(14,82,150,218)(15,83,151,219)(16,84,152,220)(17,85,153,221)(18,86,154,222)(19,87,155,223)(20,88,156,224)(21,89,157,225)(22,90,158,226)(23,91,159,227)(24,92,160,228)(25,93,161,229)(26,94,162,230)(27,95,163,231)(28,96,164,232)(29,97,165,233)(30,98,166,234)(31,99,167,235)(32,100,168,236)(33,101,169,237)(34,102,170,238)(35,103,171,239)(36,104,172,240)(37,105,173,241)(38,106,174,242)(39,107,175,243)(40,108,176,244)(41,109,177,245)(42,110,178,246)(43,111,179,247)(44,112,180,248)(45,113,181,249)(46,114,182,250)(47,115,183,251)(48,116,184,252)(49,117,185,253)(50,118,186,254)(51,119,187,255)(52,120,188,256)(53,121,189,257)(54,122,190,258)(55,123,191,259)(56,124,192,260)(57,125,193,261)(58,126,194,262)(59,127,195,263)(60,128,196,264)(61,129,197,265)(62,130,198,266)(63,131,199,267)(64,132,200,268)(65,133,201,269)(66,134,202,270)(67,135,203,271)
L2      1   MB          (0,1,68,69,136,137,204,205)(2,3,70,71,138,139,206,207)(4,5,72,73,140,141,208,209)(6,7,74,75,142,143,210,211)(8,9,76,77,144,145,212,213)(10,11,78,79,146,147,214,215)(12,13,80,81,148,149,216,217)(14,15,82,83,150,151,218,219)(16,17,84,85,152,153,220,221)(18,19,86,87,154,155,222,223)(20,21,88,89,156,157,224,225)(22,23,90,91,158,159,226,227)(24,25,92,93,160,161,228,229)(26,27,94,95,162,163,230,231)(28,29,96,97,164,165,232,233)(30,31,98,99,166,167,234,235)(32,33,100,101,168,169,236,237)(34,35,102,103,170,171,238,239)(36,37,104,105,172,173,240,241)(38,39,106,107,174,175,242,243)(40,41,108,109,176,177,244,245)(42,43,110,111,178,179,246,247)(44,45,112,113,180,181,248,249)(46,47,114,115,182,183,250,251)(48,49,116,117,184,185,252,253)(50,51,118,119,186,187,254,255)(52,53,120,121,188,189,256,257)(54,55,122,123,190,191,258,259)(56,57,124,125,192,193,260,261)(58,59,126,127,194,195,262,263)(60,61,128,129,196,197,264,265)(62,63,130,131,198,199,266,267)(64,65,132,133,200,201,268,269)(66,67,134,135,202,203,270,271)

For example, in the Cache sharing section, the tuple (0,1,68,69,136,137,204,205) indicates that these logical processors share L2 cache.

From the Place on packages section, it shows that logical processors 0, 68, 136, 204 belong to core 0. Logical processors 1, 69, 137, 205 belong to core 1. Therefore, cores 0 and 1 share L2 cache; or cores 1 and 2 belong to the same tile.

jimdempseyatthecove · ‎12-13-2016

The following is portion of actual code that runs that performs an N-Body simulation. The simulation code has been excised leaving the memory allocation and initialization. When I am finished with the formal code I will post links to it here later.

The KNL system is somewhat different from traditional SMP systems. The KNL has on-board MCDRAM that can be configured as an LLC or as extended NUMA node accessable RAM. Further to complicate things, the KNL can be configured as 1, 2 or 4 NUMA nodes, with the MCDRAM as all cache, half cache, NUMA node accessible.

In the following test program, the KNL was configured as SNC-4, Cache (4 NUMA nodes with MCDRAM partitioned into 4 caches).

// TemplateTestSuite.cpp : Verify templates
//
#if defined(WIN32)
#include <intrin.h>
#include <process.h>
#endif
#include "QuickThread.h"
#include "parallel_for.h"
#include "parallel_task.h"
#include "parallel_wait.h"

using namespace qt;

double* X = NULL;
double* Y = NULL;
double* Z = NULL;
double* dX = NULL;
double* dY = NULL;
double* dZ = NULL;
double* fX = NULL;
double* fY = NULL;
double* fZ = NULL;
double* Mass = NULL;

__int64 nBodies = 1*1000*1000*1000; // default to 1 billion

double* do_alloc()
{
 double* block = (double*)_mm_malloc(nBodies * sizeof(double), 64);
 if(block)
  return block;

 printf("Allocation error\n");
 exit(-1);
}

// global allocate - let Linux "First Touch" map pages to NUMA node
void AllocateBodies()
{
 X = do_alloc();
 Y = do_alloc();
 Z = do_alloc();
 dX = do_alloc();
 dY = do_alloc();
 dZ = do_alloc();
 fX = do_alloc();
 fY = do_alloc();
 fZ = do_alloc();
 Mass = do_alloc();
}

// Initialization performs first touch to map memory to NUMA node of first touch
// Note, the init slice has not been optimize as it is only performed once
void InitSliceL1(__int64 iBegin, __int64 iEnd)
{
 for(__int64 i=iBegin; i<iEnd; ++i)
 {
  unsigned __int64 random_val;
  while(_rdrand64_step(&random_val)==0)
   continue;
  X = random_val;
  while(_rdrand64_step(&random_val)==0)
   continue;
  Y = random_val;
  while(_rdrand64_step(&random_val)==0)
   continue;
  Z = random_val;
  dX = 0.0;
  dY = 0.0;
  dZ = 0.0;
  fX = 0.0;
  fY = 0.0;
  fZ = 0.0;
  while(_rdrand64_step(&random_val)==0)
   continue;
  Mass = random_val;
 }
}

void InitSliceM0(__int64 iBegin, __int64 iEnd)
{
 // partition the node by the number of cores within the node
 parallel_for(qtPlacement(OneEach_Within_M0$+L1$), InitSliceL1, iBegin, iEnd);
}
void InitBodies()
{
 // partition the nBodies by the number of compute available NUMA nodes
 parallel_for(qtPlacement(OneEach$+M0$),InitSliceM0, (__int64)0, nBodies);
}

int main(void)
{
 int nThreadsInit = -1;   // all possible threads
 qtInit qtInit(nThreadsInit); // enumerate system and initialize thread pool

 qtControl control;   // QuickThread control object used to access utility functions

 // thread teaming capability by proximity to current thread
 printf("Number of threads in team at L0$ %d\n", control.SelectAffinities(L0$)); // L0$ is self
 printf("Number of threads in team at L1$ %d\n", control.SelectAffinities(L1$)); // any thread sharing current thread's L1
 printf("Number of threads in team at L2$ %d\n", control.SelectAffinities(L2$)); // any thread sharing current thread's L2
 printf("Number of threads in team at L3$ %d\n", control.SelectAffinities(L3$)); // any thread sharing current thread's L3 (core L3, socket L3, or MCDRAM prorated L3, or M0 NUMA node)
 printf("Number of threads in team at M0$ %d\n", control.SelectAffinities(M0$)); // any thread sharing current thread's NUMA node
 printf("Number of threads in team at M1$ %d\n", control.SelectAffinities(M1$)); // any thread sharing current thread's NUMA node and nodes within closest distance
 printf("Number of threads in team at M2$ %d\n", control.SelectAffinities(M2$)); // any thread sharing current thread's NUMA node and nodes within next closest distance
 printf("Number of threads in team at M3$ %d\n", control.SelectAffinities(M3$)); // any thread sharing current thread's NUMA node and nodes within furthest distance distance (IOW all threads)
 printf("\n");

 // thread teaming capability across each sized proximity collective
 printf("Number of threads in team at OneEach$+L0$ %d\n", control.SelectAffinities(OneEach$+L0$)); // Each self IOW all threads
 printf("Number of threads in team at OneEach$+L1$ %d\n", control.SelectAffinities(OneEach$+L1$)); // One thread per core
 printf("Number of threads in team at OneEach$+L2$ %d\n", control.SelectAffinities(OneEach$+L2$)); // One thread per L2
 printf("Number of threads in team at OneEach$+L3$ %d\n", control.SelectAffinities(OneEach$+L3$)); // One thread per L3 (core L3, socket L3, or MCDRAM prorated L3, or M0 NUMA node)
 printf("Number of threads in team at OneEach$+M0$ %d\n", control.SelectAffinities(OneEach$+M0$)); // One thread per NUMA node
 printf("Number of threads in team at OneEach$+M1$ %d\n", control.SelectAffinities(OneEach$+M1$)); // One thread within closest NUMA distance
 printf("Number of threads in team at OneEach$+M2$ %d\n", control.SelectAffinities(OneEach$+M2$)); // One thread within next closest NUMA distance
 printf("Number of threads in team at OneEach$+M3$ %d\n", control.SelectAffinities(OneEach$+M3$)); // One thread within furthest NUMA distance (IOW 1/master thread)
 printf("\n");

 // Obtain Cache sizes and proration by numbers of threads sharing the cache
 printf("L1 Cache Size %dKB, per thread at L1 %dKB\n", CacheLevelSize(1) / 1024, CacheLevelSize(1) / 1024 / control.SelectAffinities(L1$));
 printf("L2 Cache Size %dKB, per thread at L2 %dKB\n", CacheLevelSize(2) / 1024, CacheLevelSize(2) / 1024 / control.SelectAffinities(L2$));
 printf("L3 Cache Size %dKB, per thread at L3 %dKB\n", CacheLevelSize(3) / 1024, CacheLevelSize(3) / 1024 / control.SelectAffinities(L3$));

 // Cache line sizes could potentially vary (they don't in this example)
 printf("L1 Cache Line Size %d\n", CacheLevelLineSize(1));
 printf("L2 Cache Line Size %d\n", CacheLevelLineSize(2));
 printf("L3 Cache Line Size %d\n", CacheLevelLineSize(3));
 printf("\n");

 AllocateBodies();
 printf("Bodies allocated\n");

 printf("Initializing bodies\n");
 InitBodies();
 printf("Bodies Initialized");

 // simulation code follows
 // ...
 // end simulation code

 // deallocaton follows
 // ...

 // End QuickThread thread pool
 qtInit.EndQT();

 return 0;
}

Report from numactl -H and output of initialization code:

SNC-4 Cache

[jim@KNL ~]$ numactl -H
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207
node 0 size: 24450 MB
node 0 free: 23302 MB
node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223
node 1 size: 24576 MB
node 1 free: 23733 MB
node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239
node 2 size: 24576 MB
node 2 free: 23214 MB
node 3 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255
node 3 size: 24576 MB
node 3 free: 23300 MB
node distances:
node   0   1   2   3 
  0:  10  21  21  21 
  1:  21  10  21  21 
  2:  21  21  10  21 
  3:  21  21  21  10 

Running test program:

Number of threads in team at L0$ 1
Number of threads in team at L1$ 4
Number of threads in team at L2$ 8
Number of threads in team at L3$ 64
Number of threads in team at M0$ 64
Number of threads in team at M1$ 256
Number of threads in team at M2$ 256
Number of threads in team at M3$ 256

Number of threads in team at OneEach$+L0$ 256
Number of threads in team at OneEach$+L1$ 64
Number of threads in team at OneEach$+L2$ 32
Number of threads in team at OneEach$+L3$ 4
Number of threads in team at OneEach$+M0$ 4
Number of threads in team at OneEach$+M1$ 1
Number of threads in team at OneEach$+M2$ 1
Number of threads in team at OneEach$+M3$ 1

L1 Cache Size 32KB, per thread at L1 8KB
L2 Cache Size 1024KB, per thread at L2 128KB
L3 Cache Size 4194304KB, per thread at L3 65536KB

I just got the QuickThread library up and running on KNL this weekend. The library has been around for 10 years (under appreciated). If anyone is interested in experimenting, and possibly collaborating on a paper please contact me.

The current state of QuickThread is it is configured for SMP systems, including KNL with MCDRAM quirks. Some of the templates such as a Class with a Base Class with virtual functions have issues with disambiguation amongst other instantiations with same name but different signatures. These can be avoided with relatively easy programming changes.

For future development, I intend to integrate MPI and offloading into the templates. IOW to create a unified programming paradigm for complex systems.

Jim Dempsey

Chronus_Taizen · ‎12-13-2016

Loc N.(intel), thanks for the reference. These tools make life much easier, are there free versions of the Intel cpuinfo utility?

Jim, I'd be curious about OpenMP on KNLs with threading for your N-Body simulation. And, about your first response: do you have some sample code that relates to your proposed solution? I am not familiar with the masks, and function calls you mentioned. I did some googling and it seems that there is a rather steep path into the assembly world to get things to work.

John, thanks for the empirical sanity check.

jimdempseyatthecove · ‎12-14-2016

>>Jim, I'd be curious about OpenMP on KNLs with threading for your N-Body simulation.

I am preparing comparative analysis programs for a paper. I've done some earlier work illustrating thread pinning .AND. scheduling using OpenMP. Entering "OpenMP HT team Dempsey" into the "powered by Google" at the top of the page will get you some links. One of the better articles "The Chronicles of Phi".

OpenMP is somewhat lacking in providing the application control over thread teaming. You are generally limited to specifying a particular pinning arrangement using KMP_AFFINITY, OMP_PLACE_THREADS, ... which is performed external to the application. This leaves the application stuck using the specified pinning arrangement. With QuickThread, the design goal is to provide the application to (optionally) specify thread team selection (at pinned levels) at runtime, regardless of environment variables.

>>do you have some sample code that relates to your proposed solution?

I am writing them up now.

In the code snip above, you will notice that the number of bodies was:

__int64 nBodies = 1*1000*1000*1000; // default to 1 billion

I do not think I can get a practical solution with a billion bodies on a single KNL system. The simplified body data has position, velocity, force, and mass all doubles for 80 bytes. 1 GiBodies is 80 GiB (my system has 96 GiB). This size is approximately 5x that of the MCDRAM (16 GiB). This test programs I am preparing intend to explore:

RAM access
LLC provided by MCDRAM
L2 cache
L1 cache

The intention is to use that number of bodies and then determine the average body-to-body force calculation and accumulation times for:

NUMA node and cache oblivious
NUMA aware
NUMA + LLC aware
NUMA + LLC + L2
NUMA + LLC + L2 + L1
NUMA + L2
NUMA + L1

The initial focus is to determine team(s) sequestering effect on the body-to-body interaction time. Later on, time permitting, I might introduce optimizations using #pragma simd, and then _mm512_... intrinsics.

Jim Dempsey

JJK · ‎12-14-2016

the thread siblings can be retrieved from the /sys/devices/system/cpu/cpuX/topology/ directory. Here's some sample code to show the thread siblings, i.e. the Linux CPU IDs that share the same core (and thus the same L2 cache). This code is not KNL specific, but works on any system running a modern Linux kernel

#include <limits.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/syscall.h>

int main(int argc, char **argv)
{
    FILE   *f = NULL;
    size_t  n;
    int     cpu;
    int     node;
    char    thread_siblings_path[PATH_MAX];
    char    thread_siblings[255];

    syscall(SYS_getcpu, &cpu, &node, NULL);
    printf("cpu = %3d node = %3d\n", cpu, node);

    sprintf(thread_siblings_path, "/sys/devices/system/cpu/cpu%d/topology/thread_siblings_list", cpu);
    f = fopen(thread_siblings_path, "r");
    if (f)
    {
      n = fread (thread_siblings, 1, 255, f);
      if (n > 0)
        printf("siblings = %s\n", thread_siblings);
      else
        printf("ERROR: Could not read sibling list\n");
    }
    else
      printf("ERROR: Could not open %s\n", thread_siblings_path);

    return 0;
}

McCalpinJohn · ‎12-14-2016

The Linux topology "thread_siblings" only shows the logical processors that share a physical core -- it does not show cores that share an L2 cache.

But the L2 sharing information is available in /sys/devices/system/cpu/cpu/cache/index2/shared_cpu_list. On a 68-core system I get values like:

0-1,68-69,136-137,204-205

This is entirely consistent with the formulas I provided in note 3 above....

Loc_N_Intel · ‎12-14-2016

Hi Luis,

That utility is part of the Intel MPI runtime library. One way to get that utility is to get the MPI runtime library included in the Intel® Distribution for Python*. This package is free and available to public at https://software.intel.com/en-us/intel-distribution-for-python

After installing the Python from this package (there are two Python versions available, 2.7 and 3.5, just pick either one), you just need to activate the root Intel Python Conda environment:

$ source /opt/intel/intelpython27/bin/activate root

At this point, you are ready to use the cpuinfo utility.