topic Regarding what you are saying in Intel® Moderncode for Parallel Architectures

omp parallel on the same CPU

MGRAV — Thu, 14 Jan 2016 09:08:43 GMT

Hi all,

I have some strange effect using numa library and OpenMP

I do :

#pragma omp parallel
{
		int cpu = sched_getcpu();
		int node=numa_node_of_cpu(cpu);
		printf("%d ; %d\n",cpu, node );
}

I expect have one thread one each CPU, so to have each CPU one time.
But in reality I have something like that :

39 ; 1
22 ; 0
13 ; 1
12 ; 1
11 ; 1
31 ; 1
10 ; 1
21 ; 0
30 ; 1
3 ; 0
21 ; 0
10 ; 1
12 ; 1
32 ; 1
33 ; 1
21 ; 0
3 ; 0
14 ; 1
18 ; 1
16 ; 1
15 ; 1
21 ; 0
4 ; 0
33 ; 1
17 ; 1
30 ; 1
31 ; 1
37 ; 1
18 ; 1
21 ; 0
17 ; 1
14 ; 1
30 ; 1
18 ; 1
15 ; 1
38 ; 1
12 ; 1
30 ; 1
35 ; 1
30 ; 1

So I have effectively 40 threads, but for some reason not one the good place
Have someone an explanation?
An give it a solution ?

I am not looking for the thread number, it’s linked to the memory allocation in the good place an not about id.

Mathieu

I don't know if the code is

McCalpinJohn — Thu, 14 Jan 2016 16:27:12 GMT

I don't know if the code is correct -- I have not used those interfaces before -- but there are a couple of other things you should consider.

Most threaded applications should bind the threads to a core or set of cores.
1. With the Intel compilers the KMP_AFFINITY variable is the preferred way of controlling OpenMP thread placement, and the "verbose" option will cause the job to print out lots of useful information at the start of the job. See example below.
2. With the GNU compilers the GOMP_CPU_AFFINITY variable is used to control OpenMP thread placement. This provides similar low-level functionality, but requires cores to be numbered explicitly, so it is a pain to port across systems that use different core-numbering policies.
Some environments will provide an external binding when launching your job. This is very common with MPI applications, but can also be done by other job control infrastructures or you can test this manually using "numactl".
1. By default, KMP_AFFINITY will respect the processor mask that the OpenMP job inherits from its environment. In this case the logical processor bindings shown by the "verbose" option will include only the logical processors that the external processor mask allows.
2. You can get the processor mask for each OpenMP thread using the "sched_getaffinity()" call. For OpenMP threads you need to use a value of zero as the pid to get sched_getaffinity to return the affinity mask for the current OpenMP thread.
If your threads are not bound to specific cores, the act of calling sched_getcpu() could easily cause the OS to move the thread.
1. Every Linux system that I know of supports an alternative way of getting the chip and core currently being used, by including this information in the IA32_TSC_AUX register that can be read using the RDTSCP instruction. This is a user-mode instruction that will not give the OS any particular excuse to move the process. (If the process is not bound the OS can move it whenever it wants to, but this is usually triggered by an OS call or by receiving an interrupt. The RDTSCP instruction does not do either of these things.) A sample routine to execute the RDTSC instruction and return the socket and core number is appended.

Example using KMP_AFFINITY with and without external binding:

icc -openmp binding_test.c -o binding_test
export OMP_NUM_THREADS=8
export KMP_AFFINITY="verbose,scatter"
echo "run with no external binding"
./binding_test
   [lots of output, including...]
   OMP: Info #171: KMP_AFFINITY: OS proc 0 maps to package 0 core 0 
   OMP: Info #171: KMP_AFFINITY: OS proc 1 maps to package 1 core 0 
   [...]
   OMP: Info #242: KMP_AFFINITY: pid 30988 thread 0 bound to OS proc set {0}
   OMP: Info #242: KMP_AFFINITY: pid 30988 thread 1 bound to OS proc set {1}
   OMP: Info #242: KMP_AFFINITY: pid 30988 thread 2 bound to OS proc set {2}
   OMP: Info #242: KMP_AFFINITY: pid 30988 thread 3 bound to OS proc set {3}
   OMP: Info #242: KMP_AFFINITY: pid 30988 thread 5 bound to OS proc set {5}
   OMP: Info #242: KMP_AFFINITY: pid 30988 thread 4 bound to OS proc set {4}
   OMP: Info #242: KMP_AFFINITY: pid 30988 thread 6 bound to OS proc set {6}
   OMP: Info #242: KMP_AFFINITY: pid 30988 thread 7 bound to OS proc set {7}

echo "run with external binding to cores 0-3"
numactl --physcpubind=0-3 ./binding_test
   [lots of output, including...]
   OMP: Info #242: KMP_AFFINITY: pid 31342 thread 0 bound to OS proc set {0}
   OMP: Info #242: KMP_AFFINITY: pid 31342 thread 1 bound to OS proc set {1}
   OMP: Info #242: KMP_AFFINITY: pid 31342 thread 2 bound to OS proc set {2}
   OMP: Info #242: KMP_AFFINITY: pid 31342 thread 3 bound to OS proc set {3}
   OMP: Info #242: KMP_AFFINITY: pid 31342 thread 5 bound to OS proc set {1}
   OMP: Info #242: KMP_AFFINITY: pid 31342 thread 4 bound to OS proc set {0}
   OMP: Info #242: KMP_AFFINITY: pid 31342 thread 6 bound to OS proc set {2}
   OMP: Info #242: KMP_AFFINITY: pid 31342 thread 7 bound to OS proc set {3}

Example code to use RDTSCP -- the function returns the Time Stamp Counter value and also writes to the "chip" and "core" variables passed (by reference) with the chip number and (global) logical processor number where the code was running when the RDTSCP instruction was executed.

unsigned long tacc_rdtscp(int *chip, int *core)
{
   unsigned a, d, c;

   __asm__ volatile("rdtscp" : "=a" (a), "=d" (d), "=c" (c));
	*chip = (c & 0xFFF000)>>12;
	*core = c & 0xFFF;

   return ((unsigned long)a) | (((unsigned long)d) << 32);;
}

It looks like you have 2

jimdempseyatthecove — Mon, 18 Jan 2016 22:46:10 GMT

It looks like you have 2 physical processors (this is two chips), each with 10 cores, and with Hyperthreading enabled. (I am not aware of any 20 core chip).

From your printout, it looks as if you do not have affinity set (see John's reply).

The sched_getcpu function returns a system logical processor number, who's relationship to physical CPU, core, and hardware thread is dependent on configuration settings, or lack thereof, ... at the time of the function call.

Jim Dempsey

I agree that your problem is

SergeyKostrov — Mon, 01 Feb 2016 05:49:00 GMT

I agree that your problem is related to Not setting affinity to these 40 OpenMP threads. Next, >>...I expect have one thread one each CPU, so to have each CPU one time... Unfortunately, No. Because OpenMP directive #pragma omp parallel { ... some processing... ... } does Not guarantee at all that every OpenMP thread will be assigned to some logical CPU ( with a relation 1-to-1 ) and it does Not matter whether you're using NUMA system or something else ( Non-NUMA system ). That is why your output has at least three assignments to the CPU 12 on the NUMA node 1: ... 12 ; 1 ... 12 ; 1 ... 12 ; 1 ... and, in that case the relation OpenMP-thread-to-CPU is 3-to-1. Do you agree with that? But, you need to set affinity for all these 40 OpenMP threads before (!!!) #pragma omp parallel directive and you should Not try to do it every time OpenMP thread is executed inside of #pragma omp parallel block. Since I've implemented my own fully portable OpenMP-Thread-to-CPU Affinity Management I would tell that OpenMP designers did not study in full what was done in a Multi Threaded World in the past. It means, these designers neglected legacy of Process and Thread Affinity Management from the beginning and do not want to introduce it in latest versions of OpenMP standard. That is why Intel introduced its own KMP-based ( partially portable ) solution of the problem. That is why I've implemented my own fully portable solution. A famous David Cutler, former Lead Software Engineer and the "Father" of Windows NT scheduler ( SMT-based ), designed SetProcessAffinityMask and SetThreadAffinityMask Win32 API functions for Windows NT from the beginning! These two functions more than 25-year-old and, even if they were designed for Multi-Processor systems of 1990th ( before Multi-Core CPUs appeared in 2000 year, or so ) they do what they need to do now.

Regarding what you are saying

MGRAV — Thu, 04 Feb 2016 11:04:47 GMT

Regarding what you are saying, I did lot of test and research to understand how this thread management works ! I have understood, why this flexibility exists. It gave me a new vision for future optimizations.

And now I have two small questions:

When we make Hyper-Threading, are the L2 and L1 cache levels shared between each thread, or are they divided and each thread has a part?
I like the “rdtscp” solution, that is more easy to port (and a little faster). Is there a version who is ok with XeonPhi ? (compile error Error: `rdtscp' is not supported on `k1om')

On current processors, the

jimdempseyatthecove — Thu, 04 Feb 2016 13:24:07 GMT

On current processors, the HyperThreads (or Hardware Threads when speaking of Xeon Phi) of a single core, share the L1 and L2 cache of that core. This also applies to the instruction cache.

>>I like the “rdtscp” solution, that is more easy to port (and a little faster). Is there a version who is ok with XeonPhi ? (compile error Error: `rdtscp' is not supported on `k1om')

Doesn't this imply it does not port?

Most multi-threaded applications that are concerned about cache also perform affinity pinning. One of

Locking a software thread to a hardware thread
Locking a software thread to a core (i.e. permitted to run on any hardware thread of a given core)
Locking a software thread to a CPU
Locking a software thread to a NUMA node
Locking a software thread to within a NUMA distance

For L1 and L2 binding the first two are used.
For L3/LLC CPU binding is used
For close memory, either CPU or NUMA node is used
Most users will never experience more than one NUMA hop (current and adjacent).

On KNC, there is only one timestamp counter within the CPU (one CPU on coprocessor), and this counter does not run at the same rate as the host timestamp counter. While there may be a way to nearly synchronize the timestamp counters of multiple coprocessors, doing so is likely not practical.

Jim Dempsey

>>I like the “rdtscp”

SergeyKostrov — Mon, 08 Feb 2016 02:14:16 GMT

>>I like the “rdtscp” solution, that is more easy to port (and a little faster). Is there a version who is ok >>with XeonPhi ? (compile error Error: `rdtscp' is not supported on `k1om') 1. Take into account that rdtscp is a privileged instruction and can be only executed on a Ring-0 ( privileged ring ). 2. You need to provide more details about your C++ compiler ( Note: you could also check that immintrin.h ( support for AVX ISA ) header file exists ).

>>...While there may be a way

SergeyKostrov — Mon, 08 Feb 2016 02:22:05 GMT

>>...While there may be a way to nearly synchronize the timestamp counters of multiple coprocessors, doing so >>is likely not practical. I've done a research on that about 2 years ago but it was only partially completed. I managed to get timestamp values for all working CPUs but did not try to synchronize them. There is also some thread about it on IDZ.

>>...or are they divided and

SergeyKostrov — Mon, 08 Feb 2016 02:35:20 GMT

>>...or are they divided and each thread has a part? Honestly, this is what I want to hear from Intel hardware or software engineers and I don't think they will respond. Standard prefetch instruction which could be used with hints T0, T1, T2 and NTA does not allow to load data into some portion of a cache. Am I wrong? Here is a piece of codes from my headers just for your information: [ HrtAL.h ] ... ////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////// // Note 01: Descriptions of Hint Codes for _mm_prefetch intrinsic function: // Loads one cache line of data ( address is an input ) to a location // closer to the Processor Unit. // // _MM_HINT_T0 - Prefetch data into all Levels of the cache hierarchy // ( temporal data ). // _MM_HINT_T1 - Prefetch data into Level 1 cache and higher // ( temporal data with respect to 1 Level cache ). // _MM_HINT_T2 - Prefetch data into Level 2 cache and higher // ( temporal data with respect to 2 Level cache. // _MM_HINT_NTA - Prefetch data into non-temporal cache structure and // into a location close to the processor, minimizing cache pollution // ( non-temporal data with respect to all cache Levels ). // // Note 02: Implementation HrtPrefetchData< T0/T1/T2/NTA > functions matches to: // // _mm_prefetch( ( RTchar * )piAddress, _MM_HINT_T0 ); // 0F 18 08 prefetcht0 [eax] // _mm_prefetch( ( RTchar * )piAddress, _MM_HINT_T1 ); // 0F 18 10 prefetcht1 [eax] // _mm_prefetch( ( RTchar * )piAddress, _MM_HINT_T2 ); // 0F 18 18 prefetcht2 [eax] // _mm_prefetch( ( RTchar * )piAddress, _MM_HINT_NTA ); // 0F 18 00 prefetchnta [eax] ...

RDTSCP is not a privileged

McCalpinJohn — Mon, 08 Feb 2016 19:52:33 GMT

RDTSCP is not a privileged instruction except in the unusual case that the CR4.TSD bit is set. I have never seen this bit set outside of some virtual machine implementations, but I have heard that some folks who are extremely paranoid about covert channels may have also used this bit to disable low-latency time stamp counter access.

Unfortunately, the RDTSCP instruction is newer than the P54C core used in the Xeon Phi, and it is not supported there. Xeon Phi works best with strong mandatory thread binding using the KMP_AFFINITY and KMP_PLACE_THREADS environment variables. It looks like the next generation Xeon Phi (Knights Landing) will support the RDTSCP instruction, because it supports the associated IA32_TSC_AUX MSR.

Concerning the caches -- when HyperThreading is enabled, the L1 Instruction Cache, L1 Data Cache, and L2 unified cache are shared by the two threads. There are lots of ways to adjust the "fairness" of the sharing using the LRU policies of the caches, but I am not aware of any Intel disclosures in this area. For homogeneous workloads, the behavior is pretty much what you would expect if the cache were evenly split between the two threads, but there are corner cases where behavior is less easy to understand (particularly if each thread wants to use more than 4 of the 8 ways of associativity of the cache).