- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi all,
I have some strange effect using numa library and OpenMP
I do :
#pragma omp parallel { int cpu = sched_getcpu(); int node=numa_node_of_cpu(cpu); printf("%d ; %d\n",cpu, node ); }
I expect have one thread one each CPU, so to have each CPU one time.
But in reality I have something like that :
39 ; 1
22 ; 0
13 ; 1
12 ; 1
11 ; 1
31 ; 1
10 ; 1
21 ; 0
30 ; 1
3 ; 0
21 ; 0
10 ; 1
12 ; 1
32 ; 1
33 ; 1
21 ; 0
3 ; 0
14 ; 1
18 ; 1
16 ; 1
15 ; 1
21 ; 0
4 ; 0
33 ; 1
17 ; 1
30 ; 1
31 ; 1
37 ; 1
18 ; 1
21 ; 0
17 ; 1
14 ; 1
30 ; 1
18 ; 1
15 ; 1
38 ; 1
12 ; 1
30 ; 1
35 ; 1
30 ; 1
So I have effectively 40 threads, but for some reason not one the good place
Have someone an explanation?
An give it a solution ?
I am not looking for the thread number, it’s linked to the memory allocation in the good place an not about id.
Mathieu
- Tags:
- Parallel Computing
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I don't know if the code is correct -- I have not used those interfaces before -- but there are a couple of other things you should consider.
- Most threaded applications should bind the threads to a core or set of cores.
- With the Intel compilers the KMP_AFFINITY variable is the preferred way of controlling OpenMP thread placement, and the "verbose" option will cause the job to print out lots of useful information at the start of the job. See example below.
- With the GNU compilers the GOMP_CPU_AFFINITY variable is used to control OpenMP thread placement. This provides similar low-level functionality, but requires cores to be numbered explicitly, so it is a pain to port across systems that use different core-numbering policies.
- Some environments will provide an external binding when launching your job. This is very common with MPI applications, but can also be done by other job control infrastructures or you can test this manually using "numactl".
- By default, KMP_AFFINITY will respect the processor mask that the OpenMP job inherits from its environment. In this case the logical processor bindings shown by the "verbose" option will include only the logical processors that the external processor mask allows.
- You can get the processor mask for each OpenMP thread using the "sched_getaffinity()" call. For OpenMP threads you need to use a value of zero as the pid to get sched_getaffinity to return the affinity mask for the current OpenMP thread.
- If your threads are not bound to specific cores, the act of calling sched_getcpu() could easily cause the OS to move the thread.
- Every Linux system that I know of supports an alternative way of getting the chip and core currently being used, by including this information in the IA32_TSC_AUX register that can be read using the RDTSCP instruction. This is a user-mode instruction that will not give the OS any particular excuse to move the process. (If the process is not bound the OS can move it whenever it wants to, but this is usually triggered by an OS call or by receiving an interrupt. The RDTSCP instruction does not do either of these things.) A sample routine to execute the RDTSC instruction and return the socket and core number is appended.
Example using KMP_AFFINITY with and without external binding:
icc -openmp binding_test.c -o binding_test export OMP_NUM_THREADS=8 export KMP_AFFINITY="verbose,scatter" echo "run with no external binding" ./binding_test [lots of output, including...] OMP: Info #171: KMP_AFFINITY: OS proc 0 maps to package 0 core 0 OMP: Info #171: KMP_AFFINITY: OS proc 1 maps to package 1 core 0 [...] OMP: Info #242: KMP_AFFINITY: pid 30988 thread 0 bound to OS proc set {0} OMP: Info #242: KMP_AFFINITY: pid 30988 thread 1 bound to OS proc set {1} OMP: Info #242: KMP_AFFINITY: pid 30988 thread 2 bound to OS proc set {2} OMP: Info #242: KMP_AFFINITY: pid 30988 thread 3 bound to OS proc set {3} OMP: Info #242: KMP_AFFINITY: pid 30988 thread 5 bound to OS proc set {5} OMP: Info #242: KMP_AFFINITY: pid 30988 thread 4 bound to OS proc set {4} OMP: Info #242: KMP_AFFINITY: pid 30988 thread 6 bound to OS proc set {6} OMP: Info #242: KMP_AFFINITY: pid 30988 thread 7 bound to OS proc set {7} echo "run with external binding to cores 0-3" numactl --physcpubind=0-3 ./binding_test [lots of output, including...] OMP: Info #242: KMP_AFFINITY: pid 31342 thread 0 bound to OS proc set {0} OMP: Info #242: KMP_AFFINITY: pid 31342 thread 1 bound to OS proc set {1} OMP: Info #242: KMP_AFFINITY: pid 31342 thread 2 bound to OS proc set {2} OMP: Info #242: KMP_AFFINITY: pid 31342 thread 3 bound to OS proc set {3} OMP: Info #242: KMP_AFFINITY: pid 31342 thread 5 bound to OS proc set {1} OMP: Info #242: KMP_AFFINITY: pid 31342 thread 4 bound to OS proc set {0} OMP: Info #242: KMP_AFFINITY: pid 31342 thread 6 bound to OS proc set {2} OMP: Info #242: KMP_AFFINITY: pid 31342 thread 7 bound to OS proc set {3}
Example code to use RDTSCP -- the function returns the Time Stamp Counter value and also writes to the "chip" and "core" variables passed (by reference) with the chip number and (global) logical processor number where the code was running when the RDTSCP instruction was executed.
unsigned long tacc_rdtscp(int *chip, int *core) { unsigned a, d, c; __asm__ volatile("rdtscp" : "=a" (a), "=d" (d), "=c" (c)); *chip = (c & 0xFFF000)>>12; *core = c & 0xFFF; return ((unsigned long)a) | (((unsigned long)d) << 32);; }
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It looks like you have 2 physical processors (this is two chips), each with 10 cores, and with Hyperthreading enabled. (I am not aware of any 20 core chip).
From your printout, it looks as if you do not have affinity set (see John's reply).
The sched_getcpu function returns a system logical processor number, who's relationship to physical CPU, core, and hardware thread is dependent on configuration settings, or lack thereof, ... at the time of the function call.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Regarding what you are saying, I did lot of test and research to understand how this thread management works ! I have understood, why this flexibility exists. It gave me a new vision for future optimizations.
And now I have two small questions:
- When we make Hyper-Threading, are the L2 and L1 cache levels shared between each thread, or are they divided and each thread has a part?
- I like the “rdtscp” solution, that is more easy to port (and a little faster). Is there a version who is ok with XeonPhi ? (compile error Error: `rdtscp' is not supported on `k1om')
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
On current processors, the HyperThreads (or Hardware Threads when speaking of Xeon Phi) of a single core, share the L1 and L2 cache of that core. This also applies to the instruction cache.
>>I like the “rdtscp” solution, that is more easy to port (and a little faster). Is there a version who is ok with XeonPhi ? (compile error Error: `rdtscp' is not supported on `k1om')
Doesn't this imply it does not port?
Most multi-threaded applications that are concerned about cache also perform affinity pinning. One of
Locking a software thread to a hardware thread
Locking a software thread to a core (i.e. permitted to run on any hardware thread of a given core)
Locking a software thread to a CPU
Locking a software thread to a NUMA node
Locking a software thread to within a NUMA distance
For L1 and L2 binding the first two are used.
For L3/LLC CPU binding is used
For close memory, either CPU or NUMA node is used
Most users will never experience more than one NUMA hop (current and adjacent).
On KNC, there is only one timestamp counter within the CPU (one CPU on coprocessor), and this counter does not run at the same rate as the host timestamp counter. While there may be a way to nearly synchronize the timestamp counters of multiple coprocessors, doing so is likely not practical.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
RDTSCP is not a privileged instruction except in the unusual case that the CR4.TSD bit is set. I have never seen this bit set outside of some virtual machine implementations, but I have heard that some folks who are extremely paranoid about covert channels may have also used this bit to disable low-latency time stamp counter access.
Unfortunately, the RDTSCP instruction is newer than the P54C core used in the Xeon Phi, and it is not supported there. Xeon Phi works best with strong mandatory thread binding using the KMP_AFFINITY and KMP_PLACE_THREADS environment variables. It looks like the next generation Xeon Phi (Knights Landing) will support the RDTSCP instruction, because it supports the associated IA32_TSC_AUX MSR.
Concerning the caches -- when HyperThreading is enabled, the L1 Instruction Cache, L1 Data Cache, and L2 unified cache are shared by the two threads. There are lots of ways to adjust the "fairness" of the sharing using the LRU policies of the caches, but I am not aware of any Intel disclosures in this area. For homogeneous workloads, the behavior is pretty much what you would expect if the cache were evenly split between the two threads, but there are corner cases where behavior is less easy to understand (particularly if each thread wants to use more than 4 of the 8 ways of associativity of the cache).
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page