- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm trying to run the optimized linpack on a hyperthreaded enabled system. No matter what different options for KMP_AFFINITY that I choose, only 1/2 of the threads run. Here is the latest run:
[kirk] (uid) linpack> numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6 8 10 12 14
node 0 size: 12279 MB
node 0 free: 11031 MB
node 1 cpus: 1 3 5 7 9 11 13 15
node 1 size: 12288 MB
node 1 free: 11907 MB
node distances:
node 0 1
0: 10 20
1: 20 10
[kirk] (uid) linpack> ./runme_xeon64
This is a SAMPLE run script. Change it to reflect the correct number
of CPUs/threads, problem input files, etc..
Thu Jul 23 17:04:41 MST 2009
OMP: Warning #190: Bad message catalog "libiomp5.cat": Version "2" found, version "1" expected.
OMP: Hint: Check NLSPATH environment variable, its value is "/opt/intel/Compiler/11.0/083/mkl/lib/64/locale/%l_%t/%N:/opt/intel/mkl/10.1.2.024/lib/em64t/locale/%l_%t/%N:/opt/intel/Compiler/11.0/083/lib/intel64/locale/%l_%t/%N:/opt/intel/Compiler/11.0/083/ipp/em64t/lib/locale/%l_%t/%N:/opt/intel/Compiler/11.0/083/mkl/lib/em64t/locale/%l_%t/%N:/opt/intel/Compiler/11.0/083/idb/intel64/locale/%l_%t/%N:/opt/intel/Compiler/11.0/083/lib/intel64/locale/%l_%t/%N:/opt/intel/Compiler/11.0/083/ipp/em64t/lib/locale/%l_%t/%N:/opt/intel/Compiler/11.0/083/mkl/lib/em64t/locale/%l_%t/%N:/opt/intel/Compiler/11.0/083/idb/intel64/locale/%l_%t/%N".
OMP: Info #3: Default messages will be used.
OMP: Info #157: KMP_AFFINITY: Affinity capable, using global cpuid instr info
OMP: Info #162: KMP_AFFINITY: Initial OS proc set respected: {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}
OMP: Info #164: KMP_AFFINITY: 16 available OS procs
OMP: Info #165: KMP_AFFINITY: Uniform topology
OMP: Info #167: KMP_AFFINITY: 2 packages x 4 cores/pkg x 2 threads/core (8 total cores)
OMP: Info #168: KMP_AFFINITY: OS proc to physical thread map ([] => level not in map):
OMP: Info #178: KMP_AFFINITY: OS proc 0 maps to package 0 core 0 thread 0
OMP: Info #178: KMP_AFFINITY: OS proc 8 maps to package 0 core 0 thread 1
OMP: Info #178: KMP_AFFINITY: OS proc 2 maps to package 0 core 1 thread 0
OMP: Info #178: KMP_AFFINITY: OS proc 10 maps to package 0 core 1 thread 1
OMP: Info #178: KMP_AFFINITY: OS proc 4 maps to package 0 core 2 thread 0
OMP: Info #178: KMP_AFFINITY: OS proc 12 maps to package 0 core 2 thread 1
OMP: Info #178: KMP_AFFINITY: OS proc 6 maps to package 0 core 3 thread 0
OMP: Info #178: KMP_AFFINITY: OS proc 14 maps to package 0 core 3 thread 1
OMP: Info #178: KMP_AFFINITY: OS proc 1 maps to package 1 core 0 thread 0
OMP: Info #178: KMP_AFFINITY: OS proc 9 maps to package 1 core 0 thread 1
OMP: Info #178: KMP_AFFINITY: OS proc 3 maps to package 1 core 1 thread 0
OMP: Info #178: KMP_AFFINITY: OS proc 11 maps to package 1 core 1 thread 1
OMP: Info #178: KMP_AFFINITY: OS proc 5 maps to package 1 core 2 thread 0
OMP: Info #178: KMP_AFFINITY: OS proc 13 maps to package 1 core 2 thread 1
OMP: Info #178: KMP_AFFINITY: OS proc 7 maps to package 1 core 3 thread 0
OMP: Info #178: KMP_AFFINITY: OS proc 15 maps to package 1 core 3 thread 1
OMP: Info #155: KMP_AFFINITY: Internal thread 0 bound to OS proc set {0}
OMP: Info #155: KMP_AFFINITY: Internal thread 1 bound to OS proc set {1}
OMP: Info #155: KMP_AFFINITY: Internal thread 2 bound to OS proc set {2}
OMP: Info #155: KMP_AFFINITY: Internal thread 3 bound to OS proc set {3}
OMP: Info #155: KMP_AFFINITY: Internal thread 4 bound to OS proc set {4}
OMP: Info #155: KMP_AFFINITY: Internal thread 5 bound to OS proc set {5}
OMP: Info #155: KMP_AFFINITY: Internal thread 6 bound to OS proc set {6}
OMP: Info #155: KMP_AFFINITY: Internal thread 7 bound to OS proc set {7}
Done: Thu Jul 23 17:12:03 MST 2009
[kirk] (uid) linpack> cat lin_xeon64.txt
Thu Jul 23 17:04:41 MST 2009
Intel LINPACK data
Current date/time: Thu Jul 23 17:04:41 2009
CPU frequency: 2.666 GHz
Number of CPUs: 16
Number of threads: 16
Parameters are set to:
Number of tests : 1
Number of equations to solve (problem size) : 35000
Leading dimension of array : 45000
Number of trials to run : 1
Data alignment value (in Kbytes) : 1
Maximum memory requested that can be used = 12600901024, at the size = 35000
============= Timing linear equation system solver =================
Size LDA Align. Time(s) GFlops Residual Residual(norm)
35000 45000 1 366.288 78.0419 1.073967e-09 3.117562e-02
Performance Summary (GFlops)
Size LDA Align. Average Maximal
35000 45000 1 78.0419 78.0419
End of tests
Thu Jul 23 17:12:03 MST 2009
[kirk] (uid) linpack>
cat runme_xeon64
#!/bin/bash
#
export KMP_AFFINITY=nowarnings,verbose,granularity=fine,proclist=[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],explicit
echo "This is a SAMPLE run script. Change it to reflect the correct number"
echo "of CPUs/threads, problem input files, etc.."
date
date > lin_xeon64.txt
./xlinpack_xeon64 lininput_xeon64 >> lin_xeon64.txt
date >> lin_xeon64.txt
echo -n "Done: "
date
This is the latest, but I tried granularity=fine,compact and several other options. I'm expecting that I should be able to get all 16 logical processors to have a thread running. Even the output states 16 cpus, 16 threads, but only 8 run. Could the system be configured wrong? Any help would be appreciated.
Regards, Steve.
[kirk] (uid) linpack> numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6 8 10 12 14
node 0 size: 12279 MB
node 0 free: 11031 MB
node 1 cpus: 1 3 5 7 9 11 13 15
node 1 size: 12288 MB
node 1 free: 11907 MB
node distances:
node 0 1
0: 10 20
1: 20 10
[kirk] (uid) linpack> ./runme_xeon64
This is a SAMPLE run script. Change it to reflect the correct number
of CPUs/threads, problem input files, etc..
Thu Jul 23 17:04:41 MST 2009
OMP: Warning #190: Bad message catalog "libiomp5.cat": Version "2" found, version "1" expected.
OMP: Hint: Check NLSPATH environment variable, its value is "/opt/intel/Compiler/11.0/083/mkl/lib/64/locale/%l_%t/%N:/opt/intel/mkl/10.1.2.024/lib/em64t/locale/%l_%t/%N:/opt/intel/Compiler/11.0/083/lib/intel64/locale/%l_%t/%N:/opt/intel/Compiler/11.0/083/ipp/em64t/lib/locale/%l_%t/%N:/opt/intel/Compiler/11.0/083/mkl/lib/em64t/locale/%l_%t/%N:/opt/intel/Compiler/11.0/083/idb/intel64/locale/%l_%t/%N:/opt/intel/Compiler/11.0/083/lib/intel64/locale/%l_%t/%N:/opt/intel/Compiler/11.0/083/ipp/em64t/lib/locale/%l_%t/%N:/opt/intel/Compiler/11.0/083/mkl/lib/em64t/locale/%l_%t/%N:/opt/intel/Compiler/11.0/083/idb/intel64/locale/%l_%t/%N".
OMP: Info #3: Default messages will be used.
OMP: Info #157: KMP_AFFINITY: Affinity capable, using global cpuid instr info
OMP: Info #162: KMP_AFFINITY: Initial OS proc set respected: {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}
OMP: Info #164: KMP_AFFINITY: 16 available OS procs
OMP: Info #165: KMP_AFFINITY: Uniform topology
OMP: Info #167: KMP_AFFINITY: 2 packages x 4 cores/pkg x 2 threads/core (8 total cores)
OMP: Info #168: KMP_AFFINITY: OS proc to physical thread map ([] => level not in map):
OMP: Info #178: KMP_AFFINITY: OS proc 0 maps to package 0 core 0 thread 0
OMP: Info #178: KMP_AFFINITY: OS proc 8 maps to package 0 core 0 thread 1
OMP: Info #178: KMP_AFFINITY: OS proc 2 maps to package 0 core 1 thread 0
OMP: Info #178: KMP_AFFINITY: OS proc 10 maps to package 0 core 1 thread 1
OMP: Info #178: KMP_AFFINITY: OS proc 4 maps to package 0 core 2 thread 0
OMP: Info #178: KMP_AFFINITY: OS proc 12 maps to package 0 core 2 thread 1
OMP: Info #178: KMP_AFFINITY: OS proc 6 maps to package 0 core 3 thread 0
OMP: Info #178: KMP_AFFINITY: OS proc 14 maps to package 0 core 3 thread 1
OMP: Info #178: KMP_AFFINITY: OS proc 1 maps to package 1 core 0 thread 0
OMP: Info #178: KMP_AFFINITY: OS proc 9 maps to package 1 core 0 thread 1
OMP: Info #178: KMP_AFFINITY: OS proc 3 maps to package 1 core 1 thread 0
OMP: Info #178: KMP_AFFINITY: OS proc 11 maps to package 1 core 1 thread 1
OMP: Info #178: KMP_AFFINITY: OS proc 5 maps to package 1 core 2 thread 0
OMP: Info #178: KMP_AFFINITY: OS proc 13 maps to package 1 core 2 thread 1
OMP: Info #178: KMP_AFFINITY: OS proc 7 maps to package 1 core 3 thread 0
OMP: Info #178: KMP_AFFINITY: OS proc 15 maps to package 1 core 3 thread 1
OMP: Info #155: KMP_AFFINITY: Internal thread 0 bound to OS proc set {0}
OMP: Info #155: KMP_AFFINITY: Internal thread 1 bound to OS proc set {1}
OMP: Info #155: KMP_AFFINITY: Internal thread 2 bound to OS proc set {2}
OMP: Info #155: KMP_AFFINITY: Internal thread 3 bound to OS proc set {3}
OMP: Info #155: KMP_AFFINITY: Internal thread 4 bound to OS proc set {4}
OMP: Info #155: KMP_AFFINITY: Internal thread 5 bound to OS proc set {5}
OMP: Info #155: KMP_AFFINITY: Internal thread 6 bound to OS proc set {6}
OMP: Info #155: KMP_AFFINITY: Internal thread 7 bound to OS proc set {7}
Done: Thu Jul 23 17:12:03 MST 2009
[kirk] (uid) linpack> cat lin_xeon64.txt
Thu Jul 23 17:04:41 MST 2009
Intel LINPACK data
Current date/time: Thu Jul 23 17:04:41 2009
CPU frequency: 2.666 GHz
Number of CPUs: 16
Number of threads: 16
Parameters are set to:
Number of tests : 1
Number of equations to solve (problem size) : 35000
Leading dimension of array : 45000
Number of trials to run : 1
Data alignment value (in Kbytes) : 1
Maximum memory requested that can be used = 12600901024, at the size = 35000
============= Timing linear equation system solver =================
Size LDA Align. Time(s) GFlops Residual Residual(norm)
35000 45000 1 366.288 78.0419 1.073967e-09 3.117562e-02
Performance Summary (GFlops)
Size LDA Align. Average Maximal
35000 45000 1 78.0419 78.0419
End of tests
Thu Jul 23 17:12:03 MST 2009
[kirk] (uid) linpack>
cat runme_xeon64
#!/bin/bash
#
export KMP_AFFINITY=nowarnings,verbose,granularity=fine,proclist=[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],explicit
echo "This is a SAMPLE run script. Change it to reflect the correct number"
echo "of CPUs/threads, problem input files, etc.."
date
date > lin_xeon64.txt
./xlinpack_xeon64 lininput_xeon64 >> lin_xeon64.txt
date >> lin_xeon64.txt
echo -n "Done: "
date
This is the latest, but I tried granularity=fine,compact and several other options. I'm expecting that I should be able to get all 16 logical processors to have a thread running. Even the output states 16 cpus, 16 threads, but only 8 run. Could the system be configured wrong? Any help would be appreciated.
Regards, Steve.
Link Copied
4 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Steve,
The variable KMP_AFFINITY can help to bind the thread to CPU core.
How about set the variable OMP_NUM_THREAD
for example, export OMP_NUM_THREADS=16
?
As i knew, by default, the currentMKL version will spawn only 1/2 threads on hyperthread enabling system because
to enable HT threading may not benefit the performance,some of time, it will hurttheperformance.
Here is some explanation in MKL userguide for your reference
The use of Hyper-Threading Technology:
Hyper-Threading Technology (HT Technology) is especially effective when each thread is performing different types of operations and when there are under-utilized resources on the processor.
However, Intel MKL fits neither of these criteria because the threaded portions of the library execute at high efficiencies using most of the available resources and perform identical operations on each thread. You may
obtain higher performance by disabling HT Technology. MKLby default generates threadaccording tothe number of physical core. So I guess, that is why you only see only8 threadsrun.
Best Regards,
Ying
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - Ying Hu (Intel)
Hello Steve,
The variable KMP_AFFINITY can help to bind the thread to CPU core.
How about set the variable OMP_NUM_THREAD
for example, export OMP_NUM_THREADS=16
?
As i knew, by default, the currentMKL version will spawn only 1/2 threads on hyperthread enabling system because
to enable HT threading may not benefit the performance,some of time, it will hurttheperformance.
Here is some explanation in MKL userguide for your reference
The use of Hyper-Threading Technology:
Hyper-Threading Technology (HT Technology) is especially effective when each thread is performing different types of operations and when there are under-utilized resources on the processor.
However, Intel MKL fits neither of these criteria because the threaded portions of the library execute at high efficiencies using most of the available resources and perform identical operations on each thread. You may
obtain higher performance by disabling HT Technology. MKLby default generates threadaccording tothe number of physical core. So I guess, that is why you only see only8 threadsrun.
Best Regards,
Ying
Ying,
Thank you for your reply. I had tried OMP_NUM_THREADS variable and could never achieve more than 8 threads. What made me think I could get more was the output that states 16 CPUs / 16 Threads. I think that needs to be fixed. I accept now that the program will only run 1 thread per core, especially if the performance would be worse with more threads.
Best Regards,
Steve
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Steve,
Try to export both MKL_DYNAMIC=FALSE and OMP_NUM_THREADS=16 enviroment.
Currently MKL detect number of physical cores and limit the threading to the physical core number to avoid overthreading. (It is only half of the logical processors in Hyper-Threading).
To change such behavoir, use the following two enviroment vars:
export MKL_DYNAMIC=FALSE
export MKL_NUM_THREADS=16 ( or OMP_NUM_THREADS=16)
Thanks,
Chao
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Always run Linpack without hyperthreading to utilize all the threads. Linpack is not meant to be run with hyperthreading on.
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page