- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Now we are studying network traffic of HPC use. For this, we are using Intel MPI Library (latest - Intel HPC kit at 12/10/2022) and Nas Parallel Benchmark (3.4.2). Before measuring network traffic, I measured the performance without using network traffic. We used following platform:
machine1. Xeon Silver 4310 server 8ch 64GB RAM, Hyper thread on, CentOS 7.9, Turbo ON
machine 2. Xeon Silver 4214 server 6ch 96GB RAM Hyper thread on CentOS 7.9, no Turbo
machine 3. 4 core 8GB RAM virtual machine on machine 1. CentOS 7.9
machine 4. 4 core 8GB RAM vitual machine on machine 2. CentOS 7.9
Results:
Test 1. mpirun -n 4 ./bin/bt.B.x (4 process smaller array - 102 x 102 x 102)
machine 1. 49.87 sec
machine 2. 62.02 sec
machine 3. 43.92 sec
machine 4. 63.11 sec
Test 2. mpirun -n 4 ./bin/bt.C.x (4 process larger array - 162 x 162 x 162)
machine 1. 388.57 sec
machine 2. 253.40 sec
machine 3. 201.79 sec
machine 4. 256.78 sec
In case of the above test 1, the result was understandable and performance diffrence was not strange and expected results were shown.
However, 2nd test. I saw very strange results. There is two unexped things.
1. Newer (3rd) generation of Xeon showed much slower result than older (2nd) generation of Xeon on real machine.
2. Newer (3rd) generation of Xeon showed big improvement , if the benchmark was executed on the virtual machine.
In case of the memory of the machine 1 and the machine 2, machine 2's memory is 1/3 x bigger, however, the using memory of the test 2 (bt.C.x) only consume 4GB (free command result), then it the memory size difference might not make such big effects to execution results.
I also executed the tests with openmpi 4.1 the following is the results:
Test 1. mpirun -np 4 ./bin/bt.B.x (4 process smaller array)
machine 1. 52.31 sec
machine 2. 61.73 sec
Test 2. mpirun -np 4 ./bin/bt.C.x (4 process larger array)
machine 1. 198.70 sec
machine 2. 252.31 sec
Then it seems that Intel MPI and 3rd Gen Xeon and some large array treatment may cause performance down. Then it seems that I can not use Intel MPI with 3rd Gen Xeon. But Intel MPI is much easier to specify fabric and then I want to use it our network traffic evaluation if possible. Then, I want to know following things to use Intel MPI library:
1. Why 3rd Gen Xeon showed slow performance? Why it was not shown with my vitrual machine case even with 3rd Gen Xeon?
2. Why the performance down is shown with Intel MPI library?
3. Is there any way to make performance up with Intel MPI and 3rd Gen Xeon?
Please help!.
K. Kunita
Link Copied
- « Previous
-
- 1
- 2
- Next »
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am sorry, in case of MPICC, I made mistake since I modified your recommended setting in the last reply. And "undefined reference" is due to that I forgot to do "make clean". Now I can build with mpiifort.
However, the mpirun -n 4 ./bin/bt.C.x result showed following:
on the real Gen 3th Xeon machine: 261.22 sec, on virtual machine (KVM) on same machine as the machine which resulted 261.22sec : 201.xx sec.
It is strange that the benchmark result showed faster execution on virtual machine than the result on real machine. Normally, the result of the virtual machine should show a little bit slower performance than the non-virtual case.
Also your result case, bt.A.x is faster on 3rd Gen Xeon, but bt.C.x is slower on 3rd Gen Xeon. That means that you also show what I saw and there is some issue on IntelMPI/3rd Gen Xeon convination. And I think that you can reproduce my issue.
What do you think?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sorry, I clcked Post Reply button before I get the accurate same condtion value. 201.xx sec is some condition difference is. The execution time on the virtula machine in the same environment was 220.18 sec.
Regards, K. Kunita
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Also I tried to use devcloud. I could not find the way to use 3rd/4th Gen. Xeon SP. How can I setup devcloud to use 3rd/4th Gen Xeon SP?
Anyway, I think that you can reproduce my issue in your environment even with new compiler. (Your results show that 3rd Gen Xeon SP is slower than 2nd Xeon in Class C result. In the same your environment, Class B result is faster with 3rd Gen Xeon SP than 2nd Gen Xeon SP. Normal expectation is newer Gen CPU need to show faster results in most of case. Then the result might not be good for Intel too. That is what I want you to investigate. (Also I think that you can see same result as me if you tried with vitrual machine. Virtual machine result is faster than real machine result and the result might be the one what I want to see with real machine ).
Regards, Kuni
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Do you have any counter reply to my previous communication?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Could you provide your comment to my communication. I am waiting for more than a month.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
One thing I should approgize. I over looked about your response about Devcloud. Now I could do NPB with 3rd Gen Xeon SP. The result what I saw was same as you. I think that it reproduced my issue. - 3rd Gen Xeon is slower than 2nd Gen Xeon SP (What you did is 1st Gen Xeon SP). Also I did with mpiicc/mpiifort with my environment. In my environment , real Gen 3th Xeon machine: 261.22 sec, on virtual machine (KVM) on same machine resulted about 201 sec. That means that something wrong is happen on non-virtual 3rd Gen Xeon with Intel MPI. I tryied Openmpi with no-virtual 3rd Xeon. At the case, result is similar to virtual 3rd Xeon SP with Intel MPI. 3rd Xeon SP showed better result than 2nd Gen Xeon SP.
Also I tried with all npb basic test. The result of the test is following:
Gold 6128 | Gold 6348 | Silver 4310 | Silver 4310-VM | Silver 4214R | Silver 4214R-VM | |
bt.B.x n=4 | 49.76 sec | 45.59 sec | 49.87 sec | 50.56 sec | 61.20 sec | 62.37 sec |
bt.C.x n=4 | 214.96 sec | 236.99 sec | 261.87 sec | 205.56 sec | 249.57 sec | 254.04 sec |
ft.B.x n=4 | 9.40 sec | 11.74 sec | 11.37 sec | 8.65 sec | 10.11 sec | 10.44 sec |
ft.C.x n=4 | 40.59 sec | 48.76 sec | 35.71 sec | 36.32 sec | 42.75 sec | 43.51 sec |
lu.B.x n=4 | 24.95 sec | 24.07 sec | 28.75 sec | 29.59 sec | 35.64 sec | 36.59 sec |
lu.C.x n=4 | 104.15 sec | 100.05 sec | 140.86 sec | 123.15 sec | 145.38 sec | 148.16 sec |
is.B.x n=4 | 0.68 sec | 0.80 sec | 0.46 sec | 0.44 sec | 0.44 sec | 0.52 sec |
is.C.x n=4 | 2.77 sec | 2.64 sec | 1.51 sec | 1.64 sec | 1.76 sec | 1.88 sec |
is.D.x n=4 | 49.59 sec | 45.65 sec | 30.03 sec | 29.08 sec | 32.97 sec | 33.55 sec |
cg.B.x n=4 | 8.52 sec | 8.46 sec | 8.78 sec | 10.55 sec | 9.84 sec | 11.78 sec |
cg.C.x n=4 | 25.29 sec | 33.32 sec | 23.33 sec | 25.63 sec | 27.63 sec | 33.46 sec |
cg.D.x n=4 | 2224.97 sec | 1475.95 sec | 1872.25 sec | 1916.41 sec | 3448.39 sec | 3437.87 sec |
ep.B.x n=4 | 9.37 sec | 10.98 sec | 12.97 sec | 13.15 sec | 14.43 sec | 14.58 sec |
ep.C.x n=4 | 37.44 sec | 41.70 sec | 51.61 sec | 52.22 sec | 57.67 sec | 58.24 sec |
sp.B.x n=4 | 36.22 sec | 45.42 sec | 33.51 sec | 34.31 sec | 37.94 sec | 38.76 sec |
sp.C.x n=4 | 176.85 sec | 282.39 sec | 143.23 sec | 146.02 sec | 168.69 sec | 171.56 sec |
mg.B.x n=4 | 1.30 sec | 2.04 sec | 1.36 sec | 1.41 sec | 1.46 sec | 1.58 sec |
mg.C.x n=4 | 11.36 sec | 15.38 sec | 11.38 sec | 11.87 sec | 12.79 sec | 13.02 sec |
dt.B.x n=43 BH | 1.54 sec | 1.17 sec | 1.00 sec | 4.64 sec | 1.31 sec | 5.38 sec |
dt.B.x n=192 SH | 7.16 sec | 6.76 sec | 4.46 sec | 22.73 sec | 5.65 sec | 27.72 sec |
dt.C.x n=85 BH | 26.15 sec | 18.78 sec | 18.98 sec | error(memory?) | 19.44 sec | error (memory?) |
In the test, I executed each test more than 2 times and selected the minimum execution time. And if the time was less than 30 sec, I executed more than 3 times. Exception is cg.D.x. It was executed once because it takes much long time.
In the above results, bt.C.x, ft.B.x and lu.C.x showed worse results on 3rd Gen Xeon SP real machine. However the virtual machine results are better results with 3rd Gen Xeon.
Also I executed Geekbench5 on those 2 devcloud hosts for reference purpose. Result of sigle thread is following and all tests except Text compression are faster on Gold 6348. However above results show some performance degradation with 3rd Gen Xeon SP. It means that some issue in Intel MPI library for 3rd gen Xeon SP or some special option might be needed with 3rd Gen to execute intel mpi.
Gold 6128 | Gold 6348 | |
Single-Core Score | 1016 | 1152 |
Crypto Score | 1355 | 2250 |
Integer Score | 961 | 1043 |
Floating Point Score | 1086 | 1205 |
AES-XTS | 1355 | 2250 |
Text Compression | 1035 | 840 |
Image Compression | 1007 | 1082 |
Navigation | 806 | 853 |
HTML5 | 907 | 1187 |
SQLite | 1013 | 1076 |
PDF Rendering | 947 | 1130 |
Text Rendering | 881 | 1075 |
Clang | 1075 | 1125 |
Camera | 1008 | 1077 |
N-Body Physics | 954 | 1083 |
Rigid Body Physics | 1067 | 1110 |
Gaussian Blur | 718 | 791 |
Face Detection | 983 | 1037 |
Horizon Detection | 826 | 968 |
Image Inpainting | 1841 | 2131 |
HDR | 1917 | 2140 |
Ray Tracing | 1366 | 1557 |
Structure from Motion | 938 | 1062 |
Speech Recognition | 1004 | 1099 |
Machine Learning | 918 | 984 |
What do you think?
Regards, K. Kunita
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
We have escalated this issue to the development team and will get back to you soon.
Thanks & Regards
Shivani
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I tried with RoCE communication with 4 workstations. bt.D.x test is the only the test in NPB show but performance degradation with 3rd Gen Xeon. In case of OpenMPI showed about 2 x speed.
The execution command lines are following.
Case 1:
Intel MPI 2nd Gen. Xeon SP with Nvidia ConnectX-5: mpirun -n 36 -ppn 9 -host svr0-100g,svr1-100g,svr2-100g,svr3-100g ./bin/bt.D.x
Result: 626.03 sec
Case 2:
Intel MPI 3rd Gen. Xeon SP with Nvidia ConnectX-6: mpirun -n 36 -ppn 9 -host svr4-100g,svr5-100g,svr6-100g,svr7-100g ./bin/bt.D.x
Result: 1011.72 sec
Case 3:
OpenMPI 2nd Gen Xeon SP with Nvidia ConnectX-5: mpirun -np 36 -host svr0-100g:9,svr1-100g:9,svr2-100g:9,svr3-100g:9 ./bin/bt.D.x
Result: 602.64 sec
Case 4:
OpenMPI 3rd Gen. Xeon SP with Nvidia ConnectX-6: mpirun -np 36 -host svr4-100g:9,svr5-100g:9,svr6-100g:9,svr7-100g:9 ./bin/bt.D.x
Result: 497.51 sec
Other test than BT of NPB did not show such performance degradation with Intel MPI and 3rd Gen Xeon.
Regards, K. Kunita
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
There is one typo "show but perfomance degradation" should be "which showed performance degradation".
By the way, is there any update?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have the same issue, The 4309Y single core integer benchmark is lower than 4314, 6142, E5 2643v3....too bad.
the others(Xeon 43xx) is fine. just 4309Y.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Homer-b83fb642db9bb85
I don't think this is the right place for you.
You might want to check which performance profile your 4309Y is running:
https://www.intel.com/content/www/us/en/products/sku/215275/intel-xeon-silver-4309y-processor-12m-cache-2-80-ghz/specifications.html
Intel® Speed Select Technology - Performance Profile (Intel® SST-PP)
Config | Active Cores | Base Frequency | TDP | Description |
---|---|---|---|---|
4309Y(0) | 8 | 2.8 GHz | 105W | |
4309(1) | 8 | 2.6 GHz | 95W | |
4309(2) | 8 | 2.3 GHz | 85 W |
If you think it's related to Intel MPI, please provide a reproducer and detailed information on your environment apart from the CPU SKU.
Best
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@TobiasK Thanks.
Unusual performance issues, low-clock CPUs may perform better, e.g.: Xeon 4314 , just single core test and the same BIOS/OS setting.
This rule does not apply to many other generations of Xeon CPUs. How strange.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@TobiasK Thanks.
Due to other reasons, I was unable to switch operating system versions, so testing was conducted on different operating system versions. The test results were consistent when tested on the same operating system previously.
Turbostaus info in the attachment.
Xeon 4314, el7 gcc version 9.1.1 20190605, 3.10.0-1160.49.1.el7.x86_64
Xeon 4309Y, el9 gcc version gcc version 11.3.1 20221121, 5.14.0-284.18.1.el9_2.x86_64
Xeon 4309Y, gcc version 11.3.1 20221121 5.14.0-284.18.1.el9_2.x86_64
Output: 91.6857 <------------ higher CPU MHz and take more time
sum = 9048129480000
Test command:
g++ -O2 test.cpp -o add
numactl -C 3 ./add
#include <algorithm>
#include <ctime>
#include <iostream>
int main()
{
const unsigned arraySize = 1048576;
int data[arraySize];
for (unsigned c = 0; c < arraySize; ++c)
data[c] = std::rand() % 256;
clock_t start = clock();
long long sum = 0;
for (unsigned i = 0; i < 90000; ++i)
{
for (unsigned c = 0; c < arraySize; ++c)
{
if (data[c] >= 128)
sum += data[c];
}
}
double elapsedTime = static_cast<double>(clock() - start) / CLOCKS_PER_SEC;
std::cout << elapsedTime << std::endl;
std::cout << "sum = " << sum << std::endl;
}
lscpu
lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 57 bits virtual
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0-31
Vendor ID: GenuineIntel
BIOS Vendor ID: Intel(R) Corporation
Model name: Intel(R) Xeon(R) Silver 4309Y CPU @ 2.80GHz
BIOS Model name: Intel(R) Xeon(R) Silver 4309Y CPU @ 2.80GHz
CPU family: 6
Model: 106
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 2
Stepping: 6
BogoMIPS: 5600.00
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology n
onstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dn
owprefetch cpuid_fault epb cat_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512
f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect wbnoinvd dtherm i
da arat pln pts hwp_epp avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid fsrm md_clear pconfig flush_l1d arch_capabilities
Virtualization features:
Virtualization: VT-x
Caches (sum of all):
L1d: 768 KiB (16 instances)
L1i: 512 KiB (16 instances)
L2: 20 MiB (16 instances)
L3: 24 MiB (2 instances)
NUMA:
NUMA node(s): 2
NUMA node0 CPU(s): 0-7,16-23
NUMA node1 CPU(s): 8-15,24-31
Vulnerabilities:
Itlb multihit: Not affected
L1tf: Not affected
Mds: Not affected
Meltdown: Not affected
Mmio stale data: Mitigation; Clear CPU buffers; SMT vulnerable
Retbleed: Not affected
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Srbds: Not affected
Tsx async abort: Not affected
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 57 bits virtual
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0-31
Vendor ID: GenuineIntel
BIOS Vendor ID: Intel(R) Corporation
Model name: Intel(R) Xeon(R) Silver 4309Y CPU @ 2.80GHz
BIOS Model name: Intel(R) Xeon(R) Silver 4309Y CPU @ 2.80GHz
CPU family: 6
Model: 106
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 2
Stepping: 6
BogoMIPS: 5600.00
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology n
onstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dn
owprefetch cpuid_fault epb cat_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512
f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect wbnoinvd dtherm i
da arat pln pts hwp_epp avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid fsrm md_clear pconfig flush_l1d arch_capabilities
Virtualization features:
Virtualization: VT-x
Caches (sum of all):
L1d: 768 KiB (16 instances)
L1i: 512 KiB (16 instances)
L2: 20 MiB (16 instances)
L3: 24 MiB (2 instances)
NUMA:
NUMA node(s): 2
NUMA node0 CPU(s): 0-7,16-23
NUMA node1 CPU(s): 8-15,24-31
Vulnerabilities:
Itlb multihit: Not affected
L1tf: Not affected
Mds: Not affected
Meltdown: Not affected
Mmio stale data: Mitigation; Clear CPU buffers; SMT vulnerable
Retbleed: Not affected
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Srbds: Not affected
Tsx async abort: Not affected
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@TobiasK Thanks
I'm not sure what's going on, but my previous replies haven't been showing up. I'll try again.
For other reasons, I'm unable to switch operating systems. Currently, I'm comparing performance across different operating systems. However, according to previous testing results, this test script actually performs better on higher versions of GCC. Even when switching to the same operating system, this behavior persists.
1x Xeon 4314 gcc version 9.1.1 20190605 3.10.0-1160.49.1.el7.x86_64
numactl -C 3 ./add
65.21
sum = 9048129480000
2x Xeon 4309Y gcc version 11.3.1 20221121 5.14.0-284.18.1.el9_2.x86_64
numactl -C 3 ./add
91.6857 <-----------------------take more time in the higher CPU frequency
sum = 9048129480000
here is the test.cpp code
g++ -O2 test.cpp -o add
#include <algorithm>
#include <ctime>
#include <iostream>
int main()
{
const unsigned arraySize = 1048576;
int data[arraySize];
for (unsigned c = 0; c < arraySize; ++c)
data[c] = std::rand() % 256;
clock_t start = clock();
long long sum = 0;
for (unsigned i = 0; i < 90000; ++i)
{
for (unsigned c = 0; c < arraySize; ++c)
{
if (data[c] >= 128)
sum += data[c];
}
}
double elapsedTime = static_cast<double>(clock() - start) / CLOCKS_PER_SEC;
std::cout << elapsedTime << std::endl;
std::cout << "sum = " << sum << std::endl;
}
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Homer-b83fb642db9bb85
according to your turbostat output the 4309y machine is fully loaded with other tasks, so there is no surprise it's slower than an idle machine.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you. I overlooked it.
My system shows 100% CPU busy on each core with turbostat, but the load average in top is 0. That doesn't seem right.
The power policy was the OS control and the cpupower command could not control it.
Disabling and then re-enabling Hyper-threading resolved the problem of all cores showing 100% busy, maybe I should upgrade the Firmware.
top - 20:57:45 up 72 days, 15:34, 1 user, load average: 0.00, 0.00, 0.00
Tasks: 460 total, 1 running, 459 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.0 us, 0.0 sy, 0.0 ni, 99.8 id, 0.0 wa, 0.2 hi, 0.0 si, 0.0 st
MiB Mem : 256946.5 total, 252870.0 free, 4715.9 used, 2437.1 buff/cache
MiB Swap: 4096.0 total, 0.4 free, 4095.6 used. 252230.7 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1 root 20 0 170896 10172 6204 S 0.0 0.0 0:50.60 systemd
2 root 20 0 0 0 0 S 0.0 0.0 1:14.99 kthreadd
3 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 rcu_gp
4 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 rcu_par_gp
5 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 slub_flushwq
6 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 netns
8 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 kworker/0:0H-events_highpri
Turbostat
cpu8: PKG Limit #2: ENabled (126.000 Watts, 1.000000* sec, clamp ENabled)
cpu8: MSR_VR_CURRENT_CONFIG: 0x000006c8
cpu8: PKG Limit #4: 217.000000 Watts (UNlocked)
cpu8: MSR_DRAM_POWER_INFO,: 0x8028009800180090 (18 W TDP, RAPL 3 - 19 W, 0.039062 sec.)
cpu8: MSR_DRAM_POWER_LIMIT: 0x00000000 (UNlocked)
cpu8: DRAM Limit: DISabled (0.000 Watts, 0.000977 sec, clamp DISabled)
cpu0: MSR_IA32_TEMPERATURE_TARGET: 0x805b0a00 (91 C)
cpu8: MSR_IA32_TEMPERATURE_TARGET: 0x805b0a00 (91 C)
cpu0: MSR_IA32_PACKAGE_THERM_STATUS: 0x882a0800 (49 C)
cpu0: MSR_IA32_PACKAGE_THERM_INTERRUPT: 0x00000003 (91 C, 91 C)
cpu8: MSR_IA32_PACKAGE_THERM_STATUS: 0x88280800 (51 C)
cpu8: MSR_IA32_PACKAGE_THERM_INTERRUPT: 0x00000003 (91 C, 91 C)
cpu4: MSR_PKGC3_IRTL: 0x00000000 (NOTvalid, 0 ns)
cpu4: MSR_PKGC6_IRTL: 0x00000000 (NOTvalid, 0 ns)
cpu4: MSR_PKGC7_IRTL: 0x00000000 (NOTvalid, 0 ns)
Package Core CPU Avg_MHz Busy% Bzy_MHz TSC_MHz IPC IRQ SMI POLL C1 C1E C6 POLL% C1% C1E% C6% CPU%c1 CPU%c6 CoreTmp CoreThr PkgTmp Pkg%pc2 Pkg%pc6 PkgWatt RAMWatt PKG_% RAM_%
- - - 3392 100.00 3400 2794 0.11 161475 0 2491818 0 0 0 99.65 0.00 0.00 0.00 0.00 0.00 51 0 51 0.00 0.00 184.71 8.90 0.00 0.00
0 0 0 3392 100.00 3400 2793 0.11 5009 0 77835 0 0 0 99.60 0.00 0.00 0.00 0.00 0.00 50 0 50 0.00 0.00 94.94 4.78 0.00 0.00
0 0 16 3392 100.00 3400 2793 0.11 5011 0 77844 0 0 0 99.63 0.00 0.00 0.00 0.00
0 1 1 3392 100.00 3400 2793 0.11 5019 0 77766 0 0 0 99.63 0.00 0.00 0.00 0.00 0.00 50 0
0 1 17 3392 100.00 3400 2793 0.11 5011 0 77728 0 0 0 99.63 0.00 0.00 0.00 0.00
0 2 2 3392 100.00 3400 2793 0.11 5013 0 77863 0 0 0 99.64 0.00 0.00 0.00 0.00 0.00 48 0
0 2 18 3392 100.00 3400 2793 0.11 5011 0 77867 0 0 0 99.64 0.00 0.00 0.00 0.00
0 3 3 3392 100.00 3400 2793 0.11 5011 0 77863 0 0 0 99.64 0.00 0.00 0.00 0.00 0.00 46 0
0 3 19 3392 100.00 3400 2793 0.11 5011 0 77848 0 0 0 99.64 0.00 0.00 0.00 0.00
0 4 4 3392 100.00 3400 2793 0.11 5103 0 77929 0 0 0 99.66 0.00 0.00 0.00 0.00 0.00 47 0
0 4 20 3392 100.00 3400 2793 0.11 5011 0 77858 0 0 0 99.63 0.00 0.00 0.00 0.00
0 5 5 3392 100.00 3400 2793 0.11 5008 0 77845 0 0 0 99.64 0.00 0.00 0.00 0.00 0.00 47 0
0 5 21 3392 100.00 3400 2793 0.11 5009 0 77842 0 0 0 99.64 0.00 0.00 0.00 0.00
0 6 6 3392 100.00 3400 2793 0.11 5011 0 77904 0 0 0 99.64 0.00 0.00 0.00 0.00 0.00 47 0
0 6 22 3392 100.00 3400 2793 0.11 5009 0 77828 0 0 0 99.64 0.00 0.00 0.00 0.00
0 7 7 3392 100.00 3400 2793 0.11 5009 0 77851 0 0 0 99.64 0.00 0.00 0.00 0.00 0.00 46 0
0 7 23 3392 100.00 3400 2793 0.11 5012 0 77836 0 0 0 99.64 0.00 0.00 0.00 0.00
1 0 8 3392 100.00 3400 2793 0.11 5007 0 77842 0 0 0 99.64 0.00 0.00 0.00 0.00 0.00 50 0 51 0.00 0.00 89.75 4.12 0.00 0.00
1 0 24 3392 100.00 3400 2793 0.11 5007 0 77821 0 0 0 99.64 0.00 0.00 0.00 0.00
1 1 9 3392 100.00 3400 2793 0.11 5009 0 77856 0 0 0 99.62 0.00 0.00 0.00 0.00 0.00 51 0
1 1 25 3392 100.00 3400 2793 0.11 5105 0 77930 0 0 0 99.66 0.00 0.00 0.00 0.00
1 2 10 3392 100.00 3400 2793 0.11 5011 0 77868 0 0 0 99.63 0.00 0.00 0.00 0.00 0.00 51 0
1 2 26 3392 100.00 3400 2793 0.11 5007 0 77837 0 0 0 99.64 0.00 0.00 0.00 0.00
1 3 11 3392 100.00 3400 2793 0.11 5195 0 77977 0 0 0 99.65 0.00 0.00 0.00 0.00 0.00 50 0
1 3 27 3392 100.00 3400 2793 0.11 5101 0 77952 0 0 0 99.64 0.00 0.00 0.00 0.00
1 4 12 3392 100.00 3400 2793 0.11 5100 0 77943 0 0 0 99.65 0.00 0.00 0.00 0.00 0.00 49 0
1 4 28 3392 100.00 3400 2793 0.11 5009 0 77825 0 0 0 99.63 0.00 0.00 0.00 0.00
1 5 13 3392 100.00 3400 2793 0.11 5140 0 77953 0 0 0 99.64 0.00 0.00 0.00 0.00 0.00 50 0
1 5 29 3392 100.00 3400 2793 0.11 5108 0 77917 0 0 0 99.65 0.00 0.00 0.00 0.00
1 6 14 3392 100.00 3400 2793 0.11 5098 0 77905 0 0 0 99.65 0.00 0.00 0.00 0.00 0.00 50 0
1 6 30 3392 100.00 3400 2793 0.11 5009 0 77838 0 0 0 99.63 0.00 0.00 0.00 0.00
1 7 15 3392 100.00 3400 2793 0.11 5107 0 77895 0 0 0 99.65 0.00 0.00 0.00 0.00 0.00 49 0
1 7 31 3392 100.00 3400 2793 0.11 5194 0 77952 0 0 0 99.63 0.00 0.00 0.00 0.00
cpupower
cpupower frequency-info
analyzing CPU 0:
no or unknown cpufreq driver is active on this CPU
CPUs which run at the same hardware frequency: Not Available
CPUs which need to have their frequency coordinated by software: Not Available
maximum transition latency: Cannot determine or is not supported.
Not Available
available cpufreq governors: Not Available
Unable to determine current policy
current CPU frequency: Unable to call hardware
current CPU frequency: Unable to call to kernel
boost state support:
Supported: yes
Active: yes
After reboot
Package Core CPU Avg_MHz Busy% Bzy_MHz TSC_MHz IPC IRQ SMI POLL C1 C1E C6 POLL% C1% C1E% C6% CPU%c1 CPU%c6 CoreTmp CoreThr PkgTmp Pkg%pc2 Pkg%pc6 PkgWatt RAMWatt PKG_% RAM_%
- - - 113 3.17 3559 2793 3.98 6423 0 8 103 366 2429 0.00 0.00 0.09 96.75 8.48 88.35 44 0 44 49.61 0.00 131.82 7.31 0.00 0.00
0 0 0 2 0.31 800 2793 0.90 123 0 0 0 2 496 0.00 0.00 0.03 99.72 40.51 59.18 36 0 44 0.00 0.00 74.95 4.72 0.00 0.00
0 0 16 1 0.12 801 2793 0.42 121 0 0 6 39 295 0.00 0.00 0.12 99.81 40.70
0 1 1 0 0.03 800 2793 0.58 41 0 0 0 1 49 0.00 0.00 0.00 99.97 2.25 97.72 36 0
0 1 17 0 0.01 801 2793 0.69 19 0 0 0 1 31 0.00 0.00 0.00 99.99 2.27
0 2 2 0 0.02 800 2793 0.62 28 0 0 0 1 34 0.00 0.00 0.01 99.97 1.31 98.67 33 0
0 2 18 0 0.01 800 2793 0.71 27 0 0 0 1 33 0.00 0.00 0.01 99.98 1.31
0 3 3 3590 100.00 3598 2793 3.99 5018 0 0 0 0 0 0.00 0.00 0.00 0.00 0.00 0.00 44 0
0 3 19 0 0.00 3480 2793 0.47 26 0 0 0 1 26 0.00 0.00 0.01 99.99 100.00
0 4 4 0 0.02 800 2793 0.63 29 0 0 0 1 34 0.00 0.00 0.00 99.99 1.16 98.83 36 0
0 4 20 0 0.01 800 2793 0.71 30 0 0 0 0 35 0.00 0.00 0.00 99.99 1.16
0 5 5 1 0.12 800 2793 0.62 77 0 0 0 2 198 0.00 0.00 0.03 99.88 13.20 86.68 31 0
0 5 21 0 0.04 800 2793 0.38 34 0 0 0 1 155 0.00 0.00 0.01 99.97 13.28
0 6 6 0 0.02 800 2793 0.54 37 0 0 0 17 46 0.00 0.00 0.10 99.89 3.09 96.88 33 0
0 6 22 0 0.03 800 2793 0.57 43 0 0 0 7 60 0.00 0.00 0.05 99.92 3.09
0 7 7 1 0.06 801 2793 0.28 30 0 0 0 1 206 0.00 0.00 0.01 99.96 22.39 77.55 32 0
0 7 23 2 0.19 800 2793 0.53 369 0 6 17 151 376 0.00 0.00 0.44 99.41 22.26
1 0 8 0 0.01 800 2793 0.60 16 0 0 0 7 19 0.00 0.00 0.12 99.87 0.15 99.84 32 0 35 99.23 0.00 56.87 2.59 0.00 0.00
1 0 24 0 0.01 800 2793 0.62 16 0 0 0 7 17 0.00 0.00 0.13 99.86 0.15
1 1 9 0 0.01 800 2793 0.62 17 0 0 0 3 22 0.00 0.00 0.05 99.94 0.22 99.77 34 0
1 1 25 0 0.03 800 2793 0.78 57 0 2 44 24 13 0.00 0.00 0.18 99.79 0.20
1 2 10 0 0.02 800 2793 0.56 20 0 0 0 8 25 0.00 0.00 0.14 99.85 0.15 99.83 34 0
1 2 26 0 0.01 800 2793 0.60 15 0 0 0 7 21 0.00 0.00 0.13 99.86 0.16
1 3 11 0 0.01 800 2793 0.61 15 0 0 0 7 18 0.00 0.00 0.13 99.86 0.15 99.84 32 0
1 3 27 0 0.02 800 2793 0.65 14 0 0 0 7 21 0.00 0.00 0.12 99.86 0.14
1 4 12 1 0.07 800 2793 2.05 23 0 0 0 9 34 0.00 0.00 0.15 99.79 0.28 99.66 31 0
1 4 28 0 0.02 800 2793 0.50 27 0 0 0 10 34 0.00 0.00 0.19 99.79 0.32
1 5 13 0 0.01 800 2793 0.70 15 0 0 0 1 20 0.00 0.00 0.01 99.98 0.22 99.77 32 0
1 5 29 0 0.03 800 2793 0.76 52 0 0 36 17 10 0.00 0.06 0.13 99.79 0.20
1 6 14 0 0.03 800 2793 0.48 27 0 0 0 9 30 0.00 0.00 0.16 99.82 0.25 99.73 33 0
1 6 30 0 0.02 800 2793 0.53 21 0 0 0 10 31 0.00 0.00 0.19 99.79 0.26
1 7 15 0 0.01 800 2793 0.60 13 0 0 0 5 19 0.00 0.00 0.09 99.90 0.27 99.71 30 0
1 7 31 1 0.09 800 2793 2.27 23 0 0 0 9 21 0.00 0.00 0.17 99.74 0.19

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- « Previous
-
- 1
- 2
- Next »