Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.
1750 Discussions

Different inter-core latency measured on two identical Skylake Xeon Gold 6154 systems

AATLI3
Beginner
1,889 Views

We have been using two same Skylake servers with completely the same softwares, Centos 7 OS and BIOS settings. Everything is the same, except the latency performance. Our software is using AVX512.

In tests, I noticed that something slows down performance (increasing latency) in the one of the systems each time. There is a significant performance difference. I checked everything, all are the same. In the attached code, I'm trying to measure inter-cpu latency and even this shows me a proof of slowness, but for the real code, I'm using SIMD/AVX512 for math operations as well and in the field working application, there is an incredible difference between the same systems. How can this happen?

What should I do to solve this problem? Which tool can help?

Thanks in advance..

sudo lshw -class cpu
  *-cpu:0                   
       description: CPU
       product: Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz
       vendor: Intel Corp.
       vendor_id: GenuineIntel
       physical id: 400
       bus info: cpu@0
       version: Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz
       slot: CPU1
       size: 3GHz
       capacity: 4GHz
       width: 64 bits
       clock: 1010MHz
       capabilities: lm fpu fpu_exception wp vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp x86-64 constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 intel_ppin intel_pt ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke md_clear spec_ctrl intel_stibp flush_l1d
       configuration: cores=18 enabledcores=18 threads=18
  *-cpu:1 DISABLED
       description: CPU [empty]
       physical id: 401
       slot: CPU2

I also attached the following sample code as an example to measure latency for different mesh network paths on the same die, different cpus have different inter-cpu latency. But still one of the same systems is 25% slower than the other in all case.

Also I don't know if it affects, but the slow CPU has extra md_clear CPU feature flag.

Both system's output as follows: (2-17 are isolated cpus, and the attached code ignores the first iteration for cold cache, thread startup, cache miss etc..)

            slow cpu    fast cpu
------------------------------------                

(M=2-W=3)   mean=580    mean=374
(M=2-W=4)   mean=463    mean=365
(M=2-W=5)   mean=449    mean=391
(M=2-W=6)   mean=484    mean=345
(M=2-W=7)   mean=430    mean=386
(M=2-W=8)   mean=439    mean=369
(M=2-W=9)   mean=445    mean=376
(M=2-W=10)  mean=480    mean=354
(M=2-W=11)  mean=440    mean=392
(M=2-W=12)  mean=475    mean=324
(M=2-W=13)  mean=453    mean=373
(M=2-W=14)  mean=474    mean=344
(M=2-W=15)  mean=445    mean=384
(M=2-W=16)  mean=468    mean=372
(M=2-W=17)  mean=462    mean=373
(M=3-W=2)   mean=447    mean=392
(M=3-W=4)   mean=556    mean=386
(M=3-W=5)   mean=418    mean=409
(M=3-W=6)   mean=473    mean=372
(M=3-W=7)   mean=397    mean=400
(M=3-W=8)   mean=408    mean=403
(M=3-W=9)   mean=412    mean=413
(M=3-W=10)  mean=447    mean=389
(M=3-W=11)  mean=412    mean=423
(M=3-W=12)  mean=446    mean=399
(M=3-W=13)  mean=427    mean=407
(M=3-W=14)  mean=445    mean=390
(M=3-W=15)  mean=417    mean=448
(M=3-W=16)  mean=438    mean=386
(M=3-W=17)  mean=435    mean=396
(M=4-W=2)   mean=463    mean=368
(M=4-W=3)   mean=433    mean=401
(M=4-W=5)   mean=561    mean=406
(M=4-W=6)   mean=468    mean=378
(M=4-W=7)   mean=416    mean=387
(M=4-W=8)   mean=425    mean=386
(M=4-W=9)   mean=425    mean=415
(M=4-W=10)  mean=464    mean=379
(M=4-W=11)  mean=424    mean=404
(M=4-W=12)  mean=456    mean=369
(M=4-W=13)  mean=441    mean=395
(M=4-W=14)  mean=460    mean=378
(M=4-W=15)  mean=427    mean=405
(M=4-W=16)  mean=446    mean=369
(M=4-W=17)  mean=448    mean=391
(M=5-W=2)   mean=447    mean=382
(M=5-W=3)   mean=418    mean=406
(M=5-W=4)   mean=430    mean=397
(M=5-W=6)   mean=584    mean=386
(M=5-W=7)   mean=399    mean=399
(M=5-W=8)   mean=404    mean=386
(M=5-W=9)   mean=408    mean=408
(M=5-W=10)  mean=446    mean=378
(M=5-W=11)  mean=411    mean=407
(M=5-W=12)  mean=440    mean=385
(M=5-W=13)  mean=424    mean=402
(M=5-W=14)  mean=442    mean=381
(M=5-W=15)  mean=411    mean=411
(M=5-W=16)  mean=433    mean=398
(M=5-W=17)  mean=429    mean=395
(M=6-W=2)   mean=486    mean=356
(M=6-W=3)   mean=453    mean=388
(M=6-W=4)   mean=471    mean=353
(M=6-W=5)   mean=452    mean=388
(M=6-W=7)   mean=570    mean=360
(M=6-W=8)   mean=444    mean=377
(M=6-W=9)   mean=450    mean=376
(M=6-W=10)  mean=485    mean=335
(M=6-W=11)  mean=451    mean=410
(M=6-W=12)  mean=479    mean=353
(M=6-W=13)  mean=463    mean=363
(M=6-W=14)  mean=479    mean=359
(M=6-W=15)  mean=450    mean=394
(M=6-W=16)  mean=473    mean=364
(M=6-W=17)  mean=469    mean=373
(M=7-W=2)   mean=454    mean=365
(M=7-W=3)   mean=418    mean=410
(M=7-W=4)   mean=443    mean=370
(M=7-W=5)   mean=421    mean=407
(M=7-W=6)   mean=456    mean=363
(M=7-W=8)   mean=527    mean=380
(M=7-W=9)   mean=417    mean=392
(M=7-W=10)  mean=460    mean=361
(M=7-W=11)  mean=421    mean=402
(M=7-W=12)  mean=447    mean=354
(M=7-W=13)  mean=430    mean=381
(M=7-W=14)  mean=449    mean=375
(M=7-W=15)  mean=420    mean=393
(M=7-W=16)  mean=442    mean=352
(M=7-W=17)  mean=438    mean=367
(M=8-W=2)   mean=463    mean=382
(M=8-W=3)   mean=434    mean=411
(M=8-W=4)   mean=452    mean=372
(M=8-W=5)   mean=429    mean=402
(M=8-W=6)   mean=469    mean=368
(M=8-W=7)   mean=416    mean=418
(M=8-W=9)   mean=560    mean=418
(M=8-W=10)  mean=468    mean=385
(M=8-W=11)  mean=429    mean=394
(M=8-W=12)  mean=460    mean=378
(M=8-W=13)  mean=439    mean=392
(M=8-W=14)  mean=459    mean=373
(M=8-W=15)  mean=429    mean=383
(M=8-W=16)  mean=452    mean=376
(M=8-W=17)  mean=449    mean=401
(M=9-W=2)   mean=440    mean=368
(M=9-W=3)   mean=410    mean=398
(M=9-W=4)   mean=426    mean=385
(M=9-W=5)   mean=406    mean=403
(M=9-W=6)   mean=447    mean=378
(M=9-W=7)   mean=393    mean=427
(M=9-W=8)   mean=408    mean=368
(M=9-W=10)  mean=580    mean=392
(M=9-W=11)  mean=408    mean=387
(M=9-W=12)  mean=433    mean=381
(M=9-W=13)  mean=418    mean=444
(M=9-W=14)  mean=441    mean=407
(M=9-W=15)  mean=408    mean=401
(M=9-W=16)  mean=427    mean=376
(M=9-W=17)  mean=426    mean=383
(M=10-W=2)  mean=478    mean=361
(M=10-W=3)  mean=446    mean=379
(M=10-W=4)  mean=461    mean=350
(M=10-W=5)  mean=445    mean=373
(M=10-W=6)  mean=483    mean=354
(M=10-W=7)  mean=428    mean=370
(M=10-W=8)  mean=436    mean=355
(M=10-W=9)  mean=448    mean=390
(M=10-W=11) mean=569    mean=350
(M=10-W=12) mean=473    mean=337
(M=10-W=13) mean=454    mean=370
(M=10-W=14) mean=474    mean=360
(M=10-W=15) mean=441    mean=370
(M=10-W=16) mean=463    mean=354
(M=10-W=17) mean=462    mean=358
(M=11-W=2)  mean=447    mean=384
(M=11-W=3)  mean=411    mean=408
(M=11-W=4)  mean=433    mean=394
(M=11-W=5)  mean=413    mean=428
(M=11-W=6)  mean=455    mean=383
(M=11-W=7)  mean=402    mean=395
(M=11-W=8)  mean=407    mean=418
(M=11-W=9)  mean=417    mean=424
(M=11-W=10) mean=452    mean=395
(M=11-W=12) mean=577    mean=406
(M=11-W=13) mean=426    mean=402
(M=11-W=14) mean=442    mean=412
(M=11-W=15) mean=408    mean=411
(M=11-W=16) mean=435    mean=400
(M=11-W=17) mean=431    mean=415
(M=12-W=2)  mean=473    mean=352
(M=12-W=3)  mean=447    mean=381
(M=12-W=4)  mean=461    mean=361
(M=12-W=5)  mean=445    mean=366
(M=12-W=6)  mean=483    mean=322
(M=12-W=7)  mean=431    mean=358
(M=12-W=8)  mean=438    mean=340
(M=12-W=9)  mean=448    mean=409
(M=12-W=10) mean=481    mean=334
(M=12-W=11) mean=447    mean=351
(M=12-W=13) mean=580    mean=383
(M=12-W=14) mean=473    mean=359
(M=12-W=15) mean=441    mean=385
(M=12-W=16) mean=463    mean=355
(M=12-W=17) mean=462    mean=358
(M=13-W=2)  mean=450    mean=385
(M=13-W=3)  mean=420    mean=410
(M=13-W=4)  mean=440    mean=396
(M=13-W=5)  mean=418    mean=402
(M=13-W=6)  mean=461    mean=385
(M=13-W=7)  mean=406    mean=391
(M=13-W=8)  mean=415    mean=382
(M=13-W=9)  mean=421    mean=402
(M=13-W=10) mean=457    mean=376
(M=13-W=11) mean=422    mean=409
(M=13-W=12) mean=451    mean=381
(M=13-W=14) mean=579    mean=375
(M=13-W=15) mean=430    mean=402
(M=13-W=16) mean=440    mean=408
(M=13-W=17) mean=439    mean=394
(M=14-W=2)  mean=477    mean=330
(M=14-W=3)  mean=449    mean=406
(M=14-W=4)  mean=464    mean=355
(M=14-W=5)  mean=450    mean=389
(M=14-W=6)  mean=487    mean=342
(M=14-W=7)  mean=432    mean=380
(M=14-W=8)  mean=439    mean=360
(M=14-W=9)  mean=451    mean=405
(M=14-W=10) mean=485    mean=356
(M=14-W=11) mean=447    mean=398
(M=14-W=12) mean=479    mean=338
(M=14-W=13) mean=455    mean=382
(M=14-W=15) mean=564    mean=383
(M=14-W=16) mean=481    mean=361
(M=14-W=17) mean=465    mean=351
(M=15-W=2)  mean=426    mean=409
(M=15-W=3)  mean=395    mean=424
(M=15-W=4)  mean=412    mean=427
(M=15-W=5)  mean=395    mean=425
(M=15-W=6)  mean=435    mean=391
(M=15-W=7)  mean=379    mean=405
(M=15-W=8)  mean=388    mean=412
(M=15-W=9)  mean=399    mean=432
(M=15-W=10) mean=432    mean=389
(M=15-W=11) mean=397    mean=432
(M=15-W=12) mean=426    mean=393
(M=15-W=13) mean=404    mean=407
(M=15-W=14) mean=429    mean=412
(M=15-W=16) mean=539    mean=391
(M=15-W=17) mean=414    mean=397
(M=16-W=2)  mean=456    mean=368
(M=16-W=3)  mean=422    mean=406
(M=16-W=4)  mean=445    mean=384
(M=16-W=5)  mean=427    mean=397
(M=16-W=6)  mean=462    mean=348
(M=16-W=7)  mean=413    mean=408
(M=16-W=8)  mean=419    mean=361
(M=16-W=9)  mean=429    mean=385
(M=16-W=10) mean=463    mean=369
(M=16-W=11) mean=426    mean=404
(M=16-W=12) mean=454    mean=391
(M=16-W=13) mean=434    mean=378
(M=16-W=14) mean=454    mean=412
(M=16-W=15) mean=424    mean=416
(M=16-W=17) mean=578    mean=378
(M=17-W=2)  mean=460    mean=402
(M=17-W=3)  mean=419    mean=381
(M=17-W=4)  mean=446    mean=394
(M=17-W=5)  mean=424    mean=422
(M=17-W=6)  mean=468    mean=369
(M=17-W=7)  mean=409    mean=401
(M=17-W=8)  mean=418    mean=405
(M=17-W=9)  mean=428    mean=414
(M=17-W=10) mean=459    mean=369
(M=17-W=11) mean=424    mean=387
(M=17-W=12) mean=451    mean=372
(M=17-W=13) mean=435    mean=382
(M=17-W=14) mean=459    mean=369
(M=17-W=15) mean=426    mean=401
(M=17-W=16) mean=446    mean=371

 

0 Kudos
3 Replies
McCalpinJohn
Honored Contributor III
1,890 Views

I don't understand your code, but these 18-core processors each have 6 disabled cores, so the positions of the logical processors are likely to be different on many chips. 

There is a bit map of the enabled L3 slices in a PCI configuration space register known as CAPID6.   On a typical 2-socket system, these can be found with:

# setpci -s 17:1e.3 0x9c.l
0ffbfdff

# setpci -s 85:1e.3 0x9c.l
0fff9fff

These examples are from two 26-core processors in a 2-socket node.  The low-order 28 bits of this 32-bit word are a bit map of the enabled L3 slices.  Every system that I have checked has the same number of L3 slices disabled in the upper 14 bits and the lower 14 bits (probably to better support SNC mode).   In this example, the processor in socket 0 has L3 slices disabled at positions 9 and 18, while the processor in socket 1 has L3 slices disabled at positions 13 and 14.  Some discussion about how the bit positions map to locations on the die is in my presentation at https://www.ixpug.org/documents/1524216121knl_skx_topology_coherence_2018-03-23.pptx

The position of the cores is not the only issue, unfortunately.  The latency depends on that, but it is also strongly dependent on the location of the L3 slice that is responsible for handling the cacheline address you are using.    Cores that are adjacent can have very low cache-to-cache intervention latency if they are exchanging an address mapped to one of the co-located L3 slices, or they can have a very high cache-to-cache intervention latency if they are exchanging an address mapped to a distant L3 slice.   Cores that are further apart have less variation by address (but higher best-case values).   You might want to try repeat your test using each of the 64 cache lines in a 4KiB page as the shared variable.   (I recommend using "mlock()" on the data page to prevent the OS from changing the virtual to physical translation during the run.  Alternately, I use the /proc/self/pagemaps interface to look up the physical address of the base of the 4KiB page before the tests and again after the tests.)

 

0 Kudos
AATLI3
Beginner
1,890 Views

The attached code was to check the aligned 64-byte loads/stores atomicity using AVX512. Also scalar loads/stores produced the similar results on both systems, so that makes me suspicious of AVX512, not the bitmap..

In order to find the disabled cores, I tried to read CAPID6 to check the bitmap of the enabled L3 slices in a PCI configuration space register as below, and it seems all 28 L3/CHA slices are enabled, but I have only 18 cores. What's wrong with it?

$ lspci | grep :1e.3
16:1e.3 System peripheral: Intel Corporation Sky Lake-E PCU Registers (rev 04)

$ setpci -s 16:1e.3 0x9c.l
ffffffff

0 Kudos
McCalpinJohn
Honored Contributor III
1,890 Views

The response "ffffffff" (all 32 bits) is used to indicate an error -- in this case it is probably a simple permissions error.  User processes are allowed to read the first 64 Bytes of each PCI devices configuration space, but root permission is needed to read past that point.  My output above was run as root.   On a 28-slice processor, the output is 0ffffffff.

I am not drawing any conclusions about the source of the latency difference yet -- there are several possibilities, and unless all the variables are properly controlled, it is easy to get misled by the results.....

0 Kudos
Reply