Software Tuning, Performance Optimization & Platform Monitoring
Discussion around monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform monitoring
Announcements
This community is designed for sharing of public information. Please do not share Intel or third-party confidential information here.

Many more perf events with multicore than single core

Jones__Brian
New Contributor I
351 Views

Below I show the loop body of a NASM program that uses only AVX-512 instructions. I ran this program with 1 core and with 4 cores, but with 4 cores it runs 35-50% slower than the single core.

I used Linux perf to profile it, using only the 65 PEBS counters to get line-accurate results. The perf reports show a lot of cache activity (cache hits and misses from all 3 cache levels), memory loads/stores and branch instructions retired on source lines that are AVX-512 all-register instructions. The cache hits and misses show far more frequently when run with 4 cores than when run with 1 core. According to https://software.intel.com/content/www/us/en/develop/articles/use-instructions-retired-events-to-eva...that is a sign that something is wrong, and my results show it.

The program uses AVX-512 register-to-register instructions exclusively between the data read at line 194 to where the final results are written to a shared memory buffer at line 275.

The input data is divided into 4 equal sections with each core taking a section. Stride for each core is 64 bytes (the width of an AVX-512 register). All 4 threads are pinned to physical cores. Both single core and multicore were run on an Intel Xeon Gold 6140 CPU @ 2.30GHz, which has two FMA units for AVX-512.

In the data below I show the loop body, followed by the perf reports for multicore and single core. The perf reports are formatted with the source line first, each perf counter result on a separate line showing the counter name, the source line number, the percentage of events attributable to the source line, and the total number of each event.

My main questions are: (1) why do I see cache or memory activity at all on lines that are all register-to-register instructions, and (2) why is there so much more activity with 4 cores than with 1?

I know this is a lot of data, but I think it's what's needed to show the problem.

Loop body:

label_401:

vmovupd zmm14,[r12+r11] ; SOURCE LINE 194
add r11,r9 ; stride ; SOURCE LINE 195

vmulpd zmm13,zmm14,zmm31 ; SOURCE LINE 205

vmulpd zmm9,zmm14,zmm29 ; SOURCE LINE 216
vmulpd zmm8,zmm13,zmm30 ; SOURCE LINE 217

mov r8,1 ; SOURCE LINE 219
Exponent_Label_0:
vmulpd zmm7,zmm29,zmm29
add r8,1
cmp r8,2 ;rdx
jl Exponent_Label_0

vmulpd zmm3,zmm7,zmm8

vsubpd zmm0,zmm9,zmm3 ; SOURCE LINE 236

vmulpd zmm1,zmm0,zmm28
VCVTTPD2QQ zmm0,zmm1 ; SOURCE LINE 249
VCVTUQQ2PD zmm2,zmm0 ; SOURCE LINE 250
vsubpd zmm3,zmm1,zmm2 ; SOURCE LINE 251
vmulpd zmm4,zmm3,zmm27 ; SOURCE LINE 252
VCVTTPD2QQ zmm5,zmm4 ; SOURCE LINE 253

VPCMPGTQ k2,zmm5,zmm26 ; SOURCE LINE 256
VPCMPEQQ k3 ,zmm5,zmm26 ; SOURCE LINE 257
KADDQ k1,k2,k3 ; SOURCE LINE 258

VCVTQQ2PD zmm2,zmm0 ; SOURCE LINE 261
vmulpd zmm1,zmm2,zmm25
vmovupd zmm2,zmm1
VADDPD zmm2{k1},zmm1,zmm25 ; SOURCE LINE 268

vmovapd [r15+r14],zmm2 ; SOURCE LINE 275
add r14,r9 ; stride

cmp r11,r10
jl label_401

Perf results for multicore - counter, line#, percent, total event count

195 add %r9,%r11
br_inst_retired.all_branches_pebs 195 69.23 296232346
br_inst_retired.conditional 195 55.26 198548579
br_inst_retired.near_taken 195 55.38 167741505
br_inst_retired.not_taken 195 39.39 128270540
inst_retired.prec_dist 195 50.00 1953302772
inst_retired.total_cycles_ps 195 73.02 2318027140
mem_inst_retired.all_loads 195 62.16 420795640
mem_inst_retired.all_stores 195 71.60 261311780
mem_inst_retired.split_loads 195 70.00 15911221
mem_inst_retired.stlb_miss_stores 195 50.00 277390
mem_load_l3_hit_retired.xsnp_none 195 85.71 1272960
mem_load_l3_miss_retired.local_dram 195 50.00 495953
mem_load_retired.fb_hit 195 73.81 11868893
mem_load_retired.l1_hit 195 66.67 414307174
mem_load_retired.l1_miss 195 56.10 10189564
mem_load_retired.l2_hit 195 48.65 7888956
mem_load_retired.l2_miss 195 87.50 1505887
mem_load_retired.l3_hit 195 50.00 1090482
mem_load_retired.l3_miss 195 100.00 479332
uops_retired.stall_cycles 195 90.16 1551403448

205 vmulpd %zmm31,%zmm14,%zmm13
br_inst_retired.all_branches_pebs 205 13.46 296232346
br_inst_retired.conditional 205 15.79 198548579
br_inst_retired.near_taken 205 18.46 167741505
br_inst_retired.not_taken 205 42.42 128270540
inst_retired.prec_dist 205 15.00 1953302772
mem_inst_retired.all_loads 205 18.92 420795640
mem_inst_retired.all_stores 205 12.35 261311780
mem_inst_retired.split_loads 205 20.00 15911221
mem_inst_retired.stlb_miss_stores 205 16.67 277390
mem_load_retired.l1_hit 205 33.33 414307174
mem_load_retired.l1_miss 205 31.71 10189564
mem_load_retired.l2_hit 205 29.73 7888956
mem_load_retired.l3_hit 205 12.50 1090482
rs_events.empty_end 205 50.00 16794950

216 vmulpd %zmm29,%zmm14,%zmm9
br_inst_retired.conditional 216 15.79 198548579
br_inst_retired.near_taken 216 15.38 167741505
br_inst_retired.not_taken 216 15.15 128270540
inst_retired.prec_dist 216 18.33 1953302772
mem_inst_retired.all_loads 216 13.51 420795640
mem_inst_retired.all_stores 216 11.11 261311780
mem_load_l3_miss_retired.local_dram 216 50.00 495953
mem_load_retired.l2_hit 216 10.81 7888956
mem_load_retired.l2_miss 216 12.50 1505887
mem_load_retired.l3_hit 216 37.50 1090482
rs_events.empty_end 216 50.00 16794950

217 vmulpd %zmm30,%zmm13,%zmm8
mem_inst_retired.stlb_miss_stores 217 16.67 277390
mem_load_l3_hit_retired.xsnp_none 217 14.29 1272960

219 mov $0x1,%r8d
inst_retired.prec_dist 219 11.67 1953302772
mem_inst_retired.stlb_miss_stores 219 16.67 277390

236 vsubpd %zmm3,%zmm9,%zmm0
mem_load_retired.l1_hit 236 13.33 414307174
uops_retired.stall_cycles 236 12.22 1551403448

249 vcvttpd2qq %zmm1,%zmm0
mem_inst_retired.all_loads 249 14.16 420795640
mem_inst_retired.split_loads 249 12.98 15911221
mem_inst_retired.stlb_miss_loads 249 100.00 30435
mem_load_l3_hit_retired.xsnp_none 249 66.67 1272960
mem_load_l3_miss_retired.local_dram 249 50.00 495953
mem_load_retired.fb_hit 249 12.83 11868893
mem_load_retired.l1_miss 249 14.94 10189564
mem_load_retired.l2_miss 249 48.44 1505887
mem_load_retired.l3_hit 249 54.22 1090482
mem_load_retired.l3_miss 249 44.00 479332

250 vcvtuqq2pd %zmm0,%zmm2
mem_load_l3_hit_retired.xsnp_none 250 17.71 1272960
mem_load_l3_miss_retired.local_dram 250 18.18 495953
mem_load_retired.l1_hit 250 13.33 414307174
mem_load_retired.l2_miss 250 21.88 1505887
mem_load_retired.l3_hit 250 19.28 1090482

251 vsubpd %zmm2,%zmm1,%zmm3
br_inst_retired.near_taken 251 12.19 167741505
mem_inst_retired.stlb_miss_stores 251 12.50 277390
mem_load_retired.fb_hit 251 12.83 11868893
mem_load_retired.l2_hit 251 10.29 7888956

252 vmulpd %zmm27,%zmm3,%zmm4
br_inst_retired.conditional 252 12.01 198548579
br_inst_retired.not_taken 252 10.71 128270540

253 vcvttpd2qq %zmm4,%zmm5
br_inst_retired.all_branches_pebs 253 11.56 296232346
br_inst_retired.conditional 253 12.01 198548579
br_inst_retired.not_taken 253 11.36 128270540
mem_inst_retired.stlb_miss_stores 253 12.50 277390

256 vpcmpgtq %zmm26,%zmm5,%k2
br_inst_retired.not_taken 256 10.06 128270540
inst_retired.total_cycles_ps 256 16.13 2318027140
mem_inst_retired.split_loads 256 11.90 15911221
uops_retired.stall_cycles 256 11.11 1551403448

257 vpcmpeqq %zmm26,%zmm5,%k3
br_inst_retired.not_taken 257 10.06 128270540

258 kaddq %k3,%k2,%k1
mem_inst_retired.stlb_miss_stores 258 18.75 277390

261 vcvtqq2pd %zmm0,%zmm2
br_inst_retired.near_taken 261 12.19 167741505
br_inst_retired.not_taken 261 11.04 128270540
mem_inst_retired.all_stores 261 11.78 261311780
mem_load_retired.l2_hit 261 11.73 7888956

268 vaddpd %zmm25,%zmm1,%zmm2{%k1}
mem_inst_retired.stlb_miss_stores 268 18.75 277390

275 vmovapd %zmm2,(%r15,%r14,1)
br_inst_retired.all_branches_pebs 275 21.43 296232346
br_inst_retired.conditional 275 18.28 198548579
br_inst_retired.near_taken 275 18.64 167741505
br_inst_retired.not_taken 275 19.81 128270540
inst_retired.prec_dist 275 21.84 1953302772
inst_retired.total_cycles_ps 275 10.75 2318027140
mem_inst_retired.all_loads 275 21.08 420795640
mem_inst_retired.all_stores 275 23.23 261311780
mem_inst_retired.split_loads 275 17.47 15911221
mem_inst_retired.stlb_miss_stores 275 18.75 277390
mem_load_retired.fb_hit 275 17.26 11868893
mem_load_retired.l1_hit 275 26.67 414307174
mem_load_retired.l1_miss 275 20.54 10189564
mem_load_retired.l2_hit 275 26.34 7888956
rs_events.empty_end 275 70.83 16794950
uops_retired.stall_cycles 275 18.89 1551403448

276 add %r9,%r14
br_inst_retired.far_branch 276 100.00 391608
inst_retired.total_cycles_ps 276 21.51 2318027140

Perf results for single core - - counter, line#, percent, total event count:

186 cmp %r10,%r11
inst_retired.total_cycles_ps 186 16.00 2497665430
uops_retired.stall_cycles 186 13.64 2029986338

190 add %r9,%r11
br_inst_retired.all_branches_pebs 190 100.00 283376354
br_inst_retired.conditional 190 50.00 179596698
br_inst_retired.near_taken 190 80.00 157852893
br_inst_retired.not_taken 190 50.00 124369766
inst_retired.prec_dist 190 70.00 1685951392
inst_retired.total_cycles_ps 190 56.00 2497665430
mem_inst_retired.all_stores 190 65.00 251877072
uops_retired.stall_cycles 190 86.36 2029986338

200 vmulpd %zmm31,%zmm14,%zmm13
inst_retired.total_cycles_ps 200 12.00 2497665430

211 vmulpd %zmm29,%zmm14,%zmm9
br_inst_retired.conditional 211 25.00 179596698
br_inst_retired.not_taken 211 50.00 124369766
inst_retired.prec_dist 211 10.00 1685951392
mem_inst_retired.all_stores 211 20.00 251877072

214 mov $0x1,%r8d
br_inst_retired.conditional 214 25.00 179596698
br_inst_retired.near_taken 214 20.00 157852893
inst_retired.prec_dist 214 20.00 1685951392
inst_retired.total_cycles_ps 214 12.00 2497665430
mem_inst_retired.all_stores 214 15.00 251877072

243 vmulpd %zmm28,%zmm0,%zmm1
mem_load_l3_miss_retired.local_dram 243 14.29 362959

244 vcvttpd2qq %zmm1,%zmm0{%k7}
mem_load_l3_miss_retired.local_dram 244 28.57 362959
mem_load_retired.l2_miss 244 17.02 807429
mem_load_retired.l3_hit 244 10.53 899526
mem_load_retired.l3_miss 244 66.67 362469

245 vcvtuqq2pd %zmm0,%zmm2{%k7}
mem_load_l3_miss_retired.local_dram 245 14.29 362959

246 vsubpd %zmm2,%zmm1,%zmm3
mem_load_retired.l3_hit 246 10.53 899526

247 vmulpd %zmm27,%zmm3,%zmm4
inst_retired.total_cycles_ps 247 10.81 2497665430
mem_load_l3_hit_retired.xsnp_none 247 12.50 625422
mem_load_retired.l3_hit 247 13.16 899526
uops_retired.stall_cycles 247 11.76 2029986338

251 vpcmpgtq %zmm26,%zmm5,%k2
inst_retired.prec_dist 251 10.00 1685951392
mem_inst_retired.stlb_miss_stores 251 77.78 133237
mem_load_retired.l3_hit 251 10.53 899526
uops_retired.stall_cycles 251 11.76 2029986338

252 vpcmpeqq %zmm26,%zmm5,%k3{%k7}
inst_retired.total_cycles_ps 252 13.51 2497665430

256 vcvtqq2pd %zmm0,%zmm2{%k7}
br_inst_retired.all_branches_pebs 256 14.65 283376354
br_inst_retired.conditional 256 23.56 179596698
br_inst_retired.near_taken 256 31.03 157852893
br_inst_retired.not_taken 256 15.05 124369766
mem_inst_retired.all_stores 256 31.86 251877072

257 vmulpd %zmm25,%zmm2,%zmm1{%k7}
mem_inst_retired.stlb_miss_stores 257 22.22 133237
mem_load_retired.l1_hit 257 33.33 408735454

270 vmovapd %zmm2,(%r15,%r14,1)
br_inst_retired.all_branches_pebs 270 66.88 283376354
br_inst_retired.conditional 270 59.77 179596698
br_inst_retired.near_taken 270 52.59 157852893
br_inst_retired.not_taken 270 67.74 124369766
inst_retired.prec_dist 270 45.91 1685951392
inst_retired.total_cycles_ps 270 18.92 2497665430
mem_inst_retired.all_loads 270 84.12 410923152
mem_inst_retired.all_stores 270 52.42 251877072
mem_inst_retired.split_loads 270 77.87 6473151
mem_load_l3_hit_retired.xsnp_none 270 50.00 625422
mem_load_l3_miss_retired.local_dram 270 42.86 362959
mem_load_retired.fb_hit 270 66.99 6206128
mem_load_retired.l1_hit 270 66.67 408735454
mem_load_retired.l1_miss 270 76.28 9265872
mem_load_retired.l2_hit 270 77.26 10014175
mem_load_retired.l2_miss 270 48.94 807429
mem_load_retired.l3_hit 270 39.47 899526
mem_load_retired.l3_miss 270 33.33 362469
rs_events.empty_end 270 93.42 21755837
uops_retired.stall_cycles 270 11.76 2029986338

271 add %r9,%r14
inst_retired.total_cycles_ps 271 13.51 2497665430
uops_retired.stall_cycles 271 32.35 2029986338

 

0 Kudos
1 Solution
Bernard
Black Belt
326 Views

>>>Below I show the loop body of a NASM program that uses only AVX-512 instructions. I ran this program with 1 core and with 4 cores, but with 4 cores it runs 35-50% slower than the single core.>>>

 

The loop contains a heavy floating point AVX512 machine code, hence probably the highest power license L2 was activated per core Power Management Unit thus lowering the frequency when multi-threaded computation was executed. 

By looking at this information the down-clocking was mediocre.

 

 

View solution in original post

1 Reply
Bernard
Black Belt
327 Views

>>>Below I show the loop body of a NASM program that uses only AVX-512 instructions. I ran this program with 1 core and with 4 cores, but with 4 cores it runs 35-50% slower than the single core.>>>

 

The loop contains a heavy floating point AVX512 machine code, hence probably the highest power license L2 was activated per core Power Management Unit thus lowering the frequency when multi-threaded computation was executed. 

By looking at this information the down-clocking was mediocre.

 

 

Reply