Solved: Many more perf events with multicore than single core

Jones__Brian · ‎08-12-2020

Below I show the loop body of a NASM program that uses only AVX-512 instructions. I ran this program with 1 core and with 4 cores, but with 4 cores it runs 35-50% slower than the single core.

I used Linux perf to profile it, using only the 65 PEBS counters to get line-accurate results. The perf reports show a lot of cache activity (cache hits and misses from all 3 cache levels), memory loads/stores and branch instructions retired on source lines that are AVX-512 all-register instructions. The cache hits and misses show far more frequently when run with 4 cores than when run with 1 core. According to https://software.intel.com/content/www/us/en/develop/articles/use-instructions-retired-events-to-evaluate-threading-methodology.html that is a sign that something is wrong, and my results show it.

The program uses AVX-512 register-to-register instructions exclusively between the data read at line 194 to where the final results are written to a shared memory buffer at line 275.

The input data is divided into 4 equal sections with each core taking a section. Stride for each core is 64 bytes (the width of an AVX-512 register). All 4 threads are pinned to physical cores. Both single core and multicore were run on an Intel Xeon Gold 6140 CPU @ 2.30GHz, which has two FMA units for AVX-512.

In the data below I show the loop body, followed by the perf reports for multicore and single core. The perf reports are formatted with the source line first, each perf counter result on a separate line showing the counter name, the source line number, the percentage of events attributable to the source line, and the total number of each event.

My main questions are: (1) why do I see cache or memory activity at all on lines that are all register-to-register instructions, and (2) why is there so much more activity with 4 cores than with 1?

I know this is a lot of data, but I think it's what's needed to show the problem.

Loop body:

label_401:

vmovupd zmm14,[r12+r11] ; SOURCE LINE 194
add r11,r9 ; stride ; SOURCE LINE 195

vmulpd zmm13,zmm14,zmm31 ; SOURCE LINE 205

vmulpd zmm9,zmm14,zmm29 ; SOURCE LINE 216
vmulpd zmm8,zmm13,zmm30 ; SOURCE LINE 217

mov r8,1 ; SOURCE LINE 219
Exponent_Label_0:
vmulpd zmm7,zmm29,zmm29
add r8,1
cmp r8,2 ;rdx
jl Exponent_Label_0

vmulpd zmm3,zmm7,zmm8

vsubpd zmm0,zmm9,zmm3 ; SOURCE LINE 236

vmulpd zmm1,zmm0,zmm28
VCVTTPD2QQ zmm0,zmm1 ; SOURCE LINE 249
VCVTUQQ2PD zmm2,zmm0 ; SOURCE LINE 250
vsubpd zmm3,zmm1,zmm2 ; SOURCE LINE 251
vmulpd zmm4,zmm3,zmm27 ; SOURCE LINE 252
VCVTTPD2QQ zmm5,zmm4 ; SOURCE LINE 253

VPCMPGTQ k2,zmm5,zmm26 ; SOURCE LINE 256
VPCMPEQQ k3 ,zmm5,zmm26 ; SOURCE LINE 257
KADDQ k1,k2,k3 ; SOURCE LINE 258

VCVTQQ2PD zmm2,zmm0 ; SOURCE LINE 261
vmulpd zmm1,zmm2,zmm25
vmovupd zmm2,zmm1
VADDPD zmm2{k1},zmm1,zmm25 ; SOURCE LINE 268

vmovapd [r15+r14],zmm2 ; SOURCE LINE 275
add r14,r9 ; stride

cmp r11,r10
jl label_401

Perf results for multicore - counter, line#, percent, total event count

195 add %r9,%r11
br_inst_retired.all_branches_pebs 195 69.23 296232346
br_inst_retired.conditional 195 55.26 198548579
br_inst_retired.near_taken 195 55.38 167741505
br_inst_retired.not_taken 195 39.39 128270540
inst_retired.prec_dist 195 50.00 1953302772
inst_retired.total_cycles_ps 195 73.02 2318027140
mem_inst_retired.all_loads 195 62.16 420795640
mem_inst_retired.all_stores 195 71.60 261311780
mem_inst_retired.split_loads 195 70.00 15911221
mem_inst_retired.stlb_miss_stores 195 50.00 277390
mem_load_l3_hit_retired.xsnp_none 195 85.71 1272960
mem_load_l3_miss_retired.local_dram 195 50.00 495953
mem_load_retired.fb_hit 195 73.81 11868893
mem_load_retired.l1_hit 195 66.67 414307174
mem_load_retired.l1_miss 195 56.10 10189564
mem_load_retired.l2_hit 195 48.65 7888956
mem_load_retired.l2_miss 195 87.50 1505887
mem_load_retired.l3_hit 195 50.00 1090482
mem_load_retired.l3_miss 195 100.00 479332
uops_retired.stall_cycles 195 90.16 1551403448

205 vmulpd %zmm31,%zmm14,%zmm13
br_inst_retired.all_branches_pebs 205 13.46 296232346
br_inst_retired.conditional 205 15.79 198548579
br_inst_retired.near_taken 205 18.46 167741505
br_inst_retired.not_taken 205 42.42 128270540
inst_retired.prec_dist 205 15.00 1953302772
mem_inst_retired.all_loads 205 18.92 420795640
mem_inst_retired.all_stores 205 12.35 261311780
mem_inst_retired.split_loads 205 20.00 15911221
mem_inst_retired.stlb_miss_stores 205 16.67 277390
mem_load_retired.l1_hit 205 33.33 414307174
mem_load_retired.l1_miss 205 31.71 10189564
mem_load_retired.l2_hit 205 29.73 7888956
mem_load_retired.l3_hit 205 12.50 1090482
rs_events.empty_end 205 50.00 16794950

216 vmulpd %zmm29,%zmm14,%zmm9
br_inst_retired.conditional 216 15.79 198548579
br_inst_retired.near_taken 216 15.38 167741505
br_inst_retired.not_taken 216 15.15 128270540
inst_retired.prec_dist 216 18.33 1953302772
mem_inst_retired.all_loads 216 13.51 420795640
mem_inst_retired.all_stores 216 11.11 261311780
mem_load_l3_miss_retired.local_dram 216 50.00 495953
mem_load_retired.l2_hit 216 10.81 7888956
mem_load_retired.l2_miss 216 12.50 1505887
mem_load_retired.l3_hit 216 37.50 1090482
rs_events.empty_end 216 50.00 16794950

217 vmulpd %zmm30,%zmm13,%zmm8
mem_inst_retired.stlb_miss_stores 217 16.67 277390
mem_load_l3_hit_retired.xsnp_none 217 14.29 1272960

219 mov $0x1,%r8d
inst_retired.prec_dist 219 11.67 1953302772
mem_inst_retired.stlb_miss_stores 219 16.67 277390

236 vsubpd %zmm3,%zmm9,%zmm0
mem_load_retired.l1_hit 236 13.33 414307174
uops_retired.stall_cycles 236 12.22 1551403448

249 vcvttpd2qq %zmm1,%zmm0
mem_inst_retired.all_loads 249 14.16 420795640
mem_inst_retired.split_loads 249 12.98 15911221
mem_inst_retired.stlb_miss_loads 249 100.00 30435
mem_load_l3_hit_retired.xsnp_none 249 66.67 1272960
mem_load_l3_miss_retired.local_dram 249 50.00 495953
mem_load_retired.fb_hit 249 12.83 11868893
mem_load_retired.l1_miss 249 14.94 10189564
mem_load_retired.l2_miss 249 48.44 1505887
mem_load_retired.l3_hit 249 54.22 1090482
mem_load_retired.l3_miss 249 44.00 479332

250 vcvtuqq2pd %zmm0,%zmm2
mem_load_l3_hit_retired.xsnp_none 250 17.71 1272960
mem_load_l3_miss_retired.local_dram 250 18.18 495953
mem_load_retired.l1_hit 250 13.33 414307174
mem_load_retired.l2_miss 250 21.88 1505887
mem_load_retired.l3_hit 250 19.28 1090482

251 vsubpd %zmm2,%zmm1,%zmm3
br_inst_retired.near_taken 251 12.19 167741505
mem_inst_retired.stlb_miss_stores 251 12.50 277390
mem_load_retired.fb_hit 251 12.83 11868893
mem_load_retired.l2_hit 251 10.29 7888956

252 vmulpd %zmm27,%zmm3,%zmm4
br_inst_retired.conditional 252 12.01 198548579
br_inst_retired.not_taken 252 10.71 128270540

253 vcvttpd2qq %zmm4,%zmm5
br_inst_retired.all_branches_pebs 253 11.56 296232346
br_inst_retired.conditional 253 12.01 198548579
br_inst_retired.not_taken 253 11.36 128270540
mem_inst_retired.stlb_miss_stores 253 12.50 277390

256 vpcmpgtq %zmm26,%zmm5,%k2
br_inst_retired.not_taken 256 10.06 128270540
inst_retired.total_cycles_ps 256 16.13 2318027140
mem_inst_retired.split_loads 256 11.90 15911221
uops_retired.stall_cycles 256 11.11 1551403448

257 vpcmpeqq %zmm26,%zmm5,%k3
br_inst_retired.not_taken 257 10.06 128270540

258 kaddq %k3,%k2,%k1
mem_inst_retired.stlb_miss_stores 258 18.75 277390

261 vcvtqq2pd %zmm0,%zmm2
br_inst_retired.near_taken 261 12.19 167741505
br_inst_retired.not_taken 261 11.04 128270540
mem_inst_retired.all_stores 261 11.78 261311780
mem_load_retired.l2_hit 261 11.73 7888956

268 vaddpd %zmm25,%zmm1,%zmm2{%k1}
mem_inst_retired.stlb_miss_stores 268 18.75 277390

275 vmovapd %zmm2,(%r15,%r14,1)
br_inst_retired.all_branches_pebs 275 21.43 296232346
br_inst_retired.conditional 275 18.28 198548579
br_inst_retired.near_taken 275 18.64 167741505
br_inst_retired.not_taken 275 19.81 128270540
inst_retired.prec_dist 275 21.84 1953302772
inst_retired.total_cycles_ps 275 10.75 2318027140
mem_inst_retired.all_loads 275 21.08 420795640
mem_inst_retired.all_stores 275 23.23 261311780
mem_inst_retired.split_loads 275 17.47 15911221
mem_inst_retired.stlb_miss_stores 275 18.75 277390
mem_load_retired.fb_hit 275 17.26 11868893
mem_load_retired.l1_hit 275 26.67 414307174
mem_load_retired.l1_miss 275 20.54 10189564
mem_load_retired.l2_hit 275 26.34 7888956
rs_events.empty_end 275 70.83 16794950
uops_retired.stall_cycles 275 18.89 1551403448

276 add %r9,%r14
br_inst_retired.far_branch 276 100.00 391608
inst_retired.total_cycles_ps 276 21.51 2318027140

Perf results for single core - - counter, line#, percent, total event count:

186 cmp %r10,%r11
inst_retired.total_cycles_ps 186 16.00 2497665430
uops_retired.stall_cycles 186 13.64 2029986338

190 add %r9,%r11
br_inst_retired.all_branches_pebs 190 100.00 283376354
br_inst_retired.conditional 190 50.00 179596698
br_inst_retired.near_taken 190 80.00 157852893
br_inst_retired.not_taken 190 50.00 124369766
inst_retired.prec_dist 190 70.00 1685951392
inst_retired.total_cycles_ps 190 56.00 2497665430
mem_inst_retired.all_stores 190 65.00 251877072
uops_retired.stall_cycles 190 86.36 2029986338

200 vmulpd %zmm31,%zmm14,%zmm13
inst_retired.total_cycles_ps 200 12.00 2497665430

211 vmulpd %zmm29,%zmm14,%zmm9
br_inst_retired.conditional 211 25.00 179596698
br_inst_retired.not_taken 211 50.00 124369766
inst_retired.prec_dist 211 10.00 1685951392
mem_inst_retired.all_stores 211 20.00 251877072

214 mov $0x1,%r8d
br_inst_retired.conditional 214 25.00 179596698
br_inst_retired.near_taken 214 20.00 157852893
inst_retired.prec_dist 214 20.00 1685951392
inst_retired.total_cycles_ps 214 12.00 2497665430
mem_inst_retired.all_stores 214 15.00 251877072

243 vmulpd %zmm28,%zmm0,%zmm1
mem_load_l3_miss_retired.local_dram 243 14.29 362959

244 vcvttpd2qq %zmm1,%zmm0{%k7}
mem_load_l3_miss_retired.local_dram 244 28.57 362959
mem_load_retired.l2_miss 244 17.02 807429
mem_load_retired.l3_hit 244 10.53 899526
mem_load_retired.l3_miss 244 66.67 362469

245 vcvtuqq2pd %zmm0,%zmm2{%k7}
mem_load_l3_miss_retired.local_dram 245 14.29 362959

246 vsubpd %zmm2,%zmm1,%zmm3
mem_load_retired.l3_hit 246 10.53 899526

247 vmulpd %zmm27,%zmm3,%zmm4
inst_retired.total_cycles_ps 247 10.81 2497665430
mem_load_l3_hit_retired.xsnp_none 247 12.50 625422
mem_load_retired.l3_hit 247 13.16 899526
uops_retired.stall_cycles 247 11.76 2029986338

251 vpcmpgtq %zmm26,%zmm5,%k2
inst_retired.prec_dist 251 10.00 1685951392
mem_inst_retired.stlb_miss_stores 251 77.78 133237
mem_load_retired.l3_hit 251 10.53 899526
uops_retired.stall_cycles 251 11.76 2029986338

252 vpcmpeqq %zmm26,%zmm5,%k3{%k7}
inst_retired.total_cycles_ps 252 13.51 2497665430

256 vcvtqq2pd %zmm0,%zmm2{%k7}
br_inst_retired.all_branches_pebs 256 14.65 283376354
br_inst_retired.conditional 256 23.56 179596698
br_inst_retired.near_taken 256 31.03 157852893
br_inst_retired.not_taken 256 15.05 124369766
mem_inst_retired.all_stores 256 31.86 251877072

257 vmulpd %zmm25,%zmm2,%zmm1{%k7}
mem_inst_retired.stlb_miss_stores 257 22.22 133237
mem_load_retired.l1_hit 257 33.33 408735454

270 vmovapd %zmm2,(%r15,%r14,1)
br_inst_retired.all_branches_pebs 270 66.88 283376354
br_inst_retired.conditional 270 59.77 179596698
br_inst_retired.near_taken 270 52.59 157852893
br_inst_retired.not_taken 270 67.74 124369766
inst_retired.prec_dist 270 45.91 1685951392
inst_retired.total_cycles_ps 270 18.92 2497665430
mem_inst_retired.all_loads 270 84.12 410923152
mem_inst_retired.all_stores 270 52.42 251877072
mem_inst_retired.split_loads 270 77.87 6473151
mem_load_l3_hit_retired.xsnp_none 270 50.00 625422
mem_load_l3_miss_retired.local_dram 270 42.86 362959
mem_load_retired.fb_hit 270 66.99 6206128
mem_load_retired.l1_hit 270 66.67 408735454
mem_load_retired.l1_miss 270 76.28 9265872
mem_load_retired.l2_hit 270 77.26 10014175
mem_load_retired.l2_miss 270 48.94 807429
mem_load_retired.l3_hit 270 39.47 899526
mem_load_retired.l3_miss 270 33.33 362469
rs_events.empty_end 270 93.42 21755837
uops_retired.stall_cycles 270 11.76 2029986338

271 add %r9,%r14
inst_retired.total_cycles_ps 271 13.51 2497665430
uops_retired.stall_cycles 271 32.35 2029986338

Bernard · ‎08-15-2020

>>>Below I show the loop body of a NASM program that uses only AVX-512 instructions. I ran this program with 1 core and with 4 cores, but with 4 cores it runs 35-50% slower than the single core.>>>

The loop contains a heavy floating point AVX512 machine code, hence probably the highest power license L2 was activated per core Power Management Unit thus lowering the frequency when multi-threaded computation was executed.

By looking at this information the down-clocking was mediocre.

View solution in original post

Bernard · ‎08-15-2020

>>>Below I show the loop body of a NASM program that uses only AVX-512 instructions. I ran this program with 1 core and with 4 cores, but with 4 cores it runs 35-50% slower than the single core.>>>

The loop contains a heavy floating point AVX512 machine code, hence probably the highest power license L2 was activated per core Power Management Unit thus lowering the frequency when multi-threaded computation was executed.

By looking at this information the down-clocking was mediocre.