- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have found a strange IPC behavior on a test program which benchmarks matrix multiplication using the MPFR library in 53 and 113 bits. The 113 bits was always way faster (typically 20-30%) whereas it perform more computation. After analysis, I have reduced the problem to the mpfr_mul function.
Here is the assembly extract of where I think the problem is : in the mpfr_mul function on more precisely in the section which perform 1x1, 2x1 or 2x2 multiplication :
cmpq $2, %r9
jg .L21
movq 24(%r14), %rsi
leaq 8(%rbx), %rdi
movq 24(%r13), %rcx
movq (%rsi), %rax
# APP
# 324 "mul.c" 1
mulq (%rcx)
# 0 "" 2
# NO_APP
cmpq $1, %r9
movq %rdx, %r11
movq %rax, (%rbx)
movq %rdx, 8(%rbx)
je .L23
movq 8(%rsi), %rax
# APP
# 334 "mul.c" 1
mulq (%rcx)
# 0 "" 2
# 335 "mul.c" 1
addq %rax,%r11
adcq $0,%rdx
# 0 "" 2
# NO_APP
cmpq $1, -136(%rbp)
movq %rdx, 16(%rbx)
movq %r11, (%rdi)
# je .L189
movq 8(%rcx), %r9
movq (%rsi), %rcx
movq %rcx, %rax
# APP
# 346 "mul.c" 1
mulq %r9
# 0 "" 2
# NO_APP
movq %rdx, %r11
movq %rax, %rcx
movq 8(%rsi), %rax
# APP
# 347 "mul.c" 1
mulq %r9
# 0 "" 2
# 348 "mul.c" 1
addq %rax,%r11
adcq $0,%rdx
# 0 "" 2
# NO_APP
movq 8(%rbx), %rax
movq %rdx, 24(%rbx)
movq 16(%rbx), %rdx
# APP
# 350 "mul.c" 1
addq %rcx,%rax
adcq %r11,%rdx
# 0 "" 2
# NO_APP
movq %rdx, 16(%rbx)
movq %rax, (%rdi)
cmpq %r11, 16(%rbx)
setb %r11b
movzbl %r11b, %r11d
addq 24(%rbx), %r11
movq %r11, 24(%rbx)
.L23:
subq -144(%rbp), %r8
shrq $63, %r11
When I let the asm as it is (which is produced by gcc with a litlle change in - je .L189 - in order to better show the problem), I get this performance (using linux perf stat -B tool):
23431,087207 task-clock # 0,976 CPUs utilized
2 109 context-switches # 0,000 M/sec
4 CPU-migrations # 0,000 M/sec
11 888 page-faults # 0,001 M/sec
49 043 462 004 cycles # 2,093 GHz [50,06%]
stalled-cycles-frontend
stalled-cycles-backend
30 713 070 462 instructions # 0,63 insns per cycle [75,02%]
4 492 657 867 branches # 191,739 M/sec [74,99%]
71 968 726 branch-misses # 1,60% of all branches [74,95%]
24,008123640 seconds time elapsed
If I comment the line in bold ( je .L23) in the assembly source (which performs a jump which only skips 29 instructions), I get:
12919,383975 task-clock # 0,943 CPUs utilized
1 520 context-switches # 0,000 M/sec
15 CPU-migrations # 0,000 M/sec
11 887 page-faults # 0,001 M/sec
27 032 904 739 cycles # 2,092 GHz [50,04%]
stalled-cycles-frontend
stalled-cycles-backend
31 976 622 505 instructions # 1,18 insns per cycle [75,04%]
4 734 392 898 branches # 366,457 M/sec [75,03%]
64 698 800 branch-misses # 1,37% of all branches [74,93%]
13,704240040 seconds time elapsed
It performs way faster whereas it computes effectively more instruction (The IPC is nearly twice higher whereas this is the IPC of the whole program).
I can not explain such behavior. It has been seem on multiple Intel core CPU (not only mine, which is Intel Core2 Duo T6500) . Full benchmark code for Linux is available on demand.
If I replace the je .L23 by an unconditional jump, I get the slow behavior.
If I replace the je .L23 by a nop instruction (or 2, 3, 4 nop), I get the fast behavior.
Does anyone has an explanation of such a thing?
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello PpHd,
We do not recommend running benchmark software's because may show incorrect information. From our side, we have a stress test software you can run and it will diagnose all internal components of the processor.
Here are the links:
64 bit:
https://downloadcenter.intel.com/Detail_Desc.aspx?DwnldID=19792&lang=eng https://downloadcenter.intel.com/Detail_Desc.aspx?DwnldID=19792&lang=eng
32 bit:
https://downloadcenter.intel.com/Detail_Desc.aspx?DwnldID=19791&lang=eng https://downloadcenter.intel.com/Detail_Desc.aspx?DwnldID=19791&lang=eng
Kevin M
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Kevin,
I want to clarify one thing: I am not trying to test my CPU with a stress test software or other benchmark in order to diagnose a possible CPU failure. I am trying to improve my code to get max performance from Intel CPU, and I get this behavior which I don't explain.
This behavior has been seen on the following CPU:
Intel(R) Core(TM)2 Duo CPU T6500Intel(R) Core(TM)2 Quad CPUQ9550Intel(R) Core(TM) i5-3570 CPU
Intel(R) Core(TM) i5-2500 CPU
Intel(R) Core(TM) i5-4570 CPU
PpHd
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello PpHd,
Thank you for the information. My best recommendation is for you to post your question at Developer Zone. Here is the contact link:
https://software.intel.com/en-us/intel-developer-zone-responsive https://software.intel.com/en-us/intel-developer-zone-responsive
Kevin M
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Kevin,
Thanks a lot for the information. I'll post my question there.
PpHd
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page