Strange IPC behavior

PPéli · ‎10-14-2014

I have found a strange IPC behavior on a test program which benchmarks matrix multiplication using the MPFR library in 53 and 113 bits. The 113 bits was always way faster (typically 20-30%) whereas it perform more computation. After analysis, I have reduced the problem to the mpfr_mul function.

Here is the assembly extract of where I think the problem is : in the mpfr_mul function on more precisely in the section which perform 1x1, 2x1 or 2x2 multiplication :

cmpq $2, %r9

jg .L21

movq 24(%r14), %rsi

leaq 8(%rbx), %rdi

movq 24(%r13), %rcx

movq (%rsi), %rax

# APP

# 324 "mul.c" 1

mulq (%rcx)

# 0 "" 2

# NO_APP

cmpq $1, %r9

movq %rdx, %r11

movq %rax, (%rbx)

movq %rdx, 8(%rbx)

je .L23

movq 8(%rsi), %rax

# APP

# 334 "mul.c" 1

mulq (%rcx)

# 0 "" 2

# 335 "mul.c" 1

addq %rax,%r11

adcq $0,%rdx

# 0 "" 2

# NO_APP

cmpq $1, -136(%rbp)

movq %rdx, 16(%rbx)

movq %r11, (%rdi)

# je .L189

movq 8(%rcx), %r9

movq (%rsi), %rcx

movq %rcx, %rax

# APP

# 346 "mul.c" 1

mulq %r9

# 0 "" 2

# NO_APP

movq %rdx, %r11

movq %rax, %rcx

movq 8(%rsi), %rax

# APP

# 347 "mul.c" 1

mulq %r9

# 0 "" 2

# 348 "mul.c" 1

addq %rax,%r11

adcq $0,%rdx

# 0 "" 2

# NO_APP

movq 8(%rbx), %rax

movq %rdx, 24(%rbx)

movq 16(%rbx), %rdx

# APP

# 350 "mul.c" 1

addq %rcx,%rax

adcq %r11,%rdx

# 0 "" 2

# NO_APP

movq %rdx, 16(%rbx)

movq %rax, (%rdi)

cmpq %r11, 16(%rbx)

setb %r11b

movzbl %r11b, %r11d

addq 24(%rbx), %r11

movq %r11, 24(%rbx)

.L23:

subq -144(%rbp), %r8

shrq $63, %r11

When I let the asm as it is (which is produced by gcc with a litlle change in - je .L189 - in order to better show the problem), I get this performance (using linux perf stat -B tool):

23431,087207 task-clock # 0,976 CPUs utilized

2 109 context-switches # 0,000 M/sec

4 CPU-migrations # 0,000 M/sec

11 888 page-faults # 0,001 M/sec

49 043 462 004 cycles # 2,093 GHz [50,06%]

stalled-cycles-frontend

stalled-cycles-backend

30 713 070 462 instructions # 0,63 insns per cycle [75,02%]

4 492 657 867 branches # 191,739 M/sec [74,99%]

71 968 726 branch-misses # 1,60% of all branches [74,95%]

24,008123640 seconds time elapsed

If I comment the line in bold ( je .L23) in the assembly source (which performs a jump which only skips 29 instructions), I get:

12919,383975 task-clock # 0,943 CPUs utilized

1 520 context-switches # 0,000 M/sec

15 CPU-migrations # 0,000 M/sec

11 887 page-faults # 0,001 M/sec

27 032 904 739 cycles # 2,092 GHz [50,04%]

stalled-cycles-frontend

stalled-cycles-backend

31 976 622 505 instructions # 1,18 insns per cycle [75,04%]

4 734 392 898 branches # 366,457 M/sec [75,03%]

64 698 800 branch-misses # 1,37% of all branches [74,93%]

13,704240040 seconds time elapsed

It performs way faster whereas it computes effectively more instruction (The IPC is nearly twice higher whereas this is the IPC of the whole program).

I can not explain such behavior. It has been seem on multiple Intel core CPU (not only mine, which is Intel Core2 Duo T6500) . Full benchmark code for Linux is available on demand.

If I replace the je .L23 by an unconditional jump, I get the slow behavior.

If I replace the je .L23 by a nop instruction (or 2, 3, 4 nop), I get the fast behavior.

Does anyone has an explanation of such a thing?

Kevin_M_Intel · ‎10-15-2014

Hello PpHd,

We do not recommend running benchmark software's because may show incorrect information. From our side, we have a stress test software you can run and it will diagnose all internal components of the processor.

Here are the links:

64 bit:

https://downloadcenter.intel.com/Detail_Desc.aspx?DwnldID=19792&lang=eng https://downloadcenter.intel.com/Detail_Desc.aspx?DwnldID=19792&lang=eng

32 bit:

https://downloadcenter.intel.com/Detail_Desc.aspx?DwnldID=19791&lang=eng https://downloadcenter.intel.com/Detail_Desc.aspx?DwnldID=19791&lang=eng

Kevin M

PPéli · ‎10-15-2014

Hello Kevin,

I want to clarify one thing: I am not trying to test my CPU with a stress test software or other benchmark in order to diagnose a possible CPU failure. I am trying to improve my code to get max performance from Intel CPU, and I get this behavior which I don't explain.

This behavior has been seen on the following CPU:

Intel(R) Core(TM)2 Duo CPU T6500Intel(R) Core(TM)2 Quad CPUQ9550

Intel(R) Core(TM) i5-3570 CPU

Intel(R) Core(TM) i5-2500 CPU

Intel(R) Core(TM) i5-4570 CPU

PpHd

Kevin_M_Intel · ‎10-17-2014

Hello PpHd,

Thank you for the information. My best recommendation is for you to post your question at Developer Zone. Here is the contact link:

https://software.intel.com/en-us/intel-developer-zone-responsive https://software.intel.com/en-us/intel-developer-zone-responsive

Kevin M

PPéli · ‎10-19-2014

Hello Kevin,

Thanks a lot for the information. I'll post my question there.

PpHd