topic Strange IPC behavior in Intel® ISA Extensions

Strange IPC behavior

Patrick_P_2 — Sun, 19 Oct 2014 20:11:33 GMT

Following discussion https://communities.intel.com/message/257079 I am creating this thread to get some help in explaining a strange behavior in the time taken by some instructions on Intel CPU. In short, I am measuring the IPC of a program in two cases: Case 1: when I skip 29 instructions in the control flow of the program, Case 2: when I execute them. For the case 1, I get the following perf : 23431,087207 task-clock # 0,976 CPUs utilized 2 109 context-switches # 0,000 M/sec 4 CPU-migrations # 0,000 M/sec 11 888 page-faults # 0,001 M/sec 49 043 462 004 cycles # 2,093 GHz [50,06%] stalled-cycles-frontend stalled-cycles-backend 30 713 070 462 instructions # 0,63 insns per cycle [75,02%] 4 492 657 867 branches # 191,739 M/sec [74,99%] 71 968 726 branch-misses # 1,60% of all branches [74,95%] 24,008123640 seconds time elapsed For the case 2, I get the following perf: 12919,383975 task-clock # 0,943 CPUs utilized 1 520 context-switches # 0,000 M/sec 15 CPU-migrations # 0,000 M/sec 11 887 page-faults # 0,001 M/sec 27 032 904 739 cycles # 2,092 GHz [50,04%] stalled-cycles-frontend stalled-cycles-backend 31 976 622 505 instructions # 1,18 insns per cycle [75,04%] 4 734 392 898 branches # 366,457 M/sec [75,03%] 64 698 800 branch-misses # 1,37% of all branches [74,93%] 13,704240040 seconds time elapsed Case 2 performs way faster whereas it computes effectively more instruction (The IPC is nearly twice higher whereas this is the IPC of the whole program). I can not explain such behavior. It has been seem on multiple Intel core CPU : Intel(R) Core(TM)2 Duo CPU T6500 Intel(R) Core(TM)2 Quad CPU Q9550 Intel(R) Core(TM) i5-3570 CPU Intel(R) Core(TM) i5-2500 CPU Intel(R) Core(TM) i5-4570 CPU If someone has some clues on this, I take them. (This is not really about Intel ISA Extensions, but I am not able to find a better forum.)

Probably while executing

Bernard — Mon, 20 Oct 2014 08:10:47 GMT

Probably while executing case 2 code CPU is able to exploit more efficiently Instruction Level Parallelism (ILP).

Quote:iliyapolak wrote:

Vincent_Lefevre — Mon, 20 Oct 2014 11:50:43 GMT

iliyapolak wrote:

Probably while executing case 2 code CPU is able to exploit more efficiently Instruction Level Parallelism (ILP).

Case 2 could yield better ILP overall, but this doesn't explain why it is much faster than case 1 (13.7s vs 24.0s) while there is more work to do.

Patrick, why are there more branches in case 2 than in case 1 while a "je" instruction has been commented out? It seems that the code flow is not exactly the same, and this could have an influence, IMHO.

>>>but this doesn't explain

Bernard — Mon, 20 Oct 2014 13:11:00 GMT

>>>but this doesn't explain why it is much faster than case 1 (13.7s vs 24.0s) while there is more work to do.>>>

Thanks for correction because I did not pay an attention to those 29 instructions.

I would try to run VTune on those two versions of the code in order get more comprehensive CPU metrics. Running aferomentioned code under debugger should be also done in order to see which code path is executed where code is compiled (case 2).

Quote:Vincent Lefevre wrote:

Vincent_Lefevre — Tue, 21 Oct 2014 08:03:31 GMT

Vincent Lefevre wrote:

Patrick, why are there more branches in case 2 than in case 1 while a "je" instruction has been commented out? It seems that the code flow is not exactly the same, and this could have an influence, IMHO.

From a private reply by Patrick, his asm excerpt was incorrect, indeed yielding different code flow in case 1 and case 2, explaining the obtained timings.

To complete Vincent answer,

Patrick_P_2 — Tue, 21 Oct 2014 11:44:09 GMT

To complete Vincent answer, my asm modification of the code was wrong (the modification results in the corruption of a stack variable)... but the original IPC problem I wish to analyse is still present and is not impacted by my wrong asm modification (and I get comparable difference in the IPC with 53 bits vs 113 bits of precision). I will try to reduce the test case once again.