Intel MPI real-time performance problem

wang_b_1 · ‎11-22-2016

Hello:
In my case ,i found sometimes MPI_Gather() take more than 5000-40000 cpu crcle，but normally MPI_Gather() only take about 2000 cpu crcle.

I can confirm that there is no timer interrupt or other interrupts to disturb MPI_Gather(), i also try to use mlockall(), use my own malloc() to replace i_malloc and i_free,but it not work.

When call MPI_Gather(), my programme need it return in a determinacy time, i don't know is there ant thing i can do to improve MPI real-time performance,or is there any tools can help me to find why this function take so long some times.

OS:linux3.10

cpuinfo: i use isolcpus ,so the cpuinfo cmd may get wrong info,it 8core16Threads

Intel(R) processor family information utility, Version 4.1 Update 3 Build 20140124
Copyright (C) 2005-2014 Intel Corporation. All rights reserved.
===== Processor composition =====
Processor name : Intel(R) Xeon(R) E5-2660 0
Packages(sockets) : 1
Cores : 1
Processors(CPUs) : 1
Cores per package : 1
Threads per core : 1
===== Processor identification =====
Processor Thread Id. Core Id. Package Id.
0 0 0 0
===== Placement on packages =====
Package Id. Core Id. Processors
0 0 0
===== Cache sharing =====
Cache Size Processors
L1 32 KB no sharing
L2 256 KB no sharing
L3 20 MB no sharing

test_program like this:
...
t1= rdtsc;
Ierr = MPI_Bcast(&nn, 1, MPI_INTEGER, 0, MPI_COMM_WORLD);
t2= rdtsc;
if(t2>t1&& (t2-t1)/1000 >5) record_it(t2-t1);
...

mpirun -n 4 -env I_MPI_DEBUG=4 ./my_test

[0] MPI startup(): Intel(R) MPI Library, Version 4.1 Update 3 Build 20140124
[0] MPI startup(): Copyright (C) 2003-2014 Intel Corporation. All rights reserved.
[1] MPI startup(): shm data transfer mode
[3] MPI startup(): shm data transfer mode
[0] MPI startup(): shm data transfer mode
[2] MPI startup(): shm data transfer mode
[0] MPI startup(): Device_reset_idx=8
[0] MPI startup(): Allgather: 1: 1-1 & 0-2147483647
[0] MPI startup(): Allgather: 4: 2-4 & 0-2147483647
[0] MPI startup(): Allgather: 1: 5-10 & 0-2147483647
[0] MPI startup(): Allgather: 4: 11-22 & 0-2147483647
[0] MPI startup(): Allgather: 1: 23-469 & 0-2147483647
[0] MPI startup(): Allgather: 4: 470-544 & 0-2147483647
[0] MPI startup(): Allgather: 1: 545-3723 & 0-2147483647
[0] MPI startup(): Allgather: 3: 3724-59648 & 0-2147483647
[0] MPI startup(): Allgather: 1: 59649-3835119 & 0-2147483647
[0] MPI startup(): Allgather: 3: 0-2147483647 & 0-2147483647
[0] MPI startup(): Allgatherv: 1: 0-1942 & 0-2147483647
[0] MPI startup(): Allgatherv: 3: 1942-128426 & 0-2147483647
[0] MPI startup(): Allgatherv: 4: 128426-193594 & 0-2147483647
[0] MPI startup(): Allgatherv: 3: 193594-454523 & 0-2147483647
[0] MPI startup(): Allgatherv: 4: 454523-561981 & 0-2147483647
[0] MPI startup(): Allgatherv: 3: 0-2147483647 & 0-2147483647
[0] MPI startup(): Allreduce: 7: 0-6 & 0-2147483647
[0] MPI startup(): Allreduce: 1: 6-13 & 0-2147483647
[0] MPI startup(): Allreduce: 7: 13-37 & 0-2147483647
[0] MPI startup(): Allreduce: 1: 37-104 & 0-2147483647
[0] MPI startup(): Allreduce: 7: 104-409 & 0-2147483647
[0] MPI startup(): Allreduce: 1: 409-5708 & 0-2147483647
[0] MPI startup(): Allreduce: 7: 5708-12660 & 0-2147483647
[0] MPI startup(): Allreduce: 2: 12660-61166 & 0-2147483647
[0] MPI startup(): Allreduce: 6: 61166-74718 & 0-2147483647
[0] MPI startup(): Allreduce: 8: 74718-163640 & 0-2147483647
[0] MPI startup(): Allreduce: 2: 163640-355186 & 0-2147483647
[0] MPI startup(): Allreduce: 6: 355186-665233 & 0-2147483647
[0] MPI startup(): Allreduce: 7: 0-2147483647 & 0-2147483647
[0] MPI startup(): Alltoall: 3: 0-1 & 0-2147483647
[0] MPI startup(): Alltoall: 2: 2-2 & 0-2147483647
[0] MPI startup(): Alltoall: 3: 3-25 & 0-2147483647
[0] MPI startup(): Alltoall: 2: 26-48 & 0-2147483647
[0] MPI startup(): Alltoall: 3: 49-1826 & 0-2147483647
[0] MPI startup(): Alltoall: 2: 1827-947308 & 0-2147483647
[0] MPI startup(): Alltoall: 3: 947309-1143512 & 0-2147483647
[0] MPI startup(): Alltoall: 2: 1143513-3715953 & 0-2147483647
[0] MPI startup(): Alltoall: 3: 0-2147483647 & 0-2147483647
[0] MPI startup(): Alltoallv: 0: 0-2147483647 & 0-2147483647
[0] MPI startup(): Alltoallw: 0: 0-2147483647 & 0-2147483647
[0] MPI startup(): Barrier: 1: 0-2147483647 & 0-2147483647
[0] MPI startup(): Bcast: 0: 0-2147483647 & 0-2147483647
[0] MPI startup(): Exscan: 0: 0-2147483647 & 0-2147483647
[0] MPI startup(): Gather: 3: 1-2045 & 0-2147483647
[0] MPI startup(): Gather: 2: 2046-3072 & 0-2147483647
[0] MPI startup(): Gather: 3: 3073-313882 & 0-2147483647
[0] MPI startup(): Gather: 2: 0-2147483647 & 0-2147483647
[0] MPI startup(): Gatherv: 0: 0-2147483647 & 0-2147483647
[0] MPI startup(): Reduce_scatter: 4: 0-5 & 0-2147483647
[0] MPI startup(): Reduce_scatter: 1: 5-162 & 0-2147483647
[0] MPI startup(): Reduce_scatter: 3: 162-81985 & 0-2147483647
[0] MPI startup(): Reduce_scatter: 2: 81985-690794 & 0-2147483647
[0] MPI startup(): Reduce_scatter: 5: 0-2147483647 & 0-2147483647
[0] MPI startup(): Reduce: 1: 4-11458 & 0-2147483647
[0] MPI startup(): Reduce: 5: 11459-22008 & 0-2147483647
[0] MPI startup(): Reduce: 1: 0-2147483647 & 0-2147483647
[0] MPI startup(): Scan: 0: 0-2147483647 & 0-2147483647
[0] MPI startup(): Scatter: 3: 1-24575 & 0-2147483647
[0] MPI startup(): Scatter: 2: 24576-37809 & 0-2147483647
[0] MPI startup(): Scatter: 3: 37810-107941 & 0-2147483647
[0] MPI startup(): Scatter: 2: 107942-399769 & 0-2147483647
[0] MPI startup(): Scatter: 3: 399770-2150807 & 0-2147483647
[0] MPI startup(): Scatter: 2: 0-2147483647 & 0-2147483647
[0] MPI startup(): Scatterv: 0: 0-2147483647 & 0-2147483647
[1] MPI startup(): Recognition=2 Platform(code=8 ippn=2 dev=1) Fabric(intra=1 inter=1 flags=0x0)
[3] MPI startup(): Recognition=2 Platform(code=8 ippn=2 dev=1) Fabric(intra=1 inter=1 flags=0x0)
mrloop is using core 3
mrloop is using core 5
[2] MPI startup(): Recognition=2 Platform(code=8 ippn=2 dev=1) Fabric(intra=1 inter=1 flags=0x0)
mrloop is using core 4
[0] MPI startup(): Rank Pid Node name Pin cpu
[0] MPI startup(): 0 3477 zte 0
[0] MPI startup(): 1 3478 zte 0
[0] MPI startup(): 2 3479 zte 0
[0] MPI startup(): 3 3480 zte 0
[0] MPI startup(): Recognition=2 Platform(code=8 ippn=2 dev=1) Fabric(intra=1 inter=1 flags=0x0)
[0] MPI startup(): I_MPI_DEBUG=8
[0] MPI startup(): I_MPI_PIN_MAPPING=4:0 0,1 0,2 0,3 0

core 2: Now the following data is the statistics in 1000000 step:
3:997331 4:2168 5:1 46:1
core 3: Now the following data is the statistics in 1000000 step:
1:4497 2:995003 3:1
core 4: Now the following data is the statistics in 1000000 step:
1:2767 2:996733 16:1
core 5: Now the following data is the statistics in 1000000 step:
1:1 2:999070 3:430

3:997331 means : 997331 times it take 3000 cpu cycle

4:2168 means: 2168 times it take 4000 cpu cycle

you can see , one time it take 46000 cpu cycle, but in other 999501 times ,it only takee 1000-3000 cpu cycle

James_T_Intel · ‎11-23-2016

Does this also occur in the latest version of the Intel® MPI Library, 2017 Update 1?