Intel® oneAPI HPC Toolkit
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
Announcements
Welcome to the Intel Community. If you get an answer you like, please mark it as an Accepted Solution to help others. Thank you!
1826 Discussions

Intel MPI real-time performance problem

wang_b_1
Beginner
114 Views

Hello:
In my case ,i found sometimes MPI_Gather() take more than 5000-40000 cpu crcle,but normally MPI_Gather() only take about 2000 cpu crcle.


I can confirm that  there is no timer interrupt or other interrupts  to  disturb MPI_Gather(), i also try to use mlockall(), use my own malloc()  to replace i_malloc and i_free,but it not work.


When call MPI_Gather(), my programme  need it return in a determinacy time, i don't know is there ant thing i can do to improve MPI real-time performance,or is there any tools can help me to find why this function take so long some times.


OS:linux3.10 

cpuinfo:  i use isolcpus ,so the cpuinfo cmd may get wrong info,it 8core16Threads 

Intel(R) processor family information utility, Version 4.1 Update 3 Build 20140124
Copyright (C) 2005-2014 Intel Corporation.  All rights reserved.
=====  Processor composition  =====
Processor name    : Intel(R) Xeon(R)  E5-2660 0 
Packages(sockets) : 1
Cores             : 1
Processors(CPUs)  : 1
Cores per package : 1
Threads per core  : 1
=====  Processor identification  =====
Processor       Thread Id.      Core Id.        Package Id.
0               0               0               0   
=====  Placement on packages  =====
Package Id.     Core Id.        Processors
0               0               0
=====  Cache sharing  =====
Cache   Size            Processors
L1      32  KB          no sharing
L2      256 KB          no sharing
L3      20  MB          no sharing


test_program like this:
...
       t1= rdtsc;                                                                                   
        Ierr = MPI_Bcast(&nn, 1, MPI_INTEGER, 0, MPI_COMM_WORLD);                                               
        t2= rdtsc;                                                                                      
        if(t2>t1&& (t2-t1)/1000 >5)  record_it(t2-t1);
...

 


mpirun -n 4 -env I_MPI_DEBUG=4 ./my_test

[0] MPI startup(): Intel(R) MPI Library, Version 4.1 Update 3  Build 20140124
[0] MPI startup(): Copyright (C) 2003-2014 Intel Corporation.  All rights reserved.
[1] MPI startup(): shm data transfer mode
[3] MPI startup(): shm data transfer mode
[0] MPI startup(): shm data transfer mode
[2] MPI startup(): shm data transfer mode
[0] MPI startup(): Device_reset_idx=8
[0] MPI startup(): Allgather: 1: 1-1 & 0-2147483647
[0] MPI startup(): Allgather: 4: 2-4 & 0-2147483647
[0] MPI startup(): Allgather: 1: 5-10 & 0-2147483647
[0] MPI startup(): Allgather: 4: 11-22 & 0-2147483647
[0] MPI startup(): Allgather: 1: 23-469 & 0-2147483647
[0] MPI startup(): Allgather: 4: 470-544 & 0-2147483647
[0] MPI startup(): Allgather: 1: 545-3723 & 0-2147483647
[0] MPI startup(): Allgather: 3: 3724-59648 & 0-2147483647
[0] MPI startup(): Allgather: 1: 59649-3835119 & 0-2147483647
[0] MPI startup(): Allgather: 3: 0-2147483647 & 0-2147483647
[0] MPI startup(): Allgatherv: 1: 0-1942 & 0-2147483647
[0] MPI startup(): Allgatherv: 3: 1942-128426 & 0-2147483647
[0] MPI startup(): Allgatherv: 4: 128426-193594 & 0-2147483647
[0] MPI startup(): Allgatherv: 3: 193594-454523 & 0-2147483647
[0] MPI startup(): Allgatherv: 4: 454523-561981 & 0-2147483647
[0] MPI startup(): Allgatherv: 3: 0-2147483647 & 0-2147483647
[0] MPI startup(): Allreduce: 7: 0-6 & 0-2147483647
[0] MPI startup(): Allreduce: 1: 6-13 & 0-2147483647
[0] MPI startup(): Allreduce: 7: 13-37 & 0-2147483647
[0] MPI startup(): Allreduce: 1: 37-104 & 0-2147483647
[0] MPI startup(): Allreduce: 7: 104-409 & 0-2147483647
[0] MPI startup(): Allreduce: 1: 409-5708 & 0-2147483647
[0] MPI startup(): Allreduce: 7: 5708-12660 & 0-2147483647
[0] MPI startup(): Allreduce: 2: 12660-61166 & 0-2147483647
[0] MPI startup(): Allreduce: 6: 61166-74718 & 0-2147483647
[0] MPI startup(): Allreduce: 8: 74718-163640 & 0-2147483647
[0] MPI startup(): Allreduce: 2: 163640-355186 & 0-2147483647
[0] MPI startup(): Allreduce: 6: 355186-665233 & 0-2147483647
[0] MPI startup(): Allreduce: 7: 0-2147483647 & 0-2147483647
[0] MPI startup(): Alltoall: 3: 0-1 & 0-2147483647
[0] MPI startup(): Alltoall: 2: 2-2 & 0-2147483647
[0] MPI startup(): Alltoall: 3: 3-25 & 0-2147483647
[0] MPI startup(): Alltoall: 2: 26-48 & 0-2147483647
[0] MPI startup(): Alltoall: 3: 49-1826 & 0-2147483647
[0] MPI startup(): Alltoall: 2: 1827-947308 & 0-2147483647
[0] MPI startup(): Alltoall: 3: 947309-1143512 & 0-2147483647
[0] MPI startup(): Alltoall: 2: 1143513-3715953 & 0-2147483647
[0] MPI startup(): Alltoall: 3: 0-2147483647 & 0-2147483647
[0] MPI startup(): Alltoallv: 0: 0-2147483647 & 0-2147483647
[0] MPI startup(): Alltoallw: 0: 0-2147483647 & 0-2147483647
[0] MPI startup(): Barrier: 1: 0-2147483647 & 0-2147483647
[0] MPI startup(): Bcast: 0: 0-2147483647 & 0-2147483647
[0] MPI startup(): Exscan: 0: 0-2147483647 & 0-2147483647
[0] MPI startup(): Gather: 3: 1-2045 & 0-2147483647
[0] MPI startup(): Gather: 2: 2046-3072 & 0-2147483647
[0] MPI startup(): Gather: 3: 3073-313882 & 0-2147483647
[0] MPI startup(): Gather: 2: 0-2147483647 & 0-2147483647
[0] MPI startup(): Gatherv: 0: 0-2147483647 & 0-2147483647
[0] MPI startup(): Reduce_scatter: 4: 0-5 & 0-2147483647
[0] MPI startup(): Reduce_scatter: 1: 5-162 & 0-2147483647
[0] MPI startup(): Reduce_scatter: 3: 162-81985 & 0-2147483647
[0] MPI startup(): Reduce_scatter: 2: 81985-690794 & 0-2147483647
[0] MPI startup(): Reduce_scatter: 5: 0-2147483647 & 0-2147483647
[0] MPI startup(): Reduce: 1: 4-11458 & 0-2147483647
[0] MPI startup(): Reduce: 5: 11459-22008 & 0-2147483647
[0] MPI startup(): Reduce: 1: 0-2147483647 & 0-2147483647
[0] MPI startup(): Scan: 0: 0-2147483647 & 0-2147483647
[0] MPI startup(): Scatter: 3: 1-24575 & 0-2147483647
[0] MPI startup(): Scatter: 2: 24576-37809 & 0-2147483647
[0] MPI startup(): Scatter: 3: 37810-107941 & 0-2147483647
[0] MPI startup(): Scatter: 2: 107942-399769 & 0-2147483647
[0] MPI startup(): Scatter: 3: 399770-2150807 & 0-2147483647
[0] MPI startup(): Scatter: 2: 0-2147483647 & 0-2147483647
[0] MPI startup(): Scatterv: 0: 0-2147483647 & 0-2147483647
[1] MPI startup(): Recognition=2 Platform(code=8 ippn=2 dev=1) Fabric(intra=1 inter=1 flags=0x0)
[3] MPI startup(): Recognition=2 Platform(code=8 ippn=2 dev=1) Fabric(intra=1 inter=1 flags=0x0)
mrloop is using core 3
mrloop is using core 5
[2] MPI startup(): Recognition=2 Platform(code=8 ippn=2 dev=1) Fabric(intra=1 inter=1 flags=0x0)
mrloop is using core 4
[0] MPI startup(): Rank    Pid      Node name  Pin cpu
[0] MPI startup(): 0       3477     zte        0
[0] MPI startup(): 1       3478     zte        0
[0] MPI startup(): 2       3479     zte        0
[0] MPI startup(): 3       3480     zte        0
[0] MPI startup(): Recognition=2 Platform(code=8 ippn=2 dev=1) Fabric(intra=1 inter=1 flags=0x0)
[0] MPI startup(): I_MPI_DEBUG=8
[0] MPI startup(): I_MPI_PIN_MAPPING=4:0 0,1 0,2 0,3 0

 core 2: Now the following data is the statistics in 1000000 step:
         3:997331  4:2168   5:1  46:1 
core 3: Now the following data is the statistics in 1000000 step:
         1:4497  2:995003  3:1 
core 4: Now the following data is the statistics in 1000000 step:
         1:2767  2:996733  16:1 
core 5: Now the following data is the statistics in 1000000 step:
         1:1  2:999070  3:430 

 

3:997331 means : 997331 times it take 3000 cpu cycle

4:2168 means: 2168 times it take 4000 cpu cycle

you can see , one time it take 46000 cpu cycle, but in other 999501 times ,it only takee 1000-3000 cpu cycle

 

 

 

 

 

0 Kudos
1 Reply
James_T_Intel
Moderator
114 Views

Does this also occur in the latest version of the Intel® MPI Library, 2017 Update 1?

Reply