Community
cancel
Showing results for 
Search instead for 
Did you mean: 
侯玉山
Novice
336 Views

use I_MPI_DEBUG=6 There is no print binding relationship

Hi,

I execute the MPI program with the following command:

[host test]# mpiexec -genv I_MPI_DEBUG=6 -n 6 -ppn 3 -hosts lico-c1,head ./test1.o
[0] MPI startup(): Intel(R) MPI Library, Version 2021.1 Build 20201112 (id: b9c9d2fc5)
[0] MPI startup(): Copyright (C) 2003-2020 Intel Corporation. All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): libfabric version: 1.11.0-impi
[0] MPI startup(): libfabric provider: psm2
[0] MPI startup(): I_MPI_ROOT=/opt/intel/oneapi/mpi/2021.1.1
[0] MPI startup(): I_MPI_HYDRA_TOPOLIB=hwloc
[0] MPI startup(): I_MPI_INTERNAL_MEM_POLICY=default
[0] MPI startup(): I_MPI_DEBUG=6
Hello world: rank 0 of 6 running on lico-C1
Hello world: rank 1 of 6 running on lico-C1
Hello world: rank 2 of 6 running on lico-C1
Hello world: rank 3 of 6 running on head
Hello world: rank 4 of 6 running on head
Hello world: rank 5 of 6 running on head

 

But,The program runs normally but does not output binding information

I want to know why

Thanks

0 Kudos
11 Replies
侯玉山
Novice
323 Views

Printed this information:

[0] MPI startup(): Incorrect Gather result in I_MPI_Pinning_printing 

SantoshY_Intel
Moderator
299 Views

Hi,

 

Thanks for reaching out to us.

 

Could you once try increasing the Debug level and check?

Also seems like you are getting a message regarding the incorrect Gather result.

You can use -check_mpi by ITAC to check the correctness of an MPI program.

 

Could you please try using the below command and provide us the output log?

I_MPI_DEBUG=30 mpiexec -check_mpi -genv -n 6 -ppn 3 -hosts lico-c1,head ./test1.o

 

Thanks & Regards,

Santosh

 

侯玉山
Novice
284 Views

hi,

Here is the information output from the new command(I_MPI_DEBUG=30 mpiexec -check_mpi -genv -n 6 -ppn 3 -hosts lico-c1,head ./test1.o).

This is a problem on multiple nodes, and binding information can be output on a single node.

[root@head mpi_dir]# I_MPI_DEBUG=30 mpiexec -check_mpi -genv -n 6 -ppn 3 -hosts lico-c1,head ./test1.o
[0] MPI startup(): Intel(R) MPI Library, Version 2021.1 Build 20201112 (id: b9c9d2fc5)
[0] MPI startup(): Copyright (C) 2003-2020 Intel Corporation. All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): Size of shared memory segment (374 MB per rank) * (3 local ranks) = 1122 MB total
[0] MPI startup(): libfabric version: 1.11.0-impi
libfabric:1122027:core:core:ofi_hmem_init():202<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:1122027:core:core:ofi_hmem_init():202<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:1122027:core:core:ofi_hmem_init():202<info> Hmem iface FI_HMEM_ZE not supported
libfabric:1122027:core:core:ofi_hmem_init():202<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:1122027:core:core:ofi_hmem_init():202<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:1122027:core:core:ofi_hmem_init():202<info> Hmem iface FI_HMEM_ZE not supported
libfabric:1122027:core:core:ofi_register_provider():427<info> registering provider: verbs (111.0)
libfabric:1122027:core:core:ofi_register_provider():427<info> registering provider: tcp (111.0)
libfabric:1122027:core:core:ofi_register_provider():427<info> registering provider: sockets (111.0)
libfabric:1122027:core:core:ofi_register_provider():427<info> registering provider: shm (111.0)
libfabric:1122027:core:core:ofi_register_provider():427<info> registering provider: ofi_rxm (111.0)
[3] MPI startup(): Size of shared memory segment (406 MB per rank) * (3 local ranks) = 1218 MB total
libfabric:1122027:core:core:ofi_register_provider():427<info> registering provider: psm2 (111.0)
libfabric:1122027:core:core:ofi_register_provider():427<info> registering provider: mlx (1.4)
libfabric:1122027:core:core:ofi_register_provider():427<info> registering provider: ofi_hook_noop (111.0)
libfabric:1122027:core:core:fi_getinfo_():1117<info> Found provider with the highest priority psm2, must_use_util_prov = 0
libfabric:1122027:core:core:fi_getinfo_():1144<info> Since psm2 can be used, mlx has been skipped. To use mlx, please, set FI_PROVIDER=mlx
libfabric:1122027:core:core:fi_getinfo_():1144<info> Since psm2 can be used, verbs has been skipped. To use verbs, please, set FI_PROVIDER=verbs
libfabric:1122027:core:core:fi_getinfo_():1144<info> Since psm2 can be used, tcp has been skipped. To use tcp, please, set FI_PROVIDER=tcp
libfabric:1122027:core:core:fi_getinfo_():1144<info> Since psm2 can be used, sockets has been skipped. To use sockets, please, set FI_PROVIDER=sockets
libfabric:1122027:core:core:fi_getinfo_():1144<info> Since psm2 can be used, shm has been skipped. To use shm, please, set FI_PROVIDER=shm
libfabric:1122027:core:core:fi_getinfo_():1117<info> Found provider with the highest priority psm2, must_use_util_prov = 0
libfabric:1122027:core:core:fi_getinfo_():1144<info> Since psm2 can be used, mlx has been skipped. To use mlx, please, set FI_PROVIDER=mlx
libfabric:1122027:core:core:fi_getinfo_():1144<info> Since psm2 can be used, verbs has been skipped. To use verbs, please, set FI_PROVIDER=verbs
libfabric:1122027:core:core:fi_getinfo_():1144<info> Since psm2 can be used, tcp has been skipped. To use tcp, please, set FI_PROVIDER=tcp
libfabric:1122027:core:core:fi_getinfo_():1144<info> Since psm2 can be used, sockets has been skipped. To use sockets, please, set FI_PROVIDER=sockets
libfabric:1122027:core:core:fi_getinfo_():1144<info> Since psm2 can be used, shm has been skipped. To use shm, please, set FI_PROVIDER=shm
[0] MPI startup(): libfabric provider: psm2
[0] MPI startup(): detected psm2 provider, set device name to "psm2"
libfabric:1122027:core:core:fi_fabric_():1397<info> Opened fabric: psm2
[0] MPI startup(): max_ch4_vcis: 1, max_reg_eps 1, enable_sep 0, enable_shared_ctxs 0, do_av_insert 1
libfabric:1122027:core:core:ofi_shm_map():137<warn> shm_open failed
[0] MPI startup(): addrnamelen: 16
libfabric:1122027:core:core:ofi_ns_add_local_name():370<warn> Cannot add local name - name server uninitialized
[0] MPI startup(): Load tuning file: "/opt/intel/oneapi/mpi/2021.1.1/etc/tuning_skx_shm-ofi.dat"
[2] MPI startup(): Incorrect Gather result in I_MPI_Pinning_printing
[0] MPI startup(): Incorrect Gather result in I_MPI_Pinning_printing
[0] MPI startup(): I_MPI_ROOT=/opt/intel/oneapi/mpi/2021.1.1
[0] MPI startup(): I_MPI_HYDRA_TOPOLIB=hwloc
[0] MPI startup(): I_MPI_INTERNAL_MEM_POLICY=default
[0] MPI startup(): I_MPI_DEBUG=30

[0] INFO: CHECK LOCAL:EXIT:SIGNAL ON
[0] INFO: CHECK LOCAL:EXIT:BEFORE_MPI_FINALIZE ON
[0] INFO: CHECK LOCAL:MPI:CALL_FAILED ON
[0] INFO: CHECK LOCAL:MEMORY:OVERLAP ON
[0] INFO: CHECK LOCAL:MEMORY:ILLEGAL_MODIFICATION ON
[0] INFO: CHECK LOCAL:MEMORY:INACCESSIBLE ON
[0] INFO: CHECK LOCAL:MEMORY:ILLEGAL_ACCESS OFF
[0] INFO: CHECK LOCAL:MEMORY:INITIALIZATION OFF
[0] INFO: CHECK LOCAL:REQUEST:ILLEGAL_CALL ON
[0] INFO: CHECK LOCAL:REQUEST:NOT_FREED ON
[0] INFO: CHECK LOCAL:REQUEST:PREMATURE_FREE ON
[0] INFO: CHECK LOCAL:DATATYPE:NOT_FREED ON
[0] INFO: CHECK LOCAL:BUFFER:INSUFFICIENT_BUFFER ON
[0] INFO: CHECK GLOBAL:DEADLOCK:HARD ON
[0] INFO: CHECK GLOBAL:DEADLOCK:POTENTIAL ON
[0] INFO: CHECK GLOBAL:DEADLOCK:NO_PROGRESS ON
[0] INFO: CHECK GLOBAL:MSG:DATATYPE:MISMATCH ON
[0] INFO: CHECK GLOBAL:MSG:DATA_TRANSMISSION_CORRUPTED ON
[0] INFO: CHECK GLOBAL:MSG:PENDING ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:DATATYPE:MISMATCH ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:DATA_TRANSMISSION_CORRUPTED ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:OPERATION_MISMATCH ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:SIZE_MISMATCH ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:REDUCTION_OPERATION_MISMATCH ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:ROOT_MISMATCH ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:INVALID_PARAMETER ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:COMM_FREE_MISMATCH ON
[0] INFO: maximum number of errors before aborting: CHECK-MAX-ERRORS 1
[0] INFO: maximum number of reports before aborting: CHECK-MAX-REPORTS 0 (= unlimited)
[0] INFO: maximum number of times each error is reported: CHECK-SUPPRESSION-LIMIT 10
[0] INFO: timeout for deadlock detection: DEADLOCK-TIMEOUT 60s
[0] INFO: timeout for deadlock warning: DEADLOCK-WARNING 300s
[0] INFO: maximum number of reported pending messages: CHECK-MAX-PENDING 20

Hello world: rank 0 of 6 running on lico-C1
Hello world: rank 1 of 6 running on lico-C1
Hello world: rank 2 of 6 running on lico-C1
Hello world: rank 3 of 6 running on head
Hello world: rank 4 of 6 running on head
Hello world: rank 5 of 6 running on head

[0] INFO: Error checking completed without finding any problems.

 

Thanks & Regards

SantoshY_Intel
Moderator
253 Views

Hi,


Thanks for your response.


Could u please provide the details of cpuinfo of both the nodes?


Thanks & Regards,

Santosh




侯玉山
Novice
244 Views

hi,

cpuinfo:

head:

[root@head ~]# lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 72
On-line CPU(s) list: 0-71
Thread(s) per core: 2
Core(s) per socket: 18
Socket(s): 2
NUMA node(s): 4
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Gold 6150 CPU @ 2.70GHz
Stepping: 4
CPU MHz: 1400.235
CPU max MHz: 2701.0000
CPU min MHz: 1200.0000
BogoMIPS: 5400.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 25344K
NUMA node0 CPU(s): 0-2,5,6,9,10,14,15,36-38,41,42,45,46,50,51
NUMA node1 CPU(s): 3,4,7,8,11-13,16,17,39,40,43,44,47-49,52,53
NUMA node2 CPU(s): 18-20,23,24,27,28,32,33,54-56,59,60,63,64,68,69
NUMA node3 CPU(s): 21,22,25,26,29-31,34,35,57,58,61,62,65-67,70,71
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke md_clear flush_l1d

 

lico-C1:

[root@lico-C1 ~]# lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 48
On-line CPU(s) list: 0-47
Thread(s) per core: 2
Core(s) per socket: 12
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Gold 6126 CPU @ 2.60GHz
Stepping: 4
CPU MHz: 2299.978
CPU max MHz: 2601.0000
CPU min MHz: 1000.0000
BogoMIPS: 5200.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 19712K
NUMA node0 CPU(s): 0-11,24-35
NUMA node1 CPU(s): 12-23,36-47
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke md_clear flush_l1d

 

Thank you for your help.

 

SantoshY_Intel
Moderator
231 Views

Hi,


Could you please provide some more information so that we can investigate your issue well?

Could you please provide the OS and scheduler details of both nodes?

And Do you find the same issue after executing other sample MPI programs too?


Awaiting your reply.


Thanks & Regards,

Santosh


侯玉山
Novice
215 Views

Hi,

Thanks for your help.

I got some cpuinfo use $ cat /proc/cpuinfo

[root@head ~]# cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 85
model name : Intel(R) Xeon(R) Gold 6150 CPU @ 2.70GHz
stepping : 4
microcode : 0x2000065
cpu MHz : 1371.976
cache size : 25344 KB
physical id : 0
siblings : 36
core id : 0
cpu cores : 18
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 22
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke md_clear flush_l1d
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa itlb_multihit
bogomips : 5400.00
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:

processor : 1
vendor_id : GenuineIntel
cpu family : 6
model : 85
model name : Intel(R) Xeon(R) Gold 6150 CPU @ 2.70GHz
stepping : 4
microcode : 0x2000065
cpu MHz : 1232.666
cache size : 25344 KB
physical id : 0
siblings : 36
core id : 1
cpu cores : 18
apicid : 2
initial apicid : 2
fpu : yes
fpu_exception : yes
cpuid level : 22
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke md_clear flush_l1d
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa itlb_multihit
bogomips : 5400.00
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:

processor : 2
vendor_id : GenuineIntel
cpu family : 6
model : 85

...

 

 

and I added "export I_MPI_PLATFORM=auto", Then I get the binding information. But I don't know why "I_ MPI_ DEBUG = 4" doesn't work.

 

 

SantoshY_Intel
Moderator
144 Views

Hi,

 

>>"I added "export I_MPI_PLATFORM=auto", Then I get the binding information."

Glad to know that your issue is resolved.

 

>> "I don't know why "I_ MPI_ DEBUG = 4" doesn't work"

We see the binding information in multinode systems using I_MPI_DEBUG, having similar CPU skews without setting I_MPI_PLATFORM.

Having different skews of CPU might be resulting in this kind of behavior i.e not printing binding information.

 

For more information regarding I_MPI_PLATFORM please refer to the link below:

https://software.intel.com/content/www/us/en/develop/documentation/mpi-developer-reference-linux/top...

 

Let us know if there is anything else that we can help you with?

If no, please confirm whether we can close this thread from our end. 

 

Awaiting your reply.

 

Thanks & Regards,

Santosh

 

侯玉山
Novice
124 Views

Hello,

our problem has been solved,and this thread is ready to shut down.

 

 

Thank you again for your support and help.

 

侯玉山
Novice
124 Views

Hello,

our problem has been solved,and this thread is ready to shut down.

 

 

Thank you again for your support and help.

SantoshY_Intel
Moderator
116 Views

Hi,

 

Thanks for the confirmation!

As this issue has been resolved, we will no longer respond to this thread. If you require any additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only.

 

Have a Good day.

 

Thanks & Regards

Santosh

 

Reply