Community
cancel
Showing results for 
Search instead for 
Did you mean: 
John_H_19
Beginner
593 Views

Abaqus with Omnipath

I am trying to get Abaqus running over an Omnipath fabric.

Abaqus version 6.14-1  and using Intel MPI 5.1.2

In my abaqus_v.env file I set  mp_mpirun_options   -v -genv I_MPI_FABRICS shm:tmi

By the way the -PSM2 argument is nt accepted

I cannot cut and paste the output here (Argggh!) so have attached a rather long output file.

I do not know where the wheels are coming off here. Pun intended as this is the e1 car crash simulation.

I got lots of messages about PMI buffer overrruns but I am not sure that is the root of the problem

 

 

0 Kudos
11 Replies
Dmitry_S_Intel
Moderator
593 Views

Hi,

-PSM2 flag was announced in Intel(R) MPI Library 5.1.3. Now we have released 2017 Update 2.

Can you test the issue with the latest released version of Intel(R) MPI Library?

 

--

Dmitry

John_H_19
Beginner
593 Views

Dmitry, thankyou.

I have realised this, and the Intel MPI is updated to version 2017.1.1

I am getting these messages now.  I Can set I_MPI_DEBUG=2 if that is helpful.

 

Warning: string_to_uuid_array: wrong uuid format: 0)<8D>

Warning: string_to_uuid_array: correct uuid format is: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx

Warning: string_to_uuid_array: wrong uuid format: 0)<8D>

Warning: string_to_uuid_array: correct uuid format is: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx

gm-hpc-40.4380Driver initialization failure on /dev/hfi1 (err=23)

gm-hpc-41.54627hfi_userinit: mmap of rcvhdrq at dabbad0004030000 failed: Resource temporarily unavailable

Warning: string_to_uuid_array: wrong uuid format: 0)<8D>

 

 

Intel MPI variables are:

 

 

Hydra internal environment:

  ---------------------------

    MPIR_CVAR_NEMESIS_ENABLE_CKPOINT=1

    GFORTRAN_UNBUFFERED_PRECONNECTED=y

    I_MPI_HYDRA_UUID=e7c00000-6c6a-3da2-a64b-0500c927ac10

    DAPL_NETWORK_PROCESS_NUM=80

 

  User set environment:

  ---------------------

    I_MPI_FABRICS_LIST=tmi,dapl,tcp,ofa

    I_MPI_TMI_PROVIDER=psm2

    I_MPI_DEBUG=0

 

  Intel(R) MPI Library specific variables:

  ----------------------------------------

    I_MPI_PERHOST=allcores

    I_MPI_COMPATIBILITY=4

    I_MPI_ROOT=/cm/shared/apps/intel/compilers_and_libraries/2017.1.132/mpi

    I_MPI_HYDRA_UUID=e7c00000-6c6a-3da2-a64b-0500c927ac10

    I_MPI_FABRICS_LIST=tmi,dapl,tcp,ofa

    I_MPI_TMI_PROVIDER=psm2

    I_MPI_DEBUG=0

 

 

 

 

 

Dmitry_S_Intel
Moderator
593 Views

Please try to set

export I_MPI_HYDRA_UUID=00000000-0000-0000-0001-000000000001

manually.

Also please check the needed steps from "Intel® Omni-Path Fabric Host Software User Guide" 13.3.5 Intel® Omni-Path HFI Initialization Failure:

$ lsmod | grep hfi

$ hfi1_control -iv

--

Dmitry

John_H_19
Beginner
593 Views

Thanks Dimitri.   hfi1_copntrol returns:
(the -v flag does not work on that version)
 
$ hfi1_control -i
Driver Version: 0.9-248
Driver SrcVersion: 69BFC1AA6C06ED185D6CDF9
Opa Version: 10.1.0.0.145
0: BoardId: Board ID 0x1
0: Version: ChipABI 3.0, Board ID 0x1, ChipRev 7.17, SW Compat 3
0: ChipSerial: 0x007beb7b
0,1: Status: 5: LinkUp 4: ACTIVE
0,1: LID=0x1 GUID=0011:7501:017b:eb7b
John_H_19
Beginner
593 Views

Dimitri  after setting I_MPI_HYDRA_UUID I am still getting these errors.
 
The first error eems to be from hfi1_userinit about rcvhqdrq
 
 
 
 
 
[proxy:0:2@gm-hpc-39] forwarding command (cmd=barrier_in) upstream
[proxy:0:3@gm-hpc-40] we don't understand the response get_result; forwarding downstream
[mpiexec@gm-hpc-37] [pgid: 0] got PMI command: cmd=barrier_in
gm-hpc-38.94204hfi_userinit: mmap of rcvhdrq at dabbad0004030000 failed: Resource temporarily unavailable
Warning: string_to_uuid_array: wrong uuid format: @)<8D>
Warning: string_to_uuid_array: correct uuid format is: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
Warning: string_to_uuid_array: wrong uuid format: @)<8D>
Warning: string_to_uuid_array: correct uuid format is: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
gm-hpc-38.94204Driver initialization failure on /dev/hfi1 (err=23)
gm-hpc-40.48694hfi_userinit: mmap of rcvhdrq at dabbad0004030000 failed: Resource temporarily unavailable
John_H_19
Beginner
593 Views

By the way, as this is a job runnign under Slurm I am aware of setting the max locked memory to unlimited,

and have implemented the slurm.conf configuration recommended here.

https://bugs.schedmd.com/show_bug.cgi?id=3363

If I run a test job and look at the limits, as in that thread, then the locked memory is unlimited,

Coudd it be the Abaqus launch script which is not inheriting this limit somehow?

 

John_H_19
Beginner
593 Views

Dimitri,

this is definitely looking like a issue with the locked memory limit. I see the same as these guys

https://wiki.fysik.dtu.dk/niflheim/OmniPath#memory-limits

I am confused though - I applied the recommended Slurm configurations for the memory limits.

I guess that these limits are not being inherited properly by the Abaqus launch script somehow

John_H_19
Beginner
593 Views

I should say though,  I am NOT seeing the syslog messages which the Niflheim people say should be there.

So this could be another issue, sorry.

Dmitry_S_Intel
Moderator
593 Views

About "Warning: string_to_uuid_array: wrong uuid format: 0)<8D>" I've filed the ticket to Intel MPI team.

--

Dmitry

John_H_19
Beginner
593 Views

Something else to report here - please read this fully.  I have access to two clusters with Omnipath.

I run a hello world MPI program, using two hosts, 1 process per host.

On Cluster A it runs fine.

[johnh@comp001 ~]$  hfi1_control -i
Driver Version: 0.9-294
Driver SrcVersion: 243C292B037A8EAA8075FD6
Opa Version: 10.3.0.0.81
0: BoardId: Intel Omni-Path Host Fabric Interface Adapter 100 Series
0: Version: ChipABI 3.0, ChipRev 7.17, SW Compat 3
0: ChipSerial: 0x007af039
0,1: Status: 5: LinkUp 4: ACTIVE
0,1: LID=0x4 GUID=0011:7501:017a:f039
 
The hfi devices are:
crw-rw-rw- 1 root root 241,   0 Mar 20 15:04 /dev/hfi1_0
crw------- 1 root root 241, 128 Mar 20 15:04 /dev/hfi1_diagpkt
crw------- 1 root root 241, 200 Mar 20 15:04 /dev/hfi1_diagpkt0

 

On cluster B the error is printed (and actually the Hello World completes )

m-hpc-38.106578hfi_userinit: mmap of rcvhdrq at dabbad0004030000 failed: Resource temporarily unavailable
gm-hpc-38.106578Driver initialization failure on /dev/hfi1 (err=23)

 

[root@gm-hpc-37 ~]# hfi1_control -i
Driver Version: 0.11-160
Driver SrcVersion: 572A6C0CADDB780054E5FB8
Opa Version: 10.1.0.0.145
0: BoardId: Intel Omni-Path Host Fabric Interface Adapter 100 Series
0: Version: ChipABI 3.0, ChipRev 7.17, SW Compat 3
0: ChipSerial: 0x0078f124
0,1: Status: 5: LinkUp 4: ACTIVE
0,1: LID=0x25 GUID=0011:7501:0178:f124

 

hfi devices are:

crw-rw-rw- 1 root root 246,   0 Feb 21 15:08 /dev/hfi1
crw-rw-rw- 1 root root 246,   1 Feb 21 15:08 /dev/hfi1_0
crw------- 1 root root 246, 128 Feb 21 15:08 /dev/hfi1_diagpkt
crw------- 1 root root 246, 200 Feb 21 15:08 /dev/hfi1_diagpkt0
crw------- 1 root root 246, 192 Feb 21 15:08 /dev/hfi1_ui0
 
 
Errr... some extra hfi devices there...
Matthew_M_5
Beginner
593 Views

FYI - This problem is indicative that the wrong shared object libraries are being loaded and don't match up to the version of IMPI you are using

Warning: string_to_uuid_array: wrong uuid format: 

Warning: string_to_uuid_array: correct uuid format is: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx

I just went through this on an omnipath installation and had to remove libmpi.so.12 and libmpi_mt.so.12 from the Abaqus code/bin directories and force LD_LIBRARY_PATH to include the intel64/lib directory for the appropriate Intel MPI release.

 

Reply