Intel® DevCloud
Help for those needing help starting or connecting to the Intel® DevCloud
Announcements
This community is designed for sharing of public information. Please do not share Intel or third-party confidential information here.
991 Discussions

MPI / Distributed Computing / Terminal

s_n
Beginner
774 Views

Hi,

A.
I am trying to do a distributed computing on two nodes using MPI and Julia as well as with the use of standalone Julia abilities.

I am able to do calculations on all 24 workers on one node in interactive and batch modes without any problems in pure Julia with

using Distributed
addprocs(24)
@everywhere ...

which is a basic functionality of Julia as described at [https://docs.julialang.org/en/v1/manual/distributed-computing/].

I am able to do basic MPI with the use of MPI.jl package [https://github.com/pressel/MPI.jl]:

using MPI
MPI.Init()
println("Hi from $(MPI.Comm_rank(MPI.COMM_WORLD))!")
flush(stdout)

mpirun -np 24 julia hello_world.jl.

However, I am not able to correctly add all 24 / 48 workers on two nodes. In theory I should be able to:

- add workers with the use of machine file when starting Julia

julia --machine-file=$PBS_NODEFILE

or

- with the use of MPIClustersManagers.jl [https://github.com/JuliaParallel/MPIClusterManagers.jl]

using Distributed
using MPIClusterManagers
# specify, number of mpi workers
manager=MPIManager(np=48)
# start mpi workers and add them as julia workers too.
addprocs(manager)
@everywhere import MPI
sleep(60.000) # provide time for workers to start
#Setup the worker environments
@everywhere using PackageName
#Solve with
@MPi_do manager begin
using MPI
experiment = Examples.experiments["name_of_experiment"]
session = Session(experiment, dir="/home/uxxxxx/data/xxxxx/xxxxx.jl/mytrainings/sessions/name_of_experiment")
resume!(session)
end

I have done a significant number of tries with different combinations described at various discussion lists, however I am unable to correctly launch workers on two nodes. The most common errors I see are as follow:

ERROR: TaskFailedException

nested task error: Unable to read host:port string from worker. Launch command exited with error?
[...]
caused by: Unable to read host:port string from worker. Launch command exited with error?

or/and

sh: 7: /etc/profile.d/add-local-path.sh: Syntax error: redirection unexpected
sh: 7: /etc/profile.d/add-local-path.sh: Syntax error: redirection unexpected

or

[mpiexec@s001-n047] HYD_hostfile_parse (../../../../../src/pm/i_hydra/libhydra/hostfile/hydra_hostfile.c:69): unable to open host file: -n
[mpiexec@s001-n047] mfile_fn (../../../../../src/pm/i_hydra/mpiexec/mpiexec_params.c:489): error parsing hostfile
[mpiexec@s001-n047] match_arg (../../../../../src/pm/i_hydra/libhydra/arg/hydra_arg.c:83): match handler returned error
[mpiexec@s001-n047] HYD_arg_parse_array (../../../../../src/pm/i_hydra/libhydra/arg/hydra_arg.c:128): argument matching returned error
[mpiexec@s001-n047] mpiexec_get_parameters (../../../../../src/pm/i_hydra/mpiexec/mpiexec_params.c:1356): error parsing input array
[mpiexec@s001-n047] main (../../../../../src/pm/i_hydra/mpiexec/mpiexec.c:1749): error parsing parameters

or

IOError: could not spawn setenv(`/home/uxxxxx/packages/julias/julia-1.6.1/bin/julia -Cnative -J/home/u77446/packages/julias/julia-1.6.1/lib/julia/sys.so -g1 --bind-to 127.0.0.1 --worker`; dir="/home/uxxxxx/data/xxxxx/xxxxx/mytrainings"): resource temporarily unavailable (EAGAIN)

Thus I would like to ask if any guidance from you on this topic would be possible. I would really appreciate any information.

B.
Also, heaving the opportunity, I would like to ask a question about gnome-terminal at renderkit machine. It used to work ok, however recently it does not want to start anymore. I tracked the error:

Error constructing proxy for org.gnome.Terminal:/org/gnome/Terminal/Factory0: Error calling StartServiceByName for org.gnome.Terminal: GDBus.Error:org.freedesktop.DBus.Error.Spawn.ChildExited: Process org.gnome.Terminal exited with status 8

I guess that the error is associated with broken Locale. Also the machine asks for a reboot.

To get the terminal working I have to execute LC_ALL=en_US.UTF-8 /usr/bin/dbus-launch gnome-terminal in XTerm.

Would it be possible to receive some guidance on this topic?

Best regards,
SZ

0 Kudos
8 Replies
RahulU_Intel
Moderator
749 Views

Hi,


Thanks for posting in Intel forums. We are checking on this issue from our side. We will get back to you.


Thanks

Rahul


s_n
Beginner
737 Views

Thanks Rahul. One information that I think might be relevant and I would like to add is that I understand that in contrary to MPI, Julia is not natively supported here, however, what I am trying to do currently is to transform a quite extensive machine / reinforcement learning model (package) in a way that it would be able to utilize oneAPI and corresponding Intel software and hardware technologies. I understand that this kind of activities might be / is in line with Devcloud's policy thus those kind of questions. SZ

ShivaniK_Intel
Moderator
692 Views

Hi,


We are working on it and will get back to you soon.


Thanks & Regards

Shivani


s_n
Beginner
674 Views

Hi!

Thanks. I'll really appreciate any information on this topic, especially about distributed computing. Also a very general, even preliminary information / assumption if it is doable at all or not would be useful.

Regards,
SZ

clevels
Employee
660 Views

Hello - thank you for your message. I will begin an investigation into this.


s_n
Beginner
650 Views

Hello. Thank you. Should you have any additional questions or may I be in any help please let me know.

clevels
Employee
605 Views

Hello- For part B of this issue it would be best to post a thread in Render toolkit forum at 

https://community.intel.com/t5/Intel-oneAPI-Rendering-Toolkit/bd-p/oneapi-rendering-toolkit


For part A- compile one of the test programs without Julia in $I_MPI_ROOT/test and run it with the same job layout with Julia. If that fails please repeat it I_MPI_DEBUG=16 and attach the log.




s_n
Beginner
582 Views

 

Hello,

 

Re: part B:

Thank you. I will.

 

Re: part A:

Thank you. Sure, I will compile test programs and I'll try to do tests in Julia. Please be advised that during this week I might spend less time at Devcloud than usual, however, please be assured that I will reply soon.

 

Best regards,
SZ

Reply