MPI hangs with multiple processes on same node - Intel AI DevCloud

Melo__Luckeciano · ‎07-29-2019

Hello,

I am using Intel AI DevCloud to run a Deep Reinforcement Learning training, using mpi4py to use several agents to collect data at the same time.

In my framework, I run N jobs (in different nodes) with agents and another job with the optimization algorithm in python.

When I run a single agent in each job, the application works correctly. However, when I try to run more than one agent in the same job (same node), the application hangs.

I do not think the problem is the application itself because it works when there is a single agent per node. Additionally, the same application used to work with multiple agents per node last year, when Intel DevCloud was CentOS.

The application is in this code: https://github.com/alexandremuzio/baselines/blob/71977df495e7840179dd05ed561a1f0e15a3b5d2/baselines/ppo1/run_soccer.py

Probably I am not presenting enough data about the problem, so if you need any information regarding MPI stuff I can send in this post.