Intel® Optimized AI Frameworks
Receive community support for questions related to PyTorch* and TensorFlow* frameworks.
73 Discussions

Regarding MPI for XGBoost multi node training

Misra
Novice
1,503 Views

Hi,

 

I am training an XGBoost model on 2 nodes using MPI (mpi4py) for the distribution of workload. 

As per the link provided to me below, 

https://devcloud.intel.com/oneapi/documentation/advanced-queue/

I created a list of the 2 nodes (mother superior and sister node) in the hostfile.txt achieved from the machine file (path in environment variable $PBS_NODEFILE). 

I, then used, the following command to run the code, 

mpirun --hostfile hosts.txt python multi_node.py --N=1 

*(N = parameter in the code)

Screenshot 2022-07-28 at 2.55.41 AM.png

(Also, when I used "mpirun -n 2 python script.py", script.py being a minimal mpi4py code, it works fine. Should I be using some other way to run my code?)

 

>>Also, I have created a virtual environment which uses Intel Modin toolkit libraries in oneAPI. I wanted to know as to how I will make sure I can activate the same environment in the other node that the code will run on. 

I am facing an error issue which is attached below and I am not able to understand or resolve. Please let me know the issue and how I can resolve it. Thank you!

 

Regards,

Manjari

0 Kudos
3 Replies
JaideepK_Intel
Moderator
1,431 Views

Hi,

 

Thank you for posting in Intel Communities.

Before running on multiple nodes, we need to mention how many nodes we want.

 

qsub -I -l nodes=<number_of_nodes>:<property>:ppn=2 -d .

 

example:(qsub -I -l nodes=2:gpu:ppn=2 -d .)

 

After logging into the compute node, we need to get the node numbers which we accessed.

 

echo $PBS_NODEFILE (example output looks like this: /var/spool/torque/aux//1965007.v-qsvr-1.aidevcloud)

 

We need to cat the output of $PBS_NODEFILE

 

example output : cat /var/spool/torque/aux//1965007.v-qsvr-1.aidevcloud
s001-n141
s001-n141
s001-n157
s001-n157

 

Copy the node numbers from above and paste them into the host file (I pasted the above node numbers into host1)

After pasting the node numbers into the host file, we can run the mpirun command.

 

mpirun -n 4 -hostfile host1 python hello.py

 

JaideepK_Intel_0-1659418566505.png

 

If this resolves your issue, make sure to accept this as a solution. This would help others with similar issue. Have a great day a head.

Regards,

Jaideep

 

0 Kudos
JaideepK_Intel
Moderator
1,369 Views

Hi,


If this resolves your issue, make sure to accept this as a solution. This would help others with similar issue. Have a great day a head.


Regards,

Jaideep



0 Kudos
JaideepK_Intel
Moderator
1,344 Views

Hi,


We assume that your issue is resolved. If you need any additional information, please post a new question as this thread will no longer be monitored by Intel.


Thanks,

Jaideep


0 Kudos
Reply