Intel® DevCloud
Help for those needing help starting or connecting to the Intel® DevCloud
1638 Discussions

Regarding multi node memory issue

Misra
Novice
1,735 Views

Hello,

 

I am trying to run an ML workload on a multi-node but the memory/ram provided seems to be insufficient and hence produce memory issues. Is there any way we can increase the memory required for consumption?

Also, how many nodes can we run our ML workloads on? Is it possible to go beyond nodes>2?

Thank you!

 

Regards,

Manjari

0 Kudos
15 Replies
risan-raja
Novice
1,719 Views

Hi,

How are you distributing your workload? Are you using MPI/OpenMP? If so there is an example given in advanced queue management given at here.

 

Also I have used Dask Jobqueue Library which also works. Pay attention to the specific nodes requirement. For example to request for more than 2 nodes

 

select 4 nodes from the output of this command

pbsnodes | grep "properties =" | awk '{print $3}' | sort | uniq -c

then on top of your file job submission file add something like this:

#PBS -l nodes=sXXX-nXXX,sXXX-nXXX,sXXX-nXXX,sXXX-nXXX:ppn=2,walltime=HH:MM:SS

Like stated your code will only run on node and the application code has to call the other nodes. 

Just Like

1 - Compute Node

2,3,4 - Worker Nodes.

 

Read the Documentation carefully its pretty much self explanatory.

0 Kudos
Misra
Novice
1,651 Views

Hi, 

 

Thank you so much for this answer. I am not using MPI. 

I am trying to look into the Dask job queue library but having trouble understanding. I have a python script with the ML model to get the results. Could you let me know more about the application code to call the other nodes in the case of the Dask jobqueue library? 

Thanks

 

Regards,

Manjari 

0 Kudos
Rahila_T_Intel
Moderator
1,707 Views

Hi,


Good day to you.


Thank you for posting in Intel Communities.


Could you kindly let us know which Intel DevCloud you're using? Is it Intel DevCloud for oneAPI or Intel DevCloud for Edge or Intel DevCloud for FPGA?



Thanks 


0 Kudos
Misra
Novice
1,690 Views

Hi,

 

Thank you for your reply. I am using DevCloud for oneAPI. 

 

0 Kudos
Rahila_T_Intel
Moderator
1,680 Views

Hi,


In most cases, memory error is caused by trying to run compute-intensive tasks on the login node. In such cases, log in to the compute node using qsub –I and execute your commands there.


Regarding "how many nodes can we run our ML workloads on? Is it possible to go beyond nodes>2"


We are working to optimize our queue policies to balance the needs of all users and currently we allow every user to run two jobs, one job running on two nodes simultaneously or two jobs each running on one node.


Hope this will clarify your doubts.


Thanks


0 Kudos
Misra
Novice
1,535 Views

Hi,

 

How can we distribute XGBoost training using oneAPI libraries on GPU for multi-node? Is it possible? If so, are there references or notebooks we can use? Kindly let us know. 

 

Thanks!

 

Regards,

Manjari 

0 Kudos
Rahila_T_Intel
Moderator
1,605 Views

Hi,


We are checking on this internally and will share you the updates.


Thanks


0 Kudos
Rahila_T_Intel
Moderator
1,592 Views

Hi,


We will connect with you through private email/message to get your DevCloud user ID.


Thanks


0 Kudos
carolnavya
Beginner
1,570 Views
0 Kudos
Misra
Novice
1,563 Views

Hi,

 

I have not received any email for my DevCloud user ID. Kindly let me know how I can share it.

 

Regards,

Manjari

0 Kudos
Misra
Novice
1,521 Views

Hi,

 

I am getting the mail for the private message but I am not able to open the messenger in the DevCloud. Is there any email ID that I can share my userID at?

 

 

Also, I have another follow-up after DevCloud account access. 

>> How can we distribute XGBoost training using oneAPI libraries on GPU for multi-node? Is it possible? If so, are there references or notebooks we can use? Kindly let us know. 

 

Thanks!

 

Regards,

Manjari

 

0 Kudos
Rahila_T_Intel
Moderator
1,491 Views

Hi,

 

Thanks for sharing your userid.  We are working on internally, will share you the updates.

 

Thanks

 

0 Kudos
Rahila_T_Intel
Moderator
1,463 Views

Hi,


For reference on XGBoost Optimized for Intel® Architecture please check the below link.

https://www.intel.com/content/www/us/en/developer/articles/technical/xgboost-optimized-architecture-getting-started.html


Since XGBoost GPU support is still in development and available in AI Kit as more of a experimental feature, there are not many resources available for it yet. As GPU support becomes more ready for customer adoption, expect a lot more resources, which are already in planning.


For right now, you can review the contents of the following repo for XGBoost plugin GPU support (which has been upstreamed to XGBoost in Intel Python and AI Kit) as development continues:

https://github.com/vepifanov/xgboost/blob/dpcpp_backend_1.3.3/plugin/updater_oneapi/README.md



If you have any more queries related to XGBoost, please post question in the below forum. 

https://community.intel.com/t5/Intel-Optimized-AI-Frameworks/bd-p/optimized-ai-frameworks


If this resolves your issue, please accept this as a solution as it might help others with the similar issue. 


Please let us know if we can go ahead and close this case?


Thanks


0 Kudos
Rahila_T_Intel
Moderator
1,355 Views

Hi,


We have not heard back from you. 

Please let me know if we can go ahead and close this case?


Thanks


0 Kudos
Rahila_T_Intel
Moderator
1,327 Views

Hi,


We have not heard back from you. This thread will no longer be monitored by Intel. If you need further assistance, please post a new question.


Thanks


0 Kudos
Reply