Regarding multi node memory issue

Misra · ‎06-27-2022

Hello,

I am trying to run an ML workload on a multi-node but the memory/ram provided seems to be insufficient and hence produce memory issues. Is there any way we can increase the memory required for consumption?

Also, how many nodes can we run our ML workloads on? Is it possible to go beyond nodes>2?

Thank you!

Regards,

Manjari

risan-raja · ‎06-27-2022

Hi,

How are you distributing your workload? Are you using MPI/OpenMP? If so there is an example given in advanced queue management given at here.

Also I have used Dask Jobqueue Library which also works. Pay attention to the specific nodes requirement. For example to request for more than 2 nodes

select 4 nodes from the output of this command

pbsnodes | grep "properties =" | awk '{print $3}' | sort | uniq -c

then on top of your file job submission file add something like this:

#PBS -l nodes=sXXX-nXXX,sXXX-nXXX,sXXX-nXXX,sXXX-nXXX:ppn=2,walltime=HH:MM:SS

Like stated your code will only run on node and the application code has to call the other nodes.

Just Like

1 - Compute Node

2,3,4 - Worker Nodes.

Read the Documentation carefully its pretty much self explanatory.

Misra · ‎07-07-2022

Hi,

Thank you so much for this answer. I am not using MPI.

I am trying to look into the Dask job queue library but having trouble understanding. I have a python script with the ML model to get the results. Could you let me know more about the application code to call the other nodes in the case of the Dask jobqueue library?

Thanks

Regards,

Manjari

Rahila_T_Intel · ‎06-28-2022

Hi,

Good day to you.

Thank you for posting in Intel Communities.

Could you kindly let us know which Intel DevCloud you're using? Is it Intel DevCloud for oneAPI or Intel DevCloud for Edge or Intel DevCloud for FPGA?

Thanks

Misra · ‎06-29-2022

Hi,

Thank you for your reply. I am using DevCloud for oneAPI.

Rahila_T_Intel · ‎06-30-2022

Hi,

In most cases, memory error is caused by trying to run compute-intensive tasks on the login node. In such cases, log in to the compute node using qsub –I and execute your commands there.

Regarding "how many nodes can we run our ML workloads on? Is it possible to go beyond nodes>2"

We are working to optimize our queue policies to balance the needs of all users and currently we allow every user to run two jobs, one job running on two nodes simultaneously or two jobs each running on one node.

Hope this will clarify your doubts.

Thanks

Misra · ‎07-20-2022

Hi,

How can we distribute XGBoost training using oneAPI libraries on GPU for multi-node? Is it possible? If so, are there references or notebooks we can use? Kindly let us know.

Thanks!

Regards,

Manjari

Rahila_T_Intel · ‎07-14-2022

Hi,

We are checking on this internally and will share you the updates.

Thanks

Rahila_T_Intel · ‎07-17-2022

Hi,

We will connect with you through private email/message to get your DevCloud user ID.

Thanks

carolnavya · ‎07-20-2022

.

Misra · ‎07-20-2022

Hi,

I have not received any email for my DevCloud user ID. Kindly let me know how I can share it.

Regards,

Manjari

Misra · ‎07-21-2022

Hi,

I am getting the mail for the private message but I am not able to open the messenger in the DevCloud. Is there any email ID that I can share my userID at?

Also, I have another follow-up after DevCloud account access.

>> How can we distribute XGBoost training using oneAPI libraries on GPU for multi-node? Is it possible? If so, are there references or notebooks we can use? Kindly let us know.

Thanks!

Regards,

Manjari

Rahila_T_Intel · ‎07-25-2022

Hi,

Thanks for sharing your userid. We are working on internally, will share you the updates.

Thanks

Rahila_T_Intel · ‎07-26-2022

Hi,

For reference on XGBoost Optimized for Intel® Architecture please check the below link.

https://www.intel.com/content/www/us/en/developer/articles/technical/xgboost-optimized-architecture-getting-started.html

Since XGBoost GPU support is still in development and available in AI Kit as more of a experimental feature, there are not many resources available for it yet. As GPU support becomes more ready for customer adoption, expect a lot more resources, which are already in planning.

For right now, you can review the contents of the following repo for XGBoost plugin GPU support (which has been upstreamed to XGBoost in Intel Python and AI Kit) as development continues:

https://github.com/vepifanov/xgboost/blob/dpcpp_backend_1.3.3/plugin/updater_oneapi/README.md

If you have any more queries related to XGBoost, please post question in the below forum.

https://community.intel.com/t5/Intel-Optimized-AI-Frameworks/bd-p/optimized-ai-frameworks

If this resolves your issue, please accept this as a solution as it might help others with the similar issue.

Please let us know if we can go ahead and close this case?

Thanks

Rahila_T_Intel · ‎08-08-2022

Hi,

We have not heard back from you.

Please let me know if we can go ahead and close this case?

Thanks

Rahila_T_Intel · ‎08-16-2022

Hi,

We have not heard back from you. This thread will no longer be monitored by Intel. If you need further assistance, please post a new question.

Thanks