- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I am trying to run an ML workload on a multi-node but the memory/ram provided seems to be insufficient and hence produce memory issues. Is there any way we can increase the memory required for consumption?
Also, how many nodes can we run our ML workloads on? Is it possible to go beyond nodes>2?
Thank you!
Regards,
Manjari
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
How are you distributing your workload? Are you using MPI/OpenMP? If so there is an example given in advanced queue management given at here.
Also I have used Dask Jobqueue Library which also works. Pay attention to the specific nodes requirement. For example to request for more than 2 nodes
select 4 nodes from the output of this command
pbsnodes | grep "properties =" | awk '{print $3}' | sort | uniq -c
then on top of your file job submission file add something like this:
#PBS -l nodes=sXXX-nXXX,sXXX-nXXX,sXXX-nXXX,sXXX-nXXX:ppn=2,walltime=HH:MM:SS
Like stated your code will only run on node and the application code has to call the other nodes.
Just Like
1 - Compute Node
2,3,4 - Worker Nodes.
Read the Documentation carefully its pretty much self explanatory.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thank you so much for this answer. I am not using MPI.
I am trying to look into the Dask job queue library but having trouble understanding. I have a python script with the ML model to get the results. Could you let me know more about the application code to call the other nodes in the case of the Dask jobqueue library?
Thanks
Regards,
Manjari
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Good day to you.
Thank you for posting in Intel Communities.
Could you kindly let us know which Intel DevCloud you're using? Is it Intel DevCloud for oneAPI or Intel DevCloud for Edge or Intel DevCloud for FPGA?
Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thank you for your reply. I am using DevCloud for oneAPI.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
In most cases, memory error is caused by trying to run compute-intensive tasks on the login node. In such cases, log in to the compute node using qsub –I and execute your commands there.
Regarding "how many nodes can we run our ML workloads on? Is it possible to go beyond nodes>2"
We are working to optimize our queue policies to balance the needs of all users and currently we allow every user to run two jobs, one job running on two nodes simultaneously or two jobs each running on one node.
Hope this will clarify your doubts.
Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
How can we distribute XGBoost training using oneAPI libraries on GPU for multi-node? Is it possible? If so, are there references or notebooks we can use? Kindly let us know.
Thanks!
Regards,
Manjari
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
We are checking on this internally and will share you the updates.
Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
We will connect with you through private email/message to get your DevCloud user ID.
Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I have not received any email for my DevCloud user ID. Kindly let me know how I can share it.
Regards,
Manjari
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I am getting the mail for the private message but I am not able to open the messenger in the DevCloud. Is there any email ID that I can share my userID at?
Also, I have another follow-up after DevCloud account access.
>> How can we distribute XGBoost training using oneAPI libraries on GPU for multi-node? Is it possible? If so, are there references or notebooks we can use? Kindly let us know.
Thanks!
Regards,
Manjari
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thanks for sharing your userid. We are working on internally, will share you the updates.
Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
For reference on XGBoost Optimized for Intel® Architecture please check the below link.
Since XGBoost GPU support is still in development and available in AI Kit as more of a experimental feature, there are not many resources available for it yet. As GPU support becomes more ready for customer adoption, expect a lot more resources, which are already in planning.
For right now, you can review the contents of the following repo for XGBoost plugin GPU support (which has been upstreamed to XGBoost in Intel Python and AI Kit) as development continues:
https://github.com/vepifanov/xgboost/blob/dpcpp_backend_1.3.3/plugin/updater_oneapi/README.md
If you have any more queries related to XGBoost, please post question in the below forum.
https://community.intel.com/t5/Intel-Optimized-AI-Frameworks/bd-p/optimized-ai-frameworks
If this resolves your issue, please accept this as a solution as it might help others with the similar issue.
Please let us know if we can go ahead and close this case?
Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
We have not heard back from you.
Please let me know if we can go ahead and close this case?
Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
We have not heard back from you. This thread will no longer be monitored by Intel. If you need further assistance, please post a new question.
Thanks

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page