I am training a model using the 2014 NYC-Taxi dataset for comparison against a previous experiment using an NVidia server. However, my runs exit with a killed command, which seems to be due to memory use. Because of this, I tried to find a way to distribute the training across multiple nodes with MODIN, but can't find any examples for this on the devcloud.
- Is there a way to confirm/optimize the memory use for my application, so that I don't have to go multinode?
- Are there any examples of multinode python on devcloud?
Thanks for posting in Intel forums.
Could you share the sample reproducer? Also kindly let us know the below.
1)Are you trying this training from login node ?
2)Are you using intel modin or any of the intel's optimized frameworks?
And for your information, currently we have few limitations while requesting multiple nodes in devcloud. As a part of these limitations, a user can only run two jobs/nodes in devcloud.
Regarding the examples of multinode python, please find the below github samples.
Apologies for the delay:
- I ran from an interactive node.
- Within the last few days, I've seen some information about MODIN, and I'm working to integrate it into my code now, either that or numba.
Are you saying a user can only run 2 jobs on one node? Or that one job can request/use 2 nodes at max?
Also, what is the, "reproducer?"
Regarding your question on jobs/nodes limitations, currently we allow every user to run two jobs, one job running on two nodes simultaneously or two jobs each running on one node.
Also, what is the, "reproducer?" - In order to reproduce your issue we need a sample script called reproducer containing the steps you tried. This will help us to identify your exact issue.
I assume that your issue is resolved. If you need any additional information, please post a new question as this thread will no longer be monitored by Intel.