It is really a 2-part question, so let me answer in parts.
Part 1: which DevCloud. This thread is posted on the forum for Intel DevCloud for the Edge, which is not designed for DNN training. It has compute nodes to edge inference. The better choice for training would be Intel DevCloud for Data-centric Workloads — it has compute nodes more suitable for training workloads.
Part 2: how to train in parallel. On Intel DevCloud for Data-centric workloads, you can train on multiple nodes in two ways:
- Recommended: submit independent training jobs to multiple nodes, one node per job. When training, you normally need to experiment with parameters such as the training rate, so you will go through training faster by using multiple nodes to study multiple parameter sets. Take a look at the "-F" argument of qsub to see an example of how to do that. See https://devcloud.intel.com/datacenter/learn/advanced-queue/job-parameters
- Not recommended: submit a multi-node job to train a single network using the distributed-memory architecture syntax (see https://devcloud.intel.com/datacenter/learn/advanced-queue/distributed-memory-architecture). This is not recommended for two reasons:
- Multi-node training is not automatic. Even if you reserve multiple nodes for the job, your network will run out-of-the-box on just one. To run on several nodes, you have to explicitly launch your executables on all nodes and inform your DL framework that you are doing this. The recipes for doing this vary from one framework to another.
- Multi-node training of a single network is usually very intensive in network traffic, and requires 100 GbE or faster interconnects to provide enough data to keep the training devices occupied. As of early 2020, the networking infrastructure of compute nodes in Intel DevCloud for Data-centric Workloads does not have 100 GbE interconnects, so you may experience a slow-down instead of a speed-up by doing this.
Can we do distributed computing on Intel DevCloud?
As an example, start a server on a single compute node, and have another node as a management node, but all the process needs to be done within 24 hours.
Is such a setup feasible?
Alternatively, usage of MPI4PY is preferred but how will the Intel DevCloud react with MPI4PY, will it act as a cluster with several nodes or as workers from a single node?