Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Beginner
79 Views

Training on multiple nodes

How to use more than one node in Intel Dev Cloud while training my Deep neural network?

0 Kudos
1 Reply
Highlighted
New Contributor III
79 Views

It is really a 2-part question, so let me answer in parts.

Part 1: which DevCloud. This thread is posted on the forum for Intel DevCloud for the Edge, which is not designed for DNN training. It has compute nodes to edge inference. The better choice for training would be Intel DevCloud for Data-centric Workloads — it has compute nodes more suitable for training workloads.

Part 2: how to train in parallel. On Intel DevCloud for Data-centric workloads, you can train on multiple nodes in two ways:

  1. Recommended: submit independent training jobs to multiple nodes, one node per job. When training, you normally need to experiment with parameters such as the training rate, so you will go through training faster by using multiple nodes to study multiple parameter sets. Take a look at the "-F" argument of qsub to see an example of how to do that. See https://devcloud.intel.com/datacenter/learn/advanced-queue/job-parameters
  2. Not recommended: submit a multi-node job to train a single network using the distributed-memory architecture syntax (see https://devcloud.intel.com/datacenter/learn/advanced-queue/distributed-memory-architecture). This is not recommended for two reasons:
    1. Multi-node training is not automatic. Even if you reserve multiple nodes for the job, your network will run out-of-the-box on just one. To run on several nodes, you have to explicitly launch your executables on all nodes and inform your DL framework that you are doing this. The recipes for doing this vary from one framework to another.
    2. Multi-node training of a single network is usually very intensive in network traffic, and requires 100 GbE or faster interconnects to provide enough data to keep the training devices occupied. As of early 2020, the networking infrastructure of compute nodes in Intel DevCloud for Data-centric Workloads does not have 100 GbE interconnects, so you may experience a slow-down instead of a speed-up by doing this.
0 Kudos