Designing AI applications is a process that involves a number of steps, including performance validation and tuning on actual deployment targets. If you are designing edge applications, then a low power accelerator, like Intel® Movidius™ Vision Processing Unit (VPU), is your likely target due to efficiency and power consumption constraints. If you target an application towards a data center, then you might want to leverage recent Intel® Xeon® Scalable Processors with their integrated performance and scalability. Once you are done with actual coding, compilation, and other regular design activities, the usual next step is to check model performance and tune it to particular deployment hardware. Quite often this hardware is placed somewhere in a lab or remote office, making it difficult not only to profile a model but also to analyze the performance data and tune the model to achieve maximum results. This is the exact reason why we are introducing the capability of working with remote targets in OpenVINO™ Deep Learning Workbench (DL Workbench).
As we have previously described, DL Workbench is a tool that allows you to import Deep Learning models, evaluate their performance and accuracy, and perform different optimization tasks, like calibration for 8-bit integer inference. Profiling and model optimization are device-specific, therefore, to achieve maximum performance in a deployment environment, we need to perform these steps directly in that environment. DL Workbench helps you with accessing those capabilities on remote machines. Please note that if you want to have access to numerous hardware configurations ready for work and you do not have them locally or in your private lab, you can run DL Workbench in the Intel® DevCloud for the Edge, where you can easily start experiments with available hardware. In this paper, we primarily focus on the case when you prepare the model for deployment and need to benchmark it on a specific hardware setup available in your private lab or a pre-production sandbox.
To avoid misunderstandings, below is a brief explanation of the terms used in this paper:
- Target is a machine that hosts one or several accelerators.
- Local target is a target where DL Workbench is running. Any target different from the local one is considered as remote. Usually the application is run on a target, and technically any remote target must be accessible via passwordless SSH connection.
- Device is a particular accelerator on which you execute the model, for example Intel® Movidius™ VPU or Intel Iris® Plus Gen11.
Setting up environment for remote profiling
To access remote targets, DL Workbench uses plain SSH protocol. There are several reasons behind this. Firstly, we fully respect security aspects of the deployment environment. Secondly, we want to minimize the footprint on these targets. Finally, we highly value your development time and want to provide an easy and straightforward way to quickly set up the target in DL Workbench to jump directly to profiling and optimizing your model. Let’s proceed to learn how a new remote target can be set up and added to the list of available targets in the DL Workbench. This process is described in more details in the special Remote Profiling section of the DL Workbench documentation.
Firstly, make sure that your remote target satisfies software requirements. We assume that you have also configured a passwordless SSH connection which is one of the most secure ways to communicate with remote servers. When remote target is accessible with a passwordless SSH connection, run DL Workbench and navigate to the ‘Create Project’ page where you need to click the Add Remote Target button.
You are redirected to the Add Remote Target page where you are requested to provide authentication credentials and other information needed to connect to the host.
After filling in the form, you will see that your machine is in the Configuring state:
DL Workbench configures your machine for future profiling and optimization experiments: installs drivers, if needed, puts OpenVINO™ toolkit runtime and selected scripts, and installs Python dependencies for them. Then it collects information about available accelerators. For example, if the VPU accelerator is connected to a remote target, you will be able to see it in the list of available remote devices.
You can always return to the remote target review, analyze its status, check if devices are still available, check current snapshot of system information, like CPU load or RAM being in use.
Finally, you get your remote target registered in the DL Workbench:
After the remote target is configured and marked as ready, you can experiment with it in DL Workbench in the same way as with a local target. DL Workbench takes responsibility over profiling and optimization logic on local and remote targets. Technically, the difference is only in the need to transfer a model and dataset to the remote target before conducting any experiments. Therefore, a single experiment on a remote target takes more time than an equivalent one on a local target. Regarding other aspects of profiling and optimizing mechanics, local and remote experiments are completely the same. We strictly respect security aspects and ensure that a model and dataset are removed from a remote target after each experiment.
Comparing model performance on remote targets
Let’s first import the squeezenet-v1.1 model by simply selecting it from the list of available OpenVINO™ Open Model Zoo models on the Import model page. After that, we need to import the validation dataset. In our case, it will be a small 200-images subset of the ImageNet dataset. Then we need to select the imported model, dataset, the remote target that we have just configured in the DL Workbench, and proceed to the creation of the project. Once the project is created, we can find the best inference configuration by running multiple inference experiments with various batch and stream values. You can read more about OpenVINO performance basic concepts to understand why finding the best configuration of batch and stream is individual for each model and how it allows to boost your model performance. Once we have made these experiments, we can start performance analysis.
As we can see, the model performance hugely depends on the particular combination of batch and stream. For example, running the model with 4 streams in a single-batch mode is 10% quicker than 1 stream and batch equal 4, while being 1.57 times quicker than when we executed the model without any additional configurations (1 stream and batch equal 1). Our potential next step is to analyze the model performance on the kernel level or compare model performance on remote targets.
It is important to note that all experimental data is collected and aggregated on the local target in the DL Workbench internal storage. This enables you to compare model performance across remote targets and devices side by side so that you can easily detect any anomalies or define the most appropriate target and device for the final deployment scenario.
For performance comparison we switch to the Compare Performance page and select particular experiments in our projects. In the example below, we compare the best model performance result on a remote target Remote Lab with i9-9900K with i9-9900K selected as accelerator with the best model performance result on another remote target Gen9 Sandbox with 9th Gen Intel® Core™ i9 Processor Dreamstakes as accelerator. Note that maximum performance on 9th Gen Intel® Core™ i9 Processor Dreamstakes accelerator is achieved with a single-stream inference and batch equal 4; compared to the most performant inference configuration of the same model for Intel® Core™ i9-9900K with1 stream and batch equal 4.
We start with comparing models by throughput (higher is better) and latency (lower is better) bars. In our case, squeezenet-v1.1 is much quicker on the CPU accelerator.
What are the reasons for such performance difference? After switching to the Performance Summary tab, we see that GPU accelerator was quicker in executing Pooling layers, while the performance gap was mainly caused by longer execution of Convolution layers.
We have identified the main problem area of the model inference on Intel GPU accelerator: long Convolution layers execution. We can dig into that aspect and filter out non-Convolution layers from the kernel-level table in the third tab of the performance report.
From the image above we can see that every Convolution layer was 2.4-3.9 times slower on the GPU device. From this analysis, we can clearly see that our model does not provide the best performance on the remote machine with GPU accelerator. So we need either to focus on deploying the model in production by using the same hardware setup as on the Remote Lab with i9-9900K or continue experimenting with other accelerators and hardware configurations in our private lab. You can also experiment with existing configurations for free by running DL Workbench in the Intel® DevCloud for the Edge.
We have walked through a simple process of using Deep Learning Workbench with remote targets. We learned how to register a remote target in the DL Workbench, start working with it by creating the new project and compare model performance measured on various remote targets. With remote target support in the DL Workbench model benchmarking and optimizing on a pre-production sandbox with several hardware configurations is no longer a challenge. We hope that by using this feature, you will be able to improve your productivity and reduce the time required to prepare your application for release.
Notices & Disclaimers
Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.
Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. See backup for configuration details. No product or component can be absolutely secure.
Your costs and results may vary.
Intel technologies may require enabled hardware, software or service activation.
© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.
Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.