Intel Presents Latest Computer Vision Research at ICCV 2023

ScottBair · ‎10-03-2023

Scott Bair is a key voice at Intel Labs, sharing insights into innovative research for inventing tomorrow’s technology.

Highlights:

The 2023 International Conference on Computer Vision (ICCV) will take place from October 2 – 6 in Paris, France.
Intel presents six computer vision works at the event, including three main conference papers, an oral presentation, and two workshop papers.

The 2023 International Conference on Computer Vision (ICCV) will take place from October
2 – 6 in Paris, France. Intel is proud to present six works at this year’s event, including three main conference papers, an oral presentation, and two workshop papers. These works include novel techniques to boost private inference efficiency on vision transformers, a new dataset consisting of scenes that change in appearance and geometry over time, and a method that introduces continual learning to Neural Radiance Fields. Intel’s researchers will also present a canonical camera space transformation module, a model that enhances Vision Graph Neural Networks, and a method that leverages off-the-shelf pre-trained weights for large models and generates a super-network during the fine-tuning stage. Furthermore, researchers will detail the key findings of their investigation of the relationship between different diversity metrics, accuracy, and resiliency to natural image corruptions of Deep Learning (DL) image classifier ensembles.

Main Conference Papers

SAL-ViT: Towards Latency Efficient Private Inference on ViT using Selective Attention Search with a Learnable Softmax Approximation

Recently, private inference (PI) has addressed the rising concern over data and model privacy in machine learning inference as a service. However, existing PI frameworks suffer from high computational and communication overheads due to the expensive multi-party computation (MPC) protocols, particularly for large models such as vision transformers (ViT). The majority of this overhead is due to the encrypted softmax operation in each self-attention layer. This work presents SAL-ViT with two novel techniques to boost PI efficiency on ViTs. The first technique is a learnable PI-efficient approximation to softmax, namely, learnable 2Quad (L2Q), that introduces learnable scaling and shifting parameters to the prior 2Quad softmax approximation, enabling improvement in accuracy. Then, given the researchers’ observation that external attention (EA) presents lower PI latency than widely adopted self-attention (SA) at the cost of accuracy, they present a selective attention search (SAS) method to integrate the strength of EA and SA. Specifically, for a given lightweight EA ViT, they leverage a constrained optimization procedure to selectively search and replace EA modules with SA alternatives to maximize the accuracy. Extensive experiments show that SAL-ViT can averagely achieve 1.28×, 1.28×, 1.14× lower PI latency with 1.79%, 1.41%, and 2.08% higher accuracy compared to the existing alternatives, on CIFAR-10, CIFAR-100, and Tiny-ImageNet, respectively.

CLNeRF: Continual Learning Meets NeRF

Novel view synthesis aims to render unseen views given a set of calibrated images. In practical applications, the coverage, appearance or geometry of the scene may change over time, with new images continuously being captured. Efficiently incorporating such continuous change is an open challenge. Standard NeRF benchmarks only involve scene coverage expansion. To study other practical scene changes, this work proposes a new dataset, World Across Time (WAT), consisting of scenes that change in appearance and geometry over time. Researchers also propose a simple yet effective method, CLNeRF, which introduces continual learning (CL) to Neural Radiance Fields (NeRFs). CLNeRF combines generative replay and the Instant Neural Graphics Primitives (NGP) architecture to effectively prevent catastrophic forgetting and efficiently update the model when new data arrives. Furthermore, researchers add trainable appearance and geometry embeddings to NGP, allowing a single compact model to handle complex scene changes. Without the need to store historical images, CLNeRF trained sequentially over multiple scans of a changing scene performs on-par with the upper bound model trained on all scans at once. Compared to other CL baselines CLNeRF performs much better across standard benchmarks and WAT. The source code, and the WAT dataset are available here; the video presentation is available here.

Metric3D: Towards Zero-shot Metric 3D Prediction from A Single Image

Reconstructing accurate 3D scenes from images is a long-standing vision task. Due to the ill-posedness of the single-image reconstruction problem, most well-established methods are built upon multi-view geometry. State-of-the-art (SOTA) monocular metric depth estimation methods can only handle a single camera model and are unable to perform mixed-data training due to the metric ambiguity. Meanwhile, SOTA monocular methods trained on large mixed datasets achieve zero-shot generalization by learning affine-invariant depths, which cannot recover real-world metrics. This work shows that the key to a zero-shot single-view metric depth model lies in the combination of large-scale data training and resolving the metric ambiguity from various camera models. Researchers propose a canonical camera space transformation module, which explicitly addresses the ambiguity problems and can be effortlessly plugged into existing monocular models. Equipped with our module, monocular models can be stably trained with over 8 million images with thousands of camera models, resulting in zero-shot generalization to in-the-wild images with unseen camera settings. Experiments demonstrate SOTA performance of this method on seven zero-shot benchmarks. Notably, the method won the championship in the 2nd Monocular Depth Estimation Challenge. This method enables the accurate recovery of metric 3D structures on randomly collected internet images, paving the way for plausible single-image metrology. The potential benefits extend to downstream tasks, which can be significantly improved by simply plugging in our model. For example, the model relieves the scale drift issues of monocular-SLAM (Fig. 1), leading to high-quality metric scale dense mapping. The code is available here.

Oral Presentation

Vision HGNN: An Image is More than a Graph of Nodes

The realm of graph-based modeling has proven its adaptability across diverse real-world data types. However, its applicability to general computer vision tasks had been limited until the introduction of the Vision Graph Neural Network (ViG). ViG divides input images into patches, conceptualized as nodes, constructing a graph through connections to nearest neighbors. Nonetheless, this method of graph construction confines itself to simple pairwise relationships, leading to surplus edges and unwarranted memory and computation expenses. In this paper, researchers enhance ViG by transcending conventional “pairwise” linkages and harnessing the power of the hypergraph to encapsulate image information. The objective is to encompass more intricate inter-patch associations. In both training and inference phases, researchers adeptly establish and update the hypergraph structure using the Fuzzy C-Means method, ensuring minimal computational burden. This augmentation yields the Vision HyperGraph Neural Network (ViHGNN). The model’s efficacy is empirically substantiated through its state-of-the-art performance on both image classification and object detection tasks, courtesy of the hypergraph structure learning module that uncovers higher-order relationships.

Workshop Papers

Assessing the Impact of Diversity on the Resilience of Deep Learning Ensembles: A Comparative Study on Model Architecture, Output, Activation, and Attribution.

Presented at the Out Of Distribution Generalization in Computer Vision Workshop

This work investigates the relationship between different diversity metrics, accuracy, and resiliency to natural image corruptions of Deep Learning (DL) image classifier ensembles. Researchers evaluate existing diversity dimensions such as model architecture, model prediction, and neuron activations, as well as a novel diversity dimension of input attribution. Using ResNet50 as a comparison baseline, researchers evaluate the resiliency of multiple individual DL model architectures against dataset distribution shifts corresponding to natural image corruptions. The team compares ensembles created with diverse model architectures trained either independently or through a Neural Architecture Search technique and evaluate the correlation of prediction-based and attribution-based diversity to the final ensemble accuracy. Finally, they evaluate a set of diversity enforcement heuristics for training based on negative correlation learning (NCL) and compare how effective they are to achieve independent failure behavior. The key observations are: 1) model architecture is more important for individual resiliency than model size or model accuracy but architecture diversity in an ensemble is typically not more resilient, 2) attribution-based diversity is less negatively correlated to the ensemble accuracy than prediction-based diversity, 3) a balanced loss function of individual and ensemble accuracy creates more resilient ensembles for image natural corruptions, 4) architecture diversity produces more diversity than NCL in all explored diversity metrics: predictions, attributions, and activations.

InstaTune: Instantaneous Neural Architecture Search During Fine-Tuning

Presented at the Workshop on Resource Efficient Deep Learning for Computer Vision

One-Shot Neural Architecture Search (NAS) algorithms often rely on training a hardware agnostic super-network for a domain specific task. Optimal sub-networks are then extracted from the trained super-network for different hardware platforms. However, training super-networks from scratch can be extremely time consuming and compute intensive especially for large models that rely on a two-stage training process of pre-training and fine-tuning. State of the art pre-trained models are available for a wide range of tasks, but their large sizes significantly limits their applicability on various hardware platforms. This work proposes InstaTune, a method that leverages off-the-shelf pre-trained weights for large models and generates a super-network during the fine-tuning stage. InstaTune has multiple benefits. Firstly, since the process happens during fine-tuning, it minimizes the overall time and compute resources required for NAS. Secondly, the sub-networks extracted are optimized for the target task, unlike prior work that optimizes on the pre-training objective. Finally, InstaTune is easy to “plug and play” in existing frameworks. By using multi-objective evolutionary search algorithms along with lightly trained predictors, researchers find Pareto-optimal sub-networks that outperform their respective baselines across different performance objectives such as accuracy and MACs. Specifically, they demonstrate that the approach performs well across both unimodal (ViT and BERT) and multi-modal (BEiT-3) transformer-based architectures.