Intel Labs at The Winter Conference on Applications of Computer Vision

ScottBair · ‎01-02-2023

Scott Bair is a key voice at Intel Labs, sharing insights into innovative research for inventing tomorrow’s technology.

Highlights:

The 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) is hosted from January 3-7 in Waikoloa, Hawaii.
Intel presents five computer vision papers that detail novel works that include a Dynamic Scene Graph Detection Transformer, a Fast Learnable Once-for-all Adversarial Training method, a method for quantizing convolutional neural networks for efficient training, privacy-enhancing deepfakes for restricted face access in social photos, and a novel non-local self-attentive pooling method.

The IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) is a prominent international computer vision event comprising of the main conference with several co-located workshops and tutorials. This year’s conference runs January 3-7 in Waikoloa, Hawaii. Providing a platform for innovative research, this event draws participants from industry and academia alike.

Intel Labs is pleased to present several works at the main conference, including a Dynamic Scene Graph Detection Transformer for learning long-term dependencies in a video, a Fast Learnable Once-for-all Adversarial Training method, and a method for quantizing convolutional neural networks for efficient training. Additional works encompass quantifiably dissimilar deepfakes applied in a hypothetical social network, where the user has the power to only appear in photos they approve of, and a novel non-local self-attentive pooling method that can be used as a drop-in replacement to the standard pooling layers. Read on for more details on these research areas.

Exploiting Long-Term Dependencies for Generating Dynamic Scene Graphs

Dynamic scene graph generation from a video is challenging due to the temporal dynamics of the scene and the inherent temporal fluctuations of predictions. In this paper, researchers hypothesize that capturing long-term temporal dependencies is the key to the effective generation of dynamic scene graphs. They then propose to learn the long-term dependencies in a video by capturing the object-level consistency and inter-object relationship dynamics over object-level long-term tracklets using transformers. Experimental results demonstrate that the proposed Dynamic Scene Graph Detection Transformer (DSGDETR) outperforms state-of-the-art methods by a significant margin on the benchmark dataset Action Genome. Ablation studies validate the effectiveness of each component of the proposed approach. The source code is available here.

FLOAT: Fast Learnable Once-for-All Adversarial Training for Tunable Trade-off between Accuracy and Robustness

Training a model that can be robust against adversarially-perturbed images without compromising accuracy on clean images has proven to be challenging. Recent research has tried to resolve this issue by incorporating an additional layer after each batch-normalization layer in a network that implements feature-wise linear modulation (FiLM). These extra layers enable in-situ calibration of a trained model, allowing the user to configure the desired priority between robustness and clean-image performance after deployment. However, these extra layers significantly increase training time and parameter count, and add latency which can prove costly for time or memory-constrained applications. This paper presents Fast Learnable Once-for-all Adversarial Training (FLOAT), which transforms the weight tensors without using extra layers, thereby incurring no significant increase in parameter count, training time, or network latency compared to standard adversarial training. In particular, the approach adds configurable scaled noise to the weight tensors, enabling a trade-off between clean and adversarial performance. Additionally, FLOAT is extended to slimmable neural networks to enable a three-way in-situ trade-off between robustness, accuracy, and complexity. Extensive experiments show that FLOAT can yield state-of-the-art performance, improving both clean and perturbed image classification by up to ∼6.5% and ∼14.5%, respectively, while requiring up to 1.47× fewer parameters with similar hyperparameter settings compared to FiLM-based alternatives. Code for this project will be made available shortly.

HyperBlock Floating Point: Generalised Quantization Scheme for Gradient and Inference Computation

Prior quantization methods focus on producing networks for fast and lightweight inference. However, in these works, the cost of unquantised training is overlooked, despite it requiring significantly more time and energy than inference. In response, this paper presents a method for quantizing convolutional neural networks for efficient training. Quantizing gradients is challenging because it requires higher granularity, and their values span a wider range than the weight and feature maps. This work proposes an extension of the Channel-wise Block Floating Point format that allows for quick gradient computation using minimal quantization time. This is achieved by sharing an exponent across both depth and batch dimensions to quantize tensors once and reuse them for the transposed convolution during backpropagation. The method was tested using standard models, such as AlexNet, VGG, and ResNet, on the CIFAR10, SVHN, and ImageNet datasets. Results show no loss of accuracy when quantizing AlexNet weights, activations, and gradients to only 4 bits training ImageNet.

My Face My Choice: Privacy Enhancing Deepfakes for Social Media Anonymization

Recently, the productization of face recognition and identification algorithms has become the most controversial topic of ethical AI. As new policies around digital identities are formed, this work introduces three face access models in a hypothetical social network, where the user has the power to only appear in photos they approve. The proposed approach eclipses current tagging systems and replaces unapproved faces with quantitatively dissimilar deepfakes. In addition, researchers propose new metrics specific to this task, where the deepfake is generated randomly with a guaranteed dissimilarity. The paper explains access models based on the data flow's strictness and discusses each model's impact on privacy, usability, and performance. The system is evaluated on Facial Descriptor Dataset as the real dataset and two synthetic datasets with random and equal class distributions. Running seven state-of-the-art face recognizers on the results, MFMC reduces the average accuracy by 61%. Lastly, researchers extensively analyze similarity metrics, deepfake generators, and datasets in structural, visual, and generative spaces, supporting the design choices and verifying the quality.

Self-Attentive Pooling for Efficient Deep Learning

Efficient custom pooling techniques that can aggressively trim the dimensions of a feature map for resource-constrained computer vision applications have recently gained significant traction. However, prior pooling works extract only the local context of the activation maps, limiting their effectiveness. In contrast, this work proposes a novel non-local self-attentive pooling method that can be used as a drop-in replacement to the standard pooling layers, such as max/average pooling or strided convolution. The novel self-attention module uses patch embedding, multi-head self-attention, and spatial-channel restoration, followed by sigmoid activation and exponential soft-max. This self-attention mechanism efficiently aggregates dependencies between non-local activation patches during downsampling. Extensive experiments on standard object classification and detection tasks with various convolutional neural network (CNN) architectures demonstrate the superiority of the proposed mechanism over the state-of-the-art (SOTA) pooling techniques. It surpasses the test accuracy of existing pooling techniques on different variants of MobileNet-V2 on ImageNet by an average of ∼1.2%. With the aggressive down-sampling of the activation maps in the initial layers (providing up to 22x reduction in memory consumption), this approach achieves 1.43% higher test accuracy compared to SOTA techniques with iso-memory footprints. This enables the deployment of the proposed models in memory-constrained devices, such as microcontrollers (without losing significant accuracy), because the initial activation maps consume a significant amount of on-chip memory for high-resolution images required for complex vision tasks. The pooling method also leverages channel pruning to further reduce memory footprints.