Scott Bair is a key voice at Intel Labs, sharing insights into innovative research for inventing tomorrow’s technology.
- This year’s European Conference for Computer Vision (ECCV) will be held in Tel Aviv, Israel, from October 23-27, 2022 with some options for virtual attendance.
- Intel Labs’ submission, A Better Baseline for Audio-Visual Diarization, wins first place in the Ego4D Challenge 2022.
- Intel Labs will present seven papers at ECCV 2022.
Hosted this year in the beautiful city of Tel Aviv, Israel, the European Conference for Computer Vision (ECCV) will run from October 23-27, 2022. The conference also offers some options for virtual attendance.
Intel Labs is proud to have achieved first place on the Audio-Visual Diarization (AVD) task of the Ego4D Challenge 2022. The goal of the AVD task is to localize and track each person in a given scene and associate voice to people in the scene and the camera wearer. Labs’ innovative approach, A Better Baseline for Audio-Visual Diarization, presents multiple technical improvements over the official baselines. First, it improves the detection performance of the camera wearer’s voice activity by modifying the training scheme of its model. Second, our team discovered that an off-the-shelf voice activity detection model can effectively remove false positives when it is applied solely to the camera wearer’s voice activities. Lastly, the method shows that better active speaker detection leads to a better AVD outcome. On the Ego4D test set, our approach obtains a 65.9% diarization error rate (DER) which is improved over the baseline of 73.3%. The method significantly outperformed all other baselines as well.
In addition to presenting the details of this work, Labs will also showcase seven other works at this year’s conference. These works include a method for neural video delivery, significantly reducing the cost of training content-aware SR models and achieving even better performance, as well as a novel, self-supervised continual novelty detector that shows state of the art performance in continual novelty detection, minimizing catastrophic forgetting and error propagation at each task through time. Additionally, Labs details a new camera re-localization solution named which combines the advantages of feed-forward formulations and scene coordinate based methods with results that are significantly better than other methods. Read below for more information on all of the published works.
Papers and Abstracts
Since the future of computing is heterogeneous, scalability is a crucial problem for single image super-resolution. Recent works try to train one network, which can be deployed on platforms with different capacities. However, they rely on the pixel-wise sparse convolution, which is not hardware-friendly and achieves limited practical speedup. As images can be divided into patches, which have various restoration difficulties, we present a scalable method based on Adaptive Patch Exiting (APE) to achieve more practical speedup. Specifically, we propose to train a regressor to predict the incremental capacity of each layer for the patch. Once the incremental capacity is below the threshold, the patch can exit at the specific layer. Our method can easily adjust the trade-off between performance and efficiency by changing the threshold of incremental capacity. Furthermore, we propose a novel strategy to enable the network training of our method. We conduct extensive experiments across various backbones, datasets and scaling factors to demonstrate the advantages of our method. Code is available at https://github.com/littlepure2333/APE.
Recently, Deep Neural Networks (DNNs) are utilized to reduce the bandwidth and improve the quality of Internet video delivery. Existing methods train a corresponding content-aware super-resolution (SR) model for each video chunk on the server, and stream low-resolution (LR) video chunks along with SR models to the client. Although they achieve promising results, the huge computational cost of network training limits their practical applications. In this paper, we present a method named Efficient Meta-Tuning (EMT) to reduce the computational cost. Instead of training from scratch, EMT adapts a meta-learned model to the first chunk of the input video. As for the following chunks, it finetunes the partial parameters selected by gradient masking of previous adapted model. In order to achieve further speedup for EMT, we propose a novel sampling strategy to extract the most challenging patches from video frames. The proposed strategy is highly efficient and brings negligible additional cost. Our method significantly reduces the computational cost and achieves even better performance, paving the way for applying neural video delivery techniques to practical applications. We conduct extensive experiments based on various efficient SR architectures, including ESPCN, SRCNN, FSRCNN and EDSR-1, demonstrating the generalization ability of our work. The code is released at https://github.com/Neural-video-delivery/EMT-Pytorch-ECCV2022.
Novelty detection is a key capability for practical machine learning in the real world, where models operate in non-stationary conditions and are repeatedly exposed to new, unseen data. Yet, most current novelty detection approaches have been developed exclusively for static, offline use. They scale poorly under more realistic, continual learning regimes in which data distribution shifts occur. To address this critical gap, this paper proposes incDFM (incremental Deep Feature Modeling), a self-supervised continual novelty detector. The method builds a statistical model over the space of intermediate features produced by a deep network, and utilizes feature reconstruction errors as uncertainty scores to guide the detection of novel samples. Most importantly, incDFM estimates the statistical model incrementally (via several iterations within a task), instead of a single-shot. Each time it selects only the most confident novel samples which will then guide subsequent recruitment incrementally. For a certain task where the ML model encounters a mixture of old and novel data, the detector flags novel samples to incorporate them to old knowledge. Then the detector is updated with the flagged novel samples, in preparation for a next task. To quantify and benchmark performance, we adapted multiple datasets for continual learning: CIFAR-10, CIFAR-100, SVHN, iNaturalist, and the 8-dataset. Our experiments show that incDFM achieves state of the art continual novelty detection performance. Furthermore, when examined in the greater context of continual learning for classification, our method is successful in minimizing catastrophic forgetting and error propagation.
Active Speaker Detection (ASD) in videos with multiple speakers is a challenging task as it requires learning effective audiovisual features and spatial-temporal correlations over long temporal windows. In this paper, we present SPELL, a novel spatial-temporal graph learning framework that can solve complex tasks such as ASD. To this end, each person in a video frame is first encoded in a unique node for that frame. Nodes corresponding to a single person across frames are connected to encode their temporal dynamics. Nodes within a frame are also connected to encode inter-person relationships. Thus, SPELL reduces ASD to a node classification task. Importantly, SPELL is able to reason over long temporal contexts for all nodes without relying on computationally expensive fully connected graph neural networks. Through extensive experiments on the AVA-ActiveSpeaker dataset, we demonstrate that learning graph-based representations can significantly improve the active speaker detection performance owing to its explicit spatial and temporal structure. SPELL outperforms all previous state-of-the-art approaches while requiring significantly lower memory and computational resources. Our code is publicly available at this https URL
Multi-View 3D Morphable Face Reconstruction via Canonical Volume Fusion
Due to the capability of easy animation and editing of faces, 3D Morphable Model (3DMM) is widely used in the task of face reconstruction. Recent methods recover 3DMM coefficients by fusing the information from a set of multi-view images via end-to-end Convolutional Neural Networks (CNNs), which alleviate the inherent depth ambiguity in the single-view setting. However, most of these methods fuse global features of all views to regress the 3D morphable face, without considering the dense correspondences of multi-view images. In this paper, we propose a novel approach to reconstruct high-quality 3D morphable faces. We first use a canonical feature volume to fuse multiple view features in 3D space, which establish dense correspondences between different views. Next, to bridge the gap between CNN regression and pixel-wise optimization and further leverage the muti-view information, we propose test-time optimization to improve the regressed results with negligible additional cost. Our method achieves the state-of-the-art performance on widely-used benchmarks, demonstrating the effectiveness of our approach. Code will be released.
Visual re-localization aims to recover camera poses in a known environment, which is vital for applications like robotics or augmented reality. Feed-forward absolute camera pose regression methods directly output poses by a network, but suffer from low accuracy. Meanwhile, scene coordinate based methods are accurate, but need iterative RANSAC post-processing, which brings challenges to efficient end-to-end training and inference. In order to have the best of both worlds, we propose a feed-forward method termed SC-wLS that exploits all scene coordinate estimates for weighted least squares pose regression. This differentiable formulation exploits a weight network imposed on 2D-3D correspondences, and requires pose supervision only. Qualitative results demonstrate the interpretability of learned weights. Evaluations on 7 Scenes and Cambridge datasets show significantly promoted performance when compared with former feed-forward counterparts. Moreover, our SC-wLS method enables a new capability: self-supervised test-time adaptation on the weight network. Codes and models are publicly available.
Morphable models are essential for the statistical modeling of 3D faces. Previous works on morphable models mostly focus on largescale facial geometry but ignore facial details. This paper augments morphable models in representing facial details by learning a Structure-aware Editable Morphable Model (SEMM). SEMM introduces a detail structure representation based on the distance field of wrinkle lines, jointly modeled with detail displacements to establish better correspondences and enable intuitive manipulation of wrinkle structure. Besides, SEMM introduces two transformation modules to translate expression, blend, shape, weight, and age values into changes in latent space, allowing effective semantic detail editing while maintaining identity. Extensive experiments demonstrate that the proposed model compactly represents facial details, outperforms previous methods in expression animation qualitatively and quantitatively, and achieves effective age editing and wrinkle line editing of facial details. Code and model are available at https://github.com/gerwang/facial-detail-manipulation.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.