Scott Bair is a key voice at Intel Labs, sharing insights into innovative research for inventing tomorrow’s technology.
Highlights:
- The 2023 IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR) will run from June 18th through the 22nd in Vancouver, Canada.
- Intel presents six main conference papers at this year’s conference, including novel methods for exploiting the intrinsic characteristics of different heights and for removing the bit selector without any performance loss, as well as a Permutation Straight-Through Estimator (PSTE), a high-quality ‘neural rate-estimator,’ a sparse video-text architecture, and a framework for generating more unbiased scene graphs.
- The poster for Intel’s Latent Diffusion Model for 3D (LDM3D) also won the Best Poster Award at the 3DMV workshop.
- Kyle Min, a Research Scientist at Intel Labs placed 1st on the leaderboard for the Ego4D AV Diarization challenge and 3rd for the Ego4D AV Transcription challenge, and was invited to give a spotlight talk at the workshop.
- Ilke Demir, a Senior Staff Research Scientist at Intel, will give a keynote talk at the annual mentorship dinner of the Women in Computer Vision Workshop (WiCV).
- Paula Ramos, an AI Evangelist at Intel, was invited to give a keynote speech at the Agriculture and Vision Workshop.
The IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR) is a premier computer vision event. This year, the conference will have a single track from June 18th through the 22nd in Vancouver, Canada. Additionally, all plenary events will be streamed, and the virtual platform will host videos, posters and a chat room for every paper. Intel is pleased to debut six main conference papers, including novel methods for exploiting the intrinsic characteristics of different heights and for removing the bit selector without any performance loss, as well as a Permutation Straight-Through Estimator (PSTE) that is able to not only optimize the selection process end-to-end but also maintain the non-repetitive occupancy of selected codewords. Intel researchers also present a high-quality ‘neural rate-estimator,’ a framework for generating more unbiased scene graphs, and a sparse video-text architecture that performs multi-frame reasoning with significantly lower cost than naive transformers with dense attention.
In addition to the papers and a demo, Intel has many other exciting contributions at this year’s conference. First, Kyle Min, a Research Scientist at Intel Labs placed on the leaderboard for two challenges at the third international Ego4D workshop. He achieved 1st rank on the Audio-video Diarization Challenge and 3rd rank on the AV Transcription Challenge, and was invited to give a spotlight talk at the workshop. Furthermore, two other Intel employees were invited to speak at various conference workshops: Ilke Demir, a Senior Staff Research Scientist at Intel, will give a keynote talk at the annual mentorship dinner of the Women in Computer Vision Workshop (WiCV); and Paula Ramos, an AI Evangelist at Intel, will deliver a keynote speech at the Agriculture and Vision Workshop.
Conference Papers
BEV-SAN: Accurate BEV 3D Object Detection via Slice Attention Networks
Bird's-Eye-View (BEV) 3D Object Detection is a crucial multi-view technique for autonomous driving systems. Existing methods aggregate multi-view camera features to the flattened grid in order to construct the BEV feature. However, flattening the BEV space along the height dimension fails to emphasize the informative features of different heights. This paper proposes a novel method named BEV Slice Attention Network (BEV-SAN) for exploiting the intrinsic characteristics of different heights. Instead of flattening the BEV space, we first sample along the height dimension to build the global and local BEV slices. Then, the features of BEV slices are aggregated from the camera features and merged by the attention mechanism. Finally, we fuse the merged local and global BEV features by a transformer to generate the final feature map for task heads. The purpose of local BEV slices is to emphasize informative heights. In order to find them, we further propose a LiDAR-guided sampling strategy to leverage the statistical distribution of LiDAR to determine the heights of local slices. Compared with uniform sampling, LiDAR-guided sampling can determine more informative heights.
CABM: Content-Aware Bit Mapping for Single Image Super-Resolution Network with Large Input
With the development of high-definition display devices, the practical scenario of Super-Resolution (SR) usually needs to super-resolve large input like 2K to higher resolution (4K/8K). Current methods train an MLP bit selector to determine the proper bit for each layer. However, they uniformly sample subnets for training, making simple subnets overfitted and complicated subnets underfitted. Therefore, the trained bit selector fails to determine the optimal bit. Apart from this, the introduced bit selector brings additional cost to each layer of the SR network. This paper proposes a novel method named Content-Aware Bit Mapping (CABM), which can remove the bit selector without any performance loss. CABM also learns a bit selector for each layer during training. After training, the team analyzed the relation between the edge information of an input patch and the bit of each layer. The team observed that the edge information can be an effective metric for the selected bit. Therefore, they designed a strategy to build an Edge-to-Bit lookup table that maps the edge score of a patch to the bit of each layer during inference. The bit configuration of SR network can be determined by the lookup tables of all layers. This strategy can find better bit configuration, resulting in more efficient mixed precision networks. The team also conducted detailed experiments to demonstrate the generalization ability of the method.
Compacting Binary Neural Networks by Sparse Kernel Selection
Binary Neural Network (BNN) represents convolution weights with 1-bit values, which enhances the efficiency of storage and computation. This paper is motivated by a previously revealed phenomenon that the binary kernels in successful BNNs are nearly power-law distributed: their values are mostly clustered into a small number of codewords. This phenomenon encourages us to compact typical BNNs and obtain further close performance through learning non-repetitive kernels within a binary kernel subspace. Specifically, the team regarded the binarization process as kernel grouping in terms of a binary codebook, and their task lies in learning to select a smaller subset of codewords from the full codebook. They then leverage the Gumbel-Sinkhorn technique to approximate the codeword selection process, and develop the Permutation Straight-Through Estimator (PSTE) that is able to not only optimize the selection process end-to-end but also maintain the non-repetitive occupancy of selected codewords. Experiments verify that this method reduces both the model size and bit-wise computational costs, and achieves accuracy improvements compared with state-of-the-art BNNs under comparable budgets.
Thanks to advances in computer vision and AI, there has been a large growth in the demand for cloud-based visual analytics in which images captured by a low-powered edge device are transmitted to the cloud for analytics. Use of conventional codecs (JPEG, MPEG, HEVC, etc.) for compressing such data introduces artifacts that can seriously degrade the performance of the down stream analytic tasks. Split-DNN computing has emerged as a paradigm to address such usages, in which a DNN is partitioned into a client-side portion and a server-side portion. Low complexity neural networks called ‘bottleneck units’ are introduced at the split point to transform the intermediate layer features into a lower-dimensional representation better suited for compression and transmission. Optimizing the pipeline for both compression and task-performance requires high-quality estimates of the information-theoretic rate of the intermediate features. Most works on compression for image analytics use heuristic approaches to estimate the rate, leading to suboptimal performance. This paper proposes a high-quality ‘neural rate-estimator’ to address this gap. The researchers interpret the lower-dimensional bottleneck output as a latent representation of the intermediate feature and cast the rate-distortion optimization problem as one of training an equivalent variational auto-encoder with an appropriate loss function. They showed that this leads to improved rate-distortion outcomes, and further demonstrated that replacing supervised loss terms (such as cross-entropy loss) by distillation-based losses in a teacher-student framework allows for unsupervised training of bottleneck units without the need for explicit training labels. This makes the method very attractive for real-world deployments where access to labeled training data is difficult or expensive. The team demonstrated that the proposed method outperforms several state-of-the-art methods by obtaining improved task accuracy at lower bit rates on image classification and semantic segmentation tasks.
SViTT: Temporal Learning of Sparse Video-Text Transformers
Do video-text transformers learn to model temporal relationships across frames? Despite their immense capacity and the abundance of multimodal training data, recent work has revealed the strong tendency of video-text models towards frame-based spatial representations, while temporal reasoning remains largely unsolved. This work identifies several key challenges in temporal learning of video-text transformers: the spatiotemporal trade-off from limited network size; the curse of dimensionality for multi-frame modeling; and the diminishing returns of semantic information by extending clip length. Guided by these findings, the team proposes SViTT, a sparse video-text architecture that performs multi-frame reasoning with significantly lower cost than naive transformers with dense attention. Analogous to graph-based networks, SViTT employs two forms of sparsity: edge sparsity that limits the query-key communications between tokens in self-attention, and node sparsity that discards uninformative visual tokens. Trained with a curriculum which increases model sparsity with the clip length, SViTT outperforms dense transformer baselines on multiple video-text retrieval and question answering benchmarks, with a fraction of computational cost.
Unbiased Scene Graph Generation in Video
The task of dynamic scene graph generation (SGG) from videos is complicated and challenging due to the inherent dynamics of a scene, temporal fluctuation of model predictions, and the long-tailed distribution of the visual relationships in addition to the already existing challenges in image-based SGG. Existing methods for dynamic SGG have primarily focused on capturing spatio-temporal context using complex architectures without addressing the challenges mentioned above, especially the long-tailed distribution of relationships. This often leads to the generation of biased scene graphs. To address these challenges, this paper introduces a new framework called TEMPURA: TEmporal consistency and Memory Prototype guided UnceRtainty Attenuation for unbiased dynamic SGG. TEMPURA employs object-level temporal consistencies via transformer-based sequence modeling, learns to synthesize unbiased relationship representations using memory-guided training, and attenuates the predictive uncertainty of visual relations using a Gaussian Mixture Model (GMM). Extensive experiments demonstrate that our method achieves significant (up to 10% in some cases) performance gain over existing methods highlighting its superiority in generating more unbiased scene graphs.
Conference Demos
LDM3D Latent Diffusion Model for 3D
This research paper proposes a Latent Diffusion Model for 3D (LDM3D) that generates both image and depth map data from a given text prompt, allowing users to generate RGBD images from text prompts. The LDM3D model is fine-tuned on a dataset of tuples containing an RGB image, depth map and caption, and validated through extensive experiments. Researchers also developed an application called DepthFusion, which uses the generated RGB images and depth maps to create immersive and interactive 360-degree-view experiences using TouchDesigner. This technology has the potential to transform a wide range of industries, from entertainment and gaming to architecture and design. Overall, this paper presents a significant contribution to the field of generative AI and computer vision, and showcases the potential of LDM3D and DepthFusion to revolutionize content creation and digital experiences. A short video summarizing the approach can be found here. The poster for this paper also won the Best Poster Award at the 3DMV workshop.
Streamlining Quality Control: A Guide to Automated Defect Detection with Anomalib
The quality control and quality assurance processes are important for businesses to maintain their reputation and provide a good customer experience. Many industries are using automated anomaly detection through computer vision and deep learning technology to avoid errors and improve efficiency. However, for AI to work effectively, it needs balanced datasets, and sometimes the available data is not enough for accurate predictions in industries such as manufacturing and healthcare. Additionally, with large-scale manufacturing and industrial automation, it is becoming more challenging for quality inspectors to manage large quantities of products.
A camera and robotic arm are used to detect and prevent colored cubes with defects from entering the production line. An anomaly detection model is needed, but there is no hardware accelerator, limited data for training, and few expected defects. The goal is to achieve fast and accurate training, while allowing for retraining due to external changes. Anomalib' s library was used to design, implement, and deploy unsupervised anomaly detection models from data collection to the edge, meeting all requirements.
Workshops and Workshop Papers
Large number of ReLU and MAC operations of Deep neural networks make them ill-suited for latency and compute-efficient private inference. In this paper, we present a model optimization method that allows a model to learn to be shallow. In particular, this work leverages the ReLU sensitivity of a convolutional block to remove a ReLU layer and merge its succeeding and preceding convolution layers to a shallow block. Unlike existing ReLU reduction methods, this joint reduction method can yield models with improved reduction of both ReLUs and linear operations by up to 1.73x and 1.47x, respectively, evaluated with ResNet18 on CIFAR-100 without any significant accuracy-drop.
The Fifth Workshop on Deep Learning for Geometric Computing
Computer vision approaches have made tremendous efforts toward understanding shape from various data formats, especially since entering the deep learning era. Although accurate results have been obtained in detection, recognition, and segmentation, there is less attention and research on extracting topological and geometric information from shapes. These geometric representations provide compact and intuitive abstractions for modeling, synthesis, compression, matching, and analysis. Extracting such representations is significantly different from segmentation and recognition tasks, as they contain both local and global information about the shape. To advance the state of the art in topological and geometric shape analysis using deep learning, this workshop aims to gather researchers from computer vision, computational geometry, computer graphics, and machine learning in its third edition of “Deep Learning for Geometric Computing” at CVPR 2023. The workshop encapsulates competitions, proceedings, keynotes, paper presentations, and a fair and diverse environment for brainstorming about future research collaborations.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.