Zhipeng Cai is a research scientist at Intel Labs. His research mainly focuses on fundamental computer vision and machine learning problems such as robust geometric perception, neural rendering, generative models, and continual learning.
Highlights
- Researchers at Intel Labs, in collaboration with Xiamen University, have presented LiSA, the first semantic aware AI framework for highly accurate 3D visual localization.
- LiSA can leverage pre-trained semantic segmentation models to significantly improve state-of-the-art 3D visual localization accuracy, without introducing computational overhead during inference.
- LiSA was presented at CVPR 2024 as a highlight paper, an award given to the top 3.6% of conference papers.
Researchers at Intel Labs, in collaboration with Xiamen University, presented LiDAR Localization with Semantic Awareness (LiSA) at the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024) as a highlight paper, an award given to only 3.6% of conference papers. LiSA is the first method that incorporates semantic awareness into scene coordinate regression (SCR) to boost the localization robustness and accuracy in LiDAR localization. This fundamental task in robotics and computer vision estimates the pose of a LiDAR point cloud within a global map. To avoid extra computation or network parameters during inference for LiSA, knowledge is distilled from a segmentation model to the original SCR network. Experiments show high performance by LiSA on standard LiDAR localization benchmarks compared to state-of-the-art methods. Applying knowledge distillation not only preserves high efficiency but also achieves higher localization accuracy than introducing extra semantic segmentation modules.
Figure 1. Example outputs of different methods. LiSA can accurately and stably track the trajectory.
One of the most fundamental computer vision problems is localizing visual inputs in a pre-defined map. Visual localization aims to solve this challenge but existing artificial intelligence (AI) models for visual localization can be sensitive to some objects, such as dynamic objects and repeated structures, which will largely distract the model and reduce accuracy. However, hard coding the type of objects to ignore is not effective as shown in Figure 2.
Figure 2. Impact of semantic information on LiDAR localization. Filtering out objects from different classes can significantly reduce or increase the position error. However, the noise in the semantic labels makes it hard to consistently improve both rotation and translation accuracy with point filtering.
LiSA addresses this challenge by leveraging diffusion-based knowledge distillation to transfer semantic knowledge from a pre-trained segmentation model during training so that the localization can adaptively ignore distracting objects for localization. In the meantime, the knowledge transfer is done only during training, meaning that at inference time, the segmentation model is not needed anymore so that the significant accuracy improvement of LiSA comes with no extra inference overhead.
Existing Approaches
Localization methods can be roughly categorized as 1) non-regression and 2) regression-based. Non-regression-based approaches leverage retrieval and matching to perform localization, which is a time- and memory-consuming execution process requiring pre-stored map information. Regression-based approaches let a neural network learn to directly regress the camera poses from visual inputs, which is much faster and memory efficient. However, regression-based approaches normally suffer from the generalization issues caused by distracting objects, since the training is done per scene.
Semantic Aware Visual Localization
LiSA aims to enhance the robustness and accuracy of 3D visual localization by using inputs of LiDAR scans frames, commonly used in self-driving applications. The overall framework of LiSA is described in the following Figure 3.
Figure 3. The LiSA pipeline consists of three modules: scene coordinate regression, semantic segmentation (frozen), and knowledge distillation. In the scene coordinate regression module, the coordinate regression head in the regressor directly outputs scene coordinates P′, and the semantic feature regression head learns semantic segmentation features (Fstu) from the knowledge distillation module and the semantic segmentation module. After distilling the semantic knowledge during training, both semantic segmentation and knowledge distillation modules are discarded, which ensures that no extra computation and network parameters are introduced during inference.
During training, LiSA leverages diffusion-based knowledge distillation to transfer semantic knowledge from a pre-trained segmentation model into the original localization network (Scene Coordinate Regression at the top of Figure 3). After training, the semantic segmentation model and the knowledge distillation module are discarded, so that the architecture during inference remains the same, and no extra computation is introduced. Besides maintaining the efficiency, distilling knowledge at the feature level allows the model to learn an adaptive filter for distracting objects, making the performance improvement consistent and significant across different datasets.
Results
The results in the following two tables show that LiSA can significantly advance the localization accuracy while maintaining the memory and time efficiency of regression-based approaches.
Figure 4. Quantitative results on QEOxford. Mean position error (m) and mean orientation error (◦) for various methods are reported. Best performance is highlighted in bold, lower is better. LiSA outperforms all baseline methods in terms of both position and orientation accuracy.
Figure 5. Quantitative results on NCLT. Mean position error (m) and mean orientation error (◦) for various methods are reported. Even though the semantic segmentation model performs not perfectly on the NCLT dataset, LiSA still surpasses all competitors by a large margin.
Also from the following Figure 6, we can see that without semantic knowledge from LiSA, the model treats different objects uniformly, which can be easily distracted. After applying LiSA, distracting objects such as pedestrians are automatically filtered with lower activation value.
Figure 6. The behavior of SCR with and without semantic awareness. Given a point cloud sampled from QEOxford, we show the pointwise activation value with and without using semantic information. Warmer colors denote higher activation values. Left: The activation map of the complete point cloud. Right: Zoom-in local views. Similar activation values are assigned to all points if no semantic knowledge is utilized, whereas LiSA can discriminate important points for localization and down-weight distracting points such as pedestrians.
As shown below in Figure 7, LiSA’s activation distribution is less centralized, meaning the activations on a variety of objects are different, leading to a much higher localization in practice.
Figure 7. Activation value distributions and localization results of LiSA (with semantic) and SGLoc (without semantic). (a). Given a point cloud sampled from QEOxford, we first aggregate features in the regressor (before heads) and normalize their activation values into [0,255]. Without semantic awareness, most activation values of SGLoc are clustered in the center, indicating that almost all points are treated equally. (b). The localization accuracy with semantic awareness (LiSA) is much higher than the baseline (SGLoc) without semantic awareness.
Overall, our experiments demonstrate a much better localization robustness and accuracy of LiSA. Importantly, such performance gain is free during inference. The code is available here.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.