Advancing Gen AI on Intel Gaudi AI Accelerators with Multi-Modal Panorama Generation AI Technology

ZhipengCai · ‎06-04-2024

Zhipeng Cai is a research scientist at Intel Labs. His research mainly focuses on fundamental computer vision and machine learning problems such as robust geometric perception, foundation models, neural rendering, generative models, and continual learning. Co-author Tien Pei (Joey) Chou is an Applied AI Scientist at Intel Data Center & AI (DCAI). His work focuses on bringing GenAI into applications and products on the multi-cloud system across different types of accelerators.

Highlights

Researchers at Intel Labs introduced Language Model Assisted Generation of Images with Coherence, a generalizable AI framework that can create panorama, immersive videos, and 3D imagery from multiple input modalities.
The AI framework is now fully enabled on the Intel® Gaudi® AI accelerator platform.
This work has been accepted at multiple conferences. The research paper will be presented as a highlight paper and featured in a live demonstration using the Gaudi accelerator at CVPR 2024, in addition to a demo at the ISC High Performance 2024 conference.

Researchers at Intel Labs introduced Language Model Assisted Generation of Images with Coherence, a generalizable artificial intelligence (AI) framework that can create panorama, immersive videos, and 3D imagery from multiple input modalities, including text, image, hand drawing, depth map, and more. It leverages pre-trained large language models (LLMs) to control 2D diffusion models, allowing the generation of panoramic scenes with diverse 360-degree layouts, different types of environments (indoor, outdoor, and even under water), and input modalities without fine-tuning.

The Intel Data Center & AI engineering team has enabled this technology to run on the Intel® Gaudi® AI accelerator platform. In addition, this work has been accepted at multiple conferences, including the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), the top academic conference for AI and computer vision. The research paper will be presented as a highlight paper and featured in a live demonstration using the Gaudi accelerator at CVPR 2024, in addition to a demo at the ISC High Performance 2024 conference. The code, paper, and video presentations can be found at the project page.

Research on generative artificial intelligence (GenAI) has dominated the field of AI and computer vision. Generating panoramic scenes from abstract descriptions in a piece of text or a simple hand drawing, or partial views from a single image can be a challenge in a wide range of applications such as design, education, filming, gaming, simulation, and more.

Figure 1 LMAGIC Intel.png

Figure 1. Redundant objects generated by existing methods.

Existing methods are complicated by three issues:

Many models cannot handle different types of inputs. These methods are designed only for generating panoramic scenes from text, which limits their application range.
As Figure 1 shows, many methods cannot generate scenes with diverse 360-degree layouts. For example, the model generates multiple beds from different viewing angles to represent an all-around view of a bedroom.
Many existing methods cannot close the 360-degree loop when generating all-encompassing views, as shown in Figure 2.

Figure 2 LMAGIC Intel.png

Figure 2. Loop closure issues of existing methods.

Method

Language Model Assisted Generation of Images with Coherence uses pre-trained multimodal language models (MLMs) and 2D diffusion models to address the three issues. As shown in Figure 3, the AI framework uses images to bridge inputs from different modalities. If the input is not a natural image, such as a piece of text or a sketch drawing, the AI framework will generate a natural image from the input using matured conditional diffusion models, such as ControlNet.

After obtaining the single image, the AI framework iteratively warps existing views to novel views, and then completes the missing pixels after warping. A pre-trained 2D diffusion inpainting model (Stable Diffusion v2) is used to complete the missing pixels after warping. However, simply perform inpainting in each local perspective (warped) view cannot effectively ensure the coherence of the global scene layout because there is no mechanism to create a diverse and reasonable 360-degree layout, causing redundantly generated objects. An important contribution of the AI framework is to leverage pre-trained MLMs such as BLIP, ChatGPT to control what view to generate from each diffusion inpainting step.

With proper prompt engineering, state-of-the-art MLMs can generate reasonable multi-view scene layouts for diverse in-the-wild scenes without the need of fine-tuning. MLMs can also be leveraged to monitor the diffusion model outputs, so the AI framework can automatically detect and correct erroneous outputs when diffusion models do not follow the LLM control command. After obtaining multiple inpainting perspective views, the AI framework merges them into the final panorama, and leverages matured computer vision frameworks such as depth estimation and depth-based warping to produce immersive videos and 360-degree point clouds. To further enhance the output quality, the AI framework uses pre-trained diffusion super resolution models and applies a smooth multi-view fusion so that the final merged panorama is high-resolution and without artifact multi-view fusion boundaries.

Figure 3 LMAGIC Intel.png

Figure 3. Language Model Assisted Generation of Images with Coherence pipeline.

In testing the AI framework pipeline, the input image comes from the real-world or is synthesized by conditional diffusion models. Multiple novel views composing a 360-degree panoramic scene are generated by iterative warping and inpainting. Pre-trained diffusion models assisted by pre-trained language models are used to generate views with both high-quality local textures and coherent 360-degree layouts. Further quality enhancement techniques ensure smooth blending of multiple views into high-resolution panoramic scenes. The AI framework can generate panorama images, immersive videos, and 3D point clouds from various types of inputs, such as images, text, and sketch drawings.

Result

Figure 4 LMAGIC Intel.png

Figure 4. Quantitative results for image-to-panorama and text-to-panorama generation.

As shown in Figure 4, the AI framework quantitatively beats existing methods for both image-to-panorama and text-to-panorama generation in terms of both human and algorithm evaluations. For human evaluations, each baseline has two bars representing the quality of rendered perspective views and the 360-degree layout respectively. The value of the bar shows the frequency where this method is preferred in the voting. Above 50% (dashed line) means the 3D Panorama AI Technology is more preferred than the corresponding baseline. Algorithmic evaluations are done by computing the inception score (IS). This AI framework consistently outperforms existing methods on both metrics.

Qualitatively, as shown in Figures 5 and 6, the AI framework can smoothly close the 360-degree loop and generate scenes with coherent 360-degree layouts without redundant objects in both image-to-panorama and text-to-panorama. Since the AI framework does not need model fine-tuning, the zero-shot generalization capability of the system is fully preserved. Hence, it can generate realistic panorama for diverse scene types, such as indoor, outdoor, and even underwater scenes.

Figure 5 LMAGIC Intel.png

Figure 5. Image-to-panorama visualizations.

As Figures 5 and 6 demonstrate, Text2light, Stable Diffusion v2, and LDM3D cannot close the 360-degree loop (sharp boundaries at the middle). Text2room and MVDiffusion lack mechanisms to avoid duplicate objects. The AI framework outputs have high local view quality and coherent scene layouts.

Figure 6. Text-to-panorama visualizations.

As Figure 7 shows, the AI framework can effectively create panoramas from various input modalities, such as an input depth map (top), a sketch drawing (middle), and a colored script or a segmentation mask (bottom). The dotted bounding box indicates the region of the initial perspective view, which is generated by conditional diffusion models.

Figure 7 LMAGIC Intel.png

Figure 7. Panorama generated from other input modalities.

Leveraging matured computer vision frameworks, such as depth-based warping and depth estimation, we can further create immersive videos (with both camera rotations and translations) and 360-degree 3D point clouds of the scenes. Figure 8 shows sampled point clouds generated by the AI framework, while immersive video results can be seen in the presentation video.

Figure 8 LMAGIC Intel.png

Figure 8. 3D point cloud generation.

Performing depth estimation on the generated panorama further enables the creation of 3D point clouds from diverse inputs. As Figure 8 shows, point clouds can be generated for both indoor and outdoor scenes, even the underwater scene with clear geometry of the fish and coral reefs.

The paper, code, Hugging Face demo, video presentation, and result gallery can all be found at the project website.