Intel Labs and Collaborators Present Novel Computer Vision Approaches at ECCV 2024

ScottBair · ‎09-30-2024

Scott Bair is a key voice at Intel Labs, sharing insights into innovative research for inventing tomorrow’s technology.

Highlights

Intel Labs and collaborators will present six papers focused on computer vision and machine learning at ECCV 2024.
Intel Labs papers include a novel defense approach designed to protect text-to-image models from prompt-based red teaming attacks.
With university collaborators, Intel Labs co-organized a tutorial on how to Responsibly Build Generative Models.

Researchers from Intel Labs and academic and industry collaborators will present six research papers at the European Conference on Computer Vision (ECCV 2024) on September 29-October 4. Focused on computer vision and machine learning, the research conference is managed by the European Computer Vision Association (ECVA). The Intel Labs papers include a novel defense approach designed to protect text-to-image (T2I) models from prompt-based red teaming attacks, a new spatially-focused large-scale dataset that improves spatial consistency in T2I models, and an approach to derive ground-truth radiance fields from textured meshes for 3D generation tasks.

In addition, Intel Labs in collaboration with University of Maryland, Arizona State University, and the University of Maryland, Baltimore County co-organized a tutorial on how to Responsibly Build Generative Models. While these models have evolved to production-ready tools, they face several reliability issues that can impact their widespread adoption. Intel Labs’ Ilke Demir, a senior staff research scientist, will present how Intel’s FakeCatcher uses blood flow signal algorithms for deep fake detection, allowing users to distinguish between real and fake content. Other invited speakers will cover how to mitigate copyright breaches when the model memorizes training data, and techniques for incorporating fingerprint into model weights to trace malicious content origins.

Conference Oral Paper

R.A.C.E.: Robust Adversarial Concept Erasure for Secure Text-to-Image Diffusion Model

Intel Labs collaboration with Arizona State University

In the evolving landscape of text-to-image (T2I) diffusion models, the remarkable capability to generate high-quality images from textual descriptions faces challenges with the potential misuse of reproducing sensitive content. To address this critical issue, we introduce Robust Adversarial Concept Erase (RACE), a novel approach designed to mitigate these risks by enhancing the robustness of concept erasure method for T2I models. RACE utilizes a sophisticated adversarial training framework to identify and mitigate adversarial text embeddings, significantly reducing the attack success rate (ASR). Impressively, RACE achieves a 30 percentage point reduction in ASR for the “unclothed” concept against the leading white-box attack method. Our extensive evaluations demonstrate RACE’s effectiveness in defending against both white-box and black-box attacks, marking a significant advancement in protecting T2I diffusion models from generating inappropriate or misleading imagery. This work underlines the essential need for proactive defense measures in adapting to the rapidly advancing field of adversarial challenges. Code is publicly available.

Conference Papers

CLAMP-ViT: Contrastive Data-Free Learning for Adaptive Post-Training Quantization of ViTs

Intel Labs collaboration with Georgia Institute of Technology

We present CLAMP-ViT, a data-free post-training quantization method for vision transformers (ViTs). We identify the limitations of recent techniques, notably their inability to leverage meaningful inter-patch relationships, leading to the generation of simplistic and semantically vague data, impacting quantization accuracy. CLAMP-ViT employs a two-stage approach, cyclically adapting between data generation and model quantization. Specifically, we incorporate a patch-level contrastive learning scheme to generate richer, semantically meaningful data. Furthermore, we leverage contrastive learning in layer-wise evolutionary search for fixed- and mixed-precision quantization to identify optimal quantization parameters while mitigating the effects of a non-smooth loss landscape. Extensive evaluations across various vision tasks demonstrate the superiority of CLAMP-ViT, with performance improvements of up to 3% in top-1 accuracy for classification, 0.6 mAP for object detection, and 1.5 mIoU for segmentation at similar or better compression ratio over existing alternatives. The code, project page, and video are publicly available.

Generating Physically Realistic and Directable Human Motions from Multi-Modal Inputs

Intel Labs collaboration with Oregon State University

This work focuses on generating realistic, physically-based human behaviors from multi-modal inputs, which may only partially specify the desired motion. For example, the input may come from a VR controller providing arm motion and body velocity, partial key-point animation, computer vision applied to videos, or even higher-level motion goals. This requires a versatile low-level humanoid controller that can handle such sparse, under-specified guidance, seamlessly switch between skills, and recover from failures. Current approaches for learning humanoid controllers from demonstration data capture some of these characteristics, but none achieve them all. To this end, we introduce the Masked Humanoid Controller (MHC), a novel approach that applies multi-objective imitation learning on augmented and selectively masked motion demonstrations. The training methodology results in an MHC that exhibits the key capabilities of catch-up to out-of-sync input commands, combining elements from multiple motion sequences, and completing unspecified parts of motions from sparse multimodal input. We demonstrate these key capabilities for an MHC learned over a dataset of 87 diverse skills and showcase different multi-modal use cases, including integration with planning frameworks to highlight MHC’s ability to solve new user-defined tasks without any finetuning.

GenQ: Quantization in Low Data Regimes with Generative Synthetic Data

Intel Labs collaboration with Yale University

In the realm of deep neural network deployment, low-bit quantization presents a promising avenue for enhancing computational efficiency. However, it often hinges on the availability of training data to mitigate quantization errors, a significant challenge when data availability is scarce or restricted due to privacy or copyright concerns. Addressing this, we introduce GenQ, a novel approach employing an advanced generative AI model to generate photorealistic, high-resolution synthetic data, overcoming the limitations of traditional methods that struggle to accurately mimic complex objects in extensive datasets like ImageNet. Our methodology is underscored by two robust filtering mechanisms designed to ensure the synthetic data closely aligns with the intrinsic characteristics of the actual training data. In case of limited data availability, the actual data is used to guide the synthetic data generation process, enhancing fidelity through the inversion of learnable token embeddings. Through rigorous experimentation, GenQ establishes new benchmarks in data-free and data-scarce quantization, significantly outperforming existing methods in accuracy and efficiency, thereby setting a new standard for quantization in low data regimes.

Getting it Right: Improving Spatial Consistency in Text-to-Image Models

Intel Labs collaboration with Arizona State University, Hugging Face, University of Washington, and University of Maryland, Baltimore County

One of the key shortcomings in current text-to-image (T2I) models is their inability to consistently generate images which faithfully follow the spatial relationships specified in the text prompt. In this paper, we offer a comprehensive investigation of this limitation, while also developing datasets and methods that achieve state-of-the-art performance. First, we find that current vision-language datasets do not represent spatial relationships well enough; to alleviate this bottleneck, we create SPRIGHT, the first spatially-focused, large scale dataset, by re-captioning 6 million images from four widely used vision datasets. Through a three-fold evaluation and analysis pipeline, we find that SPRIGHT largely improves upon existing datasets in capturing spatial relationships. To demonstrate its efficacy, we leverage only 0.25% of SPRIGHT and achieve a 22% improvement in generating spatially accurate images while improving the FID and CMMD scores. Secondly, we find that training on images containing a large number of objects results in substantial improvements in spatial consistency. Notably, we attain state-of-the-art on T2I-CompBench with a spatial score of 0.2133, by fine-tuning on <500 images. Finally, through a set of controlled experiments and ablations, we document multiple findings that we believe will enhance the understanding of factors that affect spatial consistency in text-to-image models. Demo, code, data, and models are publicly available.

Mesh2NeRF: Direct Mesh Supervision for Neural Radiance Field Representation and Generation

Intel Labs collaboration with TU Munich

We present Mesh2NeRF, an approach to derive ground-truth radiance fields from textured meshes for 3D generation tasks. Many 3D generative approaches represent 3D scenes as radiance fields for training. Their ground-truth radiance fields are usually fitted from multi-view renderings from a large-scale synthetic 3D dataset, which often results in artifacts due to occlusions or under-fitting issues. In Mesh2NeRF, we propose an analytic solution to directly obtain ground-truth radiance fields from 3D meshes, characterizing the density field with an occupancy function featuring a defined surface thickness, and determining view-dependent color through a reflection function considering both the mesh and environment lighting. Mesh2NeRF extracts accurate radiance fields which provides direct supervision for training generative NeRFs and single scene representation. We validate the effectiveness of Mesh2NeRF across various tasks, achieving a noteworthy 3.12dB improvement in PSNR for view synthesis in single scene representation on the ABO dataset, a 0.69 PSNR enhancement in the single-view conditional generation of ShapeNet Cars, and notably improved mesh extraction from NeRF in the unconditional generation of Objaverse Mugs.

Wählen Sie Ihre Sprache aus

Suche auf Intel.com nutzen

Direktlinks

Kürzlich durchgeführte Suchen

Erweiterte Suche

Nur darin suchen

Intel Labs and Collaborators Present Novel Computer Vision Approaches at ECCV 2024