Intel Labs Presents State-of-the-Art AI Research at ICLR 2024

ScottBair · ‎05-07-2024

Scott Bair is a key voice at Intel Labs, sharing insights into innovative research for inventing tomorrow’s technology.

Highlights:

This year’s International Conference on Learning Representations (ICLR) is hosted in Vienna Austria from May 7th to May 11th.
Intel’s presentation of its novel Generalizable Image Matcher (GIM) has been accepted to the conference as a spotlight presentation, an award given only to the top 5% of conference papers.
Intel will present six papers introducing state-of-the-art AI research at the 2024 conference.

This year’s International Conference on Learning Representations (ICLR) is hosted in Vienna Austria from May 7th to May 11th. Intel Labs is proud to contribute six works towards the advancement of various facets of artificial intelligence (AI) at the 2024 conference.

Researchers at Intel Labs, in collaboration with Xiamen University and DJI, have introduced a Generalizable Image Matcher (GIM), the first framework that can use unlabeled internet videos to train foundation AI models for zero-shot image matching. The performance of the model then continually grows with the video data size. This novel technology can revolutionize many industrial applications, such as 3D reconstruction, neural rendering (NeRFs/Gaussian Splattings), autonomous driving and so on. Recognized for its significance, this novel work was accepted as a spotlight presentation at ICLR 2024, a prestigious award granted to only the top 5% of conference papers. Read this in-depth researcher blog to learn more about the technology.

Other innovations presented at the conference include a new convex score function for sparsity-aware learning of linear directed acyclic graphs; a new forward learning procedure for graph neural networks; and an approach for learning universal and transferable graph representations. Intel Labs researchers will also present a new reinforcement learning-based molecular design algorithm and an exploration of the Fusion of Experts problem formulated as an instance of supervised learning.

CoLiDE: Concomitant Linear DAG Estimation

This work deals with the combinatorial problem of learning directed acyclic graph (DAG) structure from observational data adhering to a linear structural equation model (SEM). Leveraging advances in differentiable, nonconvex characterizations of acyclicity, recent efforts have advocated a continuous constrained optimization paradigm to efficiently explore the space of DAGs. Most existing methods employ lasso-type score functions to guide this search, which (i) require expensive penalty parameter retuning when the SEM noise variances change across problem instances; and (ii) implicitly rely on limiting homoscedasticity assumptions. In this work, researchers propose a new convex score function for sparsity-aware learning of linear DAGs, which incorporates concomitant estimation of scale and thus effectively decouples the sparsity parameter from noise levels. Regularization via a smooth, nonconvex acyclicity penalty term yields CoLiDE (Concomitant Linear DAG Estimation), a regression-based criterion amenable to efficient gradient computation and closed-form estimation of exogenous noise levels in heteroscedastic scenarios. The algorithm outperforms state-of-the-art methods without incurring added complexity, especially when the DAGs are larger, and the noise level profile is heterogeneous. Results also show that CoLiDE exhibits enhanced stability manifested via reduced standard deviations in several domain-specific metrics, underscoring the robustness of our novel linear DAG estimator.

Forward Learning of Graph Neural Networks

Graph neural networks (GNNs) have achieved remarkable success across a wide range of applications, such as recommendation, drug discovery, and question answering. Behind the success of GNNs lies the backpropagation (BP) algorithm, which is the de facto standard for training deep neural networks (NNs). However, despite its effectiveness, BP imposes several constraints, which are not only biologically implausible, but also limit the scalability, parallelism, and flexibility in learning NNs. Examples of such constraints include storage of neural activities computed in the forward pass for use in the subsequent backward pass, and the dependence of parameter updates on non-local signals. To address these limitations, the forward-forward algorithm (FF) was recently proposed as an alternative to BP in the image classification domain, which trains NNs by performing two forward passes over positive and negative data. Inspired by this advance, this work proposes ForwardGNN, a new forward learning procedure for GNNs, which avoids the constraints imposed by BP via an effective layer-wise local forward training. ForwardGNN extends the original FF to deal with graph data and GNNs, and makes it possible to operate without generating negative inputs (hence no longer forward-forward). Further, ForwardGNN enables each layer to learn from both the bottom-up and top-down signals without relying on the backpropagation of errors. Extensive experiments on real-world datasets show the effectiveness and generality of the proposed forward graph learning framework. Find the code at this https URL.

Fusing Models with Complementary Expertise

Training AI models that generalize across tasks and domains has long been among the open problems driving AI research. The emergence of Foundation Models made it easier to obtain expert models for a given task, but the heterogeneity of data that may be encountered at test time often means that any single expert is insufficient. This paper considers the Fusion of Experts (FoE) problem of fusing outputs of expert models with complementary knowledge of the data distribution and formulates it as an instance of supervised learning. The proposed method is applicable to both discriminative and generative tasks and leads to significant performance improvements in image and text classification, text summarization, multiple-choice QA, and automatic evaluation of generated text. The researchers also extended the method to the "frugal" setting where it is desired to reduce the number of expert model evaluations at test time.

GIM: Learning Generalizable Image Matcher From Internet Videos

Image matching is a fundamental computer vision problem. While learning-based methods achieve state-of-the-art performance on existing benchmarks, they generalize poorly to in-the-wild images. Such methods typically need to train separate models for different scene types and are impractical when the scene type is unknown in advance. One of the underlying problems is the limited scalability of existing data construction pipelines, which limits the diversity of standard image matching datasets. To address this problem, this work proposes GIM, a self-training framework for learning a single generalizable model based on any image matching architecture using internet videos, an abundant and diverse data source. Given an architecture, GIM first trains it on standard domain-specific datasets and then combines it with complementary matching methods to create dense labels on nearby frames of novel videos. These labels are filtered by robust fitting, and then enhanced by propagating them to distant frames. The final model is trained on propagated data with strong augmentations. Researchers also propose ZEB, the first zero-shot evaluation benchmark for image matching. By mixing data from diverse domains, ZEB can thoroughly assess the cross-domain generalization performance of different methods. Applying GIM consistently improves the zero-shot performance of 3 state-of-the-art image matching architectures; with 50 hours of YouTube videos, the relative zero-shot performance improves by 8.4%-18.1%. GIM also enables generalization to extreme cross-domain data such as Bird Eye View (BEV) images of projected 3D point clouds (Fig. 1(c)). More importantly, our single zero-shot model consistently outperforms domain-specific baselines when evaluated on downstream tasks inherent to their respective domains. The video presentation is available at this https URL.

Searching the Space of High-Value Molecules Using Reinforcement Learning and Language Models

Reinforcement learning (RL) over text representations can be effective for finding high-value policies that can search over graphs. However, RL requires careful structuring of the search space and algorithm design to be effective in this challenge. Through extensive experiments, this work explores how different design choices for text grammar and algorithmic choices for training can affect an RL policy's ability to generate molecules with desired properties. Researchers arrived at a new RL-based molecular design algorithm (ChemRLformer) and perform a thorough analysis using 25 molecule design tasks, including computationally complex protein docking simulations. Analysis uncovered unique insights in this problem space and show that ChemRLformer achieves state-of-the-art performance while being more straightforward than prior work by demystifying which design choices are actually helpful for text-based molecule design.

Towards Foundation Models for Knowledge Graph Reasoning

Foundation models in language and vision have the ability to run inference on any textual and visual inputs thanks to the transferable representations such as a vocabulary of tokens in language. Knowledge graphs (KGs) have different entity and relation vocabularies that generally do not overlap. The key challenge of designing foundation models on KGs is to learn such transferable representations that enable inference on any graph with arbitrary entity and relation vocabularies. This work presents a step towards such foundation models with ULTRA, an approach for learning universal and transferable graph representations. ULTRA builds relational representations as a function conditioned on their interactions. Such a conditioning strategy allows a pre-trained ULTRA model to inductively generalize to any unseen KG with any relation vocabulary and to be fine-tuned on any graph. Researchers conducted link prediction experiments on 57 different KGs and found that the zero-shot inductive inference performance of a single pre-trained ULTRA model on unseen graphs of various sizes is often on par or better than strong baselines trained on specific graphs; fine-tuning further boosts performance.