Scott Bair is a key voice at Intel Labs, sharing insights into innovative research for inventing tomorrow’s technology
Highlights:
- The thirty-seventh Conference on Neural Information Processing Systems (NeurIPS 2023) will run from Sunday December 10th through Saturday December 16th at the New Orleans Ernest N. Morial Convention Center.
- This year, Intel Labs presents 31 papers at NeurIPS, including 12 at the main conference. Contributions include training encoding models that can transfer across fMRI responses to stories and movies, a scalable framework for automatic generation of counterfactual examples using text-to-image diffusion models, and the first learning-based method for event-to-point cloud registration.
- Intel Labs also organized the AI for Accelerated Materials Discovery (AI4Mat) Workshop.
The thirty-seventh Conference on Neural Information Processing Systems (NeurIPS 2023) will run from Sunday December 10th through Saturday December 16th at the New Orleans Ernest N. Morial Convention Center.
This year, Intel Labs is presenting 31 papers at NeurIPS, including 12 at the main conference. Researchers will also present papers at several workshops, including this year’s AI for Accelerated Materials Discovery (AI4Mat) Workshop. Intel Labs is proud to have organized the workshop and to present nine works at the event. The workshop provides a platform for AI researchers and material scientists to tackle the cutting-edge challenges in AI-driven materials discovery and development. The workshop aims to create an opportunity for deep, thoughtful discussion among researchers working on these interdisciplinary topics, and highlight ongoing challenges in the field.
Intel Labs’ research contributions include training encoding models that can transfer across fMRI responses to stories and movies, a scalable framework for automatic generation of counterfactual examples using text-to-image diffusion models, and the first learning-based method for event-to-point cloud registration.
Main Conference Papers
A*Net: A Scalable Path-based Reasoning Approach for Knowledge Graphs
Reasoning on large-scale knowledge graphs has been long dominated by embedding methods. While path-based methods possess the inductive capacity that embeddings lack, their scalability is limited by the exponential number of paths. This work presents A*Net, a scalable path-based method for knowledge graph reasoning. Inspired by the A* algorithm for shortest path problems, our A*Net learns a priority function to select important nodes and edges at each iteration, to reduce time and memory footprint for both training and inference. The ratio of selected nodes and edges can be specified to trade off between performance and efficiency. Experiments on both transductive and inductive knowledge graph reasoning benchmarks show that A*Net achieves competitive performance with existing state-of-the-art path-based methods, while merely visiting 10% nodes and 10% edges at each iteration. On a million-scale dataset ogbl-wikikg2, A*Net not only achieves a new state-of-the-art result, but also converges faster than embedding methods. A*Net is the first path-based method for knowledge graph reasoning at such scale.
Brain Encoding Models based on Multimodal Transformers can Transfer Across Language and Vision
Encoding models have been used to assess how the human brain represents concepts in language and vision. While language and vision rely on similar concept representations, current encoding models are typically trained and tested on brain responses to each modality in isolation. Recent advances in multimodal pretraining have produced transformers that can extract aligned representations of concepts in language and vision. In this work, researchers used representations from multimodal transformers to train encoding models that can transfer across fMRI responses to stories and movies. Results showed that encoding models trained on brain responses to one modality can successfully predict brain responses to the other modality, particularly in cortical regions that represent conceptual meaning. Further analysis of these encoding models revealed shared semantic dimensions that underlie concept representations in language and vision. Comparing encoding models trained using representations from multimodal and unimodal transformers, we found that multimodal transformers learn more aligned representations of concepts in language and vision. The results demonstrate how multimodal transformers can provide insights into the brain's capacity for multimodal processing.
Causal Interpretation of Self-Attention in Pre-Trained Transformers
This paper proposes a causal interpretation of self-attention in the Transformer neural network architecture. Researchers interpret self-attention as a mechanism that estimates a structural equation model for a given input sequence of symbols (tokens). The structural equation model can be interpreted, in turn, as a causal structure over the input symbols under the specific context of the input sequence. Importantly, this interpretation remains valid in the presence of latent confounders. Following this interpretation, researchers estimate conditional independence relations between input symbols by calculating partial correlations between their corresponding representations in the deepest attention layer. This enables learning the causal structure over an input sequence using existing constraint-based algorithms. In this sense, existing pre-trained Transformers can be utilized for zero-shot causal-discovery. The work demonstrates this method by providing causal explanations for the outcomes of Transformers in two tasks: sentiment classification (NLP) and recommendation.
ClimateSet: A Large-Scale Climate Model Dataset for Machine Learning
Climate models have been key for assessing the impact of climate change and simulating future climate scenarios. The machine learning (ML) community has taken an increased interest in supporting climate scientists' efforts on various tasks such as climate model emulation, downscaling, and prediction tasks. Many of those tasks have been addressed on datasets created with single climate models. However, both the climate science and ML communities have suggested that to address those tasks at scale, we need large, consistent, and ML-ready climate model datasets. Here, researchers introduce ClimateSet, a dataset containing the inputs and outputs of 36 climate models from the Input4MIPs and CMIP6 archives. In addition, they provide a modular dataset pipeline for retrieving and preprocessing additional climate models and scenarios. This work showcases the potential of our dataset by using it as a benchmark for ML-based climate model emulation. The paper provides new insights about the performance and generalization capabilities of the different ML models by analyzing their performance across different climate models. Furthermore, the dataset can be used to train an ML emulator on several climate models instead of just one. Such a "super emulator" can quickly project new climate change scenarios, complementing existing scenarios already provided to policymakers. Researchers believe ClimateSet will create the basis needed for the ML community to tackle climate-related tasks at scale.
COCO-Counterfactuals: Automatically Constructed Counterfactual Examples for Image-Text Pairs
Counterfactual examples have proven to be valuable in the field of natural language processing (NLP) for both evaluating and improving the robustness of language models to spurious correlations in datasets. Despite their demonstrated utility for NLP, multimodal counterfactual examples have been relatively unexplored due to the difficulty of creating paired image-text data with minimal counterfactual changes. To address this challenge, researchers introduce a scalable framework for automatic generation of counterfactual examples using text-to-image diffusion models. They use the framework to create COCO-Counterfactuals, a multimodal counterfactual dataset of paired image and text captions based on the MS-COCO dataset. Researchers validate the quality of COCO-Counterfactuals through human evaluations and show that existing multimodal models are challenged by our counterfactual image-text pairs. Additionally, they demonstrate the usefulness of COCO-Counterfactuals for improving out-of-domain generalization of multimodal vision-language models via training data augmentation.
CorresNeRF: Image Correspondence Priors for Neural Radiance Fields
Neural implicit representations in Neural Radiance Fields (NeRF) have achieved impressive results in novel view synthesis and surface reconstruction tasks. However, their performance suffers under challenging scenarios with sparse input views. This work presents CorresNeRF, a method to leverage image correspondence priors computed by off-the-shelf methods to supervise the training of NeRF. These correspondence priors are first augmented and filtered with our adaptive algorithm. Then they are injected into the training process by adding loss terms on the reprojection error and depth error of the correspondence points. Researchers evaluate the methods on novel view synthesis and surface reconstruction tasks with density-based and SDF-based neural implicit representations across different datasets. Results show that this simple yet effective technique can be applied as a plug-and-play module to improve the performance of NeRF under sparse-view settings across different NeRF variants. The experiments show that this method outperforms previous methods in both photometric and geometric metrics. The source code is available at https://github.com/yxlao/corres-nerf.
DiffPack: A Torsional Diffusion Model for Autoregressive Protein Side-Chain Packing
Proteins play a critical role in carrying out biological functions, and their 3D structures are essential in determining their functions. Accurately predicting the conformation of protein side-chains given their backbones is important for applications in protein structure prediction, design and protein-protein interactions. Traditional methods are computationally intensive and have limited accuracy, while existing machine learning methods treat the problem as a regression task and overlook the restrictions imposed by the constant covalent bond lengths and angles. In this work, researchers present DiffPack, a torsional diffusion model that learns the joint distribution of side-chain torsional angles, the only degrees of freedom in side-chain packing, by diffusing and denoising on the torsional space. To avoid issues arising from simultaneous perturbation of all four torsional angles, we propose autoregressively generating the four torsional angles from \c{hi}1 to \c{hi}4 and training diffusion models for each torsional angle. Researchers evaluate the method on several benchmarks for protein side-chain packing and show that the proposed method achieves improvements of 11.9% and 13.5% in angle accuracy on CASP13 and CASP14, respectively, with a significantly smaller model size (60x fewer parameters). Additionally, they show the effectiveness of the method in enhancing side-chain predictions in the AlphaFold2 model.
Don’t Just Prune by Magnitude! Your Mask Topology is Another Secret Weapon
Recent years have witnessed significant progress in understanding the relationship between the connectivity of a deep network's architecture as a graph, and the network's performance. A few prior arts connected deep architectures to expander graphs or Ramanujan graphs, and particularly, demonstrated the use of such graph connectivity measures with ranking and relative performance of various obtained sparse sub-networks (i.e. models with prune masks) without the need for training. However, no prior work explicitly explores the role of parameters in the graph's connectivity, making the graph-based understanding of prune masks and the magnitude/gradient-based pruning practice isolated from one another. This paper strives to fill in this gap by analyzing the Weighted Spectral Gap of Ramanujan structures in sparse neural networks and investigates its correlation with final performance. Researchers specifically examine the evolution of sparse structures under a popular dynamic sparse-to-sparse network training scheme, and intriguingly find that the generated random topologies inherently maximize Ramanujan graphs. They also identify a strong correlation between masks, performance, and the weighted spectral gap. Leveraging this observation, researchers propose to construct a new "full-spectrum coordinate'' aiming to comprehensively characterize a sparse neural network's promise. Concretely, it consists of the classical Ramanujan's gap (structure), the proposed weighted spectral gap (parameters), and the constituent nested regular graphs within. In this new coordinate system, a sparse subnetwork's L2-distance from its original initialization is found to have nearly linear correlated with its performance. Eventually, researchers apply this unified perspective to develop a new actionable pruning method, by sampling sparse masks to maximize the L2-coordinate distance. The proposed method can be augmented with the "pruning at initialization" (PaI) method, and significantly outperforms existing PaI methods. With only a few iterations of training (e.g 500 iterations), researchers can get LTH-comparable performance as that yielded via "pruning after training", significantly saving pre-training costs. Codes can be found at: https://github.com/VITA-Group/FullSpectrum-PAI.
E2PNet: Event to Point Cloud Registration with Spatio-Temporal Representation Learning
Event cameras have emerged as a promising vision sensor in recent years due to their unparalleled temporal resolution and dynamic range. While registration of 2D RGB images to 3D point clouds is a long-standing problem in computer vision, no prior work studies 2D-3D registration for event cameras. To this end, researchers propose E2PNet, the first learning-based method for event-to-point cloud registration. The core of E2PNet is a novel feature representation network called Event-Points-to-Tensor (EP2T), which encodes event data into a 2D grid-shaped feature tensor. This grid-shaped feature enables matured RGB-based frameworks to be easily used for event-to-point cloud registration, without changing hyper-parameters and the training procedure. EP2T treats the event input as spatio-temporal point clouds. Unlike standard 3D learning architectures that treat all dimensions of point clouds equally, the novel sampling and information aggregation modules in EP2T are designed to handle the inhomogeneity of the spatial and temporal dimensions. Experiments on the MVSEC and VECtor datasets demonstrate the superiority of E2PNet over hand-crafted and other learning-based methods. Compared to RGB-based registration, E2PNet is more robust to extreme illumination or fast motion due to the use of event data. Beyond 2D-3D registration, researchers also show the potential of EP2T for other vision tasks such as flow estimation, event-to-image reconstruction and object recognition. The source code can be found at: https://github.com/Xmu-qcj/E2PNet.
GLEMOS: Benchmark for Instantaneous Graph Learning Model Selection
The choice of a graph learning (GL) model (i.e., a GL algorithm and its hyper parameter settings) has a significant impact on the performance of downstream tasks. However, selecting the right GL model becomes increasingly difficult and time consuming as more and more GL models are developed. Accordingly, it is of great significance and practical value to equip users of GL with the ability to perform a near-instantaneous selection of an effective GL model without manual intervention. Despite the recent attempts to tackle this important problem, there has been no comprehensive benchmark environment to evaluate the performance of GL model selection methods. To bridge this gap, researchers present GLEMOS in this work, a comprehensive benchmark for instantaneous GL model selection that makes the following contributions. (i) GLEMOS provides extensive benchmark data for fundamental GL tasks, i.e., link prediction and node classification, including the performances of 366 models on 457 graphs on these tasks. (ii) GLEMOS designs multiple evaluation settings and assesses how effectively representative model selection techniques perform in these different settings. (iii) GLEMOS is designed to be easily extended with new models, new graphs, and new performance records. (iv) Based on the experimental results, researchers discuss the limitations of existing approaches and highlight future research directions. To promote research on this significant problem, the benchmark data and code is publicly available at https://namyongpark.github.io/glemos.
Improving Systematic Generalization using Simplicial Embeddings and Iterated Learning
Compositional generalization, the ability of an agent to generalize to unseen combinations of latent factors, is easy for humans but hard for deep neural networks. A line of research in cognitive science has hypothesized a process, “iterated learning,” to help explain how human language developed this ability; the theory rests on simultaneous pressures towards compressibility (when an ignorant agent learns from an informed one) and expressivity (when it uses the representation for downstream tasks). Inspired by this process, researchers propose to improve the compositional generalization of deep networks by using iterated learning on models with simplicial embeddings, which can approximately discretize representations. This approach is further motivated by an analysis of compositionality based on Kolmogorov complexity. The papers shows that this combination of changes improves compositional generalization over other approaches, demonstrating these improvements both on vision tasks with well-understood latent factors and on real molecular graph prediction tasks where the latent structure is unknown.
The remarkable growth and significant success of machine learning have expanded its applications into programming languages and program analysis. However, a key challenge in adopting the latest machine learning methods is the representation of programming languages, which directly impacts the ability of machine learning methods to reason about programs. The absence of numerical awareness, composite data structure information, and improper way of presenting variables in previous representation works have limited their performances. To overcome the limitations and challenges of current program representations, this work proposes a novel graph-based program representation called PERFOGRAPH. PERFOGRAPH can capture numerical information and the composite data structure by introducing new nodes and edges. Furthermore, researchers propose an adapted embedding method to incorporate numerical awareness. These enhancements make PERFOGRAPH a highly flexible and scalable representation that can effectively capture program intricate dependencies and semantics. Consequently, it serves as a powerful tool for various applications such as program analysis, performance optimization, and parallelism discovery. Experimental results demonstrate that PERFOGRAPH outperforms existing representations and sets new state-of-the-art results by reducing the error rate by 7.4% (AMD dataset) and 10% (NVIDIA dataset) in the well-known Device Mapping challenge. It also sets new state-of-the-art results in various performance optimization tasks like Parallelism Discovery and Numa and Prefetchers Configuration prediction.
AI4Mat Workshop
Data Efficient Training for Materials Property Prediction Using Active Learning Querying
EGraFFBench: Evaluation of Equivariant Graph Neural Network Force Fields for Atomistic Simulations
HoneyBee: Progressive Instruction Finetuning of Large Language Models for Materials Science
Learning Conditional Policy for Crystal Design using Offline Reinforcement Learning
MatSciML: A Broad, Multi-Task Benchmark for Solid-State Materials Modeling
On the importance of catalyst-adsorbate 3D interactions for relaxed energy prediction
Searching the Space of High-Value Molecules Using Reinforcement Learning and Language Models
Towards equilibrium molecular conformation generation with GFlowNets
Other Workshop Papers
LDM3D-VR: Latent Diffusion Model for 3D
WOUAF: Weight Modulation for User Attribution and Fingerprinting in Text-to-Image Diffusion Models
Zero-shot Conversational Summarization Evaluations with small Large Language Models
Fusing Models with Complementary Expertise
InstaTune: Instantaneous Neural Architecture Search During Fine-Tuning
Towards Causal Representations of Climate Model Data
Analyzing Zero-Shot Abilities of Vision-Language Models on Video Understanding Tasks
Probing Intersectional Biases in Vision-Language Models with Counterfactual Examples
Towards foundational models for knowledge graph reasoning
PDB-Struct: A Comprehensive Benchmark for Structure-based Protein Design
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.