ULTRA: Foundation Models for Knowledge Graph Reasoning

Michael_Galkin · ‎11-29-2023

Training a single generic model for solving arbitrary datasets is always a dream for ML researchers, especially in the era of foundation models. While such dreams have been realized in perception domains like images or natural languages, whether they can be reproduced in reasoning domains (like graphs) remains an open challenge.

In this blog post, we prove such a generic reasoning model exists, at least for knowledge graphs (KGs). We create ULTRA, a single pre-trained reasoning model that generalizes to new KGs of arbitrary entity and relation vocabularies, which serves as a default solution for any KG reasoning problems.

This post is based on our recent paper (preprint) and was written together with Xinyu Yuan (Mila), Zhaocheng Zhu (Mila), and Bruno Ribeiro (Purdue / Stanford)

Why KG representation learning is stuck in 2018

The pretrain-finetune paradigm has been with us since 2018 when ELMo and ULMFit showed first promising results and they were later cemented with BERT and GPT.

In the era of large language models (LLM) and more general foundation models (FMs), we often have a single model (like GPT-4 or Llama-2) pre-trained on enormous amounts of data and capable of performing a sheer variety of language tasks in the zero-shot manner (or at least be fine-tuned on the specific dataset). Those days, multimodal FMs even support language, vision, audio, and other modalities in the same one model.

Things work a little differently in Graph ML. Particularly, what’s up with representation learning on KGs at the end of 2023? The main tasks here are edge-level:

Entity prediction (or knowledge graph completion) (h, r, ?): given a head node and relation, rank all nodes in the graph that can potentially be true tails.
Relation prediction (h, ?, t): given two nodes, predict a relation type between them

Turns out, up until now it has been somewhere in pre-2018. The key problem is:

> Each KG has its own set of entities and relations, there is no single pre-trained model that would transfer to any graph.

For example, if we look at Freebase (a KG behind Google Knowledge Graph) and Wikidata (the largest open-source KG), they have absolutely different sets of entities (86M vs 100M) and relations (1500 vs 6000). Is there any hope for current KG representation learning methods to be trained on one graph and transfer to another?

Classical transductive methods like TransE, ComplEx, RotatE, and hundreds of other embedding-based methods learn a fixed set of entities and relation types from the training graph and cannot even support new nodes added to the same graph. Shallow embedding-based methods do not transfer (in fact, we believe there is no point in developing such methods anymore except for some student project exercises).

🟡Inductive entity methods like NodePiece and Neural Bellman-Ford Nets do not learn entity embeddings. Instead, they parameterize training (seen) and new inference (unseen) nodes as a function of fixed relations. Since they learn only relation embeddings, it does allow them to transfer to graphs with new nodes but transfer to new graphs with different relations (like Freebase to Wikidata) is still beyond reach.

What to do if you have both new entities and relations at inference time (a completely new graph)? If you don’t learn entity or relation embeddings, is the transfer theoretically possible? Let’s look into the theory then.

Theory: What makes a model inductive and transferable?

Let’s define the setup more formally:

KGs are directed, multi-relational graphs with arbitrary sets of nodes and relation types
Graphs arrive without features, that is, we don’t assume the existence of textual descriptions (nor pre-computed feature vectors) of entities and relations.
Given a query (head, relation, ?), we want to rank all nodes in the underlying graph (inference graph) and maximize the probability of returning a true tail.
Transductive setup: the set of nodes and entities is the same at training and inference time.
Inductive (entity) setup: the set of relations has to be fixed at training time, but nodes can be different at training and inference
Inductive (entity and relation) setup: both new unseen entities and relations are allowed at inference

What do neural networks learn to be able to generalize to new data? The primary reference of – the book on Geometric Deep Learning by Bronstein, Bruna, Cohen, and Veličković posits that it is a question of symmetries and invariances.

What are the learnable invariances in foundation models? LLMs learn a fixed vocabulary of tokens (sub-word units, bytes, or even randomly initialized vectors as in Lexinvariant LLMs), vision models learn functions to project image patches, audio models learn to project audio patches.

> What are the learnable invariances for multi-relational graphs?

First, we will introduce the invariances (equivariances) in standard homogeneous graphs.

Standard (single) permutation equivariant graph models: A great leap in graph ML came when early GNN work (Scarselli et al. 2008, Xu et al. 2018, Morris et al. 2018) has shown that inductive tasks on graphs benefited enormously from assuming that vertex IDs are arbitrary, such that the predictions of a graph model should not change if we reassigned vertex ID. This is known as permutation equivariance of the neural network on node IDs. This realization has created great excitement and a profusion of novel graph representation methods since, as long as the neural network is equivariant to node ID permutations, we can call it a graph model.

Single-relational graphs. GNNs are equivariant to node permutations: Michael Jackson’s node vector will have the same value even after re-labeling node IDs. Image by Authors.

The permutation equivariance on node IDs allows GNNs to inductively (zero-shot) transfer the patterns learned from a training graph to another (different) test graph. This is a consequence of the equivariance, since the neural network cannot use node IDs to produce embeddings, it must use the graph structure. This creates what we know as structural representations in graphs (see Srinivasan & Ribeiro (ICLR 2020)).

Equivariance in multi-relational graphs

Now edges in the graphs might have different relation types - is there any GNN theory for such graphs? In our previous work Weisfeiler and Leman Go Relational (with Pablo Barceló, Christopher Morris, and Miguel Romero Orth, LoG 2022) we derived Relational WL - a WL expressiveness hierarchy for multi-relational graphs focusing more on node-level tasks. The great follow-up work by Huang et al (NeurIPS 2023) extended the theory to link prediction, formalized conditional message passing and logical expressiveness using Relational WL. Let’s remember conditional message passing, – we’ll need it later – it provably improves link prediction performance.

The proposed addition of a global readout vector induced by incoming/outgoing edge direction resembles the recent work of Emanuele Rossi et al on studying directionality in homogeneous MPNNs (read the blog post on Medium for more details). Still, those works do not envision the case when even relations at test time are unseen.

Double permutation equivariant (multi-relational) graph models: Recently, Gao et al. 2023 proposed the concept of double equivariance for multi-relational graphs. Double equivariance forces the neural network to be equivariant to the joint permutations of both nodes IDs and relation IDs. This ensures the neural network learns structural patterns between nodes and relations, which allows it to inductively (zero-shot) transfer the learned patterns to another graph with new nodes and new relations.

Double equivariance in multi-relational graphs. Permuting both node IDs and relation IDs does not change the relational structure. Hence, the output node states should be the same (but permuted). Image by authors.

In our work, we find the invariance of relation interactions, that is, even if relation identities are different, their fundamental interactions remain the same, and those fundamental interactions can be captured by a graph of relations. In the graph of relations, each node is a relation type from the original graph. Two nodes in this graph will be connected if edges with those relation types in the original graph are incident (that is, they share a head or tail node). Depending on the incidence, we distinguish 4 edge types in the graph of relations:

Head-to-head (h2h) – two relations can start from the same head entity
Tail-to-head (t2h) - tail entity of one relation can be a head of another relation
Head-to-tail (h2t) – head entity of one relation can be a tail of another relation
Tail-to-tail (t2t) – two relations can have the same tail entity

Different incidence patterns in the original graph produce different interactions in the graph of relations. Right-most: the example relation graph (inverse edges are omitted for clarity). Image by Authors

A few nice properties of the relation graph:

It can be built from absolutely any multi-relational graph (with simple sparse matrix multiplications)
The 4 fundamental interactions never change because they just encode the basic topology - in directed graphs there always will be head and tail nodes, and we relations would have those incidence patterns

> Essentially, learning representations over the relations graph can transfer to any multi-relational graph! This is the learnable invariance.

In fact, it can be shown (we are already working on the formal proofs which will be available in an upcoming work ) that representing relations via their interactions in a graph of relations is a double equivariant model! This means that learned relational representations are independent of identities but rather rely on the joint interactions between relations, nodes, and nodes & relations.

ULTRA: A Foundation Model for KG Reasoning

With all the theoretical foundations backing us up, we are now ready to introduce ULTRA. ULTRA is a method for unified, learnable, and transferable graph representations. ULTRA leverages the invariances (and equivariances) of the graph of relations with its fundamental interactions and applies conditional message passing to get relative relational representations. Perhaps the coolest fact is that

> a single pre-trained ULTRA model can run 0-shot inference on any possible multi-relational graph and be fine-tuned on any graph.

In other words, ULTRA is pretty much a foundation model that can run inference on any graph input (with already good performance) and be fine-tuned on any target graph of interest.

The crucial component of ULTRA is in relative relation representations constructed from the graph of relations. Given a query (Michael Jackson, genre, ?), we first initialize the genre node in the graph of relations with the all-ones vector 1^d (all other nodes are initialized with zeros). Running a GNN, the resulting node embeddings of the relation graph are conditioned on the genre node – it means that each starting initialized relation will have its own matrix of relational features, and that’s very helpful from many theoretical and practical aspects!

ULTRA employs relative relation representations (a labeling trick over the graph of relations) such that each relation (eg, “genre”) has its own unique matrix of all relation representations. Image by authors.

Practically, given an input KG and a (h, r, ?) query, ULTRA executes the following actions:

Construction of the graph of relations;
Get relation features from the conditional message passing GNN on the graph of relations (conditioned on the initialized query relation r);
Use the obtained relational representations for the inductive link predictor GNN conditioned on the initialized head node h;

Steps 2 and 3 are implemented via slightly different modifications of the Neural Bellman-Ford net (NBFNet). ULTRA only learns embeddings of the 4 fundamental interactions (h2t, t2t, t2h, h2h) and GNN weights - pretty small overall. The main model we experimented with has only 177k parameters.

Three main steps taken by ULTRA: (1) building a relation graph; (2) running conditional message passing over the relation graph to get relative relation representations; (3) use those representations for inductive link predictor GNN on the entity level. Image by Authors.

Experiments: Best even in Zero-shot inference and Fine-tuning

We pre-trained ULTRA on 3 standard KGs based on Freebase, Wikidata, and Wordnet, and ran 0-shot link prediction on 50+ other KGs of various sizes from 1k – 120k nodes and 2k edges – 1.1M edges.

Averaged across the datasets with known SOTA, a single pre-trained ULTRA model is better in the 0-shot inference mode than existing SOTA models trained specifically on each graph Fine-tuning improves the performance even 10% further. It’s particularly amazing that a single trained ULTRA model can scale to graphs of such different sizes (100x difference in node size and 500x in the edge sizes) whereas GNNs are known to suffer from size generalization issues (see the prominent works by Yehudai et al, ICML 2021 and Zhou et al, NeurIPS 2022).

A single pre-trained ULTRA is better even in the 0-shot inference mode than supervised SOTA modes trained end-to-end on specific graphs (look at the Average column). Fine-tuning improves the performance even further.

In fact, with 57 tested graphs, we ran a bit out of KGs to test ULTRA on, so if you have a fresh new benchmark hidden somewhere - let us know!

Scaling Behavior

We can bump the zero-shot performance even more by adding more graphs to the pre-training mixture although we do observe certain performance saturation after training on 4+ graphs.

Scaling Laws predict even better performance with bigger models trained on more qualitative data, so it’s definitely on our agenda.

Conclusion: Code, Data, Checkpoints

Foundation models for KG reasoning are finally here, we are past that 2018 threshold! A single pre-trained ULTRA model can perform link prediction on any KG (multi-relational graph) from any domain. You really just need a graph with more than 1 edge type to get going.

Practically, ULTRA demonstrates very promising performance on a variety of KG benchmarks already in the 0-shot mode, but you can bump the performance even further with a short fine-tuning.

We make all the code, training data, and pre-trained model checkpoints available on GitHub so you can start running ULTRA on your data right away!

preprint: arxiv

️Code, data: Githtub repo

Checkpoints: in the Github repo

Project website: here

As a closing remark, KG reasoning just represents a fraction of the many interesting problems in the reasoning domain, and the majority still don’t have a generic solution. We believe the success of KG reasoning will bring more breakthroughs in other reasoning domains (for example, we recently found that LLMs can actually learn and employ textual rules). Let’s stay optimistic about the future of reasoning!

yhlf6 · ‎12-15-2023

What are the differences with the work "INGRAM: Inductive Knowledge Graph Embedding via Relation Graphs (ICLR2023)", just the relation graph? Considering the relation graph, why don't use the relation patterns used in "Relational Message Passing for Fully Inductive Knowledge Graph Completion (ICED2022)".