Graphormer: Merging GNNs and Transformers for Cheminformatics

This is my first OPIG blog! I’m going to start with a summary of the Graphormer, a Graph Neural Network (GNN) that borrows concepts from Transformers to boost performance on graph tasks. This post is largely based on the NeurIPS paper Do Transformers Really Perform Bad for Graph Representation? by Ying et. al., which introduces the Graphormer, and which we read for our last deep learning journal club. The project has now been integrated as a Microsoft Research project.

I’ll start with a cheap and cheerful summary of Transformers and GNNs before diving into the changes in the Graphormer. Enjoy!

Graphormer: the Idea

Transformers are powerful deep learning architectures for sequential data such as natural language. They have also been successfully applied to different data structures like images. The Graphormer seeks to apply the Transformer architecture to graph structured data to boost performance of GNNs on graph-based tasks.

To achieve this, Ying et. al. make three key changes to Message-Passing Neural Networks (a type of GNN). I’ll go into details below but essentially they include the degree of a node, they come up with a way to manage positional encodings in graphs and they add a term for edge encodings in the attention.

Transformers

The Transformer deep-learning architecture took the world by storm in 2017. They are powerful encoder-decoder architectures that rely on the self-attention mechanism to generate representations of the input data.

Self-attention works as follows: the input is projected into three spaces with trainable weight matrices: a query space, a key space and a value space. The query embeddings are used to weight the keys, thereby generating an attention score for each element in the sequence. The attention score is used to weight the values in the output sequence. Each query element returns one output, typically these are calculated at once by matrix multiplication.

This is called attention because the queries and keys indicate which elements in the sequence should be ‘attended’ to. It is further called self-attention because the keys and queries originate from the same sequence as the values. As a fun fact (to be taken with a pinch of salt), I think the terminology with keys, queries and values originates from databases, where the query selects the key and the key maps to the value.

The maths looks like this:

Equation 1 from Vaswani et. al., 2017

Q, K, and V are the linear embeddings of the input sequence x, such that Q=Wqx, K=Wkx, V=Wvx, where W are trainable weight matrices.

QKT is the matrix product of the query and the key embeddings, which produces a weight vector for every element in the value sequence. dk is the dimension of the key vector, and sqrt(dk)-1 is used as a scalar normalisation. The softmax function rescales the query-key weights to sum to one. Multiplying by V, we get one output vector per query, stacked into a matrix.

The set of weight matrices, Wq , Wk , and Wv only generate a single attention output. It is empirically beneficial to generate multiple such embeddings and generate several distinct attention outputs in parallel, each with their own set of weight matrices. Using multiple of these parallel attention units is referred to as multi-head attention. These outputs are concatenated and projected once more by WO.

Unnumbered equation from Vaswani et al, 2017

Self-attention has proven to be very good at capturing meaningful patterns between elements of sequences (also known as semantic correlations) and incorporating this information into the encoded representation. However, these operations are entirely linear, limiting the representations to be linear projections of the original input. To introduce non-linearity to the Transformer, the attention output is passed through a feed-forward neural network (FFN) with a non-linear activation function such as ReLU. This consists of two fully connected layers with a ReLU activation between them:

Equation 2 from Vaswani et al, 2017

The depth of the Transormer’s encoder is generated by stacking several multi-head-attention-to-FFN blocks until a final embedding is attained. Originally these embeddings were then decoded with a self-attention based decoder to generate text translations, but the Transformer architecture, particularly the encoder, has since been sucessfully applied to a multitude of data for varying ‘downstream’ tasks. Often it is also possible to use the pre-trained encoder on alternative tasks, ommitting the need for retraining the whole network end-to-end.

Another thing to note is that the self-attention mechanism loses positional information – swapping the order of the input sequence would just rearrange the output, but is not meaningfully different and there is no way to tell the model which elements are close or far away in the sequence. To handle this, Transformer architectures use relative or absolute positional encodings, that are injected (typically by superposition) to the the input embeddings. In the original paper by Vaswani et. al. they use a sinusoidal function to add positional information to the input features.

That was a quick summary of Transformers, I will add some resources and papers I found useful at the bottom of this post for further reading.

Graph Neural Networks

Graph Neural Networks (GNNs) are deep learning architectures that process input data structured as a graph. A graph consists of a set of nodes or vertices V, connected to each other by edges E. Graph representations lend themselves well to entities with spatial structure, such as molecules, or data that is largely determined by relations between entities, such as social networks. In particular, a graph processing method referred to as a Message-Passing Neural Network (MPNN) has become a popular way of applying deep learning techniques to graphs.

MPNNs use the ‘graph in graph out’ paradigm adopted in many GNN frameworks. This means that each layer returns a graph with the same structure as the input graph with updated features. Node features, edge features, and whole graph features are updated at every layer until a final graph representation is generated. Architectures vary, but fundamentally a MPNN layer boils down to three operations: aggregate, update, and readout.

Adapted from Gilmer et al 2017 and Ying et al, 2021

Aggregate is a permutation invariant function that brings together, ie aggregates, information from the neighbours j of a node i. Popular functions are mean(.), sum(.) and max(.). The aggregation can also incorporate a trainable function, f, like in the example above. hj(l-1) is the feature vector of node j at layer (l-1). ail is the message passed to node i.

Update (here they use the terminology combine instead, but the maths is the same) uses the aggregate information and combines it with the current node representation to update the node.

Equation 1 from Ying et al, 2021

Aggregate and update are applied to every node i in the graph until the entire graph has been updated.

Equation 2 from Ying et al, 2021

The readout function is a permutation invariant function that pools information across all nodes in the graph to generate a feature vector hG for the whole graph.

Depending on the task, MPNNs can produce different final outputs. If the goal is to classify each node, the node level features generated by the network are of interest. If the goal is to make a prediction about the whole graph, such as whether or not a molecule is soluble, the final readout feature is most important.

We can now see why these GNN architectures are referred to as ‘message passing’ – the information is passed to a node by its neighbours in every layer. As a consequence of this, the receptive field of each node increases with every layer: in one iteration only direct neighbours can pass information, on the next iteration information removed by two edges can reach a node. This can cause ‘graph smoothing’ if the network is too deep, but that’s a whole other blog post. It’s also worth noting that some architectures (including the Graphormer) introduce a virtual node to the network that is connected directly to all other nodes. This can help transmitting information between distant nodes and can also be used as the readout feature, hG, for the whole graph.

Ok, phew. Now we know (roughly) how Transformers and GNNs work, we can take a look at the Graphormer in more detail.

Graphormer

The Transformer was in part the answer to a problem in sequence processing that is also faced by GNNs: signal dilution between far away elements. This motivates tranferring the Transformer architecture to GNNs, and the Graphormer is one approach. Ying et. al. introduce three changes to combine Transformers and GNNs: centrality encoding, spatial encoding, and edge encoding in the attention mechanism.

Centrality encoding

Centrality encoding incorporates the degree of each node in its initial feature representation, before calculating any attention scores. This is motivated by the fact that the attention mechanism captures sematic correlations between nodes, but not the node connectivity. Centrality encoding incorporates additional learnable feature vectors which are added to the initial input.

Equation 5 from Ying et al, 2021

hi(0) is the initial feature vector of the i-th node, before passing through any layers. It consists of an initial node representation xi summed with feature vectors z that depend on the degree of the node. z is a vector representing the number of incoming edges, z+ represents the number of outgoing edges.

As an example, if nodes m and n have two incoming edges each, and three and one outgoing edge respectively, their z vectors will be the same, but their z+ vectors will be different. These vectors are optimised during training to provide meaningful information about the degree of the nodes. For undirected graphs only one feature vector is added since one cannot differentiate between incoming and outgoing edges.

Spatial encoding

Transformers use positional encodings to retain information about the positions of elements in the sequence, since the the attention mechanism is otherwise permutation invariant. It is not immediately obvious how to determine and then incorporate such positional information in a graph, since the structure is much more flexible.

Furthermore, Transformers have a global receptive field, meaning that every element in a sequence is ‘seen’ when generating the attention output, rather than a cropped section or subset of the elements. This is not true for single layers of MPNNs, where each node update is only influenced by its neighbours.

As a solution to this, Ying et. al. propose spatial encodings. They use a function to describe the distance between two nodes, and then train a scalar bias term on the output of that function. In their paper they use the shortest-path-distance (SPD) as the function. The bias is added to the attention score calculated with the queries and keys and is shared across all layers:

Equation 6 from Ying et al, 2021

b becomes a spatial bias for the attention: if b is a decreasing function of the SPD between i and j, then A, the attention score, will be greater for nodes that are close to each other in the graph.

We can also see that the attention mechanism applies to all node pairs in the graph, not just neighbouring nodes. This means that the receptive field is global, however, it also increases the number of computations per layer. With these changes computation scales quadratically, with N2 per layer, whereas in an MPNN the number of computations scales (approximately) linearly with sum(K(n)) over N, where K(n) is the number of neighbours of the n-th node, and N is the number of nodes in the graph. While this is not devastating for graphs representing small molecules, it may render the Graphormer less useful when considering very large graphs.

Edge encoding in the attention mechanism

Often the edges of a graph contain valuable information that should be considered for whichever downstream task is being performed. For example, in molecules the number of covalent bonds or the bond length between two atoms could be important. To incorporate this information, Ying et. al. calculate a weighted average over edge feature vectors of edges between the two nodes and add this to the attention score between two nodes i and j.

Equation 7 from Ying et al, 2021

x is the feature vector of the en node, and wn is the trainable weight embedding of that node. n=1…N iterates over all the edges on the shortest path between the nodes i and j. cij is a scalar that distills information about the edges between the two nodes. One thing I could not disambiguate in the original paper is how they select which shortest path between nodes to use in cases where there are multiple options.

Similar to the bias, the edge score cij is added to the attention score of nodes i and j, yielding a final equation for the attention score between two nodes:

Equation 7 from Ying et al, 2021

This attention score is calculated for every node pair i-j and used for multi-head attention as described in the section on Transformers above.

Bringing it all together

The final aggregate equation used by the Graphormer is:

Equation 8 from Ying et al, 2021

where MHA is multi-head-attention. The update equation for each node is is:

Equation 9 from Ying et al, 2021

where FFN is the feed-forward network, and LN is a layer normalisation block that normalises the feature vector of each node with what is essentially a z-score transformation with additonal trainable parameters gamma and beta:

Unnumbered equation from Xiong et al, 2020

Results

The Graphormer won the 2021 Open Graph Benchmark Large Scale Challenge (OGB-LSC) in quantum chemistry. Ying et. al. demonstrate the Graphormers capability on multiple benchmark datasets, primarily testing on the PCQM4M-LSC challenge. This is a whole graph regression task to predict the HOMO-LUMO energy gap of molecules. They report competitive and state of the art results and perform ablation studies that suggest all three of their changes contribute to the improved results, although admittedly their model benefits from substantially more trainable parameters compared to some of the alternative methods.

Summarising notes

This was a really interesting deep-dive into both GNNs and Transformers, and reading about how they merge the two methods was very informative! It would be interesting to see how the Graphormer performs on larger graphs given the increased computational complexity.

References and Further Reading

Do Transformers Really Perform Bad for Graph Representation? – Ying et. al., 2021

Attention is all you need, Vaswani et. al. 2017

Neural Message Passing for Quantum Chemistry – Gilmer et. al. 2017

On Layer Normalization in the Transformer Architecture – Xiong et. al., 2020

Breaking the Limits of Message Passing Graph Neural Networks – Balcilar et. al. 2021

PCQM4Mv2 – Quantum Chemistry Challenge

A gentle introduction to GNNs – Blog post

An introduction to MPNNs – Blog post

The annotated Transformer – Blog post

Graphormer explanation – Video

The illustrated Transformer – Blog post

Author