Issues with graph neural networks: the cracks are where the light shines through

Deep convolutional neural networks have lead to astonishing breakthroughs in the area of computer vision in recent years. The reason for the extraordinary performance of convolutional architectures in the image domain is their strong ability to extract informative high-level features from visual data. For prediction tasks on images, this has lead to superhuman performance in a variety of applications and to an almost universal shift from classical feature engineering to differentiable feature learning.

Unfortunately, the picture is not quite as rosy yet in the area of molecular machine learning. Feature learning techniques which operate directly on raw molecular graphs without intermediate feature-engineering steps have only emerged in the last few years in the form of graph neural networks (GNNs). GNNs, however, still have not managed to definitively outcompete and replace more classical non-differentiable molecular representation methods such as extended-connectivity fingerprints (ECFPs). There is an increasing awareness in the computational chemistry community that GNNs have not quite lived up to the initial hype and still suffer from a number of technical limitations.

  1. Lack of expressivity: Xu et. al. [1] showed that some of the most popular GNNs (such as GCNs) lack theoretical expressivity, meaning that they are not able to distinguish simple graph structures and thus underfit the training set.
  2. Meaningless initial embeddings: GNNs have to learn a meaningful chemical space embedding from scratch every time they are trained on a novel task, unlike for example molecular descriptor vectors which contain useful chemical knowledge from the start. This might be one of the reasons why GNNs often underperform on small data sets.
  3. Oversmoothing: many GNN architectures cannot be made deep due to a tendency of successively convolved node features to become indistinguishable (a phenomenon known as oversmoothing).
  4. Information loss during graph pooling: GNNs require a global pooling step to eventually reduce the graph to a vector which can form a dangerous information bottleneck. Often, the global graph representation is simply chosen to be the sum or average over all node features.
  5. Locality of receptive field: much like ECFPs, most GNNs are based on a neighbourhood-aggregation scheme which limits the size of their receptive field and prevents information flow between distant nodes in the input graph.

In my experience, GNNs can often be replaced with simpler and more computationally efficient molecular representation methods such as ECFPs without a significant loss in predictive performance, especially on small data sets.

So what is the way forward?

In my estimation, we are unlikely to make substantial progress in graph-based molecular feature learning if we do not address at least some of the challenges in (1) – (5) with technical innovation. Here are some exciting ideas that I have come across that go in this direction (the numbers next to the bullet points indicate which of the above issues are addressed):

  • (4) Universal graph pooling: Navarin et. al. [2] proposed a trainable and highly expressive graph pooling operation that can provably approximate any continuous permutation-invariant function over sets of vectors.
  • (2) Self-supervised pretraining on unlabelled data: GNNs can be pretrained on unlabelled molecules in a self-supervised fashion and then fine-tuned on downstream tasks. Recently, Wang et at. [3] achieved strong results on molecular property prediction tasks via pretraining GNNs with contrastive learning strategies on unlabelled molecules.
  • (3) Prevention of oversmoothing: Liu et al. [4] contributed to our understanding of the oversmoothing issue by linking it to the entanglement of certain key operations in the node feature updating step of common GNNs. They went on to propose technical modifications to enable deeper GNN architectures and demonstrated associated performance gains.
  • (1) Provably expressive GNNs: To overcome the lack of expressivity of popular GNN architectures, Xu et al. designed a new type of GNN, the graph isomorphism network (GIN). They proved that GINs are strictly more expressive than a variety of previous GNNs and that they are in fact as powerful as the commonly used Weisfeiler-Lehman graph isomorphism test. GINs have since become the SOTA GNN in many applications.
  • (1) (5) Graphormers: Recently, attempts have been made to adapt the transformer architecture from natural language processing to graph-shaped input data. When applied to the set of node features of an input graph, transformers allow for a global receptive field that enables information flow between arbitrary nodes. Ying et al. [5] managed to explicitly incorporate graph stuctural information into the transformer-based self-attention mechanism, which gave rise to the graphormer model. Many GNN architectures such as the popular GCN or the expressive GIN can be technically formulated as a special type of graphormer. The graphormer achieved impressive results on the OGB-LSC quantum chemistry regression challenge. In particular, the graphormer beat all GNNs and achieved 1th place on the public leaderboard.

The world of molecular machine learning is incredibly exciting and things are moving at an impressive pace. Graph neural networks are not yet for molecules what convolutional neural networks are for images. But the very reasons why they are not yet what we would like them to be might just lead us in the right direction.

References:

[1] Xu, Keyulu, et al. “How powerful are graph neural networks?.” arXiv preprint arXiv:1810.00826 (2018).

[2] Navarin, Nicolò, Dinh Van Tran, and Alessandro Sperduti. “Universal readout for graph convolutional neural networks.” 2019 International Joint Conference on Neural Networks (IJCNN). IEEE, 2019.

[3] Wang, Yuyang, et al. “MolCLR: Molecular contrastive learning of representations via graph neural networks.” arXiv preprint arXiv:2102.10056 (2021).

[4] Liu, Meng, Hongyang Gao, and Shuiwang Ji. “Towards deeper graph neural networks.” Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2020.

[5] Ying, Chengxuan, et al. “Do Transformers Really Perform Bad for Graph Representation?.” arXiv preprint arXiv:2106.05234 (2021).

Author