ISMB 2018: Collaborative Structural Biology using Machine Learning and Jupyter Notebook

This post is a summary of the talk, Collaborative Structural Biology using Machine Learning and Jupyter Notebook, given by Fergus Imrie and Fergus Boyles at ISMB 2018. Materials for the experiments can be found here and here.

Myself and four other members of the Oxford Protein Informatics Group (a.k.a. OPIGlets) recently had the pleasure of attending the Intelligent Systems for Molecular Biology (ISMB) conference in Chicago. Organised by the International Society of Computational Biology (ISCB), ISMB is the largest computational biology conference in the world, with several thousand attendees.

Spread over four action-packed days in July (not including workshops/tutorial sessions), it was an eye-opening experience, showcasing the depth and breadth of computational biology research; particularly striking was the range of problems tackled, techniques applied, and data sources used.

I was fortunate enough to have the opportunity to present alongside my colleague, Fergus Boyles, as part of the 3DSIG Community of Special Interest (COSI). We led the first hands-on practical demonstration at 3DSIG, entitled “Collaborative Structural Biology using Machine Learning and Jupyter Notebook”. While a new format at the conference, with our presentation somewhat of an experiment, I understand the organising committee is keen to repeat the format next year.

In what follows, I’ll briefly outline the key themes and outcomes from our presentation. Full materials to reproduce all results presented in full can be found here and here.

Reproducibility crisis?

In a survey of 1,500 scientists by Nature in 2016 (link), more than 70% of participants had tried and failed to reproduce another scientist’s experiments, while 90% said there was a reproducibility crisis to some extent. Most striking, perhaps, was the revelation that “more than half have failed to reproduce their own experiments”!

Nature, 2016, M. Baker, 1,500 scientists lift the lid on reproducibility

While the focus of the survey was, admittedly, on traditional, lab-based, experimental research, this is certainly also an issue in computational approaches, with the machine learning community under the heaviest scrutiny.

This is clearly unsustainable and many efforts are being taken to address this across the scientific world. As one example, Nature has introduced a code and submission checklist that requires authors to submit custom algorithms or software that are central to the paper for peer review and editorial assessment. While only directly affecting a small portion of research, this is a big step in the right direction and I think we’re only going to see more of this in the future.

Software to the rescue?

With the rise of cloud computing, the open-source community, and much more, there is a plethora of software available that can be used to improve the accessibility of methods and improve the reproducibility of computational experiments. Below, I touch on a couple of general areas that are increasing used in computational pipelines and setups.

  • Cloud computing (such as Amazon Web Services, Google Cloud, and Microsoft Azure) provides widely accessible, standardised compute environments, and allows the use of anything from a single core to near-HPC-level resources for a short period of time at relative inexpensive.
  • Container solutions (such as Docker and Kubernets) allow developers to package an application, with all required libraries and dependencies, into a single executable for the end user, with no further dependencies.

Our approach

We didn’t use any of the above tools for purposes of our talk, but instead constructed our pipeline based on three other widely-used solutions: Conda, Project Jupyter, and Git/GitHub. For those unfamiliar, here is a brief overview of each.

  • Conda is an open-source package and environment management system. It works by creating distinct virtual environments and installing standalone interpreters or compilers within that virtual environment. You can then install additional packages within that virtual environment, that are completely isolated and separate from your system default packages, and other virtual environments.

  • For those of you who are familiar with the iPython notebook, Jupyter is an extension of this format to multiple languages. Jupyter provides an interactive browser-based coding environment in the form of a notebook, that can be thought of as similar to a lightweight IDE. The power of Jupyter notebooks comes from a combination of (1) the ability to intersperse code with markdown, which is much more human readable and friendly on the eye compared to traditional comments; (2) the cell-based format, where small pieces of code are contained in cells that can be run, and re-run, individually and without re-running the remainder of your code; (3) the ability to display inline figures, tables (among other things), rendering in HTML.

  • Git is an open-source version control system. Version control is an essential bedrock of good programming that we don’t have time to go into in more detail, but long-story short, Git takes any headache out of version control.

  • GitHub is a code hosting platform built for collaboration with Git at its core. Beyond a simple code repository, GitHub allows collaboration and development through two key features. “Forking” allows you to clone other projects, and either develop them yourself, or keep a record of a fixed version for integration within another project. “Pull requests” make large scale community collaboration projects possible, with users providing code for specific modifications for the original projects, which the owners/admin of the original project can choose to merge or reject.

Experiments

As a toy problem to showcase this approach to building a reproducible pipeline, we address the problem of protein classification according to the SCOP classification scheme. While the dataset we have shared contains examples of protein pairs that are in the same fold, superfamily, and family (as well as none of these), we focussed on the most straightforward task of determining whether a pair of proteins belong to the same family or not.

Our dataset is based on the Astral data set (06.02.2016 build), and consists of 8 pairwise features computed from the sequences of the two proteins. We won’t go into the details of the exact features here.

Using a simple random forest on these 8 pairwise features between the target and template protein, we achieved an accuracy of 88.0%, and an area under the receiver operative curve of 0.95. A confusion matrix and ROC curve summarising our results can be found below.

Instructions to reproduce these results, together with all materials needed, can be found here and here.

Conclusions

Reproducibility in science is facing a challenging time. All stakeholders, from researchers to funders and publishers, are placing more emphasis on work being reproducible, and are taking measures to ensure this. In computational research, in particular stochastic algorithms such as those prevalent throughout machine learning, the problem is no less serious, and on the face of it should be readily solvable.

In our demonstration, we have illustrated one approach to tackling this in a simple, efficient way. In addition, we only looked to tackle one possible problem or question, and only used a subset of the overall dataset. Please feel free to explore the dataset and pose your own questions. We’d love to hear from you if you do!

Acknowledgements

I’d like to thank all of OPIG for providing feedback on an early version of the talk. Crucially, I’d like to thank Dr Saulo de Oliveira who provided us with the dataset used in our exploratory analysis. Finally, I’d like to thank my co-presenter Fergus Bolyes, without whom I couldn’t have done this.

Author