Category Archives: Journal Club

The Coronavirus Antibody Database (CoV-AbDab)

We are happy to announce the release of CoV-AbDab, our database tracking all coronavirus binding antibodies and nanobodies with molecular-level metadata. The database can be searched and downloaded here: http://opig.stats.ox.ac.uk/webapps/coronavirus

Continue reading

Journal Club: Is our data biased, and should it be?

Jia, X., Lynch, A., Huang, Y. et al. Anthropogenic biases in chemical reaction data hinder exploratory inorganic synthesis. Nature 573, 251–255 (2019) doi:10.1038/s41586-019-1540-5 https://www.nature.com/articles/s41586-019-1540-5

Last week I presented the above paper at group meeting. While a little different from a typical OPIG journal club paper, the data we have access to almost certainly suffers from the same range of (possible) biases explored in this paper.

Continue reading

Learning dynamical information from static protein and sequencing data

I would like to advertise the research from Pearce et al. (https://doi.org/10.1101/401067) whose talk I attended at ISMB 2019. The talk was titled ‘Learning dynamical information from static protein and sequencing data’. I got interested in it as my field of research is structural biology which deals with dynamics systems, e.g. proteins, but data is often static, e.g. structures from X-ray crystallography. They presented a general protocol to infer transition rates between states in a dynamical system that can be represented with an energy landscape.

Continue reading

Journal Club: Investigating Allostery with a lot of Crystals

Keedy et al. 2018: An expanded allosteric network in PTP1B by multitemperature crystallography, fragment screening, and covalent tethering.

Allostery is defined as a conformational/activity change of an active site due to a binding event at a distant (allosteric) site.

The paper I presented in the journal club tried to decipher the underlying mechanics of allostery in PTP1B. It is a protein tyrosine phosphatase (the counter parts of kinases) and a validated drug target. Allosteric binding sites are known but so far neither active site nor allosteric site inhibitors have reached clinical use. Thus, an improved mechanistic understanding could improve drug discovery efforts.

Continue reading

Kernel Methods are a Hot Topic in Network Feature Analysis

The kernel trick is a well known method in machine learning for producing a real-valued measure of similarity between data points in any number of settings. Kernel methods for network analysis provide a way of assigning real values to vertices of the graph. These values may correspond to similarity across any number of graphical properties such as the neighbours they share, or more dynamic context, the influence that change in the state of one vertex might have on another.

By using the kernel trick it is possible to approximate the distribution of features on the vertices of a graph in a way that respects the graphical relationships between vertices. Kernel based methods have long been used, for instance in inferring protein function from other proteins within Protein Interaction Networks (PINs).

Continue Reading

Check My Blob

A brief overview and discussion of: Automatic recognition of ligands in electron density by machine learning .This paper aims to reduce the bias of crystallographers fitting ligands into electron density for protein ligand complexes. The authors train a supervised machine learning model using known ligand sites across the whole protein databank, to produce a classifier that can identify which common ligands could fit to that electron density.

Continue reading

Mol2vec: Finding Chemical Meaning in 300 Dimensions

Embeddings of Amino Acids

2D projections (t-SNE) of Mol2vec vectors of amino acids (bold arrows). These vectors were obtained by summing the vectors of the Morgan substructures (small arrows) present in the respective molecules (amino acids in the present example). The directions of the vectors provide a visual representation of similarities. Magnitudes reflect importance, i.e. more meaningful words. [Figure from Ref. 1]

Natural Language Processing (NLP) algorithms are usually used for analyzing human communication, often in the form of textual information such as scientific papers and Tweets. One aspect, coming up with a representation that clusters words with similar meanings, has been achieved very successfully with the word2vec approach. This involves training a shallow, two-layer artificial neural network on a very large body of words and sentences — the so-called corpus — to generate “embeddings” of the constituent words into a high-dimensional space. By computing the vector from “woman” to “queen”, and adding it to the position of “man” in this high-dimensional space, the answer, “king”, can be found.

A recent publication of one of my former InhibOx-colleagues, Simone Fulle, and her co-workers, Sabrina Jaeger and Samo Turk, shows how we can embed molecular substructures and chemical compounds into a similarly high-dimensional, continuous vectorial representation, which they dubbed “mol2vec“.1 They also released a Python implementation, available on Samo Turk’s GitHub repository.

 

Continue reading

Storing your stuff with clever filesystems: ZFS and tmpfs

The filesystem is a a critical component of just about any operating system, however it’s often overlooked. When setting up a new server, the default filesystem options are often ticked and never thought about again. However, there exist a couple of filesystems which can provide some extraordinary features and speed. I’m talking about ZFS and tmpfs.

ZFS was originally developed by Oracle for their Solaris operating system, but has now been open-sourced and is freely available on linux. Tmpfs is a temporary file system which uses system memory to provide fast temporary storage for files. Together, they can provide outstanding reliability and speed for not very much effort.

Hard disk capacity has increased exponentially over the last 50 years. In the 1960s, you could rent a 5MB hard disk from IBM for the equivalent of $130,000 per month. Today you can buy for less than $600 a 12TB disk – a 2,400,000 times increase.

As storage technology has moved on, the filesystems which sit on top of them ideally need to be able to access the full capacity of those ever increasing disks. Many relatively new, or at least in-use, filesystems have serious limitations. Akin to “640K ought to be enough for anybody”, the likes of the FAT32 filesystem supports files which are at most 4GB on a chunk of disk (a partition) which can be at most 16TB. Bear in mind that arrays of disks can provide a working capacity of many times that of a single disk. You can buy the likes of a supermicro sc946ed shelf which will add 90 disks to your server. In an ideal world, as you buy bigger disks you should be able to pile them into your computer and tell your existing filesystem to make use of them, your filesystem should grow and you won’t have to remember a different drive letter or path depending on the hardware you’re using.

ZFS is a 128-bit file system, which means a single installation maxes out at 256 quadrillion zettabytes. All metadata is allocated dynamically so there isn’t the need to pre-allocate inodes and directories can have up to 2^48 (256 trillion) entries. ZFS provides the concept of “vdevs” (virtual devices) which can be a single disk or redundant/striped collections of multiple disks. These can be dynamically added to a pool of vdevs of the same type and your storage will grow onto the fresh hardware.

A further consideration is that both disks of the “spinning rust” variety and SSDs are subject to silent data corruption, i.e. “bit rot”. This can be caused by a number of factors even including cosmic rays, but the consequence is read errors when it comes time to retrieve your data. Manufacturers are aware of this and buried in the small print for your hard disk will be values for “unrecoverable read errors” i.e. data loss. ZFS works around this by providing several mechanisms:

  • Checksums for each block of data written.
  • Checksums for each pointer to data.
  • Scrub – Automatically validates checksums when the system is idle.
  • Multiple copies – Even if you have a single disk, it’s possible to provide redundancy by setting a copies=n variable during filesystem creation.
  • Self-healing – When a bad data block is detected, ZFS fetches the correct data from a redundant copy and replaces it with the correct data.

An additional bonus to ZFS is its ability to de-duplicate data. Should you be working with a number of very similar files, on a normal filesystem, each file will take up space proportional to the amount of data that’s contained. As ZFS keeps checksums of each block of data, it’s able to determine if two blocks contain identical data. ZFS therefore provides the ability to have multiple pointers to the same file and only store the differences between them.

 

ZFS also provides the ability to take a point in time snapshot of the entire filesystem and roll it back to a previous time. If you’re a software developer, got a package that has 101 dependencies and you need to upgrade it? Afraid to upgrade it in case it breaks things horribly? Working on code and you want to roll back to a previous version? ZFS snapshots can be run with cron or manually and provide a version of the filesystem which can be used to extract previous versions of overwritten or deleted files or used to roll everything back to a point in time when it worked.

Similar to deduplication, a snapshot won’t take up any disk extra space until the data starts to change.

The other filesystem worth mentioning is tmpfs. Tmpfs takes part of the system memory and turns it into a usable filesystem. This is incredibly useful for systems which create huge numbers of temporary files and attempt to re-read them. Tmpfs is also just about as fast as a filesystem can be. Compared to a single SSD or a RAID array of six disks, tmpfs blows them out of the water speed wise.

Creating a tmpfs filesystem is simple:
First create your mountpoint for the disk:

mkdir /mnt/ramdisk

Then mount it. The options are saying make it 1GB in size, it’s of type tmpfs and to mount it at the previously created mount point:

mount –t tmpfs -o size=1024m tmpfs /mnt/ramdisk

At this point, you can use it like any other filesystem:

df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda1  218G 128G   80G  62% /
/dev/sdb1  6.3T 2.4T  3.6T  40% /spinnyrust
tank       946G 3.5G  942G   1% /tank
tmpfs      1.0G 391M  634M  39% /mnt/ramdisk

Journal club: Human enterovirus 71 protein interaction network prompts antiviral drug repositioning

Viruses are small infectious agents, which possess genetic code but have no independent metabolism. They propagate by infecting host cells and hijacking their machinery, often killing the cells in the process. One of the key challenges in developing effective antiviral therapies is the high mutation rate observed in viral genomes. A way to circumvent this issue is to target host proteins involved in virion assembly (also known as essential host factors, or EHFs), rather than the virion itself.

In their recent paper, Lu Han et al. [1] consider human virus protein-protein interactions in order to explore possible host drug targets, as well as drugs which could potentially be re-purposed as antivirals. Their study focuses on enterovirus 71 (EV71), one of the leading causes of hand, foot, and mouth disease.

Human virus protein-protein interactions and target identification

EHFs are typically detected by knocking out genes in the host organism and determining which of the knockouts result in virus control. Low repeat rates and high costs make this technique unsuitable for large scale studies. Instead, the authors use an extensive yeast two-hybrid screen to identify 37 unique protein-protein interactions between 7 of the 11 virus proteins and 29 human proteins. Pathway enrichment suggests that the human proteins interacting with EV71 are involved in a wide range of functions in the cell. Despite this range in functionality, as many as 17 are also associated with other viruses, either through known physical interactions, or as EHFs (Fig 1).

Fig. 1. Interactions between viral and human proteins (denoted as EIPs), and their connection to different viruses.

One of these is ATP6V0C, a subunit of vacuole ATP-ase. It interacts with the EV71 3A protein, and is a known essential host factor for five other viruses. The authors analyse the interaction further, and show that downregulating ATP6V0C gene expression inhibits EV71 propagation, while overexpressing it enhances virus propagation. Moreover, treating cells with bafilomycin A1, a selective inhibitor for vacuole ATP-ase, inhibits EV71 infection in a dose-dependent manner. The paper suggests that therefore ATP6V0C may be a suitable drug target, not only against EV71, but also perhaps even for a broad-spectrum antiviral. While this is encouraging, bafilomycin A1 is a toxic antibiotic used in research, but not suitable for human or drug use. Rather than exploring other compounds targeting ATP6V0C, the paper shifts focus to re-purposing known drugs as antivirals.

Drug prediction using CMap

A potential antiviral will ideally disturb most or all interactions between host cell and virion. One way to do this would be to inhibit the proteins known to interact with EV71. In order to check whether any known compounds already do so, the authors apply gene set enrichment analysis (GSEA) to data from the connectivity map (CMap). CMap is a database of gene expression profiles representing cellular response to a set of 1309 different compounds.  Enrichment analysis of the database reveals 27 potential EV71 drugs, of which the authors focus on the top ranking result, tanespimycin.

Tanespimycin is an orphan cancer drug, originally designed to target tumor cells by inhibiting HSP90. Its complex effects on the cell, however, may make it an effective antiviral. Following their CMap analysis, the authors show that tanespimycin reduces viral count and virus-induced cytopathic effects in a dose-dependent manner, without evidence of cytotoxicity.

Overall, the paper presents two different ways to think about target investigation and drug choice in antiviral therapeutics — by integrating different types of known host virus protein-protein interactions, and by analysing cell response to known compounds. While extensive further study is needed to determine whether the results are directly clinically relevant to the treatment of EV71, the paper shows how  interaction data analysis can be employed in drug discovery.

References:

[1] Han, Lu, et al. “Human enterovirus 71 protein interaction network prompts antiviral drug repositioning.” Scientific Reports 7 (2017).

 

Proteins evolve on the edge of supramolecular self-assembly

Inspired by Eoin’s interesting talks on prions and prion diseases, and Nick’s discussion of how Cyro-Electron microscopy is going to be the end of an era for Crystallography. I thought I’d look at a paper that discusses aggregation of protein complexes, with some cryo-electron microscopy thrown in for good measure.

Supramolecular assembbly

a, A molecule gaining a single self-interacting patch forms a finite dimer. A self-interacting patch repeated on opposite sides of a symmetric molecule can result in infinite assembly. b, A point mutation in a dihedral octamer creates a new self-interacting patch (red), triggering assembly into a fibre.

Supramolecular assemblies are folded protein complexes forming into much larger units. This formation can be triggered by a mutation on a copy of the constituent homomers of the complex, acting as a self-interacting patch. If this patch were to form in a non-symmetric complex, it would likely form a finite assemble with a limited number of copies of the complex. However, if the complex has dihedral symmetry such that a patch is accessible at multiple separated locations, then complex can potentially form near infinite supramolecular assemblies. Continue reading