A program to aid primary protein structure determination -1962 style.

This year, OPIG have been doing series of weekly lectures on papers we considered to be seminal in the field of protein informatics. I initially started looking at “Comprotein: A computer program to aid primary protein structure determination” as it was one of the earliest (1960s) papers discussing a computational method of discovering the primary structure of proteins. Many bioinformaticians use these well-formed, tidy, sterile arrays of amino acids as the input to their work, for example:

MGLSDGEWQL VLNVWGKVEA DIPGHGQEVL IRLFKGHPET LEKFDKFKHL KSEDEMKASE DLKKHGATVL TALGGILKKK GHHEAEIKPL AQSHATKHKI PVKYLEFISE CIIQVLQSKH PGDFGADAQG AMNKALELFR KDMASNYKEL GFQG
(For those of you playing at home, that’s myoglobin.)

As the OPIG crew come from a diverse background and frequently ask questions well beyond my area of expertise, if for nothing other than posterior-covering, I needed to do some background reading. Though I’m not a researcher by trade any more, I began to realise despite the lectures/classes/papers/seminars I’d been exposed to, regarding all the clever things you do with a sequence when you have it, I didn’t know how you would actually go from a bunch of cells expressing (amongst a myriad of other molecules) the protein you were interested in, to the neat array of characters shown above. So without further ado:

The first stage in obtaining your protein is: cell lysis and there’s not much in it for the cell.
Mangle your cell using chemicals, enzymes, sonification or a French press (not your coffee one).

The second stage is producing a crude extract by centrifuging the above cell-mangle. This, terrifyingly, appears to be done between 10,000G and 100,000G and removes the cellular debris leaving it as a pellet in the bottom of the container, with the supernatant containing little but a mix of the proteins which were present in the cytoplasm along with some additional macromolecules.

Stage three is to purify the crude extract. Depending on the properties of the protein you’re interested in, one or more of the following stages are required:

• Reverse-phase chromatography to separate based on hydrophobicity
• Ion-exchange to separate based on the charge of the proteins
• Gel-filtration to separate based on the size of the proteins

If all of the above are preformed, whilst the sequence of these variously charged/size-sorted/polar proteins will still be still unknown, they will now be sorted into various substrates based upon their properties. This is where the the third stage departs from science and lands squarely in the realm of art. The detergents/protocols/chemicals/enzymes/temperatures/pressures of the above techniques all differ depending on the hydrophobicity/charge/animal source of the type of protein one is aiming to extract.

Since at this point we still don’t know their sequence, working out the concentrations of the various constituent amino acids will be useful. One of the simplest methods of determining the amino acid concentrations of a protein is follow a procedure similar to:

Heat the sample in 6M HCL at at a temperature of 110C for 18-24h (or more) to fully hydrolyse all the peptide bonds. This may require an extended period (over 72h) to hydrolyse peptide bonds which are known to be more stable, such as those involving valine, isoleucine and leucine. This however can degrade Ser/Thr/Tyr/Try/Gln and Cys which will subsequently skew results. An alternative is to raise the pressure in the vessel to allow temperatures of 145-155C to for 20-240 minutes.

TL;DR: Take the glassware that’s been lying about your lab since before you were born, put 6M hydrochloric acid in it and bring to the boil. Take one difficultly refined and still totally unknown protein and put it in your boiling hydrochloric acid. Seal the above glassware in order to use it as a pressure vessel. Retreat swiftly whilst the apparatus builds up the appropriate pressure and cleaves the protein as required. -What could go wrong?

At this point I wondered if the almost exponential growth in PDB entries was due to humanity’s herd of biochemists now having been thinned to those which remained simply being several generations worth of lucky.

Once you have an idea of how many of each type of amino acid comprise your protein, we can potentially rebuild it. However at this point it’s like we’ve got a jigsaw puzzle and though we’ve got all the pieces and each piece can only be one of a limited selection of colours (thus making it a combinatorial problem) we’ve no idea what the pattern on the box should be. To further complicate matters, since this isn’t being done on but a single copy of the protein at a time, it’s like someone has put multiple copies of the same jigsaw into the box.

Once we have all the pieces, to determine the actual sequence, a second technique needs to be used. Though invented in 1950, Edman degradation appears not to have been a particularly widespread protocol, or at least it wasn’t in the National Biomedical Research Foundation from which the above paper emerged. This means of degradation tags the N-terminal amino acid and cleaves it from the rest of the protein. This can then be identified and the protocol repeated. Whilst this would otherwise be ideal, it suffers from a few issues in that it takes about an hour per cycle, only works reliably on sequences of about 30 amino acids and doesn’t work at all for proteins which have their n-terminus bonded or buried.

Instead, the refined protein is cleaved into a number of fragments at known points using a single enzyme. For example, Trypsin will cleave on the carboxyl side of arginine and lysine residues. A second copy of the protein is then cleaved using a different enzyme at a different point. These individual fragments are then sorted as above and their individual (non-sequential) components determined.

For example, if we have a protein which has an initial sequence ABCDE
Which then gets cleaved by two different enzymes to give:
Enzyme 1 : (A, B, C) and (D, E)
Enzyme 2 : (A, B) and (C, D)

We can see that the (C, D) fragment produced by Enzyme 2 overlaps with the (A, B, C) and (D, E) fragments produced by Enzyme 1. However, as we don’t know the order in which the amino acid appear within in each fragment, thus there are a number of different sequences which can come to light:
 Possibility 1 : A B C D E Possibility 2 : B A C D E Possibility 3 : E D C A B Possibility 4 : E D C B A 

At this point the paper comments that such a result highlights to the biochemist that the molecule requires further work for refinement. Sadly the above example whilst relatively simple doesn’t include the whole host of other issues which plague the biochemist in their search for an exact sequence.

The origins of exponential random graph models

The article An Exponential Family of Probability Distributions for Directed Graphs, published by Holland and Leinhardt (1981), set the foundation for the now known exponential random graph models (ERGM) or p* models, which model jointly the whole adjacency matrix (or graph) $latex X$. In this article they proposed an exponential family of probability distributions to model $latex P(X=x)$, where $latex x$ is a possible realisation of the random matrix $latex X$.

The article is mainly focused on directed graphs (although the theory can be extended to undirected graphs). Two main effects or patterns are considered in the article: Reciprocity, which relates to appearance of symmetric interactions ($latex X_{ij}=1 \iff X_{ji}=1$) (see nodes 3-5 of the Figure below).

and, the Differential attractiveness of each node in the graph, which relates to the amount of interactions each node “receives” (in-degree) and the amount of interactions that each node “produces” (out-degree) (the Figure below illustrates the differential attractiveness of two groups of nodes).

The model of Holland and Leinhardt (1981), called p1 model, that considers jointly the reciprocity of the graph and the differential attractiveness of each node is:

$latex p_1(x)=P(X=x) \propto e^{\rho m + \theta x_{**} + \sum_i \alpha_i x_{i*} + \sum_j \beta_j x_{*j}},$

where $latex \rho,\theta,\alpha_i,\beta_j$ are parameters, and $latex \alpha_*=\beta_*=0$ (identifying constrains).  $latex \rho$ can be interpreted as the mean tendency of reciprocation, $latex \theta$ can be interpreted as the density (size) of the network, $latex \alpha_i$ can be interpreted as as the productivity (out-degree) of a node, $latex \beta_j$ can be interpreted as the attractiveness (in-degree) of a node.

The values $latex m, x_{**}, x_{i*}$ and $latex x_{*j}$ are: the number of reciprocated edges in the observed graph, the number of edges, the out-degree of node i and the in-degree of node j; respectively.

Taking $latex D_{ij}=(X_{ij},X_{ji})$, the model assumes that all $latex D_{ij}$ with $latex i<j$ are independent.

To better understand the model, let’s review its derivation:

Consider the pairs $latex D_{ij}=(X_{i,j},X_{j,i}),\,i<j$ and describe the joint distribution of $latex \{D_{ij}\}_{ij}$, assuming all $latex D_{ij}$ are statistically independent. This can be done by parameterizing the probabilities

$latex P(D_{ij}=(1,1))=m_{ij} \text{ if } i<j,$

$latex P(D_{ij}=(1,0))=a_{ij} \text{ if } i\neq j,$

$latex P(D_{ij}=(0,0))=n_{ij} \text{ if } i<j,$

where $latex m_{ij}+a_{ij}+a_{ji}+n_{ij}=1,\, \forall \, i<j$.

$latex P(X=x)=\prod_{i<j} m_{ij}^{x_{ij}x_{ji}} \prod_{i\neq j}a_{ij}^{x_{ij}(1-x_{ji})} \prod_{i<j}n_{ij}^{(1-x_{ij})(1-x_{ji})} =e^{\sum_{i<j} {x_{ij}x_{ji}} \rho_{ij} + \sum_{i\neq j}{x_{ij}} \theta_{ij}} \prod_{i<j}n_{ij},$

where $latex \rho_{ij}=log(m_{ij}n_{ij} / a_{ij}a_{ji})$ for $latex i<j$, and $latex \theta_{ij}=log(a_{ij}/n_{ij})$ for $latex i\neq j$.

It can be seen that the parameters $latex \rho_{ij}$ and $latex \theta_{ij}$ can be interpreted as the reciprocity and differential attractiveness, respectively. With a bit of algebra we get:

$latex exp(\rho_{ij})=[ P(X_{ij}=1|X_{ji}=1)/P(X_{ij}=1|X_{ji}=0) ]/[ P(X_{ij}=1|X_{ji}=0) / P(X_{ij}=0|X_{ji}=0) ],$
and
$latex exp(\theta_{ij})=P(X_{ij}=1|X_{ji}=0)/P(X_{ij}=0|X_{ji}=0).$

Now, if we consider the following restrictions:

$latex \rho_{ij}=\rho,\, \forall i<j$, and $latex \theta_{ij}=\theta+\alpha_i + \beta_j,\, \forall i\neq j$ where $latex \alpha_*=\beta_*=0$.

With some algebra we get the proposed form of the model

$latex p_1(x)=P(X=x) \propto e^{\rho m + \theta x_{**} + \sum_i \alpha_i x_{i*} + \sum_j \beta_j x_{*j}}.$

Modeling of Protein Interaction Networks

In the group meeting on the 20th of August, I presented the paper by Vazquez et al (2002). This was one of the first papers proposing the duplication divergence model of evolution for protein interaction networks, and thus has had a significant impact on the field, inspiring many variants of the basic model. The paper starts out by noting that the yeast protein network has the ‘small world’ property – following along links in the network, it requires only a handful of steps to go from any one protein to any other. Another property is the manner in which the links are shared out among the various proteins: empirically, the probability that a protein interacts with k other proteins follows a power-law distribution.

Vazquez et al. show how evolution can produce scale-free networks. They explore a model for the evolution of protein networks that accurately reproduces the topological features seen in the yeast S. cerevisae. As the authors point out, proteins fall into families according to similarities in their amino-acid sequences and functions, and it is natural to suppose that such proteins have all evolved from a common ancestor. A favoured hypothesis views such evolution as taking place through a sequence of gene duplications – a relatively frequent occurrence during cell reproduction. Following each duplication, the two resulting genes are identical for the moment, and naturally lead to the production of identical proteins. But between duplications, random genetic mutations will lead to a slow divergence of the genes and their associated proteins. This repetitive, two-stage process can be captured in a relatively simple model for the growth of a protein interaction network .

Simulations by the author show that the model, starting out with a seed capture the degree distribution of the yeast network with high fidelity (Fig 1) and also possesses the quality of high tolerance to random node removal seen in biological networks. While the results are more qualitative in nature, the model still serves as the basis of most biologically rooted explanations of protein network evolution, with minor improvements. Some of these additions have been the use of asymmetric divergence, whole genome duplication events as well as interaction site modelling. As the jury is still out on what model (if any) best fits current interaction data, the basic model is still relevant as a benchmark for newer models.

Fig. 1. Zipf plot for the PIN and the DD model with p = 0.1, q = 0.7 with N = 1,825. k is the connectivity of a node and r is its rank in decreasing order of k. Error bars represent standard deviation on a single network realization.