Sebastian Kelm | Oxford Protein Informatics Group

Recently I’ve been battling furiously with R while analysing some loop modelling accuracy data. The idea was simple:

Fit a general linear model to some data
Get out a formula to predict a variable (let’s call it “accuracy”) based on some input parameters
Apply this formula to new data and see how well the predictor does

It turns out, it’s not that simple to actually implement. Fitting a general linear model in R produces coefficients in a vector.

model <- glm(accuracy ~ param1 + param2 * param3, data=trainingset)
coef(model)

            (Intercept)                  param1                  param2 
            0.435395087            -0.093295388             0.148154339 
                 param3           param2:param3
            0.024399530             0.021100300

There seems to be no easy way to insert these coefficients into your formula and apply the resulting equation to new data. The only easy thing to do is to plot the fitted values against the variable we’re trying to predict, i.e. plot our predictions on the training set itself:

plot(model$fitted.values, trainingset$accuracy, xlab="score", ylab="accuracy", main="training set")

I’m sure there must be a better way of doing this, but many hours of Googling led me nowhere. So here is how I did it. I ended up writing my own parser function, which works only on very simple formulae using the + and * operators and without any R code inside the formula.

coefapply <- function(coefficients, row)
{
  result <- 0
  for (i in 1:length(coefficients))
  {
    subresult <- as.numeric(coefficients[i])
    if (!is.na(subresult))
    {
      name <- names(coefficients[i])
      if (name != "(Intercept)")
      {
        subnames <- strsplit(name, ":", fixed=TRUE)[[1]]
        for (n in subnames)
        {
          subresult <- subresult * as.numeric(row[n])
        }
      }
      result <- result + subresult
    }
  }
  return(result)
}

calculate_scores <- function(data, coefficients)
{
  scores <- vector(mode="numeric", length=nrow(data))
  for (i in 1:nrow(data))
  {
    row <- data[i,]
    scores[i] <- coefapply(coefficients, row)
  }
  return(scores)
}

Now we can apply our formula to a new dataset and plot the accuracy achieved on the new data:

model_coef <- coef(model)

# Test if our scores are the same values as the model's fitted values
training_scores <- calculate_scores(model_coef, trainingset)
sum((training_scores - model$fitted.values) < 0.000000000001) / length(scores)

# Calculate scores for our test set and plot them
test_scores <- calculate_scores(model_coef, testset)
plot(test_scores, testset$accuracy, xlab="score", ylab="accuracy", main="test set")

It works for my purpose. Maybe one day someone will see this post, chuckle, and then enlighten me with their perfectly simple and elegant alternative.

Seb gave a talk at the Oxford Structural Genomics Consortium on Wednesday 9 Jan 2013. The talk mentioned the work of several other OPIG members. Below is the gist of it.

Homology modelling pipeline with several membrane-protein-specific steps. Input is the target protein’s sequence, output is the finished 3D model.

Fragment-based loop modelling pipeline for X-ray crystallography

Given an incomplete model of a protein, as well as the current electron density map, we apply our loop modelling method FREAD to fill in a gap with many decoy structures. These decoys are then scored using electron density quality measures computed by EDSTATS. This process can be iterated to arrive at a complete model.

Over the past five years the Oxford Protein Informatics Group has produced several pieces of software to model various aspects of membrane protein structure. iMembrane predicts how a given protein structure sits in the lipid bilayer. MP-T aligns a target protein’s sequence to an iMembrane-annotated template structure. MEDELLER produces an accurate core model of the target, based on this target-template alignment. FREAD then fills in the remaining gaps through fragment-based loop modelling. We have assembled all these pieces of software into a single pipeline, which will be released to the public shortly. In the future, further refinements will be added to account for errors in the core model, such as helix kinks and twists.

X-ray crystallography is the most prevalent way to obtain a protein’s 3D structure. In difficult cases, such as membrane proteins, often only low resolution data can be obtained from such experiments, making the subsequent computational steps to arrive at a complete 3D model that much harder. This usually involves tedious manual building of individual residues and much trial and error. In addition, some regions of the protein (such as disordered loops) simply are not represented by the electron density at all and it is difficult to distinguish these from areas that simply require a lot of work to build. To alleviate some of these problems, we are developing a scoring scheme to attach an absolute quality measure to each residue being built by our loop modelling method FREAD, with a view towards automating protein structure solution at low resolution. This work is being carried out in collaboration with Frank von Delft’s Protein Crystallography group at the Oxford Structural Genomics Consortium.

Oxford Protein Informatics Group

or "OPIG" to friends

Author Archives: Sebastian Kelm

aRrrrgh! or how to apply a fitted model to new data

Talk: Membrane Protein 3D Structure Prediction & Loop Modelling in X-ray Crystallography