Category Archives: Technical

A javascript function to validate FASTA sequences

I was more than a bit annoyed of not finding this out there in the interwebs, being a strong encourager of googling (or in Jamie’s case duck-duck-going) and re-use.

So I proffer my very own fasta validation javascript function.

 * Validates (true/false) a single fasta sequence string
 * param   fasta    the string containing a putative single fasta sequence
 * returns boolean  true if string contains single fasta sequence, false 
 *                  otherwise 
function validateFasta(fasta) {

	if (!fasta) { // check there is something first of all
		return false;

	// immediately remove trailing spaces
	fasta = fasta.trim();

	// split on newlines... 
	var lines = fasta.split('\n');

	// check for header
	if (fasta[0] == '>') {
		// remove one line, starting at the first position
		lines.splice(0, 1);

	// join the array back into a single string without newlines and 
	// trailing or leading spaces
	fasta = lines.join('').trim();

	if (!fasta) { // is it empty whatever we collected ? re-check not efficient 
		return false;

	// note that the empty string is caught above
	// allow for Selenocysteine (U)
	return /^[ACDEFGHIKLMNPQRSTUVWY\s]+$/i.test(fasta);

Let me know, by comments below, if you spot a no-no.  Please link to this post if you find use for it.

p.s. I already noticed that this only validates one sequence.  This is because this function is taken out of one of our web servers, Memoir, which specifically only requires one sequence.  If there is interested for multi sequence validation I will add it.


aRrrrgh! or how to apply a fitted model to new data

Recently I’ve been battling furiously with R while analysing some loop modelling accuracy data. The idea was simple:

  1. Fit a general linear model to some data
  2. Get out a formula to predict a variable (let’s call it “accuracy”) based on some input parameters
  3. Apply this formula to new data and see how well the predictor does

It turns out, it’s not that simple to actually implement. Fitting a general linear model in R produces coefficients in a vector.

model <- glm(accuracy ~ param1 + param2 * param3, data=trainingset)
            (Intercept)                  param1                  param2 
            0.435395087            -0.093295388             0.148154339 
                 param3           param2:param3
            0.024399530             0.021100300

There seems to be no easy way to insert these coefficients into your formula and apply the resulting equation to new data. The only easy thing to do is to plot the fitted values against the variable we’re trying to predict, i.e. plot our predictions on the training set itself:

plot(model$fitted.values, trainingset$accuracy, xlab="score", ylab="accuracy", main="training set")

I’m sure there must be a better way of doing this, but many hours of Googling led me nowhere. So here is how I did it. I ended up writing my own parser function, which works only on very simple formulae using the + and * operators and without any R code inside the formula.

coefapply <- function(coefficients, row)
  result <- 0
  for (i in 1:length(coefficients))
    subresult <- as.numeric(coefficients[i])
    if (!
      name <- names(coefficients[i])
      if (name != "(Intercept)")
        subnames <- strsplit(name, ":", fixed=TRUE)[[1]]
        for (n in subnames)
          subresult <- subresult * as.numeric(row[n])
      result <- result + subresult

calculate_scores <- function(data, coefficients)
  scores <- vector(mode="numeric", length=nrow(data))
  for (i in 1:nrow(data))
    row <- data[i,]
    scores[i] <- coefapply(coefficients, row)

Now we can apply our formula to a new dataset and plot the accuracy achieved on the new data:

model_coef <- coef(model)

# Test if our scores are the same values as the model's fitted values
training_scores <- calculate_scores(model_coef, trainingset)
sum((training_scores - model$fitted.values) < 0.000000000001) / length(scores)

# Calculate scores for our test set and plot them
test_scores <- calculate_scores(model_coef, testset)
plot(test_scores, testset$accuracy, xlab="score", ylab="accuracy", main="test set")

It works for my purpose. Maybe one day someone will see this post, chuckle, and then enlighten me with their perfectly simple and elegant alternative.

On being cool: arrows on an R plot

Recently I needed a schematic graph of traditional vs on-demand computing (don’t ask) – and in this hand waving setting I just wanted the axes to show arrows and no labels.  So, here it is:

x <- c(1:5)
y <- rnorm(5)
plot(x, y, axes = FALSE)
u <- par("usr") 
arrows(u[1], u[3], u[2], u[3], code = 2, xpd = TRUE) 
arrows(u[1], u[3], u[1], u[4], code = 2, xpd = TRUE)

And here is the output

Arrowed plot

(I pinched this off a mailing list post, so this is my due reference)

Next thing I am toying with are these xkcd like graphs in R here.

A free, sweet, valid HTML4 “Site Maintenance” page

So today we have moved servers from the cloud to a physically local server and we needed a “Site Maintenance” page.  A few google searches turned up a simple HTML5 template which I converted to HTML4 and is reproduced hereunder (could not find the original source, aargh):

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"


    <meta http-equiv="Content-type" content="text/html;charset=UTF-8">
    <title>Site Maintenance</title>
    <style type="text/css">
      body { text-align: center; padding: 150px; }
      h1 { font-size: 50px; }
      body { font: 20px Helvetica, sans-serif; color: #333; }
      #article { display: block; text-align: left; width: 650px; margin: 0 auto; }
      a { color: #dc8100; text-decoration: none; }
      a:hover { color: #333; text-decoration: none; }

    <div id="article">
    <h1>We&rsquo;ll be back soon!</h1>
        <p>Sorry for the inconvenience but we&rsquo;re performing some maintenance at the moment. If you need to you can always contact us on <b>opig AT</b>, otherwise we&rsquo;ll be back online shortly!  Site should be back up on Friday 1st March 2013, 16:00 GMT.</p>
        <p>&mdash; OPIG</p>

And here is what it looks like… (nothing glamorous, you have been warned)


How to install RDKit on Ubuntu 12.04 / 12.10 / 13.04 / 13.10 / 14.04 / 14.10 (with InChI support)

I make extensive use of this brilliant piece of cheminformatics software (RD)kit, and it has saved me writing my own Molecule, Atom, Bond, Conformer, Fingerprint, SimilarityMetric, Descriptor etc. classes time and time again.  It is really neat, and works with C++ and python (and Java I think).  Here are the instructions on how to install it on a relatively recent Ubuntu version (12.04 / 12.10 / 13.04 / 13.10 / 14.04 / 14.10).

Pre-requisite software (this is why I love Debian based systems)

sudo apt-get install flex bison build-essential python-numpy cmake python-dev sqlite3 libsqlite3-dev libboost-dev  libboost-python-dev libboost-regex-dev

Get the latest RDKit goodie from here (watch out – version number has been replaced by ‘X’ below)


Unzip the beast, save it to /opt

sudo tar xzvf RDKit_20XX_XX_X.tgz -C /opt

Add some environment salt, vim ~/.bashrc

export RDBASE=/opt/RDKit_20XX_XX_X

Resource your .bashrc

. ~/.bashrc

if you want the InChI stuff (trust me you do), first:

cd $RDBASE/External/INCHI-API/

Build (compile), install & test

mkdir build
cd build
cmake .. # if you do not care for InChI support OR
cmake -DRDK_BUILD_INCHI_SUPPORT=ON .. # to install InChI generation code 
make # -j 4 to use multiple processors
make install

If all your tests passed successful you are good to go.  Otherwise, get in touch via the post comments below.



How to make a custom latex bibliography style

Imagine you are writing up your latest thrilling piece of science in your favourite odt or docx format. Nothing comes from nothing so you need to cite the 50 or so people whose ideas you built on, or who came to conclusions that contradict yours. Suddenly you realize that your second sentence needs a reference… and this will require you to renumber the subsequent 50. What a drag! There goes 5 minutes of your life that could have been better spent drinking beer.

Had you written your research in latex instead, this drudgery would have been replaced by a range of much more interesting and intractable difficulties. In latex it is easy to automagically renumber everything just by recompiling the document. Unfortunately, it is often hard to direct latex’s magic… just try moving a picture an inch to the right, or reformatting a reference.

Moving figures around is still a black art as far as I’m concerned… but I’ve recently found out an easy way to reformat references. This might be especially handy when you find out that your sort of proteins fall out of the scope of the International Journal of Eating Disorders and you now want to submit to a journal that requires you to list authors in small-caps, and the dates of publication in seconds from the Unix epoch.

A good way of including references in latex is with a “.bib” file and a “.bst” file. An example of the end of a document is shown below.



What’s happening here? All my references are stored in bibtex format in a database file called “mycollection.bib”. A separate file “myfile.bst” says how the information in the database should be presented. For example, are references in the text of the form (Blogs et al 2005) or are they numbered (1)? At the end of the text are they grouped in order of appearance, by date of publication or alphabetically? If alphabetically does “de Ville” come under “d” or “v”? To reformat a reference, we simply need to change “myfile.bst”.

Most latex distributions come with a set of bibliography styles. Some examples can be found here (a page which also explains all of the above much better than I have). However, it is very easy to produce a custom file using the custom-bib package. After a one-click download it is as simple as typing:

latex makebst.ins
latex makebst.tex

Here’s a screenshot to prove it. At the bottom is the first of thirty or forty multiple-choice questions about how you want your references to look. If in doubt, just close your eyes and press return to select the default.

Screen shot 2013-02-17 at 00.34.45

The problem with a multiple-choice fest is that if you make a poor decision at question 28 you have to go through the whole process again. Fortunately, this can be circumvented — as well as generating a pretty “myfile.bst” file, the custom-bib package generates an intermediate file “myfile.dbj”. Changing your multiple-choice answers is just a matter of commenting out the relevant parts and typing “latex myfile.dbj”. A snippet of a “dbj” file is below:

Screen shot 2013-02-17 at 00.41.42

Selected options are those without a “%” sign on the left hand side. Who would have thought that Latex could be so cuddly?

How to make environment variables persist in your web app (Apache2)

Last Sunday, with the Memoir gang, we were talking about using this blog as a technical notebook. This post is in that spirit.

We have moved most of the validation of Memoir before submitting the job to the queue.  This has the advantage that the user knows immediately the submission has failed (instead of waiting for the job to be processed in the queue).  The only issue with this is that the apache user (on most systems www-data) is going to run the validation pipeline.  Memoir requires a ton of PATH variables addition.

In most cases you cannot add variables in .bashrc as the apache user does not have a home directory.  The easiest way how to add environment variables for the apache user is:

sudo vim /etc/apache2/envvars

And add the following line at the bottom (these have to be customized according to the server):

export PATH=$PATH:/opt/tmalign:/opt/muscle:/opt/joy-5.10:/opt/joy-5.10/psa:/opt/joy_related-1.06:/opt/joy_related-1.06/sstruc:/opt/joy_related-1.06/hbond:/opt/medeller-dev/bin:/opt/usearch6.0.307:/opt/ncbi-blast-2.2.27+:/opt/MPT/bin

After this restart apache – and you should be laughing. Or, as is often the case, apache will be laughing at you.

sudo service apache2 restart

The apache user should now “see” the amended path variable.

Here it is this technical note – so in a few weeks time when I forget all of the above I can just copy and paste it …

(This post is short, also because there is the superbowl!!)