Monthly Archives: April 2014

Quick Standalone BLAST Setup for Ubuntu Linux

Some people run into trouble trying to setup a standalone version of BLAST using the NCBI instructions. Here a stremalined process will be presented, targeted at Ubuntu.

I assume that you are aware of the paradigms of blast, meaning that there are several executables for searching nucleic acids or proteins and there are different databases you can blast against. Sinon, you should read up on the available search tools  and databases before you attempt to install Blast. NB, throughout this document, I am using protein blast and protein input – changing to nucleotide sequences is trivial as you just change blastp to blastn and ‘prot’ to ‘nt’ in obvious places (and of course you use different queries and target databases).
Without further ado, Blast setup for UNIX.
There are two components for the installation:
  1. Executables (bastn, blastp etc.)
  2. Databases. (nr, nt etc.)
Both are described below with follow-up examples of usage.
Ad.1 The executables can be downloaded and compiled from here (download the source, run ./configure then make and finally make install in the directory of the untarred file). However a much easier way to do it under Ubuntu is:
sudo apt-get install ncbi-blast+
This automatically installs everything. In both cases to check if all went ok, type:
which blastp
If you get a directory such as /usr/local/bin than all went well and that’s where your executables are.
Ad.2 FIrst, you need to decide on where to store the databases. Do this by setting the environment variable:
export BLASTDB=/path/to/blastdbs/of/your/chosing
Now, we can either use one of the ncbi-curated databases or create our own. We will do both.
A) Downloading and using an ncbi-curated database.
The databases can be downloaded using the update_blastdb script. As an example I will download a non redundant protein database which is referred to as ‘nr’:
cd $BLASTDB
sudo update_blastdb --passive --timeout 300 --force --verbose nr
ls *.gz |xargs -n1 tar -xzvf
rm *.gz.*

The penultimate command extracts all the files you have downloaded and the last one removes the downloaded archives.

Now you should be able to use your new database by executing (where somesequence.fasta is your sample query):

blastp -db nr -query somesequence.fasta

Done.

B) Creating your own database.

Firstly, put a bunch of fasta protein sequences into a file called sample.fa

Next, execute the following

makeblastdb -in sample.fa -dbtype 'prot' -out NewDb
mv NewDB* $BLASTDB/

We have now created a blast protein database from your fasta file, called NewDB. The last line simply moves all the blast files to the database directory.

Now you should be able to use your new database by executing (where somesequence.fasta is your sample query):

blastp -db NewDb -query somesequence.fasta

Done.

Afterword

These instructions are the shortest way I could find to get a working stand-alone BLAST application. If you require more info, you can look here.

 

Journal Club: Statistical Quality Indicators for Electron Density Maps

This Week I presented Ian Tickle’s 2012 Paper “Statistical quality indicators for electron-density maps”. This paper presented new, statistically robust metrics for describing the agreement between an atomic model and the experimentally derived electron density.

Previous metrics such as the Real-space R (RSR) and Real-Space Correlation Coefficient (RSCC) (Brandon & Jones, 1991, and others) are popular electron density metrics, and can inform on the quality of an atomic model. However, as Tickle claims, they cannot inform on how the model is good, or bad, as they give no indication of the accuracy, or the precision, of the electron density.

Accuracy:

Ian Tickle describes accuracy as – “How close are the results on average to the truth (regardless of precision)?” This is more often referred to as ‘error’. The most accurate model is the one that best agrees with the electron density.

Precision:

Precision is described as – “If you were to repeat the experiment, how much would you expect the results to vary (regardless of accuracy)?” This is more often described as ‘uncertainty’. Precision is a property of the crystal and the experiment. It is independent of the model.

A pictographic representation is shown below –

Pictographic representation of accuracy and precision. Taken from Tickle, 2012.

Before the discussion of the new metrics proposed, there are several assumptions that must be made and several influencing factors to be considered.

Assumptions:

  • The electron density, and the phases used to generate it, are accurate. This assumption is reasonable because density-based validation is generally done near to the end of refinement when the model is mostly correct.

Metric usefulness depends critically on:

  • Accurate calculation and representation of the electron density from our atomic model.
  • Accurate scaling of the observed and model density (neither calculated nor observed density is not on an absolute scale).
  • Accurate determination of the area to be considered for the calculation of the metric. If too large an area is considered, noise and density from other features will influence the metric. Too small an area will not encompass the whole model and its environment.

Calculating the Model Density:

Accurate calculation of the model’s electron density is essential, as the profile of the atoms will of course affect the comparison of the model to the experimental density. Often (as in Jones 1991, and others) a fixed profile is assumed for all atoms. Of course, in reality the profile will depend on atom type, B-factors, data completeness, and resolution limits.

Due to the resolution limits, the electron density from an atom is the convolution of a 3d gaussian and a sphere of constant scattering power (Blundell & Johnson, 1976). The truncated density function for an atom then becomes:

Screen Shot 2014-04-08 at 19.50.14

Scaling the calculated density:

This, fortunately, is already available and calculated by refinement programs (when calculating 2mFo – DFc maps), and the correct scaling factor is the resolution-dependent D.

Visualising the quality of a model:

To demonstrate how the (global) quality of a model can easily be seen, Tickle calculates and normalises difference density maps for a good, and a bad, model. If the model is ‘correct’, then the difference density should be gaussian noise, but if the model is ‘incorrect’, it will be skewed. This can easily be seen in Figure 8 from the paper.

Screen Shot 2014-04-08 at 20.26.06

A difference density map is calculated, sorted by value and normalised to give a density distribution. For a good model, this should look like (a), where the density function is a gaussian ~ N(0,1). For a bad model, (b), the distribution is skewed.

The main feature that appears in a ‘bad’ model is the increased weight in the tails of the distribution. Extra weight on the left-hand side indicates model that is not supported by the evidence, and extra weight on the right-hand side indicates under-modelled density.

The New Accuracy Metric

Using the ideas from above (that the difference density between a model and the experimental density should be distributed as a gaussian) Tickle goes on to develop metrics for determining the likelihood that a model density and an experimental density differ only by the addition of random noise.

The metric he describes tests a hypothesis – Does the distribution of the difference density reflect that obtained from the propagation of random errors in the experimental data (and phases)?

To do this, statistical tests are developed. First we define the difference density Z-score (ZD)

Screen Shot 2014-04-08 at 21.05.19

This quantity is the difference between the calculated electron density and the experimental density (delta rho), divided by the error in the difference density, giving the normal definition of a normalised Z-score.

The difference density (the numerator) has been discussed above, so we now discuss the error in the difference density. Assuming that the experimental data and the phases are ‘correct’, any non-random errors arise only from the model.

That is, errors arising in the experimental data will appear only as random noise, whereas errors in the model will manifest as the signal that we are trying to detect.

To calculate the strength of the noise (that of the experimental data and the phases), we look at the bulk-solvent regions. Here, the atoms are unordered, and so should give uniform density. Any deviations from uniform should be approximately the random noise from the experimental data and the phases.

Maximum Z-score analysis

Tickle considers using the maximum ZD of a sample as a test for model accuracy, and discusses merits and failings. In brief, if we were to sample from a difference density distribution, and take only the most significant ZD score, “focusing only on the maximum value inevitably overstates the significance of the results”.

A Chi-Squared test for ZD scores

The solution that Tickle proposes is to allow that all sample values may be significant (rather than just the largest values). He creates a joint probability density function of the absolute sample values (assumed half-normal and iid). This probability density function then becomes a chi-squared distribution.

Screen Shot 2014-04-08 at 22.34.30By calculating the CDF of the chi-squared (a lower regularised gamma function), Tickle is able to attain p-values for a set of observations.

Screen Shot 2014-04-08 at 22.38.11These can then be converted back to Z-scores, which crystallographers are more comfortable using. As Tickle states, just because the metric is in terms of Z-scores does not mean that the distribution is normal (here it is clearly a chi-squared).

Problem: Large Samples

The problem with the above is that for large samples, a small number of significant values will be drowned out by noise and the signal may be missed. The failure of the above test in this situation is put down to the choice of null hypothesis. Multiple null hypotheses are needed in order to cover all possibilities.

When distinguishing between multiple hypotheses, we must aim to avoid type II errors wherever possible, whilst attempting to minimise type I errors. We must select the hypothesis “that maximises the probability of obtaining a result less extreme that the one actually observed (…) or equivalently the one that minimises the probability of obtaining a result more extreme than that observed”.

Solution: A new JPDF and CDF

To solve this, Tickle takes a subset of the highest values of the original sample of n, say from i=k to n (the n-k highest scores), and calculates the chi-squared and its associated cumulative probability. We will then choose the value of k such that it gives us the highest probability,

Screen Shot 2014-04-08 at 22.56.34However, the cumulative probability of chi-squared is no longer the regularised gamma function due to the bias introduced by selected the largest values. Recalculating the JPDF and integrating analytically and numerically to obtain the CDF, we could arrive at a result. This, however, has the problem of a large dimensionality, which requires the use of very accurate Monte Carlo integration (accuracy of much better 0.27% is required, since we are interested in the p-values between 0.9973 and 1 – greater than 3 sigma).

Fortunately, an approximation can be made to bring about a practical solution.

Examples of Significant Distributions:

Tickle generates a table which gives several scenarios that will give a significant result, for different extremes of Z-value, and different sample sizes. One particularly key point is

…small samples are statistically less reliable so require a higher proportion of significant data points to achieve the same overall level of significance. Large samples require relatively fewer data points but they must have higher values to overcome the ‘multiple comparisons’ effect, where large values are more likely to occur occur purely as a result of random error.

Summary

B-factors:

Tickle shows early in the paper that the RSR and RSCC are correlated with the B-factor of the model. RSZD shows no correlation with the B-factor, as is desired.

RSZD+ & RSZD:

More useful scores can be generated by scoring the negative and positive values of delta-rho separately. This gives two scores, RSZD+ and RSZD-. RSZD+ gives the significance/prevalence of unexplained density (missing atoms) and RSZD- gives the significance/prevalence of unjustified model/misplaced atoms.

Precision & Reliability:

Although not discussed in as much depth in the paper, Tickle also proposes a metric to account for the precision of the electron density

Screen Shot 2014-04-08 at 23.24.34

This is clearly independent of the model, and is the signal-to-noise ratio of the average observed density in a specified region. Weak density (or large noise) will lead to a small RSZO, implying that any model placed here should be considered unreliable.