Monthly Archives: September 2016

Physical-chemical property predictors as command line tools

The Instant JChem Suite, from ChemAxon, is a fantastic set of software designed for Chemists. It allows easy and simple database management to store both chemical and non-chemical data. It also contains a plethora of physical-chemical prediction and visualisation tools that can be utilised by chemists and computational based scientists alike.

I personally believe that the hidden gem within the suite is the availability of these predictive tools in the command line, found within the ChemAxon Calculator (cxcalc). In addition, calculator plug ins have also been developed by external developers. This allows you to incorporate the powerful predictive tools of ChemAxon into your larger workflows, with a little scripting.

For example, it can be used to predict the dominant protonation state of a ligand before use in MD or docking studies, with the majormicrospecies tool. You can input and output all major file types including SDF, PDB and MOL2 using commands such as:

cxcalc [Input_File].sdf -o [Output_File].sdf majormicrospecies -H [pH] -f [Output_File_Type]:H

You can easily find the calculator plugins available and how to construct input commands using the cxcalc –h command, or the available online information. I would thoroughly recommend looking at the tools available and how you could incorporate them into your workflows.

Counting Threads

When someone talks about “counting threads” the first thing that you think of is probably shopping for bed sheets. But this post is not about the happy feeling of drifting off to sleep on smooth, comfortable Egyptian cotton. This post is about that much less happy feeling when you want to quickly run a bit of code on a couple of data sets to finish the results section of a thesis chapter, and you see this:

Luis blocking the server again....

Luis blocking the server again….

Obviously someone (*cough*Luis*cough*) is having some fun on the server without nice-ing their code to allow people who are much less organized than they should be (*cough*me*cough*) to do a quick last-minute data analysis run.

The solution: confront the culprit with their excessive server usage (Note: alternatives include manual server restart with a power cord to make the world your enemy – not recommended).

So… now we just need to find out how much of the server “ospina” is using. Screenshots won’t convince him… and we can’t take enough screenshots to show the extent of the server-hogging with his 1000s of processes anyway. We need to count…

Luckily there is a handy function to find out information about processes called pgrep. This is basically a ‘ps | grep’ function which has a bunch of options to reflect the many ways it can be used. We see opsina is running R, so here goes:

pgrep -c R

The -c flag counts processes and the pattern matches the command name that was run. But yeah, it turns out this wasn’t the best idea ever. A lot of people are running R (as might be expected in the Statistics Department), and you get a number that is really too high to be likely. We need to be more specific in our query, so let’s go back to the ps command. Second attempt:

ps -Af | grep ospina | wc -l

What we’re doing now is first showing all processes that are run on the server (ps -A) also showing details of the command run and who ran it (-f flag). Then we’re finding the ones that are labelled with our server culprit (grep ospina) and counting the lines we find. There are annoyingly still a few problems with this approach.

  1. We just ran this command on the server and thus will count a command like grep –color=auto ospina,
  2. User “ospina” is probably running a few more things than just his R command (like ssh-ing into the server and maybe a couple of screens)
  3. We get a number than looks far lower than what we expected just by visual inspection.

So… what happened? We can fix problems 1 and 2 by just piping to a further grep command. But problem 3 is different. As it turns out, our culprit is running multiple threads from the same process (which is also why you find so many chrome instances on htop for example). We just counted processes, when really the server is being occupied by his multi-threading exploits. So… if you want to back up your complaint with a nice number, here’s your baby:

ps -ALf | grep opsina | grep R-3.3 | wc -l

The -L flag displays all threads instead of only the processes. I further used R-3.3 as it turns out he is using a specific version of R, which I can use to specify this command. Otherwise it also helps to use inputs arguments to functions to search against. If your fingers get too tired to press the shift-key that often, ps -ALf is equivalent to ps -eLf.

For now: moan away, folks!

 

Disclaimer: Any scenarios alluded to in the above text are fictitious and do not represent the behaviour of the individuals mentioned. Read: obviously I do not do my analysis last-minute.

The Protein World

This week’s issue of Nature has a wonderful “Insight” supplement titled, “The Protein World” (Vol. 537 No. 7620, pp 319-355). It begins with an editorial from Joshua Finkelstein, Alex Eccleston & Sadaf Shadan (Nature, 537: 319, doi:10.1038/537319a), and introduces four reviews, covering:

  • the computational de novo design of proteins that spontaneously fold and assemble into desired shapes (“The coming of age of de novo protein design“, by Po-Ssu Huang, Scott E. Boyken & David Baker, Nature, 537: 320–327, doi:10.1038/nature19946). Baker et al. point out that much of protein engineering until now has involved modifying naturally-occurring proteins, but assert, “it should now be possible to design new functional proteins from the ground up to tackle current challenges in biomedicine and nanotechnology”;
  • the cellular proteome is a dynamic structural and regulatory network that constantly adapts to the needs of the cell—and through genetic alterations, ranging from chromosome imbalance to oncogene activation, can become imbalanced due to changes in speed, fidelity and capacity of protein biogenesis and degradation systems. Understanding these complex systems can help us to develop better ways to treat diseases such as cancer (“Proteome complexity and the forces that drive proteome imbalance“, by J. Wade Harper & Eric J. Bennett, Nature, 537: 328–338, doi:10.1038/nature19947);
  • the new challenger to X-ray crystallography, the workhorse of structural biology: cryo-EM. Cryo-electron microscopy has undergone a renaissance in the last 5 years thanks to new detector technologies, and is starting to give us high-resolution structures and new insights about processes in the cell that are just not possible using other techniques (“Unravelling biological macromolecules with cryo-electron microscopy“, by Rafael Fernandez-Leiro & Sjors H. W. Scheres, Nature, 537: 339–346, doi:10.1038/nature19948); and
  • the growing role of mass spectrometry in unveiling the higher-order structures and composition, function, and control of the networks of proteins collectively known as the proteome. High resolution mass spectrometry is helping to illuminate and elucidate complex biological processes and phenotypes, to “catalogue the components of proteomes and their sites of post-translational modification, to identify networks of interacting proteins and to uncover alterations in the proteome that are associated with diseases” (“Mass-spectrometric exploration of proteome structure and function“, by Ruedi Aebersold & Matthias Mann, Nature, 537: 347–355, doi:10.1038/nature19949).

Baker points out that the majority of de novo designed proteins consist of a single, deep minimum energy state, and that we have a long way to go to mimic the subtleties of naturally-occurring proteins: things like allostery, signalling, and even recessed binding pockets for small moleculecules, functional sites, and hydrophobic binding interfaces present their own challenges. Only by increasing our understanding, developing better models and computational tools, will we be able to accomplish this.

A beginner’s guide to Rosetta

Rosetta is a big software suite, and I mean really big. It includes applications for protein structure prediction, refinement, docking, and design, and specific adaptations of these applications (and others) to a particular case, for example protein-protein docking of membrane proteins to form membrane protein complexes. Some applications are available in one of the hassle-free servers online (e.g. ROSIE, Robetta, rosetta.design), which might work well if you’ve got just a few tests you would like to try using standard parameters and protocols. However, it’s likely that you will want to download and install a version if you’re interested in carrying out a large amount of modelling, or using an unusual combination of steps or scoring function. This is not a trivial task, as the source code is a 2.5G download, then your machine will be busy compiling for some time (around 5 hours on two cores on my old laptop). Alternatively, if the protocols and objects you’re interested in are part of PyRosetta, this is available in a pre-compiled package for most common operating systems and is less than 1G.

This brings me to the different ways to use Rosetta. Most applications come as an executable which you can find in Rosetta/main/source/bin/ after completing the build. There is documentation available on how to use most of these, and on the different flags which can be used to input PDB structures and parameters. Some applications can be run using RosettaScripts, which uses an xml file to define the protocol, including scoring functions, movers and other options. In this case, Rosetta/main/source/bin/rosetta_scripts.* is run, which will read the xml and execute the required protocol.

screenshot-from-2016-09-14-19-19-28

An example RosettaScript, used for the MPrelax protocol

PyRosetta is even more flexible, and relatively easy to use for anyone accustomed to programming in python. There are python bindings for the fast C++ objects and movers so that the increased usability is generally not greatly compromised by slower speeds. One of the really handy things about PyRosetta is the link to PyMOL which can be used to view the trajectory of your protein moving while a simulation is running. Just add the following to your .pymolrc file in your home directory to set up the link every time you open pymol:

run /PATH/TO/PYROSETTA/PyMOLPyRosettaServer.py

When it comes to finding your way around the Rosetta package, there are a few things it is very useful to know to start with. The demos directory contains plenty of useful example scripts and instructions for running your first jobs. In demos/tutorials you will find introductions to the main concepts. The demos/protocol_capture subdirectory is particularly helpful, as most papers which report a new Rosetta protocol will deposit here the scripts required to reproduce their results. These may not currently be the best methods to approach a problem, but if you have found a research article describing some results which would be useful to get for your system, they are a good starting point to learn how to make an application work. Then the world is your oyster as you explore the many possible options and inputs to change and add!