A few more reasons why UNIX is awesome | Oxford Protein Informatics Group

One could easily find dozens of reasons for which UNIX — mainly Ubuntu — is simply, the best operating system. Although I remember people in my proximity mentioning this for ages, it’s been only a few months that I’ve realized what are the true advantages. Helpful for this were all the people teaching/demonstrating in various modules during my first year in SABS/DTC: quite often we would be asked to do something in the console rather than by clicking the mouse. In the meanwhile, I’d wonder why using the console can be better from a nice, user-friendly GUI (i.e. Windows…). Tools like sed, grep, tar and of course alias-ing form a quick answer. I will not argue more about these but demonstrate two more tools/tricks.

AWK: an ultra-fast and simple tool for manipulating data

AWK is a utility for processing text-data either from files or streams, with a minimum amount of instructions. In brief, you can quickly parse a document, search for specific values, re-arrange or replace elements or calculate statistics. No matter how big your data are and, most importantly, without having to write long scripts where you would normally load libraries etc. You can also include it in pipes with other UNIX tools, or write a script and execute it as with common languages.

I’m not aiming in giving a tutorial about awk – there are plenty of websites available doing that – but just a few tips to advertise its usefulness. Every statement has usually the form awk (how to load) '{what to do}' (how to output). Some important keywords are -F for the field separator, OFS for output field separator, -v for passing a variable, and for/if structures of course. Columns/fields are indicated as $1,$2 etc, with $0 referring to a complete row, and awk parses each file line by line. Let’s assume we need to work with a document like the following, saved as toy.csv, which includes some types of values in the first two fields followed by some IDs, with an arbitrarily large number of rows.

Value1,Value2,Comp-ID,Targ-ID
382.91,163804,CHEMBL317956,CHEMBL236
167.84,166666,CHEMBL99895,CHEMBL1804
178.39,167742,CHEMBL104951,CHEMBL204
........

awk -F, 'NR>1{print $3,$4,$1} NR==50{exit}' OFS='\t' toy.csv > new.tab

Hitting the above command will make awk treat commas as the field separators, skip the header, print only the 3 out of 4 columns, in a new order, stop at the fiftieth line, use tabs as new field separators, and save that in new.tab. Now, if we run the following,

awk 'END{print "Fields = " NF " and rows = " NR}' new.tab

we’ll get Fields = 3 and rows = 49, since that’s how we created the new file. What if we omit the END keyword? It would print the number fields per line followed by the line-index. Notice the simplicity on how we can print stuff. Now let’s calculate an average:

cat new.tab | awk '{sum+=$3} END{print "mean = " sum/NR}'

Starting at 0 (by default) sum is incremented by the values of the third field; I just used a different way of feeding awk. When we reach the final line, thus NR=number of elements, we ask awk to print the ratio/result. Remember that it skips any checking so it’s our responsibility to ensure there are no strings or nans (usually regarded as zeros) among the values to be added. As one last example, what if we need to get those compounds which correspond to values larger than thr ? The next will do the job.

 awk -v thr=$thr '($3>thr) {print $1}' new.tab

Latex and massive plot production

As a next reason showing the awesomeness of UNIX, let’s turn our attention on latex and massively-printing figures. Assume we need to run myscript.py for six parametrisations and six set-ups and get six^2 of figures (replace six with something more realistic…). How could we easily put all of them in the same page? Firstly we’re gonna need a latex template like the following, assuming that it’s saved as latex-template.tex:

\documentclass{article}
\usepackage[top=2cm,bottom=2cm]{geometry}
\usepackage{graphicx}
\usepackage{subfigure}
\usepackage{caption}
\begin{document}
    \begin{figure}
    \centering
        \subfigure[Figure1]{\includegraphics[width=7cm]{REPLACE-Figure1.png}}
        \subfigure[Figure2]{\includegraphics[width=7cm]{REPLACE-Figure2.png}}
        \vskip3ex
        \subfigure[Figure3]{\includegraphics[width=7cm]{REPLACE-Figure3.png}}
        \subfigure[Figure4]{\includegraphics[width=7cm]{REPLACE-Figure4.png}}
        \vskip3ex
        \subfigure[Figure5]{\includegraphics[width=7cm]{REPLACE-Figure5.png}}
        \subfigure[Figure6]{\includegraphics[width=7cm]{REPLACE-Figure6.png}}
    \caption{Performance of REPLACE on all six Figures.}
    \end{figure}
\end{document}

for x in {..params..}
do
   python myscript.py $x 
    #assume this produces some plots for param $x 
    # which are named as $x-Figure1.png, ..., $x-Figure6.png
   sed -e 's/REPLACE/'$x'/g' latex-template.tex > latex-$x.tex;
   pdflatex latex-$x.tex; rm -rf *.log *.aux;
done

Provided that names and indexing are correct, the above snippet will produce a set of pdf files, each containing the six figures corresponding to each parametrisation. Something to keep in mind is that latex is struggling with filenames that contain many dots and that’s why I’m using dashes. Apparently, we’d need some time to initialise the template accordingly but I believe such a pipeline can save a lot of time when someone needs to repeat an experiment for different algorithms, data sets or metrics, thus having to deal with dozens of “similar” plots.

Thank you for your time reading this. I hope my examples were clear and useful!

Author

Yiorgos Kalantzis

View all posts