The stuff MDAnalysis didn’t implement: CPU Parallel HOLE conductance analysis

Some time ago, I needed to find a way to computationally estimate conductance values for every protein frame from several molecular dynamics (MD) trajectories.

In a previous post, I wrote about how to clean the resulting instant conductance timeseries from outliers. But, I never described how I generated these timeseries.

In this post, I will show how you can parallelise the computation of instant conductance given an MD trajectory. I will touch on the difficulties of this process. And why I had to implement a custom tool for it given that MDAnalysis seems to already have implemented a routine of this sort. Finally, I will provide two Python scripts that you can easily adapt to run your parallel calculations – for which I’ll provide some important notes you don’t wanna skip.

Violin plots of conductance distributions from 64 molecular dynamic trajectories with 1000 frames each.

The HOLE in the MDAnalysis implementation

If you were ever faced with the question of how to estimate the geometry and the conductance of a protein pore, such as in an ion channel or a transporter, you probably encountered Oliver Smarts‘ widely-used HOLE programme.

Despite being written nearly 30 years ago, this programme enjoys great popularity to the point that no one else has dared to rewrite it, instead, everyone just wraps code around it, which is MDAnalysis‘ case. I guess, no one – unsurprisingly- wants to deal with the Fortran code (Ahem! Pythoners). However, for all that Monte Carlo annealing that the code performs, it is pretty fast!

The MDAnalysis HOLE analysis routine is quite decent:

  • It allows you to run HOLE with a trajectory file as input, exploiting MDAnalysis multi-format compatibility.
  • It avoids input files to run HOLE, so given a trajectory, you won’t end up with thousands of text files piling up in your disk space.
  • Class methods enable arguments controlling settings for trajectory sub-sampling, Monte Carlo annealing, and output files to visualise pore surfaces with VMD.
  • It can directly output pore radial profile plots and store estimated conductance values in a Python variable without scraping any output file.

But, it is slow.

More importantly for me at the time, the estimated conductance value was not the one that I wanted, and there were no settings to ask for it.

Without delving too much into the technical details, HOLE works out different conductance estimates. Some estimates are corrected with fudge factors based on a pore’s geometrical properties (e.g., pore bottleneck’s radius) or physical properties (e.g., average electric potential) and others without any correction, solely based on a pore’s geometry using standard formulas, i.e., Pouillet’s law.

Speeding things up and filling in the HOLE

After hours of Googling and reading the MDAnalysis.analysis.hole2 source code, I found no solution for my two problems: lack of speed and options to harvest all HOLE’s conductance estimates.

My workaround to these problems essentially consisted of two parts:

  • First, extract in parallel all protein coordinate frames from an input trajectory and save them as PDB files.
  • Second, run HOLE in parallel for every PDB file and save everything into a dictionary.

Here you have my scripts to get an idea of my implementation:

# extract_frames_parallel.py 
import os
import sys
import numpy as np
import mdtraj as md
from time import process_time
from joblib import Parallel, delayed

dirname = sys.argv[1]
n_cores = int(sys.argv[2])

tic = process_time()
traj_file = os.path.join(dirname, 'md_100ns.xtc')
top_file = os.path.join(dirname, 'md_100ns.gro')
trajectory = md.load(traj_file, top=top_file)
protein_indices = trajectory.topology.select("protein")

framesdir = os.path.join(dirname, 'frames')
if os.path.isdir(framesdir):
    pass
    print("INFO: Dir 'frames' exists.")
else:
    os.mkdir(framesdir)
    print("INFO: Created 'frames' dir.")

def extract_protein_frame(i, trajectory, protein_indices, outdir):
    frame = trajectory[i]
    protein_frame = frame.restrict_atoms(atom_indices=protein_indices)
    pdb_output = os.path.join(outdir, 'frame'+f'{i:04d}'+'.pdb')
    protein_frame.save_pdb(filename=pdb_output)
    
print(f"INFO: You have {trajectory.n_frames} frames in your trajectory: {dirname}")
f = lambda i:extract_protein_frame(i, trajectory, protein_indices, framesdir)
p = Parallel(n_jobs=n_cores)(delayed(f)(i) for i in range(trajectory.n_frames))

toc = process_time()
elapsed_time = np.round((toc-tic)/60, decimals=3)
print(f"INFO: Total elapsed time [m]: {elapsed_time}")
# run_hole_parallel.py
import os
import re
import sys
import pickle
import numpy as np
import subprocess
from joblib import Parallel, delayed

def encode_hole_inputs(pdbfile):
    """
    Enconde a sample HOLE input file content and return its string chain
    """
    file_content = ("!HOLE input file",
                    f"coord {pdbfile}",
                    "radius ./hole2/rad/simple.rad",
                    "ignore hoh",
                    "capsule",
                    "endrad 15.")
    
    inputs_encoded = '\n'.join(file_content).encode('utf-8')
    
    return inputs_encoded

def run_hole(inputs_encoded):
    """
    Run hole executable via subprocess with an encoded HOLE input file
    Output file content is returned as an enconded string chain
    """
    command = ['./hole2/exe/hole']
    
    status = 1
    n_trial = 0
    n_max = 4
    while (status != 0 and n_trial <= n_max):
        child = subprocess.Popen(command, stdin=subprocess.PIPE, stdout=subprocess.PIPE)
        outputs_encoded = child.communicate(input=inputs_encoded)[0]
        status = child.returncode
        
        if status != 0:
            n_trial += 1
            mssg = "ERROR. HOLE failed. I will try maximum ", n_max,"times"
            print(mssg, file=sys.stderr)
            
    if status == 0:
        return outputs_encoded
    
    else:
        return 'Failed'.encode('utf-8')
    
    
def process_hole_output(hole_output, key_name = 'pdbname'):
    """
    Compute HOLE conductance estimates and pore lumen dimensions/features.
    If HOLE fails to produce an output, then a null dictionay is returned.
    """
    try:
        for line in hole_output:
            if re.search(r'"Atomic" length of channel',line):
                pore_length = float(line.split()[4]) # [Angstroms]
                
            elif re.search(r'TAG',line):
                # All conductance estimates are in [nano-Siemens]
                x = line.split()
                VDW_Rmin = float(x[3]) # [Angstroms]
                Gmacro = 0.001*float(x[5])
                Gpred_Rmin = 0.001*float(x[8])
                Gpred_Lenght = 0.001*float(x[9])
                Gpred_AvgEPot = 0.001*float(x[10])

                HOLE_dimensions = (VDW_Rmin,pore_length)
                HOLE_conductance_estimates = (Gmacro,Gpred_Rmin,Gpred_Lenght,Gpred_AvgEPot)

        return {key_name:(HOLE_dimensions,HOLE_conductance_estimates)}
    
    except Exception:
        print("ERROR. HOLE failed for:", key_name)
        
        HOLE_dimensions = (0,0)
        HOLE_conductance_estimates = (0,0,0,0)
        
        return {key_name:(HOLE_dimensions,HOLE_conductance_estimates)}
    
def extract_hole_results(pdb):
    inputs_encoded = encode_hole_inputs(pdb)
    outputs_encoded = run_hole(inputs_encoded)
    outputs_decoded = outputs_encoded.decode("utf-8").split('\n')
    
    key_name = os.path.basename(pdb).split('.')[0]
    outputs_dict = process_hole_output(outputs_decoded, key_name=key_name)
    
    return outputs_dict

if __name__ == "__main__":
    from time import process_time
    
    pdbfiles = [x.strip() for x in sys.stdin.readlines()]
    print("INFO. You inputted ", len(pdbfiles), " PDB files to process")
    
    n_cores = int(sys.argv[1])
    output_dir = sys.argv[2]
    
    #tic = process_time()
    results = Parallel(n_jobs=n_cores)(delayed(extract_hole_results)(pdb) for pdb in pdbfiles)
    results_dict = {k: v for x in results for k, v in x.items()}
    
    results_dict.update({'metadata' : {
            'pdb_filename' : ('HOLE_dimensions', 'HOLE_conductance_estimates'),
            'HOLE_dimensions' : ('VDW_Rmin', 'pore_length'),
            'HOLE_conductance_estimates' : ('Gmacro','Gpred_Rmin','Gpred_Lenght','Gpred_AvgEPot')}
                        })
    
    pickle_file = os.path.join(output_dir, "md_100ns.hole.pickle")
    with open(pickle_file, 'wb') as f:
        pickle.dump(results_dict, f)
    
    toc = process_time()
    elapsed_time = np.round((toc-tic)/60, decimals=3)
    print(f"INFO. Total elapsed time [m]: {elapsed_time}")

Important Notes (don’t skip if you want to use my scripts)

The above scripts suited my particular setup. But, this may not work for you. Fortunately, these scripts can be easily adapted provided you consider the following:

Notes on extract_frames_parallel.py

  • Important depedencies: mdtraj and joblib
  • Filenames for the trajectory and topology files are hard-coded as md_100ns.xtc and md_100sn.gro – you probably want to change these to your filenames.
  • All frames available in your trajectory will extracted and saved PDB filenames will be formatted as framesXXXX.pdb – this made sense at the time because I knew I had up to a thousand frames, but for longer trajectories, you may want to change this.


Notes on run_hole_parallel.py

  • Important dependencies: joblib and the HOLE executable file hole2/exe/hole – you can download it from here.
  • The script reads the input PDB files listed from the standard input – the definition of pdbfiles could be changed to read all PDB frames from an input directory path, passed as an input.
  • This implementation uses the HOLE’s capsule argument to output all conductance estimates – because I was only interested in conductance values, the script only focused on saving these and nothing else from the HOLE output, however, this can be modified if you’re interested in radial profiles too.
  • The output dictionary values will be saved as a pickle file named md_100ns.hole.pickle – again, this is a hard-coded name that you are free to change.

Working with several trajectories

The above scripts provide the means to deal with a single MD trajectory, but what if you have hundreds?

Well, you can loop over these trajectories and execute the above scripts in parallel. Feel free to adapt my bash scripts below to run frame extraction and the HOLE analysis.

# extract_frames.sh 
n_cores=6
for dir in `ls -d data/cWza*_conformation[0,1]_[0-9][0-9][0-9][0-9]`; do
	FILE=${dir}/frames
	if [ -d "$FILE" ]; then
	   echo "$FILE exists."
	else
	   echo "INFO: Extracting file in parallel."
	   python extract_frames_parallel.py $dir $n_cores
        fi
	zip -r "${FILE}".zip $FILE
	rm -rf "${FILE}"/frame*.pdb
done
# compute_conductance.sh

n_cpus=7
for dirname in `ls -d data/cWza*_conformation*_[0-9][0-9][0-9][0-9]`; do 
    echo $dirname;
    rm -rf $dirname/frames;
    unzip -j $dirname/frames.zip -d $dirname/frames;
    time ls $dirname/frames/*pdb | python run_hole_parallel.py $n_cpus $dirname;
    rm -rf $dirname/frames;
done

Extracting the protein frames beforehand has the advantage of avoiding loading the whole trajectory into memory, which makes the MDAnalysis HOLE analysis slow and difficult to parallelise. However, this results in increased storage use for the extracted PDBs.

A workaround to lower storage is to compress the output folder with the PDB frames, which can then be decompressed when you need to run the parallel HOLE analysis. This has been implemented in the bash scripts above.

Thankfully, the frame extraction can be run just once! Then, extracted PDB frames can be used multiple times to repeat the HOLE analysis as many times as you want. This is useful if you wish to work out error bars for conductance values.

Performance gains and limitations

Numbers speak louder than words

Here I report some figures from a basic test I did for my two scripts.

The tests were by no means comprehensive, but they give us a fair idea of the gains we get due to parallelisation when compared to a serial execution for the frame extraction and conductance calculation tasks.

Performance of extract_frames_parallel.py

Inputs:

  • Input trajectory: 329 MB (1001 frames)
  • Subsampling: None (extracted all frames!)
  • No CPUs: 8

Testing:

  • No parallel runs: 1
  • No serial runs: 1

Outputs:

  • Frames folder: 1001 PDB files = 329 MB

Performance:

  • Parallel execution time: 14.4 minutes
  • Serial execution time: 23.5 minutes
  • Speed up (serial/parallel time): ~ 1.63
  • Efficiency (speed up / No CPUs): 20%
  • Parallel memory usage: around 30% out of 16.7 GB
  • Serial memory usage: around 10% out of 16.7 GB

Performance of run_hole_parallel.py

Inputs:

  • Input trajectory: 329 MB (1001 frames)
  • Sub-sampling: 100 random PDB frames (took a handful only, this is a long calculation)
  • No CPUs: 6

Testing:

  • No parallel runs: 4 (100 random PDBs)
  • No serial runs: 4 (100 random PDBs)
  • Plus, one single full run (all PDB frames) – just get an idea about the scaling with the number of frames.

Outputs:

  • Frames folder: 1001 PDB files = 329 MB

Performance:

  • Parallel execution time (short runs): 41.6 seconds on average
  • Parallel execution time (long run): 59.5 minutes
  • Serial execution time (short runs): 134.34 seconds on average
  • Speed up (short runs): ~ 3.23
  • Efficiency: 54%
  • Memory usage increase: 1.5 – 1.8% out of 16.7 GB

From the above figures, we see that performance gains are modest for every task when compared to serial execution. We have a 20% speed-up per CPU for trajectory frame extraction. And slightly above 50% speed-up per CPU for HOLE calculations.

But, how does it compare with MDAnalysis?

We ran a quick test for this and found that for the first 100 frames of the same test trajectory, MDAnalysis achieves a serial execution time of 134.85 seconds on average over 4 independent runs. This is roughly in the same order as the one we achieved through our HOLE implementation. So virtually no performance difference with our serial implementation.

This is the code I used

# simple_hole_run.py
from timeit import default_timer as timer
start = timer()

import MDAnalysis as mda
from MDAnalysis.analysis import hole2

# Load MD trajectory
workdir = "data/"
topology_file = workdir+'md_100ns.tpr'
trajectory_file = workdir+'md_100ns.xtc'
u = mda.Universe(topology_file, trajectory_file, in_memory=True)

hole_path = 'hole2/exe/hole'
outdir = 'data/hole_analysis'
ha = hole2.HoleAnalysis(u, executable=hole_path, prefix = outdir)
ha.run(start=0,stop=99,step=1) # Take the first 100 frames

end = timer()
print((end - start)/60)

The “Ouch!” part

Finally, no analysis would be complete if we didn’t talk about the limitations. Some of the limitations of the implementation I introduced here include:

  • Maybe not suitable for seriously long MD trajectories ( micro-seconds to seconds regime) – you might not even be able to fit that into memory to extract frames, so you would potentially need a sub-sampling technique.
  • I never tested my scripts outside any computer with Linux – Sorry Windows and Mac users. Not even tested HOLE itself in those environments.
  • Unlike MDAnalysis, I did not implement arguments to custom the input HOLE parameters nor any of the vast set of arguments to control HOLE’s Monte Carlo annealing or graphical output.
  • Annoyingly, for the parallel HOLE calculation is difficult to get a sense of the progress of the execution. Yep, you have to wait until the end (Sorry) – maybe something a brave soul would like to implement.
  • Finally, the HOLE output is quite rich in terms of information. So, I had to decide not to save any text files as there was a lot of data I just didn’t need. In short, a lot of the output gets thrown away.

Final thoughts

Here I talked through a parallel implementation of the HOLE programme, useful when dealing with several molecular dynamic trajectories. This implementation is far from perfect (which I wrote myself before ChatGPT was a thing!), but at least it can provide us with moderate performance gains which are valuable for time-consuming calculations such as protein pore geometry and conductance estimation by HOLE. Good luck adapting/debugging my scripts if you ever find yourself in dire need of a parallel solution to similar problems I faced.

Author