Estimating uncertainty in MD observables using block averaging

When running molecular dynamics (MD) simulations, we are usually interested in measuring an ensemble average of some metric (e.g., RMSD, RMSF, radius of gyration, …) and use this to draw conclusions about the investigated system. While calculating the average value of a metric is straightforward (we can simply measure the metric in each frame and average it) calculating a statistical uncertainty is a little more tricky and often forgotten. The main challange when trying to calculate an uncertainty of MD oveservables is that individual frames of the simulation are not samped independently but they are time correlated (i.e., frame N depends on frame N-1). In this blog post, I will breifly introduce block averaging, a statistical technique to estimate uncertainty in correlated data.

What is block averaging?

To estimate the statistical uncertainty in the estimation of the mean in a dataset of independent samples we can use the standard error of the mean (SEM):

I will provide some intuition behind this formula. If the samples are independent, every new measurement gives you a “full amount” of new information and improves the estimate of the mean. The denominator sqrt(N) in the formula for the SEM reflects this reduction in uncertainty dependent on the size of the dataset. However, if datapoints are correlated each new measurment will provide less information than in the case of an independent sample as some of it is “repeated” from past measurements. The estimate of the mean will, therefore, not improve as quickly with increased size of the dataset. Dividing by sqrt(N) in the formula for the SEM will underestimates the true uncertainity of the mean.

Block averaging provides a way to address this issue and get an estimte for the true uncertainity. The core idea of block averaging is to divide a trajectory into smaller blocks. We have a trajectory with N total frames and segment this into M blocks with n frames each, so that N = M * n. We then treat each block, rather than each frame, as a sample by taking the mean of our observable of interest over the block. If block size is choosen correctly, the blocks will be statistically independent and we can use the SEM over block averages to get the true uncertainty of the data.

Now we remain with the problem of how to choose the right block size. The assumption that each block provides an independent measurement is only valid if the block size is greater than the correlation time of the data. If we choose a block size smaller than that we will still underestimate uncertainty. To choose an appropriate block size, we can plot the BSE against block size (see Figure 1 (top)). As block size increases, the BSE will increase and eventually plateau indicating that blocks beyond this length yield a stable error estimate. The BSE at the point the curve plateaus provides the true uncertaininty (error bars) of the observable of interest (see Figure 1 bottom). One thing to point out for short simulations, if blocks are too large you may end up with too few samples (blocks) to get good statistics. You should ensure to have at least 5-10 block for a good estimate.

Figure 1: BSE and error bars as a function of block size.

Implementation in Python

Below I provide Python code to calculate the block standard error for the RMSF in an MD simulation.

import MDAnalysis as mda
from MDAnalysis.analysis import align, rms
import numpy as np
import matplotlib.pyplot as plt

# load the trajectory and topology files
topology = 'simulation_top.pdb'
trajectory = 'simulation_trajectory.xtc'

# load the full trajectory and a reference for alignment
traj = mda.Universe(topology, trajectory)
ref = mda.Universe(topology, trajectory)

# align the trajectory to the reference
# here the reference is the first frame of the trajectory
alignment = align.AlignTraj(traj, ref, select=f'name CA', in_memory=True)
alignment.run()

### calculate RMSF across entire trajectory to get mean value
selection = traj.select_atoms(f'name CA')
rmsf = rms.RMSF(selection).run()
rmsf = rmsf.results.rmsf

### block avaraging for error bars
# set parameters for block analysis
block_size = 50
n_frames = 1000
n_residues = traj.residues.resnames.shape[0]
n_blocks = n_frames // block_size
block_rmsf_values = np.zeros((n_blocks, n_residues))

# calculate RMSF for each block
for block in range(n_blocks):
    start = block_size * block + 10
    stop = block_size * (block + 1) + 10

    selection = traj.select_atoms(f'name CA')
    block_rmsf = rms.RMSF(selection).run(start=start, stop=stop)
    block_rmsf = block_rmsf.results.rmsf

    block_rmsf_values[block, :] = block_rmsf

# calculate BSE
block_standard_deviation = block_rmsf_values.std(axis=0)
block_standard_error = block_standard_deviation / np.sqrt(n_blocks)

# plot the results
plt.figure()
plt.errorbar(
    np.arange(n_residues), 
    rmsf, 
    yerr=block_standard_error, 
    fmt='o', 
    capsize=5, 
    label='RMSF with error bars'
)

Author

Fabian Spoendlin

View all posts

Oxford Protein Informatics Group

or "OPIG" to friends

Estimating uncertainty in MD observables using block averaging

What is block averaging?

Implementation in Python

Author