Cleaning outliers in conductance timeseries from molecular dynamics

Have you ever had an annoying dataset that looks something like this?

or even worse, just several of them

In this blog post, I will introduce basic techniques you can use and implement with Python to identify and clean outliers. The objective will be to get something more eye-pleasing (and mostly less troublesome for further data analysis) like this

I will look specifically at the case of timeseries of instant conductance values of simulated ion channels (*). And at the end, I will share a code snippet you can steal and adapt to deal with similar time series.

(*) I will probably write another blog post in the future to expand on my simulations and on how I computed instant conductance for hundreds of simulation trajectories.

Identifying Outliers

Outliers can be defined as data points that significantly differ from other observations.

In general, removing outliers will depend on the specific characteristics of your data and the desired outcome. I will not elaborate on that here, but this can get really complicated as the topic of timeseries spans several disciplines — also an old one. Check this reference if you want to get a sense of how deep the rabbit hole goes.

Here, I will simply focus on a few approaches relevant to the case of the kind of timeseries that one can simply bound within an interval.

In this case, to spot outliers one can simply plot the data (visual approach) and then manually set up threshold values to filter them out. But, because this process gets impractical for hundreds of timeseries, using information drawn from the data distributions gets handy (statistical approach). I will cover both cases.

Visual approach

Just plot it

Timeseries and scatter plots are useful means to visualise outliers.

In the time series below, we can see that the bulk of the data lives between 0.5 and 2.0

To clean our data, we can set these as thresholds, replace outliers with NaN values, and fill them in with interpolated data. Using pandas it would look something like this

import pandas as pd
from numpy import NaN
df = pd.DataFrame({'original': timeseries})
outliers = df[(0.5 > df.values) | (df.values > 2.0)].values
df_with_NaNs = df.replace(outliers, NaN)
df_new = df_with_NaNs.interpolate(method='linear', axis=0).ffill().bfill()

Box plots

Using a plain plot of our timeseries does the job. However, we can use another graphic representation, a more statistical one: a box plot.

import seaborn as sns
sns.boxplot(df['original'],ax=ax)

Box plots graphically represent the anatomy of your data’s distribution. The central box indicates the ranked quartiles Q1 , Q2 , and Q3 which represent the 25-percentile, the median, and the 75-percentile of the distribution. While the two whiskers indicate the minimum and the maximum of the distribution, with all data points outside these considered outliers. For a pretty picture of the parts of a box plot, check this.

Again, from looking at the box plot, we can see that 0.5 and 2.0 as threshold values to filter out outliers are indeed quite sensible choices.

Statistical Approach

Visualisation is great for quickly spotting outliers. However, if you have to deal with hundreds or even thousands of timeseries, cluttering will limit visualisation to pick suitable thresholds to remove outliers.

Statistical approaches provide a methodic way to overcome this limitation. Two very well-known approaches I will look at are:

The Z-score
The Inter-Quartile Range

The Z-score

This method is based on a simple intuition: Just use the arithmetic mean and standard deviation of each time series to define an interval

[np.mean(timseries)-np.std(timeseries), np.mean(timseries)+np.std(timeseries)]

outside which outliers must live. After all, the bulk of the data must fall within this interval. Set this into a script, and “Boom!” Outliers gone!

The Z-score method does exactly the same thing as setting thresholds for outlier filtering with an interval of length 2*sigma centred at the arithmetic mean.

But, instead of setting thresholds on the timeseries values, this is done over the normalisation of these

z_scores = (timeseries - np.mean(timeseries))/np.std(timeseries)

So, any threshold on the z_scores will represent a multiple of the standard deviation.

Use this snippet to capture the outlier values to be cleaned following this method:

from scipy import stats
z_scores = stats.zscore(timeseries)
threshold = 1
outliers = timeseries[abs(z_scores.values) > threshold].values

Inter-Quartile Range

This is regarded as the most trusted method in research when it comes to dealing with outliers. And just in case you didn’t notice, this gets visualised in box plots.

To define the endpoints of the interval for outlier filtering, instead of using the mean and the standard deviation, this approach uses the interval

[Q3+1.5*IQR, Q1-1.5*IQR]

where Q1 and Q3 are the position of the first and the third quartile and IQR the interquartile distance given by their absolute difference.

Unlike the Z-score method, the filtering interval is defined around the median, not the mean, hence taking into account any asymmetry in the data distribution. Another reason why IQR is more robust is that the computed mean and the standard deviation values can get corrupted if you have several outliers with unusually high values — something I saw when trying Z-scores on my several timeseries.

Use this snippet to capture the outlier values to be cleaned following the IQR method:

import numpy as np
# Define first and third quartiles and IQR   
Q1 = np.percentile(timeseries, 25, interpolation = 'midpoint')
Q3 = np.percentile(timeseries, 75, interpolation = 'midpoint')
IQR = Q3 - Q1
    
# Get upper and lower outliers indices
upper_outliers_indices = timeseries >= (Q3+1.5*IQR)
lower_outliers_indices = timeseries <= (Q1-1.5*IQR)
# Extract outlier values
upper_outliers = timeseries[np.where(upper_outliers_indices)[0]].values
lower_outliers = timeseries[np.where(lower_outliers_indices)[0]].values
outliers = np.concatenate([lower_outliers, upper_outliers])

The result after removing and filling in outliers

Back to our original timeseries

OK, we have talked through how different visual and statistical methods deal with outliers along with their limitations.

Now, it’s time to go back to our original problem of filtering outliers out from our conductance timeseries.

Trade-offs

You might think that probably the way to go should be the IQR method. However, if you had a careful look at the timeseries above after using IQR, not only zero and large-value outliers are removed, but also, some other non-zero data points telling us something about the evolution of our system, i.e., low conductance states transiently visited by the ion channels during its simulation.

When dealing with outliers, you must consider whether any data points removed carry some valuable information in the context of your data.

My Python function

For my original data, the particular strategy I used required me to remove all the upper outliers along with all zeros, while keeping the lower “outliers” as considered by the IQR method. And again, filling in outliers with interpolated values.

Steal this:

import numpy as np
from numpy import NaN

def clean_outliers(df_timeseries):
    # Determine Interquartile Range (IQR)
    Q1 = np.percentile(df_timeseries, 25, interpolation = 'midpoint')
    Q3 = np.percentile(df_timeseries, 75, interpolation = 'midpoint')
    IQR = Q3 - Q1
    
    # Get upper and lower outliers indices
    upper_outliers_indices = df_timeseries >= (Q3+1.5*IQR)
    lower_outliers_indices = df_timeseries <= (Q1-1.5*IQR)
    
    # Extract outlier values
    upper_outliers = df_timeseries[np.where(upper_outliers_indices)[0]].values
    outliers = np.concatenate([upper_outliers, np.zeros(1)])
    
    # Replace outliers with NaNs and fill in via interpolation
    df_with_NaNs = df_timeseries.replace(outliers, NaN)
    df_interpolated = df_with_NaNs.interpolate(method='linear', axis=0).ffill().bfill()
    
    return df_interpolated

And the outcome:

The bottom line

Dealing with outliers doesn’t have to be a pain if you use the right approach. Sometimes just plotting your data and discarding points outside an interval is enough. Or, sometimes you may need the statistical information in your dataset to identify and remove outliers more methodically. Regardless of the method, always consider the trade-off between what information you want to keep and how much you can afford to wipe off. Think about the context of your problem to judge if you aren’t removing informative data.

Author

Broncio Aguilar-Sanjuan

Research Software Engineer at Oxford Protein Informatics Group (OPIG) 🐷

View all posts