{"id":8964,"date":"2022-12-07T16:34:21","date_gmt":"2022-12-07T16:34:21","guid":{"rendered":"https:\/\/www.blopig.com\/blog\/?p=8964"},"modified":"2022-12-13T16:22:32","modified_gmt":"2022-12-13T16:22:32","slug":"cleaning-outliers-in-conductance-timeseries-from-molecular-dynamics","status":"publish","type":"post","link":"https:\/\/www.blopig.com\/blog\/2022\/12\/cleaning-outliers-in-conductance-timeseries-from-molecular-dynamics\/","title":{"rendered":"Cleaning outliers in conductance timeseries from molecular dynamics"},"content":{"rendered":"\n<p>Have you ever had an annoying dataset that looks something like this?<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img data-recalc-dims=\"1\" decoding=\"async\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/2000\/0%2AFCzj6dlYD1bnMjpz.png?w=625&#038;ssl=1\" alt=\"\"\/><\/figure>\n\n\n\n<p>or even worse, just several of them<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img data-recalc-dims=\"1\" decoding=\"async\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/2000\/0%2Ahkb3e57WVlkfsw-n.png?w=625&#038;ssl=1\" alt=\"\"\/><\/figure>\n\n\n\n<p>In this blog post, I will introduce basic techniques you can use and implement with <em>Python<\/em> to identify and clean outliers. The objective will be to get something more eye-pleasing (and mostly less troublesome for further data analysis) like this<\/p>\n\n\n\n<!--more-->\n\n\n\n<figure class=\"wp-block-image\"><img data-recalc-dims=\"1\" decoding=\"async\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/2000\/0%2AHD2OVIFBIoLmS1K3.png?w=625&#038;ssl=1\" alt=\"\"\/><\/figure>\n\n\n\n<p>I will look specifically at the case of timeseries of instant conductance values of simulated ion channels (*). And at the end, I will share a code snippet you can steal and adapt to deal with similar time series.<\/p>\n\n\n\n<p>(*) I will probably write another blog post in the future to expand on my simulations and on how I computed instant conductance for hundreds of simulation trajectories.<\/p>\n\n\n\n<p><strong>Identifying Outliers<\/strong><\/p>\n\n\n\n<figure class=\"wp-block-image is-resized\"><img data-recalc-dims=\"1\" decoding=\"async\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/2000\/0%2AEhy5e5CyvyEY5WrJ.png?resize=443%2C565&#038;ssl=1\" alt=\"\" width=\"443\" height=\"565\"\/><figcaption class=\"wp-element-caption\">If only it were that simple&nbsp;\u2026<\/figcaption><\/figure>\n\n\n\n<p>Outliers can be defined as data points that significantly differ from other observations.<\/p>\n\n\n\n<p>In general, removing outliers will depend on the specific characteristics of your data and the desired outcome. I will not elaborate on that here, but this can get really complicated as the topic of timeseries spans several disciplines\u200a\u2014\u200aalso <a href=\"https:\/\/www.jstor.org\/stable\/j.ctv14jx6sm\" rel=\"noreferrer noopener\" target=\"_blank\">an old one<\/a>. Check <a href=\"https:\/\/academic.oup.com\/biomet\/article\/87\/4\/789\/232743\" rel=\"noreferrer noopener\" target=\"_blank\">this<\/a> reference if you want to get a sense of how deep the rabbit hole goes.&nbsp;<\/p>\n\n\n\n<p>Here, I will simply focus on a few approaches relevant to the case of the kind of timeseries that one can simply bound within an interval.&nbsp;<\/p>\n\n\n\n<p>In this case, to spot outliers one can simply plot the data (<em>visual approach)<\/em> and then manually set up threshold values to filter them out. But, because this process gets impractical for hundreds of timeseries, using information drawn from the data distributions gets handy (<em>statistical approach)<\/em>. I will cover both cases.<\/p>\n\n\n\n<p><strong>Visual approach<\/strong><\/p>\n\n\n\n<p><strong>Just plot&nbsp;it<\/strong><\/p>\n\n\n\n<p>Timeseries and scatter plots are useful means to visualise outliers.<\/p>\n\n\n\n<p>In the time series below, we can see that the bulk of the data lives between 0.5 and 2.0<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img data-recalc-dims=\"1\" decoding=\"async\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/2000\/0%2Azvnhk8heH7x7D3Gb.png?w=625&#038;ssl=1\" alt=\"\"\/><\/figure>\n\n\n\n<p>To clean our data, we can set these as thresholds, replace outliers with <code>NaN<\/code> values, and fill them in with interpolated data. Using <code>pandas<\/code> it would look something like this<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">import pandas as pd\nfrom numpy import NaN\ndf = pd.DataFrame({'original': timeseries})\noutliers = df[(0.5 > df.values) | (df.values > 2.0)].values\ndf_with_NaNs = df.replace(outliers, NaN)\ndf_new = df_with_NaNs.interpolate(method='linear', axis=0).ffill().bfill()<\/pre>\n\n\n\n<figure class=\"wp-block-image\"><img data-recalc-dims=\"1\" decoding=\"async\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/2000\/0%2ASjWJTTWB-b-y4ROl.png?w=625&#038;ssl=1\" alt=\"\"\/><\/figure>\n\n\n\n<p><strong>Box plots<\/strong><\/p>\n\n\n\n<p>Using a plain plot of our timeseries does the job. However, we can use another graphic representation, a more statistical one: a box plot.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">import seaborn as sns\nsns.boxplot(df['original'],ax=ax)<\/pre>\n\n\n\n<figure class=\"wp-block-image\"><img data-recalc-dims=\"1\" decoding=\"async\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/2000\/0%2AlDGLkTtSXE83aMAX.png?w=625&#038;ssl=1\" alt=\"\"\/><\/figure>\n\n\n\n<p><em>Box plots<\/em> graphically represent the anatomy of your data\u2019s distribution. The central box indicates the <em>ranked quartiles<\/em> <code>Q1<\/code>&nbsp;, <code>Q2<\/code>&nbsp;, and <code>Q3<\/code> which represent the 25-percentile, the median, and the 75-percentile of the distribution. While the two <em>whiskers<\/em> indicate the minimum and the maximum of the distribution, with all data points outside these considered <em>outliers<\/em>. For a pretty picture of the parts of a box plot, check <a href=\"https:\/\/miro.medium.com\/max\/9000\/1*2c21SkzJMf3frPXPAR_gZA.png\" rel=\"noreferrer noopener\" target=\"_blank\">this<\/a>.<\/p>\n\n\n\n<p>Again, from looking at the box plot, we can see that 0.5 and 2.0 as threshold values to filter out outliers are indeed quite sensible choices.<\/p>\n\n\n\n<p><strong>Statistical Approach<\/strong><\/p>\n\n\n\n<p>Visualisation is great for quickly spotting outliers. However, if you have to deal with hundreds or even thousands of timeseries, cluttering will limit visualisation to pick suitable thresholds to remove outliers.<\/p>\n\n\n\n<p>Statistical approaches provide a methodic way to overcome this limitation. Two very well-known approaches I will look at are:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><em>The Z-score<\/em><\/li>\n\n\n\n<li><em>The Inter-Quartile Range<\/em><\/li>\n<\/ul>\n\n\n\n<p><strong>The Z-score<\/strong><\/p>\n\n\n\n<p>This method is based on a simple intuition: Just use the arithmetic mean and standard deviation of each time series to define an interval&nbsp;<\/p>\n\n\n\n<p><code>[np.mean(timseries)-np.std(timeseries), np.mean(timseries)+np.std(timeseries)]<\/code><\/p>\n\n\n\n<p>outside which outliers must live. After all, the bulk of the data must fall within this interval. Set this into a script, and \u201c<em>Boom!<\/em>\u201d Outliers gone!<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img data-recalc-dims=\"1\" decoding=\"async\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/2000\/0%2Aa17X-17NkItZoBkN.png?w=625&#038;ssl=1\" alt=\"\"\/><\/figure>\n\n\n\n<p>The Z-score method does exactly the same thing as setting thresholds for outlier filtering with an interval of length <code>2*sigma<\/code> centred at the arithmetic mean.<\/p>\n\n\n\n<p>But, instead of setting thresholds on the timeseries values, this is done over the normalisation of these<\/p>\n\n\n\n<p><code>z_scores = (timeseries - np.mean(timeseries))\/np.std(timeseries)<\/code><\/p>\n\n\n\n<p>So, any threshold on the <code>z_scores<\/code> will represent a multiple of the standard deviation.<\/p>\n\n\n\n<p>Use this snippet to capture the outlier values to be cleaned following this method:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">from scipy import stats\nz_scores = stats.zscore(timeseries)\nthreshold = 1\noutliers = timeseries[abs(z_scores.values) > threshold].values<\/pre>\n\n\n\n<p><strong>Inter-Quartile Range<\/strong><\/p>\n\n\n\n<p>This is regarded as the most trusted method in research when it comes to dealing with outliers. And just in case you didn\u2019t notice, this gets visualised in box plots.&nbsp;<\/p>\n\n\n\n<p>To define the endpoints of the interval for outlier filtering, instead of using the mean and the standard deviation, this approach uses the interval<\/p>\n\n\n\n<p><code>[Q3+1.5*IQR, Q1-1.5*IQR]<\/code><\/p>\n\n\n\n<p>where <code>Q1 <\/code>and <code>Q3<\/code> are the position of the first and the third quartile and <code>IQR<\/code> the interquartile distance given by their absolute difference.&nbsp;<\/p>\n\n\n\n<p>Unlike the Z-score method, the filtering interval is defined around the <em>median<\/em>, not the mean, hence taking into account any asymmetry in the data distribution. Another reason why IQR is more robust is that the computed mean and the standard deviation values can get corrupted if you have several outliers with unusually high values\u200a\u2014\u200asomething I saw when trying Z-scores on my several timeseries.&nbsp;<\/p>\n\n\n\n<p>Use this snippet to capture the outlier values to be cleaned following the IQR method:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">import numpy as np\n# Define first and third quartiles and IQR   \nQ1 = np.percentile(timeseries, 25, interpolation = 'midpoint')\nQ3 = np.percentile(timeseries, 75, interpolation = 'midpoint')\nIQR = Q3 - Q1\n    \n# Get upper and lower outliers indices\nupper_outliers_indices = timeseries >= (Q3+1.5*IQR)\nlower_outliers_indices = timeseries &lt;= (Q1-1.5*IQR)\n# Extract outlier values\nupper_outliers = timeseries[np.where(upper_outliers_indices)[0]].values\nlower_outliers = timeseries[np.where(lower_outliers_indices)[0]].values\noutliers = np.concatenate([lower_outliers, upper_outliers])<\/pre>\n\n\n\n<figure class=\"wp-block-image\"><img data-recalc-dims=\"1\" decoding=\"async\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/2000\/0%2AGiLWmWS70Q0uC9py.png?w=625&#038;ssl=1\" alt=\"\"\/><figcaption class=\"wp-element-caption\">The result after removing and filling in&nbsp;outliers<\/figcaption><\/figure>\n\n\n\n<p><strong>Back to our original timeseries<\/strong><\/p>\n\n\n\n<p>OK, we have talked through how different visual and statistical methods deal with outliers along with their limitations.<\/p>\n\n\n\n<p>Now, it\u2019s time to go back to our original problem of filtering outliers out from our conductance timeseries.<\/p>\n\n\n\n<p><strong>Trade-offs<\/strong><\/p>\n\n\n\n<p>You might think that probably the way to go should be the IQR method. However, if you had a careful look at the timeseries above after using IQR, not only zero and large-value outliers are removed, but also, some other non-zero data points telling us something about the evolution of our system, <em>i.e.,<\/em> low conductance states transiently visited by the ion channels during its simulation.<\/p>\n\n\n\n<p>When dealing with outliers, you must consider whether any data points removed carry some valuable information in the context of your data.<\/p>\n\n\n\n<p><strong>My Python&nbsp;function<\/strong><\/p>\n\n\n\n<p>For my original data, the particular strategy I used required me to remove all the upper outliers along with all zeros, while keeping the lower \u201coutliers\u201d as considered by the IQR method. And again, filling in outliers with interpolated values.&nbsp;<\/p>\n\n\n\n<p>Steal this:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">import numpy as np\nfrom numpy import NaN\n\ndef clean_outliers(df_timeseries):\n    # Determine Interquartile Range (IQR)\n    Q1 = np.percentile(df_timeseries, 25, interpolation = 'midpoint')\n    Q3 = np.percentile(df_timeseries, 75, interpolation = 'midpoint')\n    IQR = Q3 - Q1\n    \n    # Get upper and lower outliers indices\n    upper_outliers_indices = df_timeseries >= (Q3+1.5*IQR)\n    lower_outliers_indices = df_timeseries &lt;= (Q1-1.5*IQR)\n    \n    # Extract outlier values\n    upper_outliers = df_timeseries[np.where(upper_outliers_indices)[0]].values\n    outliers = np.concatenate([upper_outliers, np.zeros(1)])\n    \n    # Replace outliers with NaNs and fill in via interpolation\n    df_with_NaNs = df_timeseries.replace(outliers, NaN)\n    df_interpolated = df_with_NaNs.interpolate(method='linear', axis=0).ffill().bfill()\n    \n    return df_interpolated<\/pre>\n\n\n\n<p>And the outcome:<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img data-recalc-dims=\"1\" decoding=\"async\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/2000\/0%2ARx1-Jxq1zE2flul2.png?w=625&#038;ssl=1\" alt=\"\"\/><figcaption class=\"wp-element-caption\">Before<\/figcaption><\/figure>\n\n\n\n<figure class=\"wp-block-image\"><img data-recalc-dims=\"1\" decoding=\"async\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/2000\/0%2AUvCbg5x9ikjLRp7W.png?w=625&#038;ssl=1\" alt=\"\"\/><figcaption class=\"wp-element-caption\">After&nbsp;<\/figcaption><\/figure>\n\n\n\n<p><strong>The bottom&nbsp;line<\/strong><\/p>\n\n\n\n<p>Dealing with outliers doesn\u2019t have to be a pain if you use the right approach. Sometimes just plotting your data and discarding points outside an interval is enough. Or, sometimes you may need the statistical information in your dataset to identify and remove outliers more methodically. Regardless of the method, always consider the trade-off between what information you want to keep and how much you can afford to wipe off. Think about the context of your problem to judge if you aren\u2019t removing informative data.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Have you ever had an annoying dataset that looks something like this? or even worse, just several of them In this blog post, I will introduce basic techniques you can use and implement with Python to identify and clean outliers. The objective will be to get something more eye-pleasing (and mostly less troublesome for further [&hellip;]<\/p>\n","protected":false},"author":100,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"nf_dc_page":"","wikipediapreview_detectlinks":true,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"ngg_post_thumbnail":0,"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[29,361,621,14,647,227,278],"tags":[],"ppma_author":[631],"class_list":["post-8964","post","type-post","status-publish","format-standard","hentry","category-code","category-data-science","category-data-visualization","category-howto","category-molecular-dynamics","category-python-code","category-statistics"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"authors":[{"term_id":631,"user_id":100,"is_guest":0,"slug":"broncio","display_name":"Broncio Aguilar-Sanjuan","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/98ed21f7b05613d3b21c650ec13af1cafba2e7d8c40a23dba86b9a8b623bd3ef?s=96&d=mm&r=g","0":null,"1":"","2":"","3":"","4":"","5":"","6":"","7":"","8":""}],"_links":{"self":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/8964","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/users\/100"}],"replies":[{"embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/comments?post=8964"}],"version-history":[{"count":5,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/8964\/revisions"}],"predecessor-version":[{"id":9173,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/8964\/revisions\/9173"}],"wp:attachment":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/media?parent=8964"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/categories?post=8964"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/tags?post=8964"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=8964"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}