Quantifying dispersion under varying instrument precision

Experimental errors are common at the moment of generating new data. Often this type of errors are simply due to the inability of the instrument to make precise measurements. In addition, different instruments can have different levels of precision, even-thought they are used to perform the same measurement. Take for example two balances and an object with a mass of 1kg. The first balance, when measuring this object different times might record values of 1.0083 and 1.0091, and the second balance might give values of 1.1074 and 0.9828. In this case the first balance has a higher precision as the difference between its measurements is smaller than the difference between the measurements of balance two.

In order to have some control over this error introduced by the level of precision of the different instruments, they are labelled with a measure of their precision $1/\sigma_i^2$ or equivalently with their dispersion $\sigma_i^2$ .

Let’s assume that the type of information these instruments record is of the form $X_i=C + \sigma_i Z$ , where $Z \sim N(0,1)$ is an error term, $X_i$ its the value recorded by instrument $i$ and where $C$ is the fixed true quantity of interest the instrument is trying to measure. But, what if $C$ is not a fixed quantity? or what if the underlying phenomenon that is being measured is also stochastic like the measurement $X_i$ . For example if we are measuring the weight of cattle at different times, or the length of a bacterial cell, or concentration of a given drug in an organism, in addition to the error that arises from the instruments; there is also some noise introduced by dynamical changes of the object that is being measured. In this scenario, the phenomenon of interest, can be given by a random variable $Y \sim N(\mu,S^2)$ . Therefore the instruments would record quantities of the form $X_i=Y + \sigma_i Z$ .

Under this case, estimating the value of $\mu$ , the expected state of the phenomenon of interest is not a big challenge. Assume that there are $x_1,x_2,...,x_n$ values observed from realisations of the variables $X_i \sim N(\mu, \sigma_i^2 + S^2)$ , which came from $n$ different instruments. Here $\sum x_i /n$ is still a good estimation of $\mu$ as $E(\sum X_i /n)=\mu$ . Now, a more challenging problem is to infer what is the underlying variability of the phenomenon of interest $Y$ . Under our previous setup, the problem is reduced to estimating $S^2$ as we are assuming $Y \sim N(\mu,S^2)$ and that the instruments record values of the from $X_i=Y + \sigma_i Z$ .

To estimate $S^2$ a standard maximum likelihood approach could be used, by considering the likelihood function:

$f(x_1,x_2,..,x_n)= \prod e^{-1/2 \times (x_i-\mu)^2 /(\sigma_i^2+S^2)} \times 1/\sqrt{2 \pi (\sigma_i^2+S^2) }$ ,

from which the maximum likelihood estimator of $S^2$ is given by the solution to

$\sum [(X_i- \mu)^2 - (\sigma_i^2 + S^2)] /(\sigma_i^2 + S^2)^2 = 0$ .

Another more naive approach could use the following result

$E[\sum (X_i-\sum X_i/n)^2] = (1-1/n) \sum \sigma_i^2 + (n-1) S^2$

from which $\hat{S^2}= (\sum (X_i-\sum X_i/n)^2 - ( (1-1/n ) \sum(\sigma_i^2) ) ) / (n-1)$ .

Here are three simulation scenarios where 200 $X_i$ values are taken from instruments of varying precision or variance $\sigma_i^2, i=1,2,...,200$ and where the variance of the phenomenon of interest $S^2=1500$ . In the first scenario $\sigma_i^2$ are drawn from $[10,1500^2]$ , in the second from $[10,1500^2 \times 3]$ and in the third from $[10,1500^2 \times 5]$ . In each scenario the value of $S_2$ is estimated 1000 times taking each time another 200 realisations of $X_i$ . The values estimated via the maximum likelihood approach are plotted in blue, and the values obtained by the alternative method are plotted in red. The true value of the $S^2$ is given by the red dashed line across all plots.

First simulation scenario where $\sigma_i^2, i=1,2,...,200$ in $[10,1500^2]$ . The values of $\sigma_i^2$ plotted in the histogram to the right. The 1000 estimations of $S$ are shown by the blue (maximum likelihood) and red (alternative) histograms.

First simulation scenario where $\sigma_i^2, i=1,2,...,200$ in $[10,1500^2 \times 3]$ . The values of $\sigma_i^2$ plotted in the histogram to the right. The 1000 estimations of $S$ are shown by the blue (maximum likelihood) and red (alternative) histograms.

First simulation scenario where $\sigma_i^2, i=1,2,...,200$ in $[10,1500^2 \times 5]$ . The values of $\sigma_i^2$ plotted in the histogram to the right. The 1000 estimations of $S$ are shown by the blue (maximum likelihood) and red (alternative) histograms.

For recent advances in methods that deal with this kind of problems, you can look at:

Delaigle, A. and Hall, P. (2016), Methodology for non-parametric deconvolution when the error distribution is unknown. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 78: 231–252. doi: 10.1111/rssb.12109

Author

Luis Ospina Forero

View all posts

Oxford Protein Informatics Group

or "OPIG" to friends

Quantifying dispersion under varying instrument precision

Author