Benford’s law and OAS | Oxford Protein Informatics Group

Benford’s law is an observation that in numerical data (produced by many kinds of process), the leading digit tends to be small. Wikipedia tells you that it in datasets obeying Benford’s law, the number 1 appears as the leading digit about 30% of the time while 9 appears less than 5% of the time (p(n) = log10(1+1/n) where n is the leading digit). Wikipedia further lists multiple kinds of data where this tends to be true such as electricity bills, population numbers and physical and mathematical constants, and particularly where data can be described by a power law.

Power laws and antibodies have been co-discussed in reference to network descriptions of antigen-experienced BCR repertoires [1], which are often described as scale-free to use the network terminology (following a power law). This means a few highly-connected nodes in the network and lots of nodes with few or no connections. This is an obvious candidate for Benford’s law.

This is of no practical relevance, but I wondered if I could see Benford’s law in other kinds of data besides clone counts in the Observed Antibody Space (OAS). For example, I looked at the leading digit in the number of sequences in all of the data units in OAS. It looks like a good fit for Benford’s law (though with more density at the smaller leading digits) and has a chi-squared value of 0.007 (Figure 1A).

I took mutation counts from 100 data units from OAS (amounting to 2,167,727 sequences with non-zero mutation counts) and while this looks like a poorer fit to Benford’s law (and has a chi-squared value of 0.21) it still looks Benford-like (Figure 1B). Perhaps this poorer fit is due to combining multiple datasets from different cell types which will each have their own mutational profile.

As I said, I can’t think of any practical reason for this but it is a cool law I learned of recently. The major applications are in detecting data that has been faked in some way, allowing you to monitor data of various kinds for signs of fraud. Comfortingly BCR repertoire sequencing datasets conform to this law in various regards (so we don’t need to perform a citizen’s arrest on Tobias for the time being).

[1] Miho, Enkelejda, et al. “Computational strategies for dissecting the high-dimensional complexity of adaptive immune repertoires.” Frontiers in immunology 9 (2018): 224.

Author

Eve Richardson

View all posts