Visualisation of very large high-dimensional data sets as minimum spanning trees

Large high-dimensional data sets are frequently used in chemical and biological sciences. For example the ChEMBL database contain millions of bioactive molecules from the scientific literature and their associated biological assay data are usually used for drug discovery. Visualising such databases helps understand the structure of data.

Matrix factorisation (e.g. PCA) and nearest neighbour graph (e.g. t-SNE, UMAP), are usually used to reduce dimensions for visualisation. PCA is able to capture the global structure of the data as it projects the data onto first k principal components, giving the k-dimensional view of the data that best preserve the first two moments. On the other hand, t-SNE and UMAP learn the lower-dimensional embedding, and they are able to recover the local structures of the data. These algorithms have time complexities between O(n1.14) and O(n5), limiting the size of data to be visualised.

Probst and Reymond proposed a new algorithm, “Tree MAP” (TMAP) [1], to visualise large data sets in a tree. The algorithm consists of four steps: (i) LSH forest indexing, (ii) construction of a c-approximate k-nearest neighbour graph, (iii) calculation of a minimum spanning tree (MST) of the c-approximate k-nearest neighbour graph, (iv) generation of a layout for the resulting MST. The key difference between TMAP and previous methods such as UMAP or t-SNE is that UMAP and t-SNE attempt to learn the embedding, while in TMAP, it is replaced by calculation of minimum spanning tree. This significantly lower the computational complexity. The authors show that the algorithm generates visualisation with an empirical sub-linear time complexity O(n0.931), allowing to visualise much larger high dimensional data sets that other methods.

In summary, Tree MAP (TMAP) is a new visualisation method for very large, high-dimensional data sets.

[1] D. Probst and J‑L Reymond. Visualization of very large high-dimensional data sets as minimum spanning trees. J. Cheminf. 2020 12 12

Author