Probabilistic hierarchical clustering

Probabilistic hierarchical clustering

Motivation

In real world we often try to find structure in data without knowing if there is any structure - e.g. if there are clusters and how many clusters are there is they exist. One way to approach this is to do hierarchical clustering. Recently, we suggested t-NEB, a method to do probabilisticly based hierarchical clustering, where the idea is to overcluster data with student-t mixture model (TMM) – as it is more ronus to noise and outliers – and then to merge these mixtures using the geodesic distance along the TMM energy landscape. However, t-NEB requires fitting a density model in high-dimensional space, which requires a lot of data, otherwise, the density model might be suboptimal.

Project

One approach to improve the TMM fitting is to reduce dimensions before fitting the TMM. However, as Johnson-Lindenstrauss lemma says - we cannot unlimitedly reduce dimensions preserving both local and global distances. Practically this means that if the dimensions are reduced without caution, we can merge some clusters in lower-dimensional space or split a single cluster into two. Additionally, we want to preserve the probabilitic properties of our data. Luckily for us, there is a method to reduce dimensional in a probabilistic way, considering preserving clusters separability - Hierarchical mixtures of Gaussians (HMoGs). The idea is first to merge the HMoGs with the t-NEB and then maybe extend the HMoGs towards a mixture of student-t components.

Thesis

Within the context of this project, several research questions could be explored: - How much does reducing dimensions help us to improve performance? - What is the dependency between the amount of data, the dimensionality and the performance of both algorithm? - Does replacing GMM to TMM in HMoGs actually improves the performance in practice? - Does both methods help us to learn something new about the biological data? (We can provide a dataset of functional embeddings for neurons) - How to evaluate hierarchy in hierarchical clustring?

Contact

To apply please email Polina Turishcheva stating your interest in this project and detailing your relevant skills. A part of this project could be also a lab rotation. This project is better suited for people with minimal coding experience. Both codebases use jax. No prior experince in biology is required.

Neural Data Science Group
Institute of Computer Science
University of Goettingen