Probabilistic hierarchical clustering
Motivation
In real world we often try to find structure in data without knowing if there is any structure - e.g. if there are clusters and how many clusters are there is they exist. One way to approach this is to do hierarchical clustering. Recently, we suggested t-NEB, a method to do probabilisticly based hierarchical clustering, where the idea is to overcluster data with student-t mixture model (TMM) – as it is more ronus to noise and outliers – and then to merge these mixtures using the geodesic distance along the TMM energy landscape. However, t-NEB requires fitting a density model in high-dimensional space, which requires a lot of data, otherwise, the density model might be suboptimal.
Project
One approach to improve the TMM fitting is to reduce dimensions before fitting the TMM. However, as Johnson-Lindenstrauss lemma says - we cannot unlimitedly reduce dimensions preserving both local and global distances. Practically this means that if the dimensions are reduced without caution, we can merge some clusters in lower-dimensional space or split a single cluster into two. Additionally, we want to preserve the probabilitic properties of our data. Luckily for us, there is a method to reduce dimensional in a probabilistic way, considering preserving clusters separability - Hierarchical mixtures of Gaussians (HMoGs). The idea is first to merge the HMoGs with the t-NEB and then maybe extend the HMoGs towards a mixture of student-t components.
Thesis
Within the context of this project, several research questions could be explored: - How much does reducing dimensions help us to improve performance? - What is the dependency between the amount of data, the dimensionality and the performance of both algorithm? - Does replacing GMM to TMM in HMoGs actually improves the performance in practice? - Does both methods help us to learn something new about the biological data? (We can provide a dataset of functional embeddings for neurons) - How to evaluate hierarchy in hierarchical clustring?
Contact
To apply please email Polina Turishcheva stating your interest in this project and detailing your relevant skills. A part of this project could be also a lab rotation. This project is better suited for people with minimal coding experience. Both codebases use jax. No prior experince in biology is required.
