Create benchmark dataset for hierarchical clustering
Motivation
In real world we often try to find structure in data without knowing if there is any structure - e.g. if there are clusters and how many clusters are there is they exist. One way to approach this is to do hierarchical clustering. However, evaluating hierarchical clustering is currently an issue, as there are no datasets with unarguable ground truth hierarchucal structure and different metrics have various issues (for example, rely on distances, while not all distances are well-behaved in higher-dimensional space, especially the most common ones like Eucledian distances become more uniformly distributed). Hence, there is a very clear need in a benchmark hierarchical-by-design dataset to evaluate hierarchical clustering methods.
Project
Inspired by hyperbolic space, where hierarchy is the innate property of the space, we suggest to create embeddings in this space. Small explanation why hyperbolic space is good for hierarchical data: in hierarchy the number of nodes at each successive level (children, grandchildren, etc.) often increases exponentially, like $b^n$, where $n$ is the depth of hierarchy. Hyperbolic space is a non-Euclidean geometry with a constant negative curvature (it locally resembles a saddle shape). Because of this negative curvature, the space expands exponentially as you move away from any point. Hence, hyperbolic space has the room that a hierarchy needs, making it a natural, low-distortion “home” for tree-like and hierarchical data. We suggest to take the following papers as inspiration, create a hierarchical-by-design dataset and evaluate current Hierarchical clustering methods and metrics on it:
- Poincaré Embeddings for Learning Hierarchical Representations
- Hyperbolic space for Single-cell Data
- Accelerating hyperbolic t-SNE
Thesis
Within the context of this project, several research questions could be explored: - What is a good data domain to generate hierarchical embedings? Any alternatives for WORDNET mammals subtree from the Meta paper? - How do different hierarchical methods (agglomerative clustering, HBC, t-NEB) perform on these hierarchical datasets? - How do different hierarchicy evaluation methods behave, do they actually reflect if the method is close to the true hierarchy? If no - then why? - Can we somehow match the existing embeddings from euclidean space to hyperbolic space to find hierarchy? What are the prerequisites for the embeddings in the euclidean space?
Contact
To apply please email Polina Turishcheva stating your interest in this project and detailing your relevant skills. A part of this project could be also a lab rotation. This project is better suited for people with minimal coding experience. No prior experince in biology is required.
